The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at

Cluster EG-Annotation guiding principles

| posted 10 Apr, 2018 18:08

We just started annotating OneinaGillian (Cluster EG) and I wanted to clarify a couple of things before moving forward. There are some discrepancies on the gene calls made by DNA Master versus those we see in phagesdb (phamerator) and in PECAAN.

For example, in phamerator, genes 6 and 8 are listed as being transcribed in the right to left direction while gene 7 is transcribed in the opposite strand from left to right. Based on the Guiding Principles in the lab manual I’m inclined to say that “genes” 6 and 8 are not real since they’re both less than 120 bp long. However, when I look at GeneMark there’s no coding potential on the top three reading frames that would correspond to gene 7. Does this mean that none are actual genes and all three should be deleted or am I missing something? Should we automatically delete anything that’s less than 120 bp even if there’s coding potential but no hits on HHPred and BLAST?

On a related note, one of the guiding principles states that each double-stranded segment of DNA is generally part of only one gene. Genes 13 and 14 are on opposite strands and overlap over the entire length of gene 14 (201 base pairs), is this possible? If not, since gene 14 is part of pham 285541 (encodes for a HTH DNA binding protein) while gene 13 is an orphan should we either delete or trim gene 13?

It seems like we’re going to run into similar issues in other parts of the genome so I wanted to get your input to insure that we’re on the right track.

Thanks for your help,

| posted 13 Apr, 2018 13:30
Hi Arturo,
This draft annotation in Phamerator is an excellent illustration in the limitations of the gene prediction programs.
Genes 6, 7, and 8 are all in the draft annotation because something about their nucleotide content was scored highly enough by the algorithms to rate as a "gene". However, we also know that the gene prediction programs are wrong somewhere between 5 and 10% of the time.

You are also correct that these calls, (6,7, and smile and (13,14) violate the guiding principles and should be resolved. You should explore all the prediction via BLAST and HHPred and see if the sequences are found in other phages and/or if they have predicted functions. From looking at an EG Phamerator map, it looks like keeping 14 and trimming 13 is the choice that was made for the related genomes.

And to be clear: the guide states that 120bp is a normal lower size limit for genes, not a hard and fast rule. We know of a number of exceptions that we've characterized at the bench. So you should absolutely NOT just delete small ORFs from a draft annotation just because they are small.

Login to post a reply.