The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at

F1 gene needs help on start site

| posted 13 May, 2021 19:33
Good Afternoon,
We identified JalFarm20_51 as a gene in PECAAN.
The original start was 34172 - Glimmer start and the GeneMark start 34331.
It is an orpham.
The stop is 34047 and is a reverse gene.
We concluded it was a gene based on coding potential. In addition to coding potential both Glimmer and GeneMark did call the gene.
The phagedb blast with the original start (34172), shows partial alignment with DNA primase/polymerase. If we use this start JalFArm20 will be 41 AA long but all the other best matches are 239 AA long.
NCBI blast with the same start did not find a putative conserved domain.
When we do a blastp with start 34331 (Longest ORF), the phage SiSi_42 had the best alignment.
And JalFarm20 was 51 AA long and SiSi_42 was 58 AA long.
The HHPred had better coverage with the 34331 start.
We are requesting assistance to determine which start to use.
| posted 13 May, 2021 22:26
Hi, here's how I think about this. I see that the coding potential extends further upstream than the Glimmer start at 34172, which strongly suggests that the start is further upstream. So the only possible starts are two that are right next to each other upstream of this, 34331 and 34328. Neither of them have a significant match to a Shine-Dalgarno sequence, so they seem equal with regard to SD. To me, it seems like an arbitrary choice to choose one over the other, and there would be a very minor difference in the protein of one amino acid at the N-terminus. Sounds like GM called the 34331 site in your analysis (it did not in my DNA Master auto-annotation just now). This makes the longest possible ORF and would match 1:1 with the start of SiSi_42 and two others. This would tip the balance for me and I would choose the 34331 start. Best wishes! -Kirk
| posted 14 May, 2021 18:20
Just a follow up. when I had two tandem start codons I always picked the longer gene model (based on the "All other things being equal, a longer call is usually preferable," rule) but recent work with mass spec on phage proteins suggest otherwise. I am quoting now from the online guide (this page on revising your annotations) with the somewhat obscure rule that came out of that mass spec work [note i have added the underline for emphasis]

Can the start site of the downstream gene be extended so that the gene covers more of the gap? Carefully consider all possible start sites for the downstream gene. If a longer one is available, compare it to the current start site to see if it is a similar or better choice. All other things being equal, a longer call is usually preferable, but do not extend genes just to fill a gap. The exception to this are genes with two start codons in tandem, in these cases all of our wet bench experiments support the second of the two codons as the correct start.
| posted 14 May, 2021 19:14
Thanks for this, Chris. I would like to change my suggestion to the 34328 start!

Knowing that mass spec data exist, I've been applying the "second-of-tandem-start-codons" rule when there is a tandem set at 4bp and 1bp overlaps, choosing the 1 bp overlap start. In my mind, I had thought the data applied to cases where a ribosome was potentially restarting after termination of the upstream gene (4 bp and 1 bp overlaps).

But the statement you quoted in the guide suggests it is more broadly observed.

Which makes me wonder: are the mass spec data relative to tandem start codons published someplace? It would be interesting to look at the range of cases observed.
| posted 18 May, 2021 19:55
Thank you for your help!
Login to post a reply.