The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at

Second opinion Cluster F1 Gene Start Site

| posted 16 Apr, 2021 16:39
Hello everyone,

I’d welcome some thoughts on our JalFarm20_3 start site. Thank you! Please see attached for screen shots.

1. Glimmer and GeneMark agree on start: nucleotide 797. Not longest ORF. Could be nucleotide 671 or 659, but that would create overlap with JalFarm20_3 by 28 or 40 bp.
2. The start is not conserved among all members of the pham (Starterator). JalFarm20_3 start is 152, same as UncleRicky_3, UPitt annotation-sequenced in 2017; most called is 103; similar phages earlier start 93 (e.g. SpikeLee_3; Bobi_3).
3. Currently predicted start site does not include all coding potential in GeneMark in +2 frame.
4. For genes with functional predictions, start bp 797 includes the full-length protein using BLASTP, terminase large subunit. BLASTP with start bp 671 (overlaps with JalFarm20_2: by 28 bp) produces alignments as well.
5. Start bp 671 has a better final RBS score.
6. Synteny evident.
| posted 19 Apr, 2021 20:36
Hi Andy,
This is a big pham of genes with no easy answers about starts. It is an easy function call, with this gene being the terminase. Starterator was not easily helpful. The conservation of the overlap may just be because that upstream gene is driving that portion of the sequence. There is no good indication for me to push to change what Glimmer and GeneMark agree on.
If I recollect correctly, the N-terminus of a terminase gene is its ATPase domain. If you carve out that domain (so that your bioinformatic comparisons don't get overpowered by the rest of the protein) you may be able to align the N-terminus sequence with other ATPase domains and find an alignment that is informative.

Without more data, I would stick with the start at 797.

Just my two cents on the subject.
| posted 19 Apr, 2021 22:09
Thanks Debbie! JalFarm20_3 looked very similar to U Pitt's annotation of UncleRicky_3 (e.g. coding potential, length, function) which pointed to retaining the Glimmer and GeneMark call of 797.

Hope you're well out there!

| posted 20 Apr, 2021 15:32
I think Deb is right in that you you should check for alignments to domains. I can see quite a few HHPRED matches that start in the middle of the subject but align to amino acid 1 or 2 when start 797 is selected.

When I get situations like this, I have my students take the amino acid sequence of the longer form and do an hhpred search. Then look at the results and ask: do those "extra" amino acids at the beginning (42 amino acids in this case) also aligning to the subject. If those amino acids do align, we take it as pretty evidence that those first amino acids are in the protein and we pick the longer form, if the amino acids to not align we pick the shorter form.
| posted 20 Apr, 2021 16:44
Thank you, that was very helpful! I just searched on HHpred using took the longer sequence; the extra amino acids did not align to the subject.

Login to post a reply.