new minor tail?

| posted 31 Jul, 2019 06:24
Hello, kind of a complicated question. Im looking at Tanis gp25 (stop 18226) which is in cluster 19534 as of today (also contains Gravy_25 and Kerry_25); this is directly downstream of the second major tail gene. The most commonly called start is the longest one, but the 4th start in starterator (1773smile is more conserved. When doing a phagesdb BLAST and HHPred on the longest ORF, there aren't any good functional hits. However, when I use the later/most conserved start, I get pretty strong blast and hhpred hits to tail genes.

I'm leaning towards calling the shorter orf and assigning it as a minor tail. Thoughts?

Here are the sequences to check for yourself, if you like.


| posted 31 Jul, 2019 11:52
I would call this a minor tail protein. I would also call the longest ORF (start 17699). There is too much coding potential upstream of the 4th start. and the difference between start 1 &2 is negligible.
| posted 31 Jul, 2019 20:07
Jordan (and anyone else who reads this terrific post,
I asked Chris Shaffer if I was correct in thinking that start 1 and 2 in the phamerator report 19534 were basically the same. He explained that they ARE the same. Check it out! chris just showed me that I am smarter than a computer! lol!

what an interesting case,

TL,DR: no starts 1 and 2 are not really different even though starterator says they are.

Long version if interested:
this is a weird corner case where the starterator program is giving the literally correct answer which is really "wrong". So no the 1 and 2 should not be considered different. Looking very closely at the diagram you can see all the 1 starts have a tiny white sliver on the right edge while all the 2 starts do not. Here is the actual multiple sequence alignment (MSA) and you can see the single A base insertion in some sequence and not others (the while sliver represents the gap), I highlighted the start codon in yellow (attached):

When you ask a computer are those starts "the same" the answer is no, so Starterator considers the top 3 sequences as start 2 and the bottom 4 as start 1. What we have run across is one of the fundamental problems of MSA's, multiple sequence aligners like clustal just doesn't have any way to know if it should align as -A or A-. This is why people "hand tweek" alignments when they want the best possible alignments (for publishing conserved domains for example) because it just isn't possible to code in the external evidence needed to decide between "A-" and "-A". There might be a way to fix this in starterator, not sure, I will have to think on it.

As for recommendation I typically consider starts that are clearly that close on a starterator diagram as "the same" even if the computer gives them different numbers.


Edited 31 Jul, 2019 20:11
