SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

Help! I don't understand what this means!

| posted 25 May, 2018 22:04
I seem to have done a crappy job teaching the new notes format so I'm having to do a lot of cleanup on a very late file, and believe it or not it is just now occurring to me that I don't have a clue what this language taken straight from the starterator guide means:

When interpreting Starterator data, in general the start that is present in all genes
that yields the longest possible gene is the correct one. The underlying rationale
for this is that upstream sequence is more likely to vary than protein encoding
sequence, and so the most conserved start that yields the longest genes should be
selected. As always, there are exceptions to this, and so sometimes the analysis is
not informative or not applicable. Examples of this will be described below.

I've simply been notating "SS" when the auto-annotated start is the same one that the large majority of nondraft annotations use, without even asking if it's going to give the longest gene. If there doesn't seem to be a start that significantly more than half of nondrafts use then I call it NI and when the start isn't even available I call it NA. Don't know if what I'm doing is correct or not.
| posted 25 May, 2018 22:35
hi Joe,
you've got most of the nuances.

the idea with the "longest start language" is that there could be a conserved start upstream of the starts chosen in all the files. In our dataset, we've annotated phages as we sequenced them. So the start we selected when the phage was a singleton may not be the best choice. you could imagine a scenario in which we now have ten phages in the cluster, and they all have a longer start in common, even though we selected the shorter start for all the others because we didn't have enough comparative data.
So if you found such a start in the alignment, regardless of whether it was chosen most often in the GenBank files, it is probably time to do a reassessment, and reannotate all equivalent genes across the cluster.

does that make more sense?
| posted 26 May, 2018 21:51
Welkin Pope
hi Joe,
you've got most of the nuances.

the idea with the "longest start language" is that there could be a conserved start upstream of the starts chosen in all the files. In our dataset, we've annotated phages as we sequenced them. So the start we selected when the phage was a singleton may not be the best choice. you could imagine a scenario in which we now have ten phages in the cluster, and they all have a longer start in common, even though we selected the shorter start for all the others because we didn't have enough comparative data.
So if you found such a start in the alignment, regardless of whether it was chosen most often in the GenBank files, it is probably time to do a reassessment, and reannotate all equivalent genes across the cluster.

does that make more sense?
YES, THANKS
 
Login to post a reply.