SEA-PHAGES | Help! I don't understand what this means!

Link to this post \| posted 25 May, 2018 22:04
jross1025	I seem to have done a crappy job teaching the new notes format so I'm having to do a lot of cleanup on a very late file, and believe it or not it is just now occurring to me that I don't have a clue what this language taken straight from the starterator guide means: When interpreting Starterator data, in general the start that is present in all genes that yields the longest possible gene is the correct one. The underlying rationale for this is that upstream sequence is more likely to vary than protein encoding sequence, and so the most conserved start that yields the longest genes should be selected. As always, there are exceptions to this, and so sometimes the analysis is not informative or not applicable. Examples of this will be described below. I've simply been notating "SS" when the auto-annotated start is the same one that the large majority of nondraft annotations use, without even asking if it's going to give the longest gene. If there doesn't seem to be a start that significantly more than half of nondrafts use then I call it NI and when the start isn't even available I call it NA. Don't know if what I'm doing is correct or not.

Link to this post | posted 25 May, 2018 22:04

I seem to have done a crappy job teaching the new notes format so I'm having to do a lot of cleanup on a very late file, and believe it or not it is just now occurring to me that I don't have a clue what this language taken straight from the starterator guide means:

When interpreting Starterator data, in general the start that is present in all genes
that yields the longest possible gene is the correct one. The underlying rationale
for this is that upstream sequence is more likely to vary than protein encoding
sequence, and so the most conserved start that yields the longest genes should be
selected. As always, there are exceptions to this, and so sometimes the analysis is
not informative or not applicable. Examples of this will be described below.

I've simply been notating "SS" when the auto-annotated start is the same one that the large majority of nondraft annotations use, without even asking if it's going to give the longest gene. If there doesn't seem to be a start that significantly more than half of nondrafts use then I call it NI and when the start isn't even available I call it NA. Don't know if what I'm doing is correct or not.

Link to this post \| posted 25 May, 2018 22:35
welkin	hi Joe, you've got most of the nuances. the idea with the "longest start language" is that there could be a conserved start upstream of the starts chosen in all the files. In our dataset, we've annotated phages as we sequenced them. So the start we selected when the phage was a singleton may not be the best choice. you could imagine a scenario in which we now have ten phages in the cluster, and they all have a longer start in common, even though we selected the shorter start for all the others because we didn't have enough comparative data. So if you found such a start in the alignment, regardless of whether it was chosen most often in the GenBank files, it is probably time to do a reassessment, and reannotate all equivalent genes across the cluster. does that make more sense?

Link to this post | posted 25 May, 2018 22:35

welkin

hi Joe,
you've got most of the nuances.

the idea with the "longest start language" is that there could be a conserved start upstream of the starts chosen in all the files. In our dataset, we've annotated phages as we sequenced them. So the start we selected when the phage was a singleton may not be the best choice. you could imagine a scenario in which we now have ten phages in the cluster, and they all have a longer start in common, even though we selected the shorter start for all the others because we didn't have enough comparative data.
So if you found such a start in the alignment, regardless of whether it was chosen most often in the GenBank files, it is probably time to do a reassessment, and reannotate all equivalent genes across the cluster.

does that make more sense?

Link to this post \| posted 26 May, 2018 21:51
jross1025	Welkin Pope hi Joe, you've got most of the nuances. the idea with the "longest start language" is that there could be a conserved start upstream of the starts chosen in all the files. In our dataset, we've annotated phages as we sequenced them. So the start we selected when the phage was a singleton may not be the best choice. you could imagine a scenario in which we now have ten phages in the cluster, and they all have a longer start in common, even though we selected the shorter start for all the others because we didn't have enough comparative data. So if you found such a start in the alignment, regardless of whether it was chosen most often in the GenBank files, it is probably time to do a reassessment, and reannotate all equivalent genes across the cluster. does that make more sense? YES, THANKS

Link to this post | posted 26 May, 2018 21:51

jross1025

Welkin Pope
hi Joe,
you've got most of the nuances.

the idea with the "longest start language" is that there could be a conserved start upstream of the starts chosen in all the files. In our dataset, we've annotated phages as we sequenced them. So the start we selected when the phage was a singleton may not be the best choice. you could imagine a scenario in which we now have ten phages in the cluster, and they all have a longer start in common, even though we selected the shorter start for all the others because we didn't have enough comparative data.
So if you found such a start in the alignment, regardless of whether it was chosen most often in the GenBank files, it is probably time to do a reassessment, and reannotate all equivalent genes across the cluster.

does that make more sense?

YES, THANKS

Recent Activity

Help! I don't understand what this means!