SEA-PHAGES | All posts created by cdshaffer

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
next →

Link to this post \| posted 16 Apr, 2021 16:13
cdshaffer	Yes this is a database sync issue. The new database should appear by the end of today. In the mean time the results for vine_74 are still available using the older number you mentioned: http://phages.wustl.edu/starterator/Pham58426Report.pdf The URL has the exact same pattern for all phams, so if get a link that does not work and you see the pham number has changed you can always manually change the URL back to the old number and see if that works. In this case, that 58246 number does work. Sometime later this link will not work. and the newer link will: http://phages.wustl.edu/starterator/Pham57943Report.pdf

Link to this post | posted 16 Apr, 2021 16:13

Yes this is a database sync issue. The new database should appear by the end of today. In the mean time the results for vine_74 are still available using the older number you mentioned:

http://phages.wustl.edu/starterator/Pham58426Report.pdf

The URL has the exact same pattern for all phams, so if get a link that does not work and you see the pham number has changed you can always manually change the URL back to the old number and see if that works. In this case, that 58246 number does work. Sometime later this link will not work. and the newer link will:

http://phages.wustl.edu/starterator/Pham57943Report.pdf

Posted in: Starterator → Pham not found in Starterator

Link to this post \| posted 28 Mar, 2021 20:06
cdshaffer	Just a heads up. Christian in the Hatfull lab has been working on optimizing the parameters for the clustering of phage proteins into phams. The most recent version of the database (ver 400) shows a much larger than average shift in both the number and make-up of phams. We don't know how these changes will effect starterator analysis. It may help overall in that more genes will be grouped resulting in fewer genes ending up as orphams with no starterator report. It may also not help in that the added genes will be so divergent that they provide little evidence to interpret within the reports. All uses should be on the lookout for changes that effect the usefulness of the starterator reports. If anything that appears "off" or "confusing" in the starterator results let us know. If things seem to be working better for you let us know that too. You can use this forum or send me an email.

Link to this post | posted 28 Mar, 2021 20:06

cdshaffer

Just a heads up.

Christian in the Hatfull lab has been working on optimizing the parameters for the clustering of phage proteins into phams. The most recent version of the database (ver 400) shows a much larger than average shift in both the number and make-up of phams. We don't know how these changes will effect starterator analysis. It may help overall in that more genes will be grouped resulting in fewer genes ending up as orphams with no starterator report. It may also not help in that the added genes will be so divergent that they provide little evidence to interpret within the reports.

All uses should be on the lookout for changes that effect the usefulness of the starterator reports. If anything that appears "off" or "confusing" in the starterator results let us know. If things seem to be working better for you let us know that too. You can use this forum or send me an email.

Posted in: Starterator → phameration tweeks and effects on starterator

Link to this post \| posted 19 Mar, 2021 16:43
cdshaffer	I am still not convinced it is not one amino acid back (i.e. the slip is D/P instead of K/P). Supporting the former is base conservation, supporting the latter is the "observed pattern" for many slippery sequences. I know of no evidence to tell me which is more informative in this situation. I will certainly say that either annotation has enough support that it will qualify as "less worse" than going with the up til now policy of "annotate T as a separate gene and pick the Longest orf". So we have several BK1 and will annotate using the CCCAAAT pattern accordingly.

Link to this post | posted 19 Mar, 2021 16:43

cdshaffer

I am still not convinced it is not one amino acid back (i.e. the slip is D/P instead of K/P). Supporting the former is base conservation, supporting the latter is the "observed pattern" for many slippery sequences. I know of no evidence to tell me which is more informative in this situation. I will certainly say that either annotation has enough support that it will qualify as "less worse" than going with the up til now policy of "annotate T as a separate gene and pick the Longest orf". So we have several BK1 and will annotate using the CCCAAAT pattern accordingly.

Posted in: Frameshifts and Introns → No frameshift in cluster BK1?

Link to this post \| posted 18 Mar, 2021 16:49
cdshaffer	1st: Yup, stop codons are a no go as far as I am concerned. I was just looking at conservation in the MSA which is why I mentioned backing up; but you are correct, I would not think it a good gene model to add in "stop codon read through" (I know these do exist in eukaryotes do they even exist in Prok's?) 2nd: I am fine if Joyce or anyone else wants to include this data in a poster. Its kind of a pain to create all those DNA sequences if you don't have Starterator running in a VM so I am happy to send anyone the sequences or the alignment for any pham, just send me an email. As I said above, I think the only evidence that could be relatively easily collected that would help me make up my mind is to get a sense of how often the slippery sequence changes in other phams, if we NEVER see it change in other phams then that would make me pause here on the side of caution and stick with the "least worst" model. On the other hand if we do see it happening in other phams then I could see calling it here too. So just like in all my wet bench experiments: if you are not sure of your conclusions: run another experiment. Edited 18 Mar, 2021 16:50

Link to this post | posted 18 Mar, 2021 16:49

cdshaffer

1st: Yup, stop codons are a no go as far as I am concerned. I was just looking at conservation in the MSA which is why I mentioned backing up; but you are correct, I would not think it a good gene model to add in "stop codon read through" (I know these do exist in eukaryotes do they even exist in Prok's?)

2nd: I am fine if Joyce or anyone else wants to include this data in a poster. Its kind of a pain to create all those DNA sequences if you don't have Starterator running in a VM so I am happy to send anyone the sequences or the alignment for any pham, just send me an email.

As I said above, I think the only evidence that could be relatively easily collected that would help me make up my mind is to get a sense of how often the slippery sequence changes in other phams, if we NEVER see it change in other phams then that would make me pause here on the side of caution and stick with the "least worst" model. On the other hand if we do see it happening in other phams then I could see calling it here too.

So just like in all my wet bench experiments: if you are not sure of your conclusions: run another experiment.

Edited 18 Mar, 2021 16:50

Posted in: Frameshifts and Introns → No frameshift in cluster BK1?

Link to this post \| posted 17 Mar, 2021 22:38
cdshaffer	What an interesting and cool question! Here is an update with some more evidence: I checked 5 BK1 by hand and all have that CCCAAAT sequence. I then realized we should just look at all the sequences in the pham. So I looked at the multiple sequence alignment for the pham 5495 which include the G gene for BE and BK phages. The CCCAAAT is found in all the BK1 G genes (they all have gene numbers in the 30's) but that sequence is not found in any of the BE (genes in the 50's-60's) so if this is the slippery sequence you have to argue that it changed to CCCGGAA and yet it is still slippery -or- that the location of the slip has moved since the BE and BK genes diverged. This is fruitful ground for reasonable well trained annotators to disagree, since it is all based on individual estimations of the likelihoods of certain events occurring over evolution. Do we have any evidence of the frequency of slippery sequence turn over rates in the mycobacteriophage? That is a much more comprehensive set might be informative. Alternatively if you back up a few bases there is a sequence which is conserved for 7 of 8 residues across all phage sequences and the one degenerate position is always a pyrimidine I.e. AA(C/T)GACCC. This may not fit any pattern seen among the bench validated slippery sequences but the sample size there is low enough I am not sure how much confidence we should put in those observed patterns. SaltySpitoon_CDS_62 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC MindFlayer_CDS_56 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Wipeout_CDS_56 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Quaran19_CDS_62 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC TomSawyer_CDS_56 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC JimJam_CDS_62 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC PumpkinSpice_CDS_62 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Starbow_CDS_56 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Battuta_CDS_56 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Birchlyn_CDS_55 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Bordeaux_CDS_56 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC Karimac_CDS_57 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC IchabodCrane_CDS_55 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC LukeCage_CDS_57 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAGGGCGGCGATGATGAC StarPlatinum_CDS_58 AAGCTGAACGACCCGGAACTGGAAGCCGCAGCGAGGGCGGCGATGATGAC Enygma_CDS_63 AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAGGGCGGCGATGATGAC Genie2_CDS_58 AAGCTTAACGACCCGGAACTGGAAGCCGCAGCGAGAGCGGCGATGATGAC BoomerJR_CDS_58 AAGCTTAACGACCCGGAACTGGAAGCCGCAGCGAGAGCGGCGATGATGAC Yaboi_CDS_58 AAGCTTAACGACCCGGAACTGGAAGCCGCAGCGAGAGCGGCGATGATGAC Wofford_CDS_57 AAGCTGAATGACCCGGAACTGGAGGCCGCAGCGAAGGCGGCTCTGATGAG Evy_CDS_56 AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA Jay2Jay_CDS_61 AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA Warpy_CDS_60 AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA Targaryen_CDS_59 AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA Sushi23_CDS_56 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA Teutsch_CDS_56 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA Tribute_CDS_54 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA Peebs_CDS_55 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA Cross_CDS_56 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA Samisti12_CDS_56 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA EGole_CDS_56 AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA NootNoot_CDS_52 AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA Paradiddles_CDS_52 AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA Bartholomune_CDS_54 AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA Braelyn_CDS_55 AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA LilMartin_CDS_53 AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGTGATGGAGCA MulchMansion_CDS_53 AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGTGATGGAGCA Mildred21_CDS_54 AAGCTAAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA Bmoc_CDS_54 AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGTGATGGAGCA Daubenski_CDS_57 AAGCTCAACGACCCGGAACTGATGGCCGCAGCAGCGGCAGCGATGGAACA Tomas_CDS_67 AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGCAGTGGAGAG Annadreamy_CDS_31 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA Limpid_CDS_31 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA Beuffert_CDS_32 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA Blueeyedbeauty_CDS_33 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA Sham_CDS_32 AAGCTCAACGACCCAAATCTTCTAGCGATGGCTCAGGAGGCACTTGGAAG TunaTartare_CDS_32 AAGCTCAACGACCCAAATCTTCTAGCGATGGCTCAGGAGGCACTTGGAAG Faust_CDS_34 AAGTTGAACGACCCAAATCTTCTAGCGATGGCTCAGGAGGCACTTGGCAG Jada_CDS_32 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Forrest_CDS_35 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Gilson_CDS_34 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG MeganTheeKilla_CDS_32 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Emma1919_CDS_34 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG SparkleGoddess_CDS_34 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Stigma_CDS_35 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Karp_CDS_34 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Belfort_CDS_35 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Comrade_CDS_34 AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG Moab_CDS_34 AAGTTGAACGACCCAAATCTTCTAGCGGCGGCTCAGGAAGCACTTGGGAG Circinus_CDS_31 AAGCTGAACGACCCAAATCTTCTAGCGATGGCAGCGGAAGCACTTGGGAA BillNye_CDS_29 AAGCTGAACGACCCAAATCTTCTAGCGATGGCAGCGGAAGCACTTGGGAA Muntaha_CDS_30 AAGCTGAACGACCCAAATCTTCTAGCGGCAGCGGCGGAGGCTCTTGGGAA Wakanda_CDS_30 AAGCTGAACGACCCAAATCTTCTAGCGGCAGCGGCGGAGGCTCTTGGGAA *** * *** * ** * * ^^^^^^^

Link to this post | posted 17 Mar, 2021 22:38

cdshaffer

What an interesting and cool question!
Here is an update with some more evidence:

I checked 5 BK1 by hand and all have that CCCAAAT sequence. I then realized we should just look at all the sequences in the pham. So I looked at the multiple sequence alignment for the pham 5495 which include the G gene for BE and BK phages. The CCCAAAT is found in all the BK1 G genes (they all have gene numbers in the 30's) but that sequence is not found in any of the BE (genes in the 50's-60's) so if this is the slippery sequence you have to argue that it changed to CCCGGAA and yet it is still slippery -or- that the location of the slip has moved since the BE and BK genes diverged. This is fruitful ground for reasonable well trained annotators to disagree, since it is all based on individual estimations of the likelihoods of certain events occurring over evolution.

Do we have any evidence of the frequency of slippery sequence turn over rates in the mycobacteriophage? That is a much more comprehensive set might be informative.

Alternatively if you back up a few bases there is a sequence which is conserved for 7 of 8 residues across all phage sequences and the one degenerate position is always a pyrimidine I.e. AA(C/T)GACCC. This may not fit any pattern seen among the bench validated slippery sequences but the sample size there is low enough I am not sure how much confidence we should put in those observed patterns.

SaltySpitoon_CDS_62        AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
MindFlayer_CDS_56          AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Wipeout_CDS_56             AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Quaran19_CDS_62            AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
TomSawyer_CDS_56           AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
JimJam_CDS_62              AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
PumpkinSpice_CDS_62        AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Starbow_CDS_56             AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Battuta_CDS_56             AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Birchlyn_CDS_55            AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Bordeaux_CDS_56            AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
Karimac_CDS_57             AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
IchabodCrane_CDS_55        AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAAGGCGGCGATGATGAC
LukeCage_CDS_57            AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAGGGCGGCGATGATGAC
StarPlatinum_CDS_58        AAGCTGAACGACCCGGAACTGGAAGCCGCAGCGAGGGCGGCGATGATGAC
Enygma_CDS_63              AAGCTGAATGACCCGGAACTGGAAGCCGCAGCGAGGGCGGCGATGATGAC
Genie2_CDS_58              AAGCTTAACGACCCGGAACTGGAAGCCGCAGCGAGAGCGGCGATGATGAC
BoomerJR_CDS_58            AAGCTTAACGACCCGGAACTGGAAGCCGCAGCGAGAGCGGCGATGATGAC
Yaboi_CDS_58               AAGCTTAACGACCCGGAACTGGAAGCCGCAGCGAGAGCGGCGATGATGAC
Wofford_CDS_57             AAGCTGAATGACCCGGAACTGGAGGCCGCAGCGAAGGCGGCTCTGATGAG
Evy_CDS_56                 AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA
Jay2Jay_CDS_61             AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA
Warpy_CDS_60               AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA
Targaryen_CDS_59           AAGCTCAATGACCCGGAACTGATGGCCGCAGCAGCGGCGATAATGGAGAA
Sushi23_CDS_56             AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
Teutsch_CDS_56             AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
Tribute_CDS_54             AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
Peebs_CDS_55               AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
Cross_CDS_56               AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
Samisti12_CDS_56           AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
EGole_CDS_56               AAGTTGAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
NootNoot_CDS_52            AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA
Paradiddles_CDS_52         AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA
Bartholomune_CDS_54        AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA
Braelyn_CDS_55             AAGCTTAACGACCCGGAACTGATGGCCGCAGCGGCGGCAATGATGGAGAA
LilMartin_CDS_53           AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGTGATGGAGCA
MulchMansion_CDS_53        AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGTGATGGAGCA
Mildred21_CDS_54           AAGCTAAATGACCCGGAACTGATGGCCGCAGCGGCGGCAGTGATGGAGCA
Bmoc_CDS_54                AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGTGATGGAGCA
Daubenski_CDS_57           AAGCTCAACGACCCGGAACTGATGGCCGCAGCAGCGGCAGCGATGGAACA
Tomas_CDS_67               AAGCTGAATGACCCGGAACTGATGGCCGCAGCAGCGGCAGCAGTGGAGAG
Annadreamy_CDS_31          AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA
Limpid_CDS_31              AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA
Beuffert_CDS_32            AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA
Blueeyedbeauty_CDS_33      AAGTTGAATGACCCAAATCTTCTAGCGGCGGCTCAGGAGGCTCTTGGGAA
Sham_CDS_32                AAGCTCAACGACCCAAATCTTCTAGCGATGGCTCAGGAGGCACTTGGAAG
TunaTartare_CDS_32         AAGCTCAACGACCCAAATCTTCTAGCGATGGCTCAGGAGGCACTTGGAAG
Faust_CDS_34               AAGTTGAACGACCCAAATCTTCTAGCGATGGCTCAGGAGGCACTTGGCAG
Jada_CDS_32                AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Forrest_CDS_35             AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Gilson_CDS_34              AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
MeganTheeKilla_CDS_32      AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Emma1919_CDS_34            AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
SparkleGoddess_CDS_34      AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Stigma_CDS_35              AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Karp_CDS_34                AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Belfort_CDS_35             AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Comrade_CDS_34             AAGTTGAATGACCCAAATCTTCTAGCGGCGGCAGCGGAGGCTCTTGGGAG
Moab_CDS_34                AAGTTGAACGACCCAAATCTTCTAGCGGCGGCTCAGGAAGCACTTGGGAG
Circinus_CDS_31            AAGCTGAACGACCCAAATCTTCTAGCGATGGCAGCGGAAGCACTTGGGAA
BillNye_CDS_29             AAGCTGAACGACCCAAATCTTCTAGCGATGGCAGCGGAAGCACTTGGGAA
Muntaha_CDS_30             AAGCTGAACGACCCAAATCTTCTAGCGGCAGCGGCGGAGGCTCTTGGGAA
Wakanda_CDS_30             AAGCTGAACGACCCAAATCTTCTAGCGGCAGCGGCGGAGGCTCTTGGGAA
                           *** * ** *****  * **    **    **    *      *
                                      ^^^^^^^

Posted in: Frameshifts and Introns → No frameshift in cluster BK1?

Link to this post \| posted 17 Mar, 2021 21:41
cdshaffer	I would guess there is a slippery sequence here but there is no way to find it as it has yet to be discovered in the lab. As an annotator I would never intentionally "make up" a slippery sequence. So even though there is likely a slippery sequence somewhere in that genome I have no way to find it. This means I know I cannot get the "right" answer. Then, if I cannot get the "right" answer, the best I can do is try to find the "least worst" answer. For me, the "least worst" is to annotate as much of the T region as I can as a gene. I know this is very likely wrong but it is "less wrong" than the alternatives of either picking a slipper sequence with no support or having no gene annotated for that region at all. And yes for many of the BK1's the "longest" form is really really short, so we just annotate that tiny gene and give it the tail assembly chaperone and hope that anyone that runs across the annotation will know enough (or go to the literature to find out) what is really going on here. But there is really no way to annotate these regions that works well for a naive reader. But I agree with Deb, if we can come up with a hypothesis that makes sense based on the published properties of slippery sequences then that is better than the current solution. I will look for the XXXYYYZ in our BK1's. P.S. for those unfamiliar with the G/T nomenclature see this page: https://seaphagesbioinformatics.helpdocsonline.com/article-6 Edited 17 Mar, 2021 21:46

Link to this post | posted 17 Mar, 2021 21:41

cdshaffer

I would guess there is a slippery sequence here but there is no way to find it as it has yet to be discovered in the lab. As an annotator I would never intentionally "make up" a slippery sequence. So even though there is likely a slippery sequence somewhere in that genome I have no way to find it. This means I know I cannot get the "right" answer. Then, if I cannot get the "right" answer, the best I can do is try to find the "least worst" answer. For me, the "least worst" is to annotate as much of the T region as I can as a gene. I know this is very likely wrong but it is "less wrong" than the alternatives of either picking a slipper sequence with no support or having no gene annotated for that region at all. And yes for many of the BK1's the "longest" form is really really short, so we just annotate that tiny gene and give it the tail assembly chaperone and hope that anyone that runs across the annotation will know enough (or go to the literature to find out) what is really going on here. But there is really no way to annotate these regions that works well for a naive reader.

But I agree with Deb, if we can come up with a hypothesis that makes sense based on the published properties of slippery sequences then that is better than the current solution. I will look for the XXXYYYZ in our BK1's.

P.S. for those unfamiliar with the G/T nomenclature see this page:
https://seaphagesbioinformatics.helpdocsonline.com/article-6

Edited 17 Mar, 2021 21:46

Posted in: Frameshifts and Introns → No frameshift in cluster BK1?

Link to this post \| posted 12 Mar, 2021 18:47
cdshaffer	Your question about false positives is an interesting one. I had always assumed the algorithms were specifically designed to distinguish the differences between membrane domains and simple hydrophobic helices. So I went back to the 2001 paper for TMHMM (doi:10.1006/jmbi.2000.4315). The intro in the paper has a really good discussion on the early methods used to distinguish just that issue. There is also a whole section of the paper on this issue. Bottom line is there are other structures and length requirements that help in the determination which helps distinguish a "real" transmembrane domain. Might even be worth pointing out this paper to students who are interested, if only to read the intro. As for the issue of false positives with TM-HMM, according to the paper, the algorithm has a specificity of around 99% if there is not a leader peptide, so I think the protocol as defined is a pretty good one and further support from BLAST is not required. But this issue that a leader peptide reduces the quality of the results is very interesting. Maybe the SEA-phages protocol should be amended if a leader peptide is predicted. That is really a good question for a faculty workshop I think. Edited 12 Mar, 2021 18:52

Link to this post | posted 12 Mar, 2021 18:47

cdshaffer

Your question about false positives is an interesting one. I had always assumed the algorithms were specifically designed to distinguish the differences between membrane domains and simple hydrophobic helices. So I went back to the 2001 paper for TMHMM (doi:10.1006/jmbi.2000.4315). The intro in the paper has a really good discussion on the early methods used to distinguish just that issue. There is also a whole section of the paper on this issue. Bottom line is there are other structures and length requirements that help in the determination which helps distinguish a "real" transmembrane domain. Might even be worth pointing out this paper to students who are interested, if only to read the intro.

As for the issue of false positives with TM-HMM, according to the paper, the algorithm has a specificity of around 99% if there is not a leader peptide, so I think the protocol as defined is a pretty good one and further support from BLAST is not required.

But this issue that a leader peptide reduces the quality of the results is very interesting. Maybe the SEA-phages protocol should be amended if a leader peptide is predicted. That is really a good question for a faculty workshop I think.

Edited 12 Mar, 2021 18:52

Posted in: Annotation → Membrane proteins

Link to this post \| posted 11 Mar, 2021 17:39
cdshaffer	Yes I agree, I was not trying to imply otherwise. My thought experiment was more with the idea of a outside reviewer. There are many papers out there that talk about how poor the annotations are in genbank as a whole, so I could imagine a naive reviewer liking the "published set" over and above the "all genbank set". This brings up the point that one might pick all SEA-PHAGES as a test set over all phages in genbank or all published phage if annotation consistency was important to the analysis. Again, the experimenter should pick the best possible dataset for the question.

Link to this post | posted 11 Mar, 2021 17:39

cdshaffer

Yes I agree, I was not trying to imply otherwise. My thought experiment was more with the idea of a outside reviewer. There are many papers out there that talk about how poor the annotations are in genbank as a whole, so I could imagine a naive reviewer liking the "published set" over and above the "all genbank set".

This brings up the point that one might pick all SEA-PHAGES as a test set over all phages in genbank or all published phage if annotation consistency was important to the analysis. Again, the experimenter should pick the best possible dataset for the question.

Posted in: General Message Board → Comparative analysis

Link to this post \| posted 10 Mar, 2021 18:03
cdshaffer	to me this is mostly an issue of good experimental design and picking the right dataset for your experiments. So consider these two possible sentences you could write in a hypothetical paper and decide which one would be better at convincing the reviewer to accept your paper and its conclusions: 1. "To cast a wide net and compare as many as possible we collected and analyzed all phage in genbank" 2. "To ensure the highest quality of data and gene annotations we selected only phage genomes which have been published in peer-reviewed journals" if I were a reviewer I would be fine with option 1 if the study was mostly about sequence variation with little or no input from annotations. The more your experimental conclusions rely on annotations the more I would favor option 2. As a middle ground, you could also consider all the phage in the refseq database instead of all of genbank.

Link to this post | posted 10 Mar, 2021 18:03

cdshaffer

to me this is mostly an issue of good experimental design and picking the right dataset for your experiments. So consider these two possible sentences you could write in a hypothetical paper and decide which one would be better at convincing the reviewer to accept your paper and its conclusions:

1. "To cast a wide net and compare as many as possible we collected and analyzed all phage in genbank"

2. "To ensure the highest quality of data and gene annotations we selected only phage genomes which have been published in peer-reviewed journals"

if I were a reviewer I would be fine with option 1 if the study was mostly about sequence variation with little or no input from annotations. The more your experimental conclusions rely on annotations the more I would favor option 2. As a middle ground, you could also consider all the phage in the refseq database instead of all of genbank.

Posted in: General Message Board → Comparative analysis

Link to this post \| posted 06 Mar, 2021 21:29
cdshaffer	I would always prefer the HHPRED matches (if I find them) over the blast results. This is due, in no small part, on the quality of the different databases being searched and the relative sensitivity of the algorithms. The source for many of these "discrepancies" like your list is that the alignments are only matching to part of your protein or to just part of the subject. Since some proteins have multiple functional parts all connected together in a single polypeptide chain this can lead to what I would call a "partial annotation". Also note that your first two possibilities do not really "disagree" they really are just different levels of specificity. When trying to decide on levels of specificity, I first direct my students to try to understand the differences in what the terms mean, good sources include the sea-phages approved terms list, the EXPASY enzyme class list, Wikipedia, intro bio text books etc. are all good sources. Once you have a better understanding of the terms you can then look for evidence to help you decide if a higher level of specificity is justified or not. As for this particular protein, if I scan through the top 15-20 hits from prokaryotes (i.e. I am going to ignore the two human mitochondrial proteins) I see many hits to proteins that are described to have BOTH a helicase activity AND a nuclease activity. This explains the "discrepant" results, so the question becomes: does this protein from crewmate also have those two domains or just one. This is why you see annotators often talk about the size of the protein and the size of the match. I quickly focus on the length of the alignment and which part of the subject is matching. Most of these alignments cover about 75% of crewmate 28 but you can see that they only match a much shorter part of the subjects that are described to have both a nuclease and a helicase activity (like residues 1005-1232, 790-1014 or 129-368 ). So likely crewmate is similar to either the nuclease or the helicase part of these larger proteins, but I cannot tell which based simply on the summary data presented in the table. Looking at the other matches I can see hits to pfam domains and "cd conserved domains" that are all different types of exonucleases. So what we have here is likely an exonuclease that is often found as part of a large helix/nuclease combo protein. So I am pretty convinced that either of the first two options on your list could be appropriate here. When talking about this with my students I would point out that since they are the first author it is really up to them to read up on the two terms and decide if they think the "cas4" is better than the generic "exo" but I would be willing to put my name as an author on either of those annotations since one is just a more specific subtype of the other.

Link to this post | posted 06 Mar, 2021 21:29

cdshaffer

I would always prefer the HHPRED matches (if I find them) over the blast results. This is due, in no small part, on the quality of the different databases being searched and the relative sensitivity of the algorithms. The source for many of these "discrepancies" like your list is that the alignments are only matching to part of your protein or to just part of the subject. Since some proteins have multiple functional parts all connected together in a single polypeptide chain this can lead to what I would call a "partial annotation".

Also note that your first two possibilities do not really "disagree" they really are just different levels of specificity. When trying to decide on levels of specificity, I first direct my students to try to understand the differences in what the terms mean, good sources include the sea-phages approved terms list, the EXPASY enzyme class list, Wikipedia, intro bio text books etc. are all good sources. Once you have a better understanding of the terms you can then look for evidence to help you decide if a higher level of specificity is justified or not.

As for this particular protein, if I scan through the top 15-20 hits from prokaryotes (i.e. I am going to ignore the two human mitochondrial proteins) I see many hits to proteins that are described to have BOTH a helicase activity AND a nuclease activity. This explains the "discrepant" results, so the question becomes: does this protein from crewmate also have those two domains or just one. This is why you see annotators often talk about the size of the protein and the size of the match. I quickly focus on the length of the alignment and which part of the subject is matching. Most of these alignments cover about 75% of crewmate 28 but you can see that they only match a much shorter part of the subjects that are described to have both a nuclease and a helicase activity (like residues 1005-1232, 790-1014 or 129-368 ). So likely crewmate is similar to either the nuclease or the helicase part of these larger proteins, but I cannot tell which based simply on the summary data presented in the table. Looking at the other matches I can see hits to pfam domains and "cd conserved domains" that are all different types of exonucleases. So what we have here is likely an exonuclease that is often found as part of a large helix/nuclease combo protein. So I am pretty convinced that either of the first two options on your list could be appropriate here.

When talking about this with my students I would point out that since they are the first author it is really up to them to read up on the two terms and decide if they think the "cas4" is better than the generic "exo" but I would be willing to put my name as an author on either of those annotations since one is just a more specific subtype of the other.

Posted in: Functional Annotation → Phage gene annotation has matching phage genes have 4 different proteins - which one is a match?

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
next →

Recent Activity

All posts created by cdshaffer