SEA-PHAGES | All posts created by cdshaffer

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
next →

Link to this post \| posted 15 Mar, 2022 22:37
cdshaffer	Cluster BE have two genes which have been annotated as "endolysin". See the phams for Cross_40 and Cross 40 and Cross 44. I am linking to the proteins in case pham numbers change, currently 98222 and 98787 The first pham has genes ~500 bp long with some endolysin annotations and some hydrolases and all members are in the BE cluster. The latter pham has genes in the ~1000 base range also with "endolysin" as well as "LysM-like peptidoglycan binding protein" and variants. It is quite a bit larger group that spans multiple streptomyces clusters, as well as cluster AS. Most BE phage have both these proteins and currently annotate one as "endolysin" and the other as something else. On close inspection by HHPRED the Cross 44 group has the more typical "Lysin A structure" with two domains. In this case there is an N-terminal domain of 150 amino acids with high quality HHRED hits to "N-acetylmuramoyl-L-alanine amidase" (e.g. crystal 6SSC) and C-terminal domains with high quality HHPRED hits to transglycosylases and "lysozyme". Cross 40 and its members also have High quality hits which probably explain the "endolysin" annotations. In particular a very good hit to the C-terminal region to the peptidoglycan hydrolase domain of a M tuberculosis resuscitation protein RfpB, as well as another alignment to the 6TAB crystal termed a "lysozyme". Both of these proteins are glycosyltransferase of one type or another. I would suggest that given its structure and distribution across multiple clusters that the larger Cross 44 like proteins be called "endolysin", for the other an annotation of "glycosyltransferase" might be best, or a second "endolysin" could also be considered.

Link to this post | posted 15 Mar, 2022 22:37

Cluster BE have two genes which have been annotated as "endolysin". See the phams for Cross_40 and Cross 40 and Cross 44. I am linking to the proteins in case pham numbers change, currently 98222 and 98787

The first pham has genes ~500 bp long with some endolysin annotations and some hydrolases and all members are in the BE cluster. The latter pham has genes in the ~1000 base range also with "endolysin" as well as "LysM-like peptidoglycan binding protein" and variants. It is quite a bit larger group that spans multiple streptomyces clusters, as well as cluster AS.

Most BE phage have both these proteins and currently annotate one as "endolysin" and the other as something else.

On close inspection by HHPRED the Cross 44 group has the more typical "Lysin A structure" with two domains. In this case there is an N-terminal domain of 150 amino acids with high quality HHRED hits to "N-acetylmuramoyl-L-alanine amidase" (e.g. crystal 6SSC) and C-terminal domains with high quality HHPRED hits to transglycosylases and "lysozyme".

Cross 40 and its members also have High quality hits which probably explain the "endolysin" annotations. In particular a very good hit to the C-terminal region to the peptidoglycan hydrolase domain of a M tuberculosis resuscitation protein RfpB, as well as another alignment to the 6TAB crystal termed a "lysozyme". Both of these proteins are glycosyltransferase of one type or another.

I would suggest that given its structure and distribution across multiple clusters that the larger Cross 44 like proteins be called "endolysin", for the other an annotation of "glycosyltransferase" might be best, or a second "endolysin" could also be considered.

Posted in: Cluster BE Annotation Tips → two endolysins

Link to this post \| posted 13 Mar, 2022 22:11
cdshaffer	According to this page: https://biostar.usegalaxy.org/p/28273/ The Galaxy instance at Texas A&M has a circos wrapper as well as other graphics methods. The good news is that Galaxy is a web based graphical system for bioinformatic analysis and there is no charge. The bad news is there is still a non-trival learning curve. Galaxy is a really nice middle ground for doing bioinformatics, and the Texas A&M galaxy instance is specifically geared to phage analysis. So, it might be worth considering, but if you have never used Galaxy you are going to need to commit a non-trivial amount of time just to train on Galaxy. I have used Galaxy and think it is one of the best web based systems for complex computational workflows, however, some tools work better than others when implemented in Galaxy and since I have never used that Circos wrapper, I have no idea how good it is. So if you don't know Galaxy there could be a considerable investment in time only to find out the wrapper really doesn't give you what you are looking for. On the other hand Galaxy is a pretty good system to learn if you are looking to dive deeper into bioinformatics and still keep everything in a graphical, web-based format where you don't have to worry about command line, package management, and installation.

Link to this post | posted 13 Mar, 2022 22:11

cdshaffer

According to this page: https://biostar.usegalaxy.org/p/28273/
The Galaxy instance at Texas A&M has a circos wrapper as well as other graphics methods.
The good news is that Galaxy is a web based graphical system for bioinformatic analysis and there is no charge. The bad news is there is still a non-trival learning curve. Galaxy is a really nice middle ground for doing bioinformatics, and the Texas A&M galaxy instance is specifically geared to phage analysis.

So, it might be worth considering, but if you have never used Galaxy you are going to need to commit a non-trivial amount of time just to train on Galaxy. I have used Galaxy and think it is one of the best web based systems for complex computational workflows, however, some tools work better than others when implemented in Galaxy and since I have never used that Circos wrapper, I have no idea how good it is. So if you don't know Galaxy there could be a considerable investment in time only to find out the wrapper really doesn't give you what you are looking for.

On the other hand Galaxy is a pretty good system to learn if you are looking to dive deeper into bioinformatics and still keep everything in a graphical, web-based format where you don't have to worry about command line, package management, and installation.

Posted in: Bioinformatic Tools and Analyses → Circular Genome Visualization

Link to this post \| posted 11 Mar, 2022 17:55
cdshaffer	First, I would say that having an orpham or two in a phage is not to unusual to really worry me. In looking at the phamerator map of all AZ phage I can see that KeAlii has at least 3 orphams, and there are several phage with 1 or 2 orphams so having two orphams is not so unusual as to cause real questions in these genes. The longer gene 54 has such good coding potential I would always call that one. 53 is just long enough to call. See rule 8 guiding Principles.. I would say 35 amino acids is in the grey zone, meaining it requires some evidence other than just an open reading frame to call the gene. However both have pretty good coding potential so I would call both. No BLAST hits is also not so surprising as we already know it is an orpham (i.e. unique among all 400K proteins in phagesd). So while a good BLASTp metch might make you feel more confident there really is a gene here, the lack of a BLAST hit is not good evidence that this region is not a gene. Said more formally, a positive result in a BLASTp is good evidence, a negative result is not good evidence there is no gene, it is simply that BLAST has nothing to say one way or the other. As for overlap, we would call this a 4 base overlap (gap score of -4). Since gene coordinates describe intervals not counts you cannot just subtract the coordinates, you have to adjust by 1. I run across this issue all the time with my students and I have them draw out a tiny sequence with a few "genes" to see the difference between interval math and normal math. Finally, for gene calls, it is better to have a false positive (i.e. call a gene which really isn't there) than it is to have a false negative (miss a gene). So even if I was not sure of gene 53 I would still call it given this rule. So for all the above, I would keep both these genes in the annotation and just be amazed at how diverse the gene collection is among all these phage. I am sure Deb could quote you a few papers that discuss the ideas of phage as "engines of gene creation" and I think, at least for 54, that we could have an example of that.

Link to this post | posted 11 Mar, 2022 17:55

cdshaffer

First, I would say that having an orpham or two in a phage is not to unusual to really worry me. In looking at the phamerator map of all AZ phage I can see that KeAlii has at least 3 orphams, and there are several phage with 1 or 2 orphams so having two orphams is not so unusual as to cause real questions in these genes.

The longer gene 54 has such good coding potential I would always call that one. 53 is just long enough to call. See rule 8 guiding Principles.. I would say 35 amino acids is in the grey zone, meaining it requires some evidence other than just an open reading frame to call the gene. However both have pretty good coding potential so I would call both.

No BLAST hits is also not so surprising as we already know it is an orpham (i.e. unique among all 400K proteins in phagesd). So while a good BLASTp metch might make you feel more confident there really is a gene here, the lack of a BLAST hit is not good evidence that this region is not a gene. Said more formally, a positive result in a BLASTp is good evidence, a negative result is not good evidence there is no gene, it is simply that BLAST has nothing to say one way or the other.

As for overlap, we would call this a 4 base overlap (gap score of -4). Since gene coordinates describe intervals not counts you cannot just subtract the coordinates, you have to adjust by 1. I run across this issue all the time with my students and I have them draw out a tiny sequence with a few "genes" to see the difference between interval math and normal math.

Finally, for gene calls, it is better to have a false positive (i.e. call a gene which really isn't there) than it is to have a false negative (miss a gene). So even if I was not sure of gene 53 I would still call it given this rule.

So for all the above, I would keep both these genes in the annotation and just be amazed at how diverse the gene collection is among all these phage. I am sure Deb could quote you a few papers that discuss the ideas of phage as "engines of gene creation" and I think, at least for 54, that we could have an example of that.

Posted in: Gene or not a Gene → Orpham genes in AZ phage

Link to this post \| posted 01 Mar, 2022 17:20
cdshaffer	The first thing to try is confirm that the preferences are set correctly. See this page on the bioinformatics guide: https://seaphagesbioinformatics.helpdocsonline.com/article-66 There are settings which worked in the past but will now cause the connection to fail. I had issues with updates on my old install until I double checked and changed a few preferences.

Posted in: DNA Master → DNA Master Failing to Update - 01.23.2020

Link to this post \| posted 22 Feb, 2022 16:35
cdshaffer	I heard from Steve, Phamerator is now up and working for me. If it is not working for you be sure to post again.

Posted in: Phamerator → Phamerator not loading

Link to this post \| posted 21 Feb, 2022 18:23
cdshaffer	Not loading for me either. Also 502 error on 2 different browsers. I have sent an email to Steve. Edited 21 Feb, 2022 18:32

Posted in: Phamerator → Phamerator not loading

Link to this post \| posted 18 Feb, 2022 17:30
cdshaffer	I also tell my student to put genbank submissions in a section other than "publications" if they are going to a formal CV since genbank submissions are not really "peer reviewed". I feel like too scientists might object to this as dishonest padding of a CV. I do tell students they should put this on their CV though, as it is important work that deserved recognition (especially for a freshman) so I suggest creating a section called "authored submissions" or "authored contributions". The idea is that we all agree that the work displayed in genbank is important enough that it deserves authorship so it should also be on your CV just don't call it a "publication" and keep that section for authorship on a peer reviewed article (i.e. MRA)

Link to this post | posted 18 Feb, 2022 17:30

cdshaffer

I also tell my student to put genbank submissions in a section other than "publications" if they are going to a formal CV since genbank submissions are not really "peer reviewed". I feel like too scientists might object to this as dishonest padding of a CV. I do tell students they should put this on their CV though, as it is important work that deserved recognition (especially for a freshman) so I suggest creating a section called "authored submissions" or "authored contributions". The idea is that we all agree that the work displayed in genbank is important enough that it deserves authorship so it should also be on your CV just don't call it a "publication" and keep that section for authorship on a peer reviewed article (i.e. MRA)

Posted in: Annotation → In Genbank

Link to this post \| posted 16 Feb, 2022 22:45
cdshaffer	These sound more like generic issues with Win 11 and may not be specific to a version of Win 11 running in a VM. I have not had any chance to work with Win 11 yet, hopefullysomeone who has tried and succeeded at Win 11 installation can help. If all else fails, you can delete the VM and start over. Also note that these virtual systems allow you to create save points which you can go back to (they go by different names, on vitualbox they are called "snapshots". I think they are the same in paralleles. So anyone doing this, take advantage of this snapshot feature. Which is to say, create the VM, install windows, run all updaters, set windows settings as you like, THEN create a snapshot. Now try to install DNA Master, if things gets messed up you can revert back to the snapshot and your windows machine will be exactly like it was before you did any DNA master work. Now you can try something different. Once you get DNA master installed the way you like, create another snapshot, that way if things break in the future you can revert back to the version where DNA Master was successfully installed. wash rinse repeat. Good luck and please, if you find any solutions, post a follow-up here for all of us that will likely be facing these same issues once we migrate to win 11.

Link to this post | posted 16 Feb, 2022 22:45

cdshaffer

These sound more like generic issues with Win 11 and may not be specific to a version of Win 11 running in a VM. I have not had any chance to work with Win 11 yet, hopefullysomeone who has tried and succeeded at Win 11 installation can help.

If all else fails, you can delete the VM and start over. Also note that these virtual systems allow you to create save points which you can go back to (they go by different names, on vitualbox they are called "snapshots". I think they are the same in paralleles. So anyone doing this, take advantage of this snapshot feature. Which is to say, create the VM, install windows, run all updaters, set windows settings as you like, THEN create a snapshot. Now try to install DNA Master, if things gets messed up you can revert back to the snapshot and your windows machine will be exactly like it was before you did any DNA master work. Now you can try something different. Once you get DNA master installed the way you like, create another snapshot, that way if things break in the future you can revert back to the version where DNA Master was successfully installed. wash rinse repeat.
Good luck and please, if you find any solutions, post a follow-up here for all of us that will likely be facing these same issues once we migrate to win 11.

Posted in: DNA Master → DNA Master on M1 Mac

Link to this post \| posted 16 Feb, 2022 19:01
cdshaffer	Ok I have looked into this more. DNAbinder has an estimated false positive rate of 5-7% when using the "realistic dataset". The biggest issue I don't like is it appears you can only load 1 sequence at a time which makes general use for all genes quite labor and time intensive, not ideal for a good program to recommend everyone use. Also it runs in an old version of PERL and the server is running on Solaris so could be very time consuming to update to a modern server and make it easily usable as a general program (i.e. to make it less labor intensive). DNABIND does allow for multiple submissions, so you could just dump all the protein sequences from a phage in but the program is really designed to analyze 3d structures and the say on the page: " Although it [DNABIND] can predict DNA binding from the protein sequence alone, pure sequence-based prediction was only validated on a very small set of sequences (all of them belonging to structures in the Protein Data Bank)." So there is no way to know its performance with just fasta peptide sequences. The interesting idea would be to combine this program with the newest 3-d predictors and see how it preforms. Again this would be so time consuming I would not recommend using as a general protocol for all phage but a very interesting question that a student could investigate. Bottom line here, I think the issues with practicality mean it should not be added as a recommended protocol for all annotations of all phage but it would be suitable for individual "in depth" investigations. I always have my students do some kind of individual research during the last 3-4 weeks of the semester and using these two programs to search through all the NKF proteins for possible function would certainly qualify for a suitable project. Especially with an eye on specificity and sensitivity issues. I guess the question remains is: should these be used to add the DNA binding to approved annotations when anyone does the work. To me again that would rely on a convincing data analysis showing a very low rate of false positives. Edited 16 Feb, 2022 19:31

Link to this post | posted 16 Feb, 2022 19:01

cdshaffer

Ok I have looked into this more. DNAbinder has an estimated false positive rate of 5-7% when using the "realistic dataset". The biggest issue I don't like is it appears you can only load 1 sequence at a time which makes general use for all genes quite labor and time intensive, not ideal for a good program to recommend everyone use. Also it runs in an old version of PERL and the server is running on Solaris so could be very time consuming to update to a modern server and make it easily usable as a general program (i.e. to make it less labor intensive).

DNABIND does allow for multiple submissions, so you could just dump all the protein sequences from a phage in but the program is really designed to analyze 3d structures and the say on the page: " Although it [DNABIND] can predict DNA binding from the protein sequence alone, pure sequence-based prediction was only validated on a very small set of sequences (all of them belonging to structures in the Protein Data Bank)."
So there is no way to know its performance with just fasta peptide sequences. The interesting idea would be to combine this program with the newest 3-d predictors and see how it preforms. Again this would be so time consuming I would not recommend using as a general protocol for all phage but a very interesting question that a student could investigate.

Bottom line here, I think the issues with practicality mean it should not be added as a recommended protocol for all annotations of all phage but it would be suitable for individual "in depth" investigations. I always have my students do some kind of individual research during the last 3-4 weeks of the semester and using these two programs to search through all the NKF proteins for possible function would certainly qualify for a suitable project. Especially with an eye on specificity and sensitivity issues.

I guess the question remains is: should these be used to add the DNA binding to approved annotations when anyone does the work. To me again that would rely on a convincing data analysis showing a very low rate of false positives.

Edited 16 Feb, 2022 19:31

Posted in: Functional Annotation → Can we call DNA Binding proteins based on DNABIND and DNA Binder results?

Link to this post \| posted 16 Feb, 2022 17:42
cdshaffer	I also think that the issue of false positives (i.e. specificity) is as important as Christian's initial investigation into false negatives (sensitivity). So I would like to know if we pick 50 or so phams that we know are NOT dna binding (structural proteins come to mind first, lysis enzymes, holins, suggestions?) for these proteins, do we get any false positives? I always think that keeping a protein NKF is better than giving it a false function. So if no false positives are evident then no major harm would be expected from doing the screen with the two programs and I think it might be worth considering. t If we were to consider I would take the common practice like we do for "membrane proteins" which is we only screen proteins at the end without a better annotation. That is, screen the NKF's at the end for possible "DNA binding" activity. Then only add the annotation "DNA Binding" to the protein if BOTH programs predicted a DNA binding activity. But I will say again that all depends on having an excellent specificity (like 100%). Edited 16 Feb, 2022 18:33

Link to this post | posted 16 Feb, 2022 17:42

cdshaffer

I also think that the issue of false positives (i.e. specificity) is as important as Christian's initial investigation into false negatives (sensitivity). So I would like to know if we pick 50 or so phams that we know are NOT dna binding (structural proteins come to mind first, lysis enzymes, holins, suggestions?) for these proteins, do we get any false positives? I always think that keeping a protein NKF is better than giving it a false function. So if no false positives are evident then no major harm would be expected from doing the screen with the two programs and I think it might be worth considering. t

If we were to consider I would take the common practice like we do for "membrane proteins" which is we only screen proteins at the end without a better annotation. That is, screen the NKF's at the end for possible "DNA binding" activity. Then only add the annotation "DNA Binding" to the protein if BOTH programs predicted a DNA binding activity.
But I will say again that all depends on having an excellent specificity (like 100%).

Edited 16 Feb, 2022 18:33

Posted in: Functional Annotation → Can we call DNA Binding proteins based on DNABIND and DNA Binder results?

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
next →

Recent Activity

All posts created by cdshaffer