SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

All posts created by cdshaffer

| posted 10 Apr, 2019 17:30
Lee is correct about the presence of non-printable or control characters in your notes field, if you see this line in your documentation:

/note=""

then you know there is something in the notes which will mean "hypothetical protein" will not be added correctly.

I see this a lot and I always assumed it was an instance of a very common problem in bioinformatics that has to do with newline definitions of mac vs. dos vs. windows vs. linux (its complicated but see this wiki page if you are curious). If my suspicion is correct then PECAAN is not the problem and cannot fix anything, but rather its an issue with one of the various specific programs you use and/or your OS as they handle text as you process the downloaded file.

The solution I have found is to use the "recreate" button in DNA Master and then parse again. This has fixed the issue for me. So the protocol then is:

1. Paste in the downloaded file into the documentation window
2. Parse this entry by clicking the "parse" button, the next "parse" button, then "yes"
3. Recreate the documentation by clicking the "recreate" button, then "yes"
4. Now parse this version of the documentation by following step 2 again

I usually do one more round of recreate and parse just to be sure. So the click stream I use is:
paste, parse, parse, yes, recreate, yes, parse, parse, yes, recreate, yes, parse, parse, yes
Edited 10 Apr, 2019 17:49
Posted in: PECAANNew Features in PECAAN
| posted 21 Mar, 2019 15:47
To me annotations like helix-turn-helix DNA binding domain should be added only if there is sufficient evidence that the protein really does have a domain of that type AND, more importantly, there is not a more specific approved term that is also supported by the evidence. For example many (if not all??) sigma factors contain HTH domains but "sigma factor" so either term would apply. However sigma factor is a much better annotation than HTH binding domain protein since it is a more specific term. So you have to look at the evidence with an eye toward the validity of "sigma factor" vs "HTH domain", this is why each match should be evaluated with respect to the size and location of the exact match with respect to the whole proteins. Full length alignments are much better than short little domain matches, but if all you have is a short high quality match to an HTH domain then I would add it. I think for short domain matches I would also focus on the HHPRED results. A "helix-turn-helix domain" is really annotating the presence of a structural domain so I would want to focus on the programs that are trying to find similarity at the structural level (HHPRED, Phyre2 etc), not the primary amino acid level (i.e. BLASTP)

In this particular case since FIC has been rejected as a term, there is no better approved term that HTH domain. Thus, I would just evaluate the evidence from hhpred as to the question do I really have a HTH domain and add it if I felt the evidence justified it.
Posted in: Functional AnnotationFIC family protein
| posted 20 Mar, 2019 16:23
Hmmm, that pham no longer exists, probably was in the old database but not the newest one, if you are using the virtual machine it is important to run phamerator first to check for and install any database updates before running starterator.

I have used the most recent "stable" version of starterator which has a few udpates that you have not seen if you are used to the whole phage report from the version in the virtual machine, so just be aware that the output will be a bit different. I have posted the results here:

https://wustl.box.com/s/irzr8fel3z6e1tmr9qirymj3fl91j0ng

You should also know that most users no longer bother with the whole phage reports but instead go to the pre-computed online versions of the per pham starterator reports. You can see in this article how to get access to those online reports. I always try to keep these reports up to date with new versions of the database.
Posted in: Starteratorphage that crash starterator
| posted 12 Mar, 2019 18:02
I think it is always better to be as specific as the evidence allows. The three terms you cite are not inconsistent just different levels of specificity. I think most, if not all, kinases use ATP as the source of the phosphate so not surprising that a kinase has an identifiable "ATP binding cassette".

I would pay particular attention to the length of the alignments, does a particular alignment include the majority of your protein? is that alignment along the majority of the subject? Thoese "full length by full length" alignments are the most informative and I would pick the most specific term (i.e. the thymidylate kinase). On the other hand if the region of your protein that matches the "ATP binding cassette" is the same region that is matching to a subpart of some thymidylate kinase then you likely just have an ATP binding domain that is particularly similar to the ATP binding domain found in a thymidylate kinase, in that case I would use the more general ATP binding domain.
Posted in: Functional AnnotationPham 5614 function
| posted 11 Mar, 2019 19:24
the alternative would be to create two approved annotation terms that indicate how the subdomains of the protein are divided between the two polypeptide chains. This is similar to how some AY phage have the large terminase being split into a "ATP-ase domain" and a "nuclease domain".

I think the issue will be is there really value added with defining two new terms. If the split in your DNA pol correlates well with domain locations, like the above terminase example, it might make sense. If, on the other hand, the split does not split nicely domains in a sensible way (like the polypeptides split a doamin into two halves) then there is probably not a better solution than the one currently in use which is to give each part the name of the whole. As a reminder, to propose the addition of the two terms you think would better represent the gene products use the "request a new function" topic.
Posted in: Cluster EF Annotation TipsTwo piece DnaE-like DNA polymerse III (alpha)
| posted 11 Mar, 2019 19:07
cristian,
I have my students use this online tool for ANI calculations, sorry it doesn't answer your question but a work around is better than nothing:

http://enve-omics.ce.gatech.edu/ani/
Posted in: DNA MasterGenome Comparison
| posted 28 Feb, 2019 17:20
These are not mutually exclusive results. Looking at the internal structure for RecA, {I used this link: [Rec A page at UniProt]} I can see that RecA includes a AAA-ATPase like domain. So both term apply, it's just a question of specificity.

To me RecA is a much better description of a function than a general ATPase fold seen in diverse cellular activities {see this}. So if you have a good match to RecA across most of RecA I would say use RecA. If, on the other hand, the match is simply to the AAA ATPase as found in RecA I would go with the less specific AAA-ATPase. A detailed look at which parts of RecA is aligning to your protein by looking at the actual HHPred alignment should answer that question.
Posted in: Functional AnnotationR cluster Candle pham 4972 function
| posted 26 Feb, 2019 18:49
Just a heads up.
It looks like this more recent database update had a very large number of changes. Since starterator tries to use previous results when possible, the large number of changes means this analysis requires a lot more processing and is taking much longer than is typical. I will post as soon as the results are available.

Data has been posted, if you are still missing proper pham links please repost your message
Edited 26 Feb, 2019 22:02
Posted in: StarteratorStarterator not matching up with listed phams
| posted 26 Feb, 2019 16:10
The most likely reason is a database version sync. There was a database update to version 256 late yesterday, the wustl website is still showing the results from version 255. The new database is being worked on by Starterator but there are approximately 15 thousand reports it just takes time, I expect all the runs to be completed in a couple more hours.
If you still have trouble after the update is posted then please post a specific example or two of exactly which genes and pham numbers are giving you issues and I will investigate further.
Posted in: StarteratorStarterator not matching up with listed phams
| posted 04 Feb, 2019 19:08
It appears from your picture that you are still running an older version of Starterator. The newer version does not crash. Updating starterator is non-trivial which is one of the reasons we went with online reports.
I have posted a whole phage report for phage Zolita which you can download.
If you want to try to update, I can help you with that just post a followup or email me directly (address is my last name @ wustl.edu).
Posted in: Starteratorphage that crash starterator