SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

How is the most annotated start determined when 2 starts have the same number of manual annotations in the same pham?

| posted 10 Feb, 2022 18:20
Hi,

This is regarding Pytheas gene 29 in the current version of Starterator report as of 02/10/22. My student caught the following and asked why start 44 is not the most annotated start since it has the same number of manual annotation as start 39. Please see attached slide for details. I don’t have the answer and would appreciate your input.
Thank you!

Ping An from University of Pittsburgh
| posted 11 Feb, 2022 02:37
My guess is that Starterator just doesn't have the capability to say "there's a tie." It probably just chose the first of the two starts and designated that as the most-annotated one even though they are actually equal.
| posted 11 Feb, 2022 18:42
I think that the answer is more complicated than that. Can we set up a time to chat about this? The distinction may be arbitrary, but more information is needed and starterator - by itself- will not have all of the answers. I am happy to review this with you and your students!
debbie
| posted 11 Feb, 2022 18:55
Thank you both for replying!
I agree with you, Debbie. Starterator by itself does not have all the answers. In this case, one of the two starts with the same number of MAs is actually not present in this gene. It really does not affect our analysis but I think we're just being extra nuanced. I'll email you to set up a time to chat.
| posted 11 Feb, 2022 20:08
There are two issues here. One is the code and what should starterator put on the report for situations like this. The other is how best to interpret the data to try to come up with the start choice best supported by the evidence.

With respect to the former, Amanda is correct in that the code that handles that is quite simple and just is not built to deal with ties and deciding in any formal way how to break them. Coding/testing/publishing changes all takes time so for many issue like this, the question is always "is the problem worth fixing? or "is it good enough even though not perfect?" There are probably dozens of issues like this so there is always more problems that need fixing that time to fix them. Thus, these kinds of issues can be quite common, especially in bioinformatic software where there is one or only a small number of maintainers. This is a good teaching moment to remind students that for all bioinformatic software like this, it is always wise to to be wary of the results from any one program, especially when running across unusual or rare situations. As you use a program more and more you will learn what the program does well and where it "fails" but before that time (to mis-quote an old TV cop show) "Be careful out there".

In this particular case, I have time to work on Starterator (pretty much only in the Fall) and I use feedback from users on what issues to fix or new features to add when deciding what exactly to do. So it would be totally appropriate to submit this as an "issue" that needs fixing. This is done on the Github pages where the official version of the code exists. There on github is a discussion board called "issues" where anyone can post bug reports and feature requests. I encourage anyone and everyone to provide feedback there by creating a new issue and posting your requests/comments or adding your own comments to other issues. Any software is only as good as it ability to serve the needs of its users, which is why user feedback is so important. When I get time to work on starterator I go to the issue board to see what's up and any issue with lots of comments is far more likely to be worked on than an issue that is never mentioned.

As for the interpretation of Starterator reports during gene start analysis I will leave that for your discussion with Deb.
| posted 15 Feb, 2022 16:28
Hi Chris,

Thanks so much for the in-depth discussion/explanation and advice!
I really appreciate it!
I wouldn't claim to have extensive experience with bioinformatics tools, but I am fully aware that no computational tool is 100% perfect. And chances are there are minor issues that are not necessarily worth fixing. In this case, I do NOT think the program "failed" at all. I'm just glad that I got clarifications from experts like you and thank you for putting this great tool to work in the first place!
| posted 15 Feb, 2022 17:30
Debbie and I just chatted about this particular phamm today.
Debbie pointed out that this pham is particularly diverse and shared the multiple sequence alignment of all members in this pham, which is attached to this post. You may need a multiple sequence alignment viewer to open the alignment and Debbie recommended the following,
https://macdownload.informer.com/aliview/
The AliView tool is free for download.
To reiterate the important message again, Starterator does not have all the answers for choice of start site especially when a pham is very diverse. A closer look at the specific cluster to which the phage belongs may be more informative.

In addition, the gene of interest, Pytheas gp29, appears to be a lysin B coding gene. Drs. Kim Payne (a former graduate student from the Hatful lab) and Graham Hatful have published a great article on the study of mycobacteriophage endolysins in PlosOne (https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0034052&type=printable). This article revealed the diversity and modular structure of endolysin and provides very helpful background knowledge on endolysin for annotation work.
 
Login to post a reply.