SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

Suggestions for Starterator Report Upgrades

| posted 25 Mar, 2017 13:47
Chris,

I know you have more than enough things on your plate, but I have been thinking about how to make the Starterator Reports, the newest pdf pham reports that we are pulling from <http://phages.wustl.edu/starterator/PhamXXXXReport.pdf>, could be useful for calling the most conserved starts.

When a given pham member does not contain the most commonly annotated start in that pham, I then look at the bands for other pham members that also are missing the suggested start. The current printed report gives the percentage of calls for each of the other possible starts in annotations and autoannotations. At that point, I am not very interested in the start calls, but in the available starts.

If the sequential start list provided the percentage of all pham members that contain that possible start, I could quickly evaluate the most conserved start(s), regardless of what had been called in past annotations or in autoannotations. I would look for the first start that is found in the greatest percentage of pham members. Could those data be added to the Pham Reports?

| posted 29 Mar, 2017 01:58
Having dealt with these new reports in class I agree that the current set of calculated numbers are not the best with higher variance phams. In our discussions last fall we wanted to get away from the "Suggested Start" because it was counting both human and computational annotations equally and thus putting too much emphasis on glimmer/genemark.

I wonder if a good number to report is the fraction of times a start is annotated as the start of the gene but only consider the manually annotated genes that actually have that start present.

I can see two places to put that kind of info that might help, in the "Summary by start number" section and/or in the "Gene information" section. Here are examples, the details of which can easily be changed but it gives you the idea and something to comment on:

Summary by start number
• Start number 18 is called in: Spectropatronm_Draft_2, Rima_2, Namo_2,
Percent of genes with start 18 present: 37.5% ( 3 of 8 )
Start 18 was manually annotated as the start 100% (2 of 2) of the time when present.

• Start number 19 is called in: Scap1_2,
Percent of genes with start 19 present: 25.0% ( 2 of 8 )
Start 19 is called as the start 50% of the time when present (1 of 2).

So in the above example I image there are 8 members of the pham, start 18 is present in 3 of the 8 members so starterator reports "present in 37.5%". Of those three members that have start 18, two have been manually annotated (Rima and Namo) and in both cases start 18 was the annotated start of the gene. Thus starterator reports 2 out of 2 or 100%. For start 19, it is present in 2 phage (Scap1, and one other), both are manually annotated but in only one of them was 19 the annotated start so starterator reports 50%. Thus we have a % presence with examines all genes and gives a sense of overall levels of conservation, and a % manually annotated which gives the fraction of the time it is picked when present.

A different way to encode the same would be to put those in the Gene info like this:
Gene information
•Gene: Spectropatronm_Draft_2 Start: 485, Stop: 892
Candidate Starts for Spectropatronm_Draft_2:
[(5, 395, 0%), (18, 485, 100%), (19, 563, 50%)]
I guess a third option is to do both.

I am a little reticent to put lots of details into the Gene information since that section gets quite long already for phams with lots of members and long genes.

It would be great to hear feedback on any of the above. I agree that more would be needed, I am just not sure if there is something better. Are there tweeks to the above that make better sense to you? Are there other numbers that could be calculated and reported that would be more informative? Thoughtful feedback is much appreciated.
| posted 29 Mar, 2017 16:19
Thanks Chris. The additional information in the Summary by Start Number is exactly what I would like to see and would be most helpful when making decisions about a member of a Pham that does not have the most frequently manually annotated start available. Absolutely keep the information about manually annotated starts but adding the available starts would be great.

I think the Gene Information is fine as it is now. If you make a change, make it in Option 1 as you have detailed.
| posted 29 Mar, 2017 17:18
OK i have updated this issue for Starterator on the issue tracker. You can see the issue posted here. There are still some unresolved bugs/crashes as well as some ideas for other changes on the issue tracker as well.

Not sure how easy difficult this change will be to implement, so not sure if/when it will be added. It should be fairly easy in theory but some of the data constructs are a little tricky to work with, especially the ones in the section of the code that generates the PDF, so it will take some time to dig into the code and see what can be done.
| posted 22 Apr, 2017 21:18
Here is a possible version of the report that changes the "Summary by start number" section.

http://phages.wustl.edu/Pham6711Report.pdf

I have tried to incorporate the above discussion by including info on how often a given start is found in the pham as well as how often it is called as the start of the protein. You can compare it to the current version here:

http://phages.wustl.edu/103/Pham6711Report.pdf

Feedback would be appreciated
| posted 23 Apr, 2017 17:45
The possible new version looks very good to me. This change in the Pham Reports would be very helpful and would make them much more useful than the current version when trying to make calls on ORFs that do not contain the most frequently annotated start.

Thanks Chris. You have my vote of support on this change.

Larry
 
Login to post a reply.