SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

Relative importance of criteria when annotating an uncommon phage

| posted 21 Mar, 2017 13:42
We made very careful use of the Starterator Pham reports from WUSTL in this annotation, and I think it really helped us make better start calls, but it has raised new questions about calling starts.

We are working on Nairb is a Cluster T phage, one of only 5. Only two annotations (RonRayGun and Bernal13) have been previously published. In a number of cases, Starterator data only are available for Cluster T phages or the most common Starterator start is not present in Cluster T phams. When the calls previously made in RonRayGun and Bernal13 do not agree, and there is no special reason to choose a start, such as the magic 4bp overlap, then we have been making calling decisions based on the following criteria (in this order):
1. Start is conserved in all Pham members from Cluster T
2. Captures more coding potential
3. The longest possible ORF length called
4. The SD score is the best of the choices available
5. The start was called by Glimmer and/or GeneMark

Our problem Phams are ORFs 21, 26, 30, 32, 33, 34, 39, 55.

I am asking whether my criteria are in the order you would choose? Are there criteria for using the conserved start for all but one member of a Cluster? One of the draft annotations in Cluster T, Mendoksei, has Phams that are regularly shorter than all the other members and that suggests a different conserved start than if just the other four members are considered.

I am asking for any words of wisdom on making the hard choices given that calling conserved starts has become a relatively high priority.

For example, in ORF 34 REV (Pham 4779) the start we called was previously called in only one of the two previous Cluster T annotations. One member of this Pham is not from Cluster T, but their starts are completely different from the Cluster T members, so we ignored it. We called the most conserved start (30) that also had the best SD score and was the Glimmer call. However, this start did not capture all the coding potential and was not the longest ORF. An earlier start was available in all but one of the Cluster T phages in this Pham that would have captured all the coding potential. Did we make the call appropriately?

ORF 55 (Pham 17577) has 14 members but the most frequently called start is not present in the Cluster T members of this Pham. The two annotated Cluster T members called different starts, 10 and 15. Yet, start 14 is the conserved start in all five of the Cluster T phages. Glimmer called start 15, GeneMark called start 13. I am inclined to call the start at 14. It has a better Z value but a worse SD score than start 15 which is also found in all five Cluster T phages. Just to make things more interesting, start 10 is present in all but one of the Cluster T phages and it would be the most conserved start in all but that one. Start 10 is present in both of the annotated phages, was the called start in one of them, and would include all the coding potential. Thoughts?
| posted 26 Mar, 2017 17:06
Larry,
Hi. Your questions are thoughtful and difficult.
First, I can' order the tools as you would like me to. each case is solved in context, so i can't prescribe a weight to any individual piece of data. Second, there are 5 Cluster T genomes, only 2 of which are finals. My opinion is that all 5 inform your decision making, so you can't ignore the data just because they are drafts. Soon, we will have to reconcile all of the cluster T phages, but I can' tell if we still have enough data.

So, I started to look at gene by gene. I would not worry about which previous call I agree with(when they differ - between the 2 in GenBank). Just make your best call. Your cover sheet to SMART will be lengthy in places where you don't like one of the final 2 choices better than the other.

Here is as far as I got so far:
gp 21 I would call 19663

gp 26 I would call 23229

gp 30 I would call 26046. It is the only start that captures ALL coding potential. It took some work to throw away the GeneMark call on the opposite strand. Cool.

gp 32 I think that I would call 27508. I don’t like any of the choices, but this is the one that captures all coding potential but doesn’t have a shitty SDS.

gp 33 This one is hard. I’m still thinking on this. I don’t find that the starerator program is all that helpful. I need to think about this one again. It is the integrase, so we should be able to see the different domains that need to be there…. and it is significantly different from Bxb1's integrase (the one with experimental data to determine the start).

gp 34 This one is the immunity repressor. Again, I couldn't resolve this one easily.

it might be helpful to look for the attP site around here too.

That is all I have for now. i will email Welkin to ask that she chime in.
Keep up the good work,
debbie
| posted 27 Mar, 2017 18:23
Debbie,

This helps, thanks. When a Pham has very few members, it is difficult to decide which members are outliers, as in Nairb ORF 32. When there are many members, a few ORFs with deletions should not change our sense of the conserved start.

I am still curious about your call on ORF 21. The 19663 start had a worse SD score (-4.569) than 19657 (-3.104) but they have same Z values (I wrote "lower SD score" in my annotation notes which may be confusing). 19663 is a TTG start while 19657 is an ATG start. Neither call changes the inclusion of coding potential, but 19657 has a 1bp overlap with the prior ORF while 19663 has a gap of 5bp from the prior ORF. What indicates that 19663 is the better call?
| posted 29 Mar, 2017 13:35
Wow, Larry, you sure don't pick easy ones!
I may have to break these up into several posts because it is taking me a while to get through them.
First of all, the integrase. This is a tyrosine integrase, rather than a serine (which is the Bxb1 integrase Debbie was referring to). We have several tyrosine integrase starts characterized at the bench, including L5. If I do an ideal alignment using Smith-Waterman at ENI between the two, it looks like the cluster T integrase should be around 1266 in your Nairb. This start lines up better with the Cluster F integrases that have been called on phagesdb and is start "20" on the Starterator report.

I'll post others as I get a chance. I am also going to post your .dnam5 file, so others can follow along if they need to.

-Welkin
| posted 29 Mar, 2017 14:43
Hi Larry,
I am sticking to my call for gene 21. Though my data is slim, my rationale is that the TTG start is under-called and if I can pick it, I do. (That would be my bias - remember it is not favored by any of our gene calling programs, so someone has to stick up for it.) In addition, our limited (very limited) mass spec data suggests that the second start is more likely. (Why? Because the mass spec data has the start of the protein at the next amino acid downstream of the second start.) Until experiments are done, these are predictions. In the big scheme of things, I would find either acceptable.
-debbie
Edited 29 Mar, 2017 14:43
| posted 29 Mar, 2017 17:08
Hi Welkin,

Thanks for this detail on ORF 33 of Nairb. I see now that I mistakenly thought that the most frequently annotated start (20) was not available in this ORF. I have made corrections accordingly. Rejecting both Glimmer and GeneMark calls, and both prior annotations in Cluster T feels strange, but the logic is sound.
| posted 29 Mar, 2017 17:14
Hi Debbie,
The underdog criterion was unexpected, so you put a smile on my face. I am good with the understanding that ORF 21 of Nairb could be called either way.
 
Login to post a reply.