SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

Gap or overlap in Superstar (BD2)

| posted 02 Feb, 2024 01:18
We are annotating phage Superstar (BD2) and aren't sure if we should call a large overlap (~150bp) between gp73 and gp72 (~44,250), or a large gap (~300bp) by deleting gp73.

Some phages (Ismmi) have deleted the gene, others (Diane) have kept it in.

Good coding potential in gp73 (and in the two comparators) that ends abruptly where gp72 begins, but no STOP for ~150bp.

My bias (unfounded? Maybe based on my R2I session with Sally Molloy?) is to fill the gap, trust the coding potential, and accept the overlap. Or should we delete gp73? The QCers have had different opinions (though the overlaps do seem to vary somewhat - perhaps more evidence for the idea below).

One solution is if there is a programmed slip, and this is "called" (the black triangle) in the Genemark files of Diane and Issmi (actually Diane calls slips between the gp74-gp73-gp72!). My student found a potential slippery sequence in the correct location in Superstar, and we may test it in E. coli, but I know without evidence we shouldn't call this.

Any advice/opinions?

Adam
| posted 02 Feb, 2024 13:17
Hi Adam,
This is not a gap-filling exercise. There is great coding potential. If I am at the correct gene (you didn't provide coordinates), there is a 31 base overlap between the gene that starts at 44145 and the next one that ends at 44114. Not that troublesome in this context. Just as important is that there are interesting hits to the gene that ends at 44114. More investigation is needed but I would want to know if I can make a 'toxin' or a DNA binding functional assignment.
Edited 02 Feb, 2024 13:20
| posted 02 Feb, 2024 14:24
hi Debbie.

To not lose coding potential in gp72, the student selected the start site at 44271 (also has better scores/spacer) and this gives a 158bp overlap.

It sounds like you are suggesting we use 44145 to minimize overlap, and not worry about cutting off coding potential?

Part of why I posted this is because there is a clear difference in the way the other related genomes were annotated - one QCer added a gene, and I'm guessing the other decided the overlap was too big an issue.

I didn't get into the functional calls, which are obviously important, but I first wanted to get an opinion on the 158bp overlap. Phamerator map screenshot attached. Diane on top, Superstar_draft middle, Issmi bottom.

Adam
Edited 02 Feb, 2024 14:26
| posted 02 Feb, 2024 15:02
Adam,
Next time, please use coordinates. I just looked too quickly, my apologies.
I would call the big overlap.
This is a great opportunity to imagine how a toxin/antitoxin pair would work.
and that is saying that i am not sure it is even a toxin/antitoxin pair.

Another key point, you aren't done with your positional annotation until you have looked at the functional annotation. You can convince yourself that the gene and the start are the right calls, but investigating the functional call may show you you are wrong because you are missing a needed domain or the opposite - there is no functional information the helps.

As for the picture that you sent, the dissimilar nucleotide sequence in the bottom genome negates inferring anything about what genes to call in your genome without further inspections.
do you use the multiple protein seqeuece alignments from phagesDB/phamerator?

In this case, you may want to call a toxin, but can you identify an anti-toxin?
so many good questions!
debbie
| posted 02 Feb, 2024 17:30
Hi Debbie

The student is carefully annotating this region, and I understand the functional information could influence the start site call, but what if these were two NKFs? I'm interested in the more generic case because it would change the way I think about the annotation guiding principles regarding overlap. If they were NKFs, would I make the same call and use the 44271 start, accepting the 158bp overlap? Based on the coding potential, I think I would want to.

Even though Issmi has dissimilar nucleotide sequence, the GeneMark files looks quite similar for all three phages: good coding potential but a lot of overlap. I probably would have included the gene and I was wondering if there is a reason it wasn't changed in QC that I'm missing. Based on synteny, I suspect we'll find the same pham in Issmi (though we haven't checked).

A
Edited 02 Feb, 2024 17:31
| posted 06 Feb, 2024 02:10
Hi Adam,
I know very little about Streptomyces phages annotations.
When I first looked at this, I most wanted to respect coding potential. Tonight, I had to ask - after your most recent questions - how many BD phages are there. Looks like there are 43 members in BD, with 11 in subcluster BD2. And additional phages in the rest of the BD subcluster. My impression is that the right arms of BD phages are filled with small genes and lots of HGT.

Because of the overlap, I expected to see lots of hits to the c terminus of this overlapping gene. But if you blastn 44271- 43849, you can see what this looks like athere are 2 genes in the region. My take, with the short amount of time I have looked at this, is that there is no clear answer here. If you choose the longer overlap, please be sure to include your justification when you submit.

There is definitely more digging to do. And no, this won't get answered until you look at all the data, don't leave out the functional investigation as you continue to consider this.

debbie
| posted 06 Feb, 2024 16:44
Hi Adam and Debbie,

I did a very quick look at this yesterday. There are 32 phages in BD2, but only 13 had this particular gene call based on Phamerator data (and 3 of those, including Superstar are draft annotations). I only did a quick glance through them and didn't see a clear pattern on the overlap sizes that have been called, but I didn't really have a chance to look into them in detail.

Please continue to share your progress here. We are annotating the two other draft genomes at UNT right now.

Thanks,
Lee
| posted 07 Feb, 2024 22:08
Hi Debbie

Continuing this thread with a different, but related question about Superstar.

The genes in question are between 25,200 and 27,000. The auto-annotation added two bottom strand genes, but we see stronger coding potential on the top strand, and we think we should call the gene (a membrane protein with no significant HHpred hits) from 25,195 to 26,052. However, this gene has a very large overlap with the next gene (an exonuclease). To preserve coding potential, use the best RBS scores and match the Glimmer/Genemark calls, the best call for the second gene is 25,700 to 26,641. This would create a 353 bp overlap between the two genes.

A related phage, Caelum in the attached document, also has a gene in the same pham on the top strand (gp31 in the attached phamerator map), but with a smaller overlap (~50bp).

Would a 353bp overlap be acceptable? Or does the large overlap of the gene at 25,195 to 26,052 in Superstar suggest that the similar gene in Caelum (and Issmi) might have been incorrectly called, and all three should be removed? This would leave a gap and unassigned coding potential.

We can move the start site of the second gene, but the start at 25,700 is the most frequently called start and creates a gene in Superstar of similar size to those in related phages.

thanks!
Adam
| posted 07 Feb, 2024 23:42
Adam,
What do you want to call here? What is your evidence?
debbie
| posted 08 Feb, 2024 03:25
Hi Debbie

The question and the evidence has been generated by a student in my course.

I think the second gene (25,700 to 26,641) should be annotated as an exonuclease. Clear coding potential, strong functional hits, annotated in all comparators, most used start site that covers all coding potential.

I'm not sure what to do for the first gene. Either:

1) Annotate the gene (25,195 to 26,052) as a membrane protein, respect the coding potential and allow a 353 overlap. Two other phages annotate this gene, but as I mentioned above the encoded genes are shorter. No hits on Blastp or HHpred to mention.

2) Leave a gap before the exonuclease and call nothing. I'm skeptical the auto-annotated bottom strand calls are real: only very weak GenemarkS coding potential (dotted red lines)and no matches on HHpred or Blastp. The coding potential is much higher on the top strand.

3) I suppose a third option would be a translational frameshift, though we'd obviously need experimental evidence to make that call. I didn't see any evidence for a slippery site near the transition in coding potential between the frames.

I think the decision hinges on whether a 353bp overlap would be acceptable. It severely violates a Guiding Principle.

I haven't annotated enough genomes to have a sense what the correct decision is.

thanks!
Adam
 
Login to post a reply.