SEA-PHAGES | when Glimmer and Genemark call genes in different strands

Link to this post \| posted 26 Feb, 2025 18:15
lisabono	Our section of SEA-PHAGES is annotating PrairieDogTown in Cluster FO. We noticed that PrairieDogTown (FO) has 82 predicted features, while the rest of the phages in FO has 52-54 features. When we dug into this a bit, we realized that the discrepancy in the number of features appears to be a difference in the number of reverses. PrairieDogTown has ~27 reverse genes while JanetJ and Aoka have 3-5. The rest of the phage in the cluster are still draft genomes. Anyone have any ideas about why we would be seeing such a large discrepancy in the number of reverses? Is there a technical reason why more reverses would be detected? Thanks! Note that there's no Cluster FO specific topic, so I'm posting in F.

Link to this post | posted 26 Feb, 2025 18:15

Our section of SEA-PHAGES is annotating PrairieDogTown in Cluster FO. We noticed that PrairieDogTown (FO) has 82 predicted features, while the rest of the phages in FO has 52-54 features. When we dug into this a bit, we realized that the discrepancy in the number of features appears to be a difference in the number of reverses. PrairieDogTown has ~27 reverse genes while JanetJ and Aoka have 3-5. The rest of the phage in the cluster are still draft genomes. Anyone have any ideas about why we would be seeing such a large discrepancy in the number of reverses? Is there a technical reason why more reverses would be detected? Thanks!

Note that there's no Cluster FO specific topic, so I'm posting in F.

Link to this post \| posted 26 Feb, 2025 20:46
debbie	It is not common but is definitely seen that Glimmer and GeneMark finds a big open reading frame on opposite strands and then calls most genes in that orientation. These genomes cannot have that many genes in their genomes. Caution is required as to where the other supporting data aligns. Look for the ORFs that have known phage function to build your case for where to call genes.

Link to this post \| posted 26 Feb, 2025 20:50
debbie	Hi Lisa, In the regions where there are forward and reverse genes called, the previous annotators decided the evidence for forward genes i that region were more compelling. Please be sure to make you decisions based on the evidence that you have. Note that gene prediction models are using a 4 nucleotide sequence to find the most abundant patterns in the big open reading frames, then applying that pattern to the whole genome. The math will break down for smaller genes. My guess is that those regions where genes are called simultaneously on both strands are not very GC rich, so the patterns of ATCG are somewhat equivalent, so they are all getting called. Note to see if one of the programs (Glimmer or GeneMark) calls one strand more than the other. So you may see a bias due to the program's algorithm. debbie Edited 26 Feb, 2025 20:53

Link to this post | posted 26 Feb, 2025 20:50

debbie

Hi Lisa,
In the regions where there are forward and reverse genes called, the previous annotators decided the evidence for forward genes i that region were more compelling.
Please be sure to make you decisions based on the evidence that you have.

Note that gene prediction models are using a 4 nucleotide sequence to find the most abundant patterns in the big open reading frames, then applying that pattern to the whole genome. The math will break down for smaller genes. My guess is that those regions where genes are called simultaneously on both strands are not very GC rich, so the patterns of ATCG are somewhat equivalent, so they are all getting called. Note to see if one of the programs (Glimmer or GeneMark) calls one strand more than the other. So you may see a bias due to the program's algorithm.

debbie

Edited 26 Feb, 2025 20:53

Link to this post \| posted 08 Mar, 2025 22:12
Pollenz	Hello A strategy to approach this "mess" is to pull up your phage and TWO FO annotated reference phages with decent identity. The BLAST of PrairieDogTown shows that JanetJ and Aoka are good ref phages that have good matches to PrairieDogTown in several genomic areas. Obviously if you have good nucleotide identity (PURPLE), you should have a similar genomic organization and can see what was done with the other annotated phages. Functional hits and # of pham members can be very helpful when deciding which genes are valid and which to delete. The guiding principles are also important here in regard to overlapping genes and transitions from REV to FOR. RS Pollenz

Link to this post | posted 08 Mar, 2025 22:12

Pollenz

Hello

A strategy to approach this "mess" is to pull up your phage and TWO FO annotated reference phages with decent identity. The BLAST of PrairieDogTown shows that JanetJ and Aoka are good ref phages that have good matches to PrairieDogTown in several genomic areas. Obviously if you have good nucleotide identity (PURPLE), you should have a similar genomic organization and can see what was done with the other annotated phages. Functional hits and # of pham members can be very helpful when deciding which genes are valid and which to delete. The guiding principles are also important here in regard to overlapping genes and transitions from REV to FOR.

RS Pollenz

Link to this post \| posted 10 Mar, 2025 19:30
cdshaffer	I always remind my students of rule 9 in cases like this as a good first bit of evidence: 9. Switches in gene orientation (from forward to reverse, or vice versa) are relatively rare. In other words, it is common to find groups of genes transcribed in the same direction. so if all the neighbor genes are on one strand then of the two predictions, I generally prefer the one on the same strand (there are rare cases of a single gene on the opposite strand, so use the Pollenz method described above to check for that). Another thing I have students do is run both predicted protein sequences through HHPRED. I think a true gene is much more likely to have an HHPRED hit as compared to a false positive ORF on the other strand. You don't need to see >90% probability, if one gene has HHPRED hits in the 70-80% probability and the other gene has a hits in the <40% probability, you have found good evidence which is likely the true gene and which is the false positive.

Link to this post | posted 10 Mar, 2025 19:30

cdshaffer

I always remind my students of rule 9 in cases like this as a good first bit of evidence:

9. Switches in gene orientation (from forward to reverse, or vice versa) are relatively rare. In other words, it is common to find groups of genes transcribed in the same direction.

so if all the neighbor genes are on one strand then of the two predictions, I generally prefer the one on the same strand (there are rare cases of a single gene on the opposite strand, so use the Pollenz method described above to check for that).

Another thing I have students do is run both predicted protein sequences through HHPRED. I think a true gene is much more likely to have an HHPRED hit as compared to a false positive ORF on the other strand. You don't need to see >90% probability, if one gene has HHPRED hits in the 70-80% probability and the other gene has a hits in the <40% probability, you have found good evidence which is likely the true gene and which is the false positive.

Recent Activity

when Glimmer and Genemark call genes in different strands