SEA-PHAGES | Typical vs atypical GeneMarkS coding potential

Link to this post \| posted 06 Feb, 2016 20:18
jross1025	Page 48 of the current Annotation guide makes reference to "typical" and "atypical" coding potential in the GeneMarkS output, which I believe is the heuristically generated version. What is the distinction here? In both LilDestine and Teardrop, the two types of coding potential appear to be largely, although not entirely, overlapping; in those areas of sequence containing atypical, but not typical, cp, the atypical regions are generally small.

Link to this post | posted 06 Feb, 2016 20:18

Page 48 of the current Annotation guide makes reference to "typical" and "atypical" coding potential in the GeneMarkS output, which I believe is the heuristically generated version. What is the distinction here? In both LilDestine and Teardrop, the two types of coding potential appear to be largely, although not entirely, overlapping; in those areas of sequence containing atypical, but not typical, cp, the atypical regions are generally small.

Link to this post \| posted 12 Feb, 2018 16:06
ivanerill	Hi Joseph, The typical and atypical models in GeneMark are described in: https://www.ncbi.nlm.nih.gov/pubmed/9847079 Essentially, GeneMark uses a heuristic to figure out the coding potential in a genome. It detects the longest possible ORFs and assumes they are real ORFs. Based on these, it starts the iterative training of Hidden Markov Models to predict coding and non-coding regions. One the HMM is set, GeneMark then performs clustering of predicted protein-coding genes. The most common clustering uses two clusters, assuming that the majority of genes follow a "coherent" codon usage pattern, and there is a minority, likely the result of lateral gene transfer (LGT) that do not stick to those rules. Final models are trained on these two clusters, resulting in the "Typical" model for "coherent" genes, and the "Atypical" model for the weirder one. Hope this helps! Ivan

Link to this post | posted 12 Feb, 2018 16:06

ivanerill

Hi Joseph,

The typical and atypical models in GeneMark are described in:
https://www.ncbi.nlm.nih.gov/pubmed/9847079

Essentially, GeneMark uses a heuristic to figure out the coding potential in a genome. It detects the longest possible ORFs and assumes they are real ORFs. Based on these, it starts the iterative training of Hidden Markov Models to predict coding and non-coding regions. One the HMM is set, GeneMark then performs clustering of predicted protein-coding genes. The most common clustering uses two clusters, assuming that the majority of genes follow a "coherent" codon usage pattern, and there is a minority, likely the result of lateral gene transfer (LGT) that do not stick to those rules.

Final models are trained on these two clusters, resulting in the "Typical" model for "coherent" genes, and the "Atypical" model for the weirder one.

Hope this helps! smile

Ivan

Recent Activity

Typical vs atypical GeneMarkS coding potential