The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at

All posts created by welkin

| posted 13 Apr, 2018 18:23
Cluster CQ does not have an identifiable lysin B.
Posted in: Cluster CQ Annotation Tipsno lysin B
| posted 13 Apr, 2018 18:21
Lysin A, which is a single multi-domain protein in many Actinobacteriophages, is split at the domain boundaries encoded by two adjacent genes in cluster CQ. The functions should be reported as "lysin A, whatever the domain is" to indicate that each gene does not encode the full lysin A.
Posted in: Cluster CQ Annotation Tipslysin A in two parts
| posted 13 Apr, 2018 17:32
Like the Cluster A mycobacteriophages, Cluster CA phages have two overlapping DNA primase genes back to back. It is unclear how the correct full length primase gets made.
Posted in: Cluster CA Annotation TipsDouble DNA primase
| posted 13 Apr, 2018 14:38
YEp, thanks, LEe— not an error.

For the exact slip coordinate, I usually pick the middle nucleotide of the slippery sequence.
Posted in: Frameshifts and IntronsFrameshift-C1
| posted 13 Apr, 2018 14:30
You've identified a neat little region in the B3s that is not well-conserved. It seems like every B3 has its own different nucleotide sequence right here, which makes it hard to use comparative genomics to give you the answer. The Phamerator map has actually called a tiny forwards gene and your genemark TB data looks like it should be a slightly larger reverse gene in the same place– although the coding potential overlaps the RNA ligase, so that's no good.

I guess I am leaning towards "not a gene", mostly because of the lack of small genes upstream of the RNA ligase in the other B3s.
Edited 13 Apr, 2018 14:35
Posted in: Gene or not a GeneA gene or not a gene - Morty007 #68
| posted 13 Apr, 2018 13:30
Hi Arturo,
This draft annotation in Phamerator is an excellent illustration in the limitations of the gene prediction programs.
Genes 6, 7, and 8 are all in the draft annotation because something about their nucleotide content was scored highly enough by the algorithms to rate as a "gene". However, we also know that the gene prediction programs are wrong somewhere between 5 and 10% of the time.

You are also correct that these calls, (6,7, and smile and (13,14) violate the guiding principles and should be resolved. You should explore all the prediction via BLAST and HHPred and see if the sequences are found in other phages and/or if they have predicted functions. From looking at an EG Phamerator map, it looks like keeping 14 and trimming 13 is the choice that was made for the related genomes.

And to be clear: the guide states that 120bp is a normal lower size limit for genes, not a hard and fast rule. We know of a number of exceptions that we've characterized at the bench. So you should absolutely NOT just delete small ORFs from a draft annotation just because they are small.

Posted in: Gene or not a GeneCluster EG-Annotation guiding principles
| posted 13 Apr, 2018 13:19
Hi Joe,
for BLAST:
The gene accession number is available when you BLAST on NCBI. The database you are BLASTing against is either NCBI or phagesdb. We have worked hard to sync these two databases with respect of our own data, however, we have altered the annotation for some phages that we did not isolate or sequence in the phagesdb database. For phagesdb, there won't be a gene accession number.

In HHPREd, you are doing your alignment against four databases at time. They do not all have equivalently reliable information. So if your function comes from a crystal structure, you'd write "PDB". If it is a pfam entry, you'd write pfam. Etc.

Regarding the lines of evidence– we are asking you to investigate all three for every gene to make sure that you don't find conflicting answers. You may find that a scaffolding protein doesn't have any sequence similarity to anything via BLAST, and no entries via HHPRed, but it is still the only small protein between the protease and the capsid protein. In that case, it is fine that two lines are NKF, and synteny gets you "scaffolding".
Posted in: Notes and Final FilesClarification regarding "SIF"
| posted 12 Apr, 2018 17:59
Cluster Q phages have a natural gap in coding sequences in the right arm, starting around 51600 in Giles. This is because they have a conserved small RNA—demonstrated in this paper:
Dedrick et al 2013.
Posted in: Cluster Q Annotation Tipssmall RNA in right arm
| posted 05 Apr, 2018 12:19
Hi Arturo,
You are right that in a circularly permuted genome you should pay attention to the gap between gene 1 and its upstream gene (in this case, the last gene).
However– we ask you write down the gap to highlight the space or overlap between genes and get you to think about whether or not you've chosen the correct start. Since you know where gene 1 starts, it becomes somewhat irrelevant to note the gap. So gene 1s in all the genomes have always received a pass on the "gap". you can just write "n/a" for gene 1.
Posted in: Choosing Start SitesHow to choose the start of the first gene for a circularly permuted genome
| posted 05 Apr, 2018 12:15
Hi Miriam,
There is a document on the Faculty Information page in the Bioinformatics section that describes how to fix a corrupted file:

Posted in: DNA MasterCorrupted files for merge