The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at

All posts created by Pollenz

| posted 09 May, 2022 16:00
In the DJ phages (and also in many of the Gordonia clusters) there is evidence for numerous intergenic 50-75bp repeated sequence are are located directly upstearm of the ATG start sites for 7-10 genes in the middle of the genome, (in the DJ the area is just upstream of the tape measure gene). Note that these areas are easily identified when looking at the Phamerator maps as solid lines that diverge to numerous areas and hit the genome UPSTREAM of numerous genes. The nature of the sequence can be found using the SCAN function in DNA master and also using programs like CLUSTAL OMEGA to identify the alignments. They key is that although gaps of 50-75bp between genes are atypical, in some cases there are start sites that can be found within these regions that create longer ORFS. These longer starts should not be called and its important to assess both the Glimmer and Genmark calls as usually they are the correct start for these types of genes. In some cases the longer starts have been annotated and this creates confusion.
RS Pollenz
Posted in: Cluster DJ Annotation TipsIntergenic Repeated Sequences
| posted 08 Feb, 2022 02:05
Its been ~2 years since this thread and there has been several different scenarios regarding the tail assembly chaperone (TAC) in DR phages. There are currently 8 annotated phages with 5 different TAC annotations (see enclosed file). All of the DR phages have high nucleotide similarity and the same set of genes upstream of the tape measure. All of the genes are from huge PHAMS with 350-550 members across multiple clusters. The 3rd gene upstream of the tape measure has a majority call of hypothetical, only 3 DR phages call this TAC out of 406 members. The 2nd gene upstream has a MAJORITY call of TAC and the gene directly upstream of the tape measure has 55 of 384 calls to TAC.

The issue is that in the DR phages, there are FIVE different annotation scenarios for the 8 annotated genomes in this region:

1) No TAC calls to any of the genes
2) TAC call to only the 2nd and 3rd upstream genes (Mariokart)
3) TAC calls with programmed frameshift between 2nd and 3rd upstream genes (Sour and NHagos)
4) TAC call to only the 2nd upstream gene (AnClar)
5) TAC with programmed frameshift between 1st and 2nd upstream genes (CloverMinnie)

We are annotating CaiB. We cant have a programmed frameshift between the 2nd and 3rd upstream gens due to an in frame stop codon to the 2nd upstream gene that limits the sequence area that can be used for the slippage to ~25bp (and there is no slippery sequence within this region)

We CAN find a non-canonical CGGGGCCG sequence in the 2nd upstream gene that would allow slippage to the 1st upstream gene (just as annotated in CloverMinnie with the SAME slippery sequence.

Thoughts, ideas on this DR cluster? There should be consistency based on genes and sequences. Based on lack of wet lab data regarding the slippery sequences in Gordonia, seems that a non-canonical slippery call is preamature. Based on synteny and HHpred, seems that TAC should be called for 2nd upstream gene and possibly 1st upstream, but NOT for 3rd upstream.
RS Pollenz
Posted in: Cluster DR Annotation TipsTail Assembly Chaperone
| posted 03 Dec, 2021 14:22
Using GC/MS of several EK2 phages, one of the most expressed proteins that is identified is the gene product for the gene directly downstream of the called portal protein. Although we do not have direct evidence that this is the major capsid, the location, similarity across all EK phages and its level of expression in a phage protein lysate suggest that this is a good candidate to be the MCP. We are also using cro-EM to begin to assess which protein is the MCP in these podoviriae phages.
RS Pollenz
Posted in: Cluster EK Annotation Tipsmajor capsid protein
| posted 03 Jun, 2021 14:19
We have found in Phage Leperchaun (cluster F1) a split methyltransferase with very good HHpred evidence (gp68/gp69). In other cluster F1 there is one large methylase gene in this region, but in Leperchuan the first smaller ORF has a stop codon and the switch to frame 2 picks up the sequence with an 80bp overlap and a large second gene. In several of the HHpred hits, there are alignment to the same PDB that show continuous coverage between the 2 Leperchuan genes. There is no consensus slippery sequence but it will be interesting to see if this is the case. See attachment.
RS Pollenz
Posted in: AnnotationMid-gene deletion causing frameshift and orphams
| posted 13 May, 2021 19:39
tyrosine integrase #1 = gp 34 (many strong HHpred hits)
immunity repressor = gp 35 (strong HHpred evidence and genes in the Pham)
tyrosine integrase #2 = gp 37 (many strong HHpred hits as gp34)
cro = gp41 (7CSV_A HTH cro/C1-type domain-containing protein; dimer, ANTITOXIN; 1.71A {Pseudomonas aeruginosa PAO1}
excise gp 42 (PF06806.1 Putative excisionase; 1Y6U_A Excisionase from transposon Tn916; DNA architectural protein, Tyrosine recombinase, Excisionase, Winged-helix protein, C

So all required genes can be identified
RS Pollenz
Posted in: Cluster FF Annotation TipsImmunity Cluster Genes in FF Phage Popper
| posted 13 May, 2021 18:50
Given that these FF cluster phage have genes in PHAM 54987 (92 members and majority calls are immunity repressor) that hit to numerous repressors such as 5FD4_B (ComR; Streptococcus, Competence, Quorum sensing, ComR, TRANSCRIPTION REGULATOR; 2.9A {Streptococcus suis (strain 05ZYH33), 6H49_A (Orf20; SaPI, Repressor, STRUCTURAL PROTEIN; HET: SO4; 1.8A {Staphylococcus aureus}) and 5D50_D (Repressor; Repressor, Anti-repressor, complex, DNA BINDING PROTEIN; 2.49A {Salmonella phage SPC32H}) an immunity repressor call is consistent with the data and the location of the genes within an immunity cluster (gp35 in Popper, gp37 in Nandita and gp37 in Ryan).
RS Pollenz
Edited 13 May, 2021 19:16
Posted in: Cluster FF Annotation TipsRepressor vs HTH DNA binding proteins
| posted 22 Jan, 2021 20:44
The PHAM you reference is now 44108 and contains ~500 members. The majority of calls are to Major Capsid even though the HHPred hits have 25% probability, poor e scores and minimal coverage. I have tested many of these proteins and do not see any evidence of hitting to major capsid PDB entries. Even when the HHpred query is run with the pfam selected, there are no hits to phages. We are annotating a DE3 Gordonia phage (EdmundFerry) where gp19 is in this pham. It would appear that most calls are possibly based on synteny.

BUT: in EdmundFerry there are also very few structural genes identified in the 5' region of the phage as in others we have annotated. So calling genes based on the location to other known structural proteins is iffy and is not supported by condition 2 of the synteny rules (2. adjacent to other structural genes of known, verifiable function ).

For example we can find small and large terminases with excellent HHpred evidence (gp1 and 2), but genes 3-10 have NKF, portal (gp11, good HHpred hits), but that is about it. The Tape measure is clearly gp29 based on its size and location and there are two major tail proteins that hit low probability HHPred and pfam that are upstream of it. The call of a MuF-like minor capsid (gp13) has NO HHpred PDB HITS of any consequence, but DOES hit to low coverage pfam hits to MuF and minor capsid, BUT: gp15 (PHAM 46435) is called as a capsid maturation protease or RNA ligase with essentially NO hits to relevant proteins that have anything to do with proteases or capsids……so the sequence of clearly identifiable structural genes does not fit the standard.

Any other thoughts here?
RS Pollenz
Posted in: Functional AnnotationSingleton FuzzBuster Major Capsid
| posted 25 Feb, 2020 12:58
The short duplicated areas in the DJ phages appear to have sequence similarity to the sequences in the Cluster BI1 phages even though they come from different species.

The DJ phages have eight of these directly upstream of start codons:

RS Pollenz
Posted in: Choosing Start SitesSD scoring matrix
| posted 24 Feb, 2020 15:22
Yes, very interesting. There is one sequence of ~44bp that is replicated PRIOR to 5 consecutive genes…..
RS Pollenz
Posted in: Choosing Start SitesSD scoring matrix
| posted 24 Feb, 2020 14:03
We are annotating a Gordonia DJ phage Secretariat. It appears there is an region across most of these DJ genomes that begins about 32,000bp and contains ~15 small genes that are divergent across the different phages and all the genes are separated by 20-100bp gaps between them (no 1-4bp overlaps or small <10bp gaps). The issue is that the called start for many of these in NOT the LO that in many cases will significantly reduce the gaps between these genes. It appears that the RBS data is very poor (example, Z value 0.446, spacer 8, final score -8, compared to the "called" start of 1.77/11/-5.3) for many of these. All of the genes are hypothetical, so no evidence of how these genes are related (an opernon??). The Coding Potential may drop off at the LO on some of these, but its not that different from many genes that have been called. We know that there are many aspects that impact translation initiation beyond the RBS/SD such as RNA structure, operons, translation of previous genes, etc. What are your thoughts on selecting the LO and reducing gaps when the RBS data is so poor?
RS Pollenz
Posted in: Choosing Start SitesSD scoring matrix