SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

All posts created by cdshaffer

| posted 17 Jul, 2023 18:38
I use fastqc to check, but the last few years the data has been so good i would probably not bother if I have to set up a new pipeline.

I use trimmomatic to trim. Most because it was an easy install using brew (a mac package installer); but, as Dan said above the most recent machines are so good at the short reads (I am currently using 2x150) that it is probably not necessary. So this will depend on the machine used by your sequencer.
I then move the data into the old SEA VM and use newbler to assemble and consed to screen quality, determine strand, and find base 1.

I would not recommend you use the whole fastq file unless it is a tiny number of reads. One of the main jobs of an assembler is to distinguish read errors from valid sequence. If you give the assembler too many reads it will see the same error multiple times and evaluate that sequence as valid. I typically use 50,000 to 150,000 reads in an assembly (depending on genome size). this gives about 200X coverage for my typical genomes. I use the unix command "head" to get just the number of reads I want from the beginning of the fastq file. Just remember that each read in a fastq file takes up 4 lines. So as an example to get the first 50,000 reads (which would be 200,000 lines of data) from my "data.fastq" file and create a new "data_50k.fastq" file I would execute this on the unix command line:

head -200000 data.fastq > data_50k.fastq

Contamination will only be an issue if the fraction of bacterial reads is high, assemblers are totally fine with low levels of contamination. So try and if you get a number of smaller contigs with lower than expected coverage try again adding more reads. The optimal number of reads will need to be determined empirically, just try to get the number of phage reads where you have 100 - 200X coverage of the phage genome in your sequence data. If the bacterial contamination rate is too high you may, as you suggest, need to try to "fish out" the phage reads from your fastq file before you attempt assembly but that that would require substantial effort of testing each read and you likely have several million reads so you only to go that route if there is just too much contamination for the assembler to handle.
Posted in: NewblerGetting Started with Phage Assembly
| posted 28 Jun, 2023 20:06
Not updating for me either. No news since this discussion:
https://seaphages.org/forums/topic/5558/
Posted in: Bioinformatic Tools and AnalysesDNA Master not updating
| posted 16 May, 2023 15:47
I am looking on the server and everything looks good here. I see vordorf gene 5 in this starterator report for 80704 as described in Phamerator http://phages.wustl.edu/starterator/Pham80704Report.pdf. and gene 21 here: http://phages.wustl.edu/starterator/Pham80706Report.pdf

Your pham numbers you quote for starterator are out of sync. Those numbers 78088 and 78092 are from version 510 of the database. You can actually get older starterator reports using the version number in the URL, so here are the same genes in the older starterator reports for version 510 and the pham numbers you mentioned:

http://phages.wustl.edu/510/Pham78088Report.pdf
http://phages.wustl.edu/510/Pham78092Report.pdf
Posted in: StarteratorPham not found in Starterator
| posted 12 May, 2023 20:08
I too think we could call gp98 an HNH, I did an HHPRED search with gp98 against the pfam database since there is a pfam motif with the label "HNH endonuclease" (PF01844) In this case, looking at the alignment gp98 would be of the HKH type. So given the definition used by that paper from Deb, gp98 should be annotated as "HNH endonuclease"
Others may want the definition to be a more strict definition of "only those endonuclease that actually have those exact 3 specific amino acids" and might argue that we should call gp98 an HKH endonuclease or just endonuclease.

There is no right or wrong answer to how a term should be defined, but given Fred's totally valid points and the comments from Deb's paper I think we should just change the note on the approved terms list to "Has H-N-H within 30 aa span but minor variations allowed, see forum topic 5505" or something similar
Posted in: Functional AnnotationClarification Question About HNH Endonuclease Function Determination in view of hits to the Ref Sequences
| posted 05 May, 2023 19:33
As for a simple method for students to use:
I just copied the sequences into word. Then used advanced find and replace to make all the H's red font style and the N's green; that took all of 30 seconds. It then took me less than 5 minutes to screen all the proteins by eye and I was able to find an HNH pattern in all the sequences except 1 ( gp98 ) One had HNNH [since any amino acid can be between the H's and the N I would say that having more than 1 N is OK but maybe not] so if HNNH should be rejected we meed to clarify the simple test that there “Must have H-N-H over a 30 aa span.”

easy enough for students to do.

see attached with the colors and my underlines for the HNH patters I found.
Posted in: Functional AnnotationClarification Question About HNH Endonuclease Function Determination in view of hits to the Ref Sequences
| posted 12 Apr, 2023 17:18
#32 has a clear and strong HHPRED match to the several different HTH pfams and as Adam mentioned HHPREd does predict helix-turn-helix like secondary structure for the amino acids around 19 - 55. so i would add an annotation of "helix-turn-helix DNA binding domain".

for me, I teach my students that these words we are adding are "annotations" not necessarily "functions". Ideally (but only rarely) is the evidence sufficient to be convinced what the exact biological role is for a protein. But that does not mean that we cannot add value by adding annotations that help the readers of our "publication" (i.e. genbank entries). So adding "helix-turn-helix DNA binding domain" annotation helps the reader limit the possible roles (and is therefore a good annotation) even if the evidence is not sufficient to generate a better more informative annotation like "sigma factor" etc.

If a protein has good evidence for an HTH domain I assume there is a high probability it does indeed bind DNA but I typically only use the approved term "DNA binding protein" if I find good evidence for a DNA binding motif that is not an HTH like a zinc finger, leucine zipper or talons. If I see an HTH domain, I prefer the "helix-turn-helix DNA binding domain" over "DNA binding" even though it very likely does bind DNA if it has a HTH, I will just cite the evidence (matches HTH domain) and leave the assumption of activity (actually binds DNA) up to the reader.
Posted in: Annotationhelix-turn-helix binding domain or protein?
| posted 07 Apr, 2023 17:35
TL;DR: I see on Pecaan that your current annotation is "DNA Primase/helicase" and I would also annotate that way as well.

Long story:
I ran a full HHPRED result and looked at the overall general structure of this protein. See attached annotated imare. I noted in the results 3 regions each with a unique set of hits. In looking at the N terminal region the consensus of the 4 hits in the green box strongly suggest some kind of primase or polymerase. Since primases are a specialized type of polymerases cannot really tell from just those 4 hits what would be the best annotation. However the middle section has over 100 hits and I cut off the image with the top 15 or so. The general consensus of all those hits in the blue box are (as you said) some kind of helicase which is also supported by the AAA-ATPase hits. The really curious part is all the hits in the region of the orange box, the general consensus of all those hits is a DNA binding domain. I don't ever remember seeing a DNA binding domain attached to a helicase domain. I did look through the pfam architectures to see if this combination has been seen before but my 10 minutes of poking did not reveal any examples like this. So taken together we have a DNA binding domain, a helicase domain, and a primase/polymerase domain. It looks to me like maybe a polymerase specifically designed to start replication at a single location "maybe?". It would take a really deep dive to know for sure this combination of domains is truly novel but with just my 30 minutes poking around I could not find anything quite like this, hence I have no good annotation because there is nothing else like this I could find in the published literature to link to. So the practical solution in my mind is to pick the best of the approved terms, which in this case is to highlight the presence of both the primase domain and the helicase domain. Hence I would annotate as you did: DNA primase/helicase

As with many annotations, not ideal, one might say "not even good" but in the end it is the best term we have.
Posted in: Functional AnnotationGG cluster DNA primse/helicase
| posted 16 Mar, 2023 21:08
I would discount the observation that "majority of the functions as hypothetical" because most of those "hypothetical proteins" were probably annotated before the above discussion and thus before the addition of "helicase loader" to the approved terms. So in this case, "absence of evidence is NOT evidence of absence". New crystals are constantly being published and the approved terms list is a living document that changes as we find new functions and fine tune the nomenclature. This is one of those places where I would tell a student "This is why we keep doing annotation by hand, we keep learning more and more and we keep getting better and better at annotation".

Looking at the positive evidence though, this call is indeed tricky, there are several HHPRED hits suggestive of helicase loader that all have really high probability but only about 40-50% coverage. So this is where reasonable annotators can disagree. In looking at the crystal data here I can see that the part that does not align is "disordered" so one could use that to argue that a strictly similar structure in this region is not required for function (as this region is not highly structured in the crystalized protein) and thus the fact that it does not match at the structural level is not good evidence that this new protein is not another example of a helicase loader. Bottom line, the fact that this region is disordered means that I discount the evidence that HHPRED is not matching them (i.e. it weakens the negative result). I don't think I ever like adding annotations on just a single piece of weak evidence, even if I can make a handwavy argument for why it is weak, so I would want more evidence. My own sense would be to look for synteny evidence to strengthen the call for a helicase loader. Since proteins that interact are much more often found near each other in phage genomes, you might find some positive results that give you more confidence you have a helicase loader. Is this gene near other genes that look to be part of a replisome or near some type of helicase? If you find a nearby helicase then you have found additional evidence. Synteny is never strong evidence, but combining two pieces of weak evidence (synteny and partial HHPRED), can sometimes clearly provide sufficient evidence and give you confidence to "make the call".
Posted in: Request a new function on the SEA-PHAGES official listphage helicase loader protein
| posted 03 Mar, 2023 16:45
Unfortunately, being an "Orpham" means that the protein has not other similar proteins and is thus placed in its own pham group. Starterator is about using evidence from evolution to help gather evidence for start codons based on conservation and evolution of the genes by comparing how they have evolved over time. However for orphams, with only one member in the pham, there is nothing to "compare", so there is nothing to report. Start codon choice will just have to proceed without evidence from comparative evolution and rely on what evidence is available.
Posted in: StarteratorPham not found in Starterator
| posted 02 Mar, 2023 17:11
OK
DeepTMMHMM is working for me again but I had to create an account. It was still failing when I tried to use DeepTMMHMM as a guest. I used my github account to sign in thru OAuth but it looks like you might be able to just create an account de novo with an email address.
Posted in: Functional AnnotationDeep TMHMM?