SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

All posts created by cdshaffer

| posted 11 Sep, 2023 22:48
Unfortunately Newbler is no longer being developed.

for me, I get slow assembly times mostly when I don't have enough memory. This can easily slow things down by many orders of magnitude. So start by trying to increasing the memory available to the VM if you can. I give my machines 4 or 6 Gb if I can. Ask google, or post another query here if you need help with that.

If you have given the VM the max memory size you can and it is still really slow then try fewer reads. Most of my 100x coverage genomes assemble just fine, so you could easily reduce your read count by half and still very likely get a good assembly. If that fails, try 50X (i.e. reduce the read count by a factor of 4). it is really just trial and error in terms of having enough data to get a good assembly but not so much data that you overlaod the memory available on your machine. This is why when I last assembled a whole genome on a drosophila species (180 GB haploid genome) I used a campus computer with 128 GB of available memory, made the assembly take 4-6 hours instead of 4-6 months if I had tried on my laptop.
Posted in: NewblerGetting Started with Phage Assembly
| posted 07 Sep, 2023 17:12
So your example large contig has about ~35X coverage (6441 reads times 150 bp / read divided by ~28,000 bp). 35X is a bit low for illumina. Recommended minimum for Illumina is 50x but for these tiny genomes since sequencing is so cheap I typically go for 200-300X.

For a 70,000 bp genome and 150 bp reads I would probably use 100,000 to 150,000 reads. So adjust your "head" command to take extract more reads and try another assembly. I would just work with the R1 reads they tend to be better quality than the R2 reads. R2 reads are really good for mapping reads to large complex genomes, but for de novo assembly I stick with the R1 reads. Since each sequence takes up 4 line you want somewhere between 400,000 and 600,000 lines of your fastq file to get the 100 to 150 k reads. So instead of your example command of using 20000 use 500000. That would give an estimated coverage of 267x for a 70 kb genome.
Posted in: NewblerGetting Started with Phage Assembly
| posted 05 Sep, 2023 18:50
As for how to handle R1 and R2 depends on your exact sequencer, the quality of the reads, the library prep method, and the read length. There are enough variables here that you will just need to do trial and error and see what works for you. Lately, I have been using 150 bp reads and the R1 reads have been of such high quality that I get good assemblies by just using the correct number of R1 reads. I would suggest you try this simple solution first and see if you get a good assembly, if you do great, if not then more work prepping the reads prior to assembly is worth trying, see the next paragraph. The other problem is using the correct number of reads, see farther below.

On older machines, with reads with higher error rates, I would run the program "pear" to merge the R1 and R2 reads into a longer higher quality "read" that would improve assembly, but this requires a library prep protocol with shorter 300-500 DNA fragments and longer reads. For me, I I used to do this when running 250-300 bp reads. I used pear becuase it was easy to install on my old intel mac. Not sure what I would use now that I am on a newer mac or if I had a PC.

As for the issue of a small number of large contigs and 100's of smaller, this is exactly the result you will get if you use too many reads. See my comments above on error and why too many reads can be a "Bad Thing". I would recommend you try the "head …" command where you extract out a smaller number of reads and try assembling those. That is not really a step you can skip if you want a nice clean assembly. If you did reduce the number of reads and you are getting this result it may be you either have contamination or too few reads. Getting evidence on thsi question is in the next paragraph.

Have you looked at your contigs? Do they look like phage genomes by blast or contaminants? For the large contigs how many reads are in the contig and how long is the contig? More specific details here would help. Note that the newbler assembler creates a file called "454LargeContigs.fna" it has all the sequences of all the "large" contigs. You can open this file with a text editor, and copy out sections of sequence to use in BLAST searches to see if the contig is likely phage sequence or some other contaminant. If you get phage hits you can likely estimate the size of your genome to help you pick the correct number of reads.

See if all that adive solves your issue, if not post specific answers to as many of the above questions as you can and we can go from there.
Posted in: NewblerGetting Started with Phage Assembly
| posted 21 Jul, 2023 19:01
The password discussed above is the password for the mysql database it is not the password if you want to run sudo commands. For sudo you use your password you used to login. But there is a more important issue.

Are you logged in as SEA student or SEA faculty? The SEA student account does not have permission to do sudo commands so the seastudent password will fail. If this is the problem you should get an error message about not being in "sudoers". You have to do all this by logging into the SEA faculty account as that account has permission to run the sudo command. If you are on the seafaculty account and it is failing there is some unusual issue with your set up. If that is the case the next step is probably to post the exact steps you are on and the exact error message.

Note: The password discussed above is the password for the mysql database it is not the password if you want to run sudo. You can use mysql with password phage from the seastudent account, you just cannot run the sudo command with the seastudent login password.
Edited 21 Jul, 2023 19:04
Posted in: PhameratorInstall Guest Additions to VM ---- Without "SEAFaculty Login ability"
| posted 17 Jul, 2023 18:38
I use fastqc to check, but the last few years the data has been so good i would probably not bother if I have to set up a new pipeline.

I use trimmomatic to trim. Most because it was an easy install using brew (a mac package installer); but, as Dan said above the most recent machines are so good at the short reads (I am currently using 2x150) that it is probably not necessary. So this will depend on the machine used by your sequencer.
I then move the data into the old SEA VM and use newbler to assemble and consed to screen quality, determine strand, and find base 1.

I would not recommend you use the whole fastq file unless it is a tiny number of reads. One of the main jobs of an assembler is to distinguish read errors from valid sequence. If you give the assembler too many reads it will see the same error multiple times and evaluate that sequence as valid. I typically use 50,000 to 150,000 reads in an assembly (depending on genome size). this gives about 200X coverage for my typical genomes. I use the unix command "head" to get just the number of reads I want from the beginning of the fastq file. Just remember that each read in a fastq file takes up 4 lines. So as an example to get the first 50,000 reads (which would be 200,000 lines of data) from my "data.fastq" file and create a new "data_50k.fastq" file I would execute this on the unix command line:

head -200000 data.fastq > data_50k.fastq

Contamination will only be an issue if the fraction of bacterial reads is high, assemblers are totally fine with low levels of contamination. So try and if you get a number of smaller contigs with lower than expected coverage try again adding more reads. The optimal number of reads will need to be determined empirically, just try to get the number of phage reads where you have 100 - 200X coverage of the phage genome in your sequence data. If the bacterial contamination rate is too high you may, as you suggest, need to try to "fish out" the phage reads from your fastq file before you attempt assembly but that that would require substantial effort of testing each read and you likely have several million reads so you only to go that route if there is just too much contamination for the assembler to handle.
Posted in: NewblerGetting Started with Phage Assembly
| posted 28 Jun, 2023 20:06
Not updating for me either. No news since this discussion:
https://seaphages.org/forums/topic/5558/
Posted in: Bioinformatic Tools and AnalysesDNA Master not updating
| posted 16 May, 2023 15:47
I am looking on the server and everything looks good here. I see vordorf gene 5 in this starterator report for 80704 as described in Phamerator http://phages.wustl.edu/starterator/Pham80704Report.pdf. and gene 21 here: http://phages.wustl.edu/starterator/Pham80706Report.pdf

Your pham numbers you quote for starterator are out of sync. Those numbers 78088 and 78092 are from version 510 of the database. You can actually get older starterator reports using the version number in the URL, so here are the same genes in the older starterator reports for version 510 and the pham numbers you mentioned:

http://phages.wustl.edu/510/Pham78088Report.pdf
http://phages.wustl.edu/510/Pham78092Report.pdf
Posted in: StarteratorPham not found in Starterator
| posted 12 May, 2023 20:08
I too think we could call gp98 an HNH, I did an HHPRED search with gp98 against the pfam database since there is a pfam motif with the label "HNH endonuclease" (PF01844) In this case, looking at the alignment gp98 would be of the HKH type. So given the definition used by that paper from Deb, gp98 should be annotated as "HNH endonuclease"
Others may want the definition to be a more strict definition of "only those endonuclease that actually have those exact 3 specific amino acids" and might argue that we should call gp98 an HKH endonuclease or just endonuclease.

There is no right or wrong answer to how a term should be defined, but given Fred's totally valid points and the comments from Deb's paper I think we should just change the note on the approved terms list to "Has H-N-H within 30 aa span but minor variations allowed, see forum topic 5505" or something similar
Posted in: Functional AnnotationClarification Question About HNH Endonuclease Function Determination in view of hits to the Ref Sequences
| posted 05 May, 2023 19:33
As for a simple method for students to use:
I just copied the sequences into word. Then used advanced find and replace to make all the H's red font style and the N's green; that took all of 30 seconds. It then took me less than 5 minutes to screen all the proteins by eye and I was able to find an HNH pattern in all the sequences except 1 ( gp98 ) One had HNNH [since any amino acid can be between the H's and the N I would say that having more than 1 N is OK but maybe not] so if HNNH should be rejected we meed to clarify the simple test that there “Must have H-N-H over a 30 aa span.”

easy enough for students to do.

see attached with the colors and my underlines for the HNH patters I found.
Posted in: Functional AnnotationClarification Question About HNH Endonuclease Function Determination in view of hits to the Ref Sequences
| posted 12 Apr, 2023 17:18
#32 has a clear and strong HHPRED match to the several different HTH pfams and as Adam mentioned HHPREd does predict helix-turn-helix like secondary structure for the amino acids around 19 - 55. so i would add an annotation of "helix-turn-helix DNA binding domain".

for me, I teach my students that these words we are adding are "annotations" not necessarily "functions". Ideally (but only rarely) is the evidence sufficient to be convinced what the exact biological role is for a protein. But that does not mean that we cannot add value by adding annotations that help the readers of our "publication" (i.e. genbank entries). So adding "helix-turn-helix DNA binding domain" annotation helps the reader limit the possible roles (and is therefore a good annotation) even if the evidence is not sufficient to generate a better more informative annotation like "sigma factor" etc.

If a protein has good evidence for an HTH domain I assume there is a high probability it does indeed bind DNA but I typically only use the approved term "DNA binding protein" if I find good evidence for a DNA binding motif that is not an HTH like a zinc finger, leucine zipper or talons. If I see an HTH domain, I prefer the "helix-turn-helix DNA binding domain" over "DNA binding" even though it very likely does bind DNA if it has a HTH, I will just cite the evidence (matches HTH domain) and leave the assumption of activity (actually binds DNA) up to the reader.
Posted in: Annotationhelix-turn-helix binding domain or protein?