SEA-PHAGES | Getting Started with Phage Assembly

Link to this post \| posted 05 Sep, 2023 18:50
cdshaffer	As for how to handle R1 and R2 depends on your exact sequencer, the quality of the reads, the library prep method, and the read length. There are enough variables here that you will just need to do trial and error and see what works for you. Lately, I have been using 150 bp reads and the R1 reads have been of such high quality that I get good assemblies by just using the correct number of R1 reads. I would suggest you try this simple solution first and see if you get a good assembly, if you do great, if not then more work prepping the reads prior to assembly is worth trying, see the next paragraph. The other problem is using the correct number of reads, see farther below. On older machines, with reads with higher error rates, I would run the program "pear" to merge the R1 and R2 reads into a longer higher quality "read" that would improve assembly, but this requires a library prep protocol with shorter 300-500 DNA fragments and longer reads. For me, I I used to do this when running 250-300 bp reads. I used pear becuase it was easy to install on my old intel mac. Not sure what I would use now that I am on a newer mac or if I had a PC. As for the issue of a small number of large contigs and 100's of smaller, this is exactly the result you will get if you use too many reads. See my comments above on error and why too many reads can be a "Bad Thing". I would recommend you try the "head …" command where you extract out a smaller number of reads and try assembling those. That is not really a step you can skip if you want a nice clean assembly. If you did reduce the number of reads and you are getting this result it may be you either have contamination or too few reads. Getting evidence on thsi question is in the next paragraph. Have you looked at your contigs? Do they look like phage genomes by blast or contaminants? For the large contigs how many reads are in the contig and how long is the contig? More specific details here would help. Note that the newbler assembler creates a file called "454LargeContigs.fna" it has all the sequences of all the "large" contigs. You can open this file with a text editor, and copy out sections of sequence to use in BLAST searches to see if the contig is likely phage sequence or some other contaminant. If you get phage hits you can likely estimate the size of your genome to help you pick the correct number of reads. See if all that adive solves your issue, if not post specific answers to as many of the above questions as you can and we can go from there.

Link to this post | posted 05 Sep, 2023 18:50

As for how to handle R1 and R2 depends on your exact sequencer, the quality of the reads, the library prep method, and the read length. There are enough variables here that you will just need to do trial and error and see what works for you. Lately, I have been using 150 bp reads and the R1 reads have been of such high quality that I get good assemblies by just using the correct number of R1 reads. I would suggest you try this simple solution first and see if you get a good assembly, if you do great, if not then more work prepping the reads prior to assembly is worth trying, see the next paragraph. The other problem is using the correct number of reads, see farther below.

On older machines, with reads with higher error rates, I would run the program "pear" to merge the R1 and R2 reads into a longer higher quality "read" that would improve assembly, but this requires a library prep protocol with shorter 300-500 DNA fragments and longer reads. For me, I I used to do this when running 250-300 bp reads. I used pear becuase it was easy to install on my old intel mac. Not sure what I would use now that I am on a newer mac or if I had a PC.

As for the issue of a small number of large contigs and 100's of smaller, this is exactly the result you will get if you use too many reads. See my comments above on error and why too many reads can be a "Bad Thing". I would recommend you try the "head …" command where you extract out a smaller number of reads and try assembling those. That is not really a step you can skip if you want a nice clean assembly. If you did reduce the number of reads and you are getting this result it may be you either have contamination or too few reads. Getting evidence on thsi question is in the next paragraph.

Have you looked at your contigs? Do they look like phage genomes by blast or contaminants? For the large contigs how many reads are in the contig and how long is the contig? More specific details here would help. Note that the newbler assembler creates a file called "454LargeContigs.fna" it has all the sequences of all the "large" contigs. You can open this file with a text editor, and copy out sections of sequence to use in BLAST searches to see if the contig is likely phage sequence or some other contaminant. If you get phage hits you can likely estimate the size of your genome to help you pick the correct number of reads.

See if all that adive solves your issue, if not post specific answers to as many of the above questions as you can and we can go from there.

Link to this post \| posted 05 Sep, 2023 20:58
jcaoyao@gmail.com	Much grateful, cdshaffer. My genomes were also sequenced in 2x150 paired-end fashion. I did assemble using R1 and R2 reads individually, which did not come out well. Since I have a PC, I guess Pear won't work either. The messy results I got were indeed after downsampling using the 'head -n 20000' command. I checked my 454LargeContigs.fna file, and the longest contig is 27916 bp and there are 6,441 reads for it. Upon BLASTn, a long list of hits came up, and all were of phage genomes. The highest hit was phage Zakhe101, whose genome is 69,653 bp.

Link to this post | posted 05 Sep, 2023 20:58

jcaoyao@gmail.com

Much grateful, cdshaffer. My genomes were also sequenced in 2x150 paired-end fashion. I did assemble using R1 and R2 reads individually, which did not come out well. Since I have a PC, I guess Pear won't work either. The messy results I got were indeed after downsampling using the 'head -n 20000' command. I checked my 454LargeContigs.fna file, and the longest contig is 27916 bp and there are 6,441 reads for it. Upon BLASTn, a long list of hits came up, and all were of phage genomes. The highest hit was phage Zakhe101, whose genome is 69,653 bp.

Link to this post \| posted 07 Sep, 2023 17:12
cdshaffer	So your example large contig has about ~35X coverage (6441 reads times 150 bp / read divided by ~28,000 bp). 35X is a bit low for illumina. Recommended minimum for Illumina is 50x but for these tiny genomes since sequencing is so cheap I typically go for 200-300X. For a 70,000 bp genome and 150 bp reads I would probably use 100,000 to 150,000 reads. So adjust your "head" command to take extract more reads and try another assembly. I would just work with the R1 reads they tend to be better quality than the R2 reads. R2 reads are really good for mapping reads to large complex genomes, but for de novo assembly I stick with the R1 reads. Since each sequence takes up 4 line you want somewhere between 400,000 and 600,000 lines of your fastq file to get the 100 to 150 k reads. So instead of your example command of using 20000 use 500000. That would give an estimated coverage of 267x for a 70 kb genome.

Link to this post | posted 07 Sep, 2023 17:12

cdshaffer

So your example large contig has about ~35X coverage (6441 reads times 150 bp / read divided by ~28,000 bp). 35X is a bit low for illumina. Recommended minimum for Illumina is 50x but for these tiny genomes since sequencing is so cheap I typically go for 200-300X.

For a 70,000 bp genome and 150 bp reads I would probably use 100,000 to 150,000 reads. So adjust your "head" command to take extract more reads and try another assembly. I would just work with the R1 reads they tend to be better quality than the R2 reads. R2 reads are really good for mapping reads to large complex genomes, but for de novo assembly I stick with the R1 reads. Since each sequence takes up 4 line you want somewhere between 400,000 and 600,000 lines of your fastq file to get the 100 to 150 k reads. So instead of your example command of using 20000 use 500000. That would give an estimated coverage of 267x for a 70 kb genome.

Link to this post \| posted 08 Sep, 2023 11:33
jcaoyao@gmail.com	Thank you very much, Sir. Would you know why assembling for example 200,000 lines using that command takes Newbler forever to run? On most of my genomes, it will simply get stuck after hours of waiting, and I just had to shut down the program. Should my Newbler be upgraded?

Link to this post \| posted 11 Sep, 2023 22:48
cdshaffer	Unfortunately Newbler is no longer being developed. for me, I get slow assembly times mostly when I don't have enough memory. This can easily slow things down by many orders of magnitude. So start by trying to increasing the memory available to the VM if you can. I give my machines 4 or 6 Gb if I can. Ask google, or post another query here if you need help with that. If you have given the VM the max memory size you can and it is still really slow then try fewer reads. Most of my 100x coverage genomes assemble just fine, so you could easily reduce your read count by half and still very likely get a good assembly. If that fails, try 50X (i.e. reduce the read count by a factor of 4). it is really just trial and error in terms of having enough data to get a good assembly but not so much data that you overlaod the memory available on your machine. This is why when I last assembled a whole genome on a drosophila species (180 GB haploid genome) I used a campus computer with 128 GB of available memory, made the assembly take 4-6 hours instead of 4-6 months if I had tried on my laptop.

Link to this post | posted 11 Sep, 2023 22:48

cdshaffer

Unfortunately Newbler is no longer being developed.

for me, I get slow assembly times mostly when I don't have enough memory. This can easily slow things down by many orders of magnitude. So start by trying to increasing the memory available to the VM if you can. I give my machines 4 or 6 Gb if I can. Ask google, or post another query here if you need help with that.

If you have given the VM the max memory size you can and it is still really slow then try fewer reads. Most of my 100x coverage genomes assemble just fine, so you could easily reduce your read count by half and still very likely get a good assembly. If that fails, try 50X (i.e. reduce the read count by a factor of 4). it is really just trial and error in terms of having enough data to get a good assembly but not so much data that you overlaod the memory available on your machine. This is why when I last assembled a whole genome on a drosophila species (180 GB haploid genome) I used a campus computer with 128 GB of available memory, made the assembly take 4-6 hours instead of 4-6 months if I had tried on my laptop.

Link to this post \| posted 17 Sep, 2023 17:02
jcaoyao@gmail.com	Thank you a million for all your support.

Link to this post \| posted 18 Sep, 2023 09:53
jcaoyao@gmail.com	You're right, I would like to know how to increase the memory available. Are you talking about RAM memory? What specifications should the ideal computer have for this sort of job? Edited 18 Sep, 2023 09:55

Link to this post \| posted 18 Sep, 2023 16:19
cdshaffer	yes ram memory. How are you running newbler? I run it using the old SEA VM in virtualbox with an older intel mac host. For that set up I go to Machine -> settings -> System -> motherboard on that page is a slider called "Base memory". My iMac has 16 Gb, so virtualBox allows me up to about 12 GB. I have it set to 10 GB which is plenty. I would recommend setting it to 8 GB or as high as allowed given your computer set-up. Then try assembly again. If it still takes too long reduce the number of reads and try again. If you have a different set up post the description.

Link to this post | posted 18 Sep, 2023 16:19

cdshaffer

yes ram memory. How are you running newbler? I run it using the old SEA VM in virtualbox with an older intel mac host. For that set up I go to Machine -> settings -> System -> motherboard
on that page is a slider called "Base memory". My iMac has 16 Gb, so virtualBox allows me up to about 12 GB. I have it set to 10 GB which is plenty. I would recommend setting it to 8 GB or as high as allowed given your computer set-up. Then try assembly again. If it still takes too long reduce the number of reads and try again. If you have a different set up post the description.

Link to this post \| posted 18 Sep, 2023 19:15
jcaoyao@gmail.com	Thank you, Professor. I am running Newbler using the 2017 SEA VM in virtualbox on my hp laptop Windows 11 Home. I went to the same place you said, Machine -> settings -> System -> motherboard, but all the writings are in colorless (grey), so the slider or anything else cannot be adjusted in any way. Would there be a different way?

Link to this post \| posted 18 Sep, 2023 20:10
cdshaffer	You can only change those settings while the machine is OFF. So go the the SEA VM, and shutdown. Once the VM machine is off you should see a green area which are permissible settings. Once you have made and saved the changes you can boot up the SEA VM again. I would also recommend you shut down most other programs running on the host while you work on assembly (like email clients, web browsers, Word, etc,) This will give your machine as much free memory as possible to work with. Edited 18 Sep, 2023 20:11

Link to this post | posted 18 Sep, 2023 20:10

cdshaffer

You can only change those settings while the machine is OFF. So go the the SEA VM, and shutdown. Once the VM machine is off you should see a green area which are permissible settings. Once you have made and saved the changes you can boot up the SEA VM again. I would also recommend you shut down most other programs running on the host while you work on assembly (like email clients, web browsers, Word, etc,) This will give your machine as much free memory as possible to work with.

Edited 18 Sep, 2023 20:11

Recent Activity

Getting Started with Phage Assembly