SEA-PHAGES | Getting Started with Phage Assembly

Link to this post \| posted 11 Nov, 2020 05:16
kmaclea	So, I have a fair bit of experience assembling bacterial genomes on a basic level. Let's say I want to do my own phage genome sequencing, and then assembly. Obviously, to get the best quality control and the benefit of all your experience, getting your eyeballs on the assembly is probably critical. But is there a process written anywhere with the typical Newbler or other parameters you use for assembly? How much is customized on each assembly run? Is there a pathway where we could sequence and assemble on our own but then ask for your review of our assembly and if you concur with our assembly? Or would that be totally on our own? Thank you as always! Kyle – Kyle MacLea Associate Professor, University of New Hampshire at Manchester kyle.maclea@unh.edu +1 603-641-4129

Link to this post | posted 11 Nov, 2020 05:16

So, I have a fair bit of experience assembling bacterial genomes on a basic level. Let's say I want to do my own phage genome sequencing, and then assembly.

Obviously, to get the best quality control and the benefit of all your experience, getting your eyeballs on the assembly is probably critical.

But is there a process written anywhere with the typical Newbler or other parameters you use for assembly? How much is customized on each assembly run?

Is there a pathway where we could sequence and assemble on our own but then ask for your review of our assembly and if you concur with our assembly? Or would that be totally on our own?

Thank you as always!

Kyle

–
Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129

Link to this post \| posted 11 Nov, 2020 15:51
DanRussell	Hi Kyle, Good questions! There are a few resources that might be helpful here. One is that I wrote a small software package that helps streamline some of the assembly/QC process for phage genomes. It's called phageAssembler and is on github. https://github.com/SEA-PHAGES/phageAssembler It's only really meant to be installed on the 2017 SEA Virtual Machine. (I didn't really spend the time to make it thoroughly cross-platform.) But it should work there if you follow the Quick Start instructions. Because Newbler and consed are already installed on the SEA VM, it can use those installations and basically does the following: INPUT: fastq file 1. Downsample reads from your fastq file to get a workable number (default 80,000) 2. Assemble those reads with Newbler 3. Report #s of contigs & sizes 4. BLAST large contigs against a phage database and report possible cluster 5. Attempt to locate base 1 by similarity to genomes in the database 6. Report coverage and GC% of assembled contigs 7. Run AceUtil to search and tag assembly weak areas 8. Create consed-ready file for review 9. Write findings to a log file You can certainly do all those steps independently if you'd like to learn the process, but this script kind of gets you to the actual analysis part, skipping a lot of the need to learn command-line stuff for many different programs. The second resource is a chapter I wrote that details the whole process: https://pubmed.ncbi.nlm.nih.gov/29134591/ (If you can't get access, I can share the manuscript.) It's a more general look at what things you need to think about when sequencing and finishing phage genomes. Finally, there are some video tutorials I made that walk through some of the assembly/finishing process. These are a bit old and potentially outdated, but probably still have some useful info if you want to do more of the steps yourself. https://phagesdb.org/workflow/Sequencing/ And also, if you do sequence/assemble your own, we would definitely like to double-check them and include them in PhagesDB. To do so, we'd need your final sequence file and the sequencing reads. Hope that helps! –Dan Edited 11 Nov, 2020 16:00

Link to this post | posted 11 Nov, 2020 15:51

DanRussell

Hi Kyle,

Good questions! There are a few resources that might be helpful here. One is that I wrote a small software package that helps streamline some of the assembly/QC process for phage genomes. It's called phageAssembler and is on github.

https://github.com/SEA-PHAGES/phageAssembler

It's only really meant to be installed on the 2017 SEA Virtual Machine. (I didn't really spend the time to make it thoroughly cross-platform.) But it should work there if you follow the Quick Start instructions. Because Newbler and consed are already installed on the SEA VM, it can use those installations and basically does the following:

INPUT: fastq file
1. Downsample reads from your fastq file to get a workable number (default 80,000)
2. Assemble those reads with Newbler
3. Report #s of contigs & sizes
4. BLAST large contigs against a phage database and report possible cluster
5. Attempt to locate base 1 by similarity to genomes in the database
6. Report coverage and GC% of assembled contigs
7. Run AceUtil to search and tag assembly weak areas
8. Create consed-ready file for review
9. Write findings to a log file

You can certainly do all those steps independently if you'd like to learn the process, but this script kind of gets you to the actual analysis part, skipping a lot of the need to learn command-line stuff for many different programs.

The second resource is a chapter I wrote that details the whole process:
https://pubmed.ncbi.nlm.nih.gov/29134591/

(If you can't get access, I can share the manuscript.) It's a more general look at what things you need to think about when sequencing and finishing phage genomes.

Finally, there are some video tutorials I made that walk through some of the assembly/finishing process. These are a bit old and potentially outdated, but probably still have some useful info if you want to do more of the steps yourself.
https://phagesdb.org/workflow/Sequencing/

And also, if you do sequence/assemble your own, we would definitely like to double-check them and include them in PhagesDB. To do so, we'd need your final sequence file and the sequencing reads.

Hope that helps!
–Dan

Edited 11 Nov, 2020 16:00

Link to this post \| posted 11 Nov, 2020 17:08
kmaclea	Dear Dan That is exactly what I needed to get started on this. I really appreciate all the links and tips. I will plan to try this out and then of course bounce this off of you for corrections and helpful advice–BIG help to me, thank you. All best, Kyle – Kyle MacLea Associate Professor, University of New Hampshire at Manchester kyle.maclea@unh.edu +1 603-641-4129

Link to this post \| posted 11 Nov, 2020 17:08
kmaclea	Dear Dan That is exactly what I needed to get started on this. I really appreciate all the links and tips. I will plan to try this out and then of course bounce this off of you for corrections and helpful advice–BIG help to me, thank you. All best, Kyle – Kyle MacLea Associate Professor, University of New Hampshire at Manchester kyle.maclea@unh.edu +1 603-641-4129

Link to this post \| posted 26 Mar, 2021 02:51
kmaclea	DanRussell Hi Kyle, INPUT: fastq file 1. Downsample reads from your fastq file to get a workable number (default 80,000) 2. Assemble those reads with Newbler 3. Report #s of contigs & sizes 4. BLAST large contigs against a phage database and report possible cluster 5. Attempt to locate base 1 by similarity to genomes in the database 6. Report coverage and GC% of assembled contigs 7. Run AceUtil to search and tag assembly weak areas 8. Create consed-ready file for review 9. Write findings to a log file –Dan So apparently the data files we have are too large (the fastq.gz files are 4.2-4.5 GB each) so we can't upload to the virtual machine. Is there a way to downsample them ourselves prior to upload? I have the data on another linux based server already, so if the software to downsample were available there, perhaps I could do the downsampling before then trying to upload to the virtual machine. Or, alternatively, is there another way we can transfer the data onto the VM so the size issue is not a problem? Although I figure such huge files are not probably something you want on there, but maybe not? Kyle – Kyle MacLea Associate Professor, University of New Hampshire at Manchester kyle.maclea@unh.edu +1 603-641-4129

Link to this post | posted 26 Mar, 2021 02:51

kmaclea

DanRussell
Hi Kyle,

INPUT: fastq file
1. Downsample reads from your fastq file to get a workable number (default 80,000)
2. Assemble those reads with Newbler
3. Report #s of contigs & sizes
4. BLAST large contigs against a phage database and report possible cluster
5. Attempt to locate base 1 by similarity to genomes in the database
6. Report coverage and GC% of assembled contigs
7. Run AceUtil to search and tag assembly weak areas
8. Create consed-ready file for review
9. Write findings to a log file
–Dan

So apparently the data files we have are too large (the fastq.gz files are 4.2-4.5 GB each) so we can't upload to the virtual machine. Is there a way to downsample them ourselves prior to upload? I have the data on another linux based server already, so if the software to downsample were available there, perhaps I could do the downsampling before then trying to upload to the virtual machine.

Or, alternatively, is there another way we can transfer the data onto the VM so the size issue is not a problem? Although I figure such huge files are not probably something you want on there, but maybe not?

Kyle

–
Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129

Link to this post \| posted 26 Mar, 2021 13:09
DanRussell	Hi Kyle, Yes, the way I most commonly "downsample" is by using a simple "head" command, which should be available on almost any Unix/Linux system. So if PhageReads.fastq is your big file of all reads, you'd do something like the following: `head -n 400000 PhageReads.fastq > 100k_PhageReads.fastq` You can play with the exact number. "head" just gives you the first "n" lines of the file. Each read in a fastq file is stored in 4 lines, and so when I ask for 400,000 lines, I'm getting 100,000 reads. If you use 1,000,000 in the command, you'll get 250,000 reads. The ">" tells it to store the output in a new file, which you can name whatever you want. Then you can use that new file to move through the assembly process. Good luck! –Dan

Link to this post | posted 26 Mar, 2021 13:09

DanRussell

Hi Kyle,

Yes, the way I most commonly "downsample" is by using a simple "head" command, which should be available on almost any Unix/Linux system. So if PhageReads.fastq is your big file of all reads, you'd do something like the following:

head -n 400000 PhageReads.fastq > 100k_PhageReads.fastq

You can play with the exact number. "head" just gives you the first "n" lines of the file. Each read in a fastq file is stored in 4 lines, and so when I ask for 400,000 lines, I'm getting 100,000 reads. If you use 1,000,000 in the command, you'll get 250,000 reads.

The ">" tells it to store the output in a new file, which you can name whatever you want. Then you can use that new file to move through the assembly process.

Good luck!
–Dan

Link to this post \| posted 26 Mar, 2021 16:55
kmaclea	Perfect! And I have even used the head command before. THANK YOU! Kyle – Kyle MacLea Associate Professor, University of New Hampshire at Manchester kyle.maclea@unh.edu +1 603-641-4129

Link to this post \| posted 12 Aug, 2021 19:11
byrumc@cofc.edu	Quick question…During assembly of the genomes, is there a program used to trim the reads or is there no need to trim the 150-bp single end reads before looking at them in Newbler? Thanks! Christine

Link to this post \| posted 12 Aug, 2021 19:19
DanRussell	byrumc@cofc.edu Quick question…During assembly of the genomes, is there a program used to trim the reads or is there no need to trim the 150-bp single end reads before looking at them in Newbler? Thanks! Christine Hi Christine, We usually don't bother trimming the reads when doing phage assembly since it is usually fairly straightforward. So raw reads are totally fine. We do use skewer to trim reads for our bacterial assemblies, partly because those are often 300-base reads and have lower quality more frequently towards their ends. https://github.com/relipmoc/skewer –Dan

Link to this post | posted 12 Aug, 2021 19:19

DanRussell

byrumc@cofc.edu
Quick question…During assembly of the genomes, is there a program used to trim the reads or is there no need to trim the 150-bp single end reads before looking at them in Newbler? Thanks!

Christine

Hi Christine,

We usually don't bother trimming the reads when doing phage assembly since it is usually fairly straightforward. So raw reads are totally fine.

We do use skewer to trim reads for our bacterial assemblies, partly because those are often 300-base reads and have lower quality more frequently towards their ends.

https://github.com/relipmoc/skewer

–Dan

Link to this post \| posted 12 Aug, 2021 20:05
byrumc@cofc.edu	Thanks, Dan! That helps! Christine

Recent Activity

Getting Started with Phage Assembly