SEA-PHAGES | Getting Started with Phage Assembly

Link to this post \| posted 12 Aug, 2021 20:05
byrumc@cofc.edu	Thanks, Dan! That helps! Christine

Link to this post \| posted 07 Aug, 2022 17:41
stpage	Hi, I have a (circular) genome that was assembled at NCSU and its closest hit is a phage that was annotated in 2010 at Wellcome Trust Sanger back in 2010 (NC_015296.1). So, no UPitt assembly QC on either, I'm afraid. Hopefully, someone will help me wrap my head around this. The strand is completely opposite between the two and there is a relative inversion in the middle. That is, all my forward genes are listed as complement in the other phage and vice versa. But, (with gp210 approximately the ends of both) the gp numbers are partially inverted gp1+strand=gp64-strand, gp2+strand=63-strand….. gp61-=gp1+; gp 62-=gp210+, gp 63=209….etc). So, it is like there are two giant crossovers if you map them in a phamerator fashion. I know that BLASTN searches also the reverse complement, but all of these genes are listed as Q1:S1 essentially, not reverse. So, I think even if I assume that my genome should have gp 61 as gp1, there is still an inversion relative to each other. Otherwise, I would have gp1=gp210, gp2=gp209…. Finally, there are other phages (published more recently but less sequence conservation) that have the same strand but whose halves are flipped relative to mine. gp1+=gp142+, gp27-=gp168-, gp32-=gp172- So, my current hypothesis is that phage A may be in GenBank as reverse complement but that doesn't matter. Coding is strand agnostic, so I think if I had gp1=gp210, gp2=gp209…. for the whole genome it would not matter, right? Q1: Is it possible that my assembly should be inverted in some way? "Complementing the contig" for the whole genome? Even if I were to base it on the reverse complement, it would still have an inversion compared to the other. Q2: Or, should I start with gp1 in the middle and flip the halves? Q3: Alternatively, am I thinking about this whole thing stupidly and there really is no consensus on which strand or where to start gp1? Thanks.

Link to this post | posted 07 Aug, 2022 17:41

stpage

Hi, I have a (circular) genome that was assembled at NCSU and its closest hit is a phage that was annotated in 2010 at Wellcome Trust Sanger back in 2010 (NC_015296.1). So, no UPitt assembly QC on either, I'm afraid. Hopefully, someone will help me wrap my head around this.

The strand is completely opposite between the two and there is a relative inversion in the middle. That is, all my forward genes are listed as complement in the other phage and vice versa. But, (with gp210 approximately the ends of both) the gp numbers are partially inverted gp1+strand=gp64-strand, gp2+strand=63-strand….. gp61-=gp1+; gp 62-=gp210+, gp 63=209….etc). So, it is like there are two giant crossovers if you map them in a phamerator fashion. I know that BLASTN searches also the reverse complement, but all of these genes are listed as Q1:S1 essentially, not reverse.

So, I think even if I assume that my genome should have gp 61 as gp1, there is still an inversion relative to each other. Otherwise, I would have gp1=gp210, gp2=gp209….

Finally, there are other phages (published more recently but less sequence conservation) that have the same strand but whose halves are flipped relative to mine. gp1+=gp142+, gp27-=gp168-, gp32-=gp172-

So, my current hypothesis is that phage A may be in GenBank as reverse complement but that doesn't matter. Coding is strand agnostic, so I think if I had gp1=gp210, gp2=gp209…. for the whole genome it would not matter, right?

Q1: Is it possible that my assembly should be inverted in some way? "Complementing the contig" for the whole genome? Even if I were to base it on the reverse complement, it would still have an inversion compared to the other.

Q2: Or, should I start with gp1 in the middle and flip the halves?

Q3: Alternatively, am I thinking about this whole thing stupidly and there really is no consensus on which strand or where to start gp1?

Thanks.

Link to this post \| posted 07 Aug, 2022 18:04
stpage	More info added in bold Hi, I have a (circular) genome that was assembled at NCSU and its closest hit is a phage that was annotated in 2010 at Wellcome Trust Sanger back in 2010 (NC_015296.1). So, no UPitt assembly QC on either, I'm afraid. Hopefully, someone will help me wrap my head around this. The strand is completely opposite between the two and there is a relative inversion in the middle. That is, all my forward genes are listed as complement in the other phage and vice versa. But, (with gp210 approximately the ends of both) the gp numbers are partially inverted gp1+strand=gp64-strand, gp2+strand=63-strand….. gp61-=gp1+; gp 62-=gp210+, gp 63=209….etc). So, it is like there are two giant crossovers if you map them in a phamerator fashion. I know that BLASTN searches also the reverse complement, but all of these genes are listed as Q1:S1 essentially, not reverse. So, I think even if I assume that my genome should have gp 61 as gp1, there is still an inversion relative to each other. Otherwise, I would have gp1=gp210, gp2=gp209…. Finally, there are other phages (published more recently but less sequence conservation) that have the same strand but whose halves are flipped relative to mine. gp1+=gp142+, gp27-=gp168-, gp32-=gp172- So, my current hypothesis is that phage A may be in GenBank as reverse complement but that doesn't matter. Coding is strand agnostic, so I think if I had gp1=gp210, gp2=gp209…. But, we have about 80% NKF and so large-scale functional synteny is a challenge but tapemeasure protein appears to be after the transition from forward to reverse around our gp 30. So it is early in the genome with our gp1 but not in the forward direction, and a couple of tail fiber proteins are "after" tapemeasure as reverse genes. So that would indicate that our assembly is maybe entirely reverse complement compared to the consensus setup? Q1: Is it possible that my assembly should be inverted in some way? "Complementing the contig" for the whole genome? (Even if I were to base it on the reverse complement, it would still have an inversion compared to the other.) Q2: Or, should I start with gp1 in the middle and flip the halves? Thanks.

Link to this post | posted 07 Aug, 2022 18:04

stpage

More info added in bold

Hi, I have a (circular) genome that was assembled at NCSU and its closest hit is a phage that was annotated in 2010 at Wellcome Trust Sanger back in 2010 (NC_015296.1). So, no UPitt assembly QC on either, I'm afraid. Hopefully, someone will help me wrap my head around this.

The strand is completely opposite between the two and there is a relative inversion in the middle. That is, all my forward genes are listed as complement in the other phage and vice versa. But, (with gp210 approximately the ends of both) the gp numbers are partially inverted gp1+strand=gp64-strand, gp2+strand=63-strand….. gp61-=gp1+; gp 62-=gp210+, gp 63=209….etc). So, it is like there are two giant crossovers if you map them in a phamerator fashion. I know that BLASTN searches also the reverse complement, but all of these genes are listed as Q1:S1 essentially, not reverse.

So, I think even if I assume that my genome should have gp 61 as gp1, there is still an inversion relative to each other. Otherwise, I would have gp1=gp210, gp2=gp209….

Finally, there are other phages (published more recently but less sequence conservation) that have the same strand but whose halves are flipped relative to mine. gp1+=gp142+, gp27-=gp168-, gp32-=gp172-

So, my current hypothesis is that phage A may be in GenBank as reverse complement but that doesn't matter. Coding is strand agnostic, so I think if I had gp1=gp210, gp2=gp209…. But, we have about 80% NKF and so large-scale functional synteny is a challenge but tapemeasure protein appears to be after the transition from forward to reverse around our gp 30. So it is early in the genome with our gp1 but not in the forward direction, and a couple of tail fiber proteins are "after" tapemeasure as reverse genes. So that would indicate that our assembly is maybe entirely reverse complement compared to the consensus setup?

Q1: Is it possible that my assembly should be inverted in some way? "Complementing the contig" for the whole genome? (Even if I were to base it on the reverse complement, it would still have an inversion compared to the other.)

Q2: Or, should I start with gp1 in the middle and flip the halves?

Thanks.

Link to this post \| posted 08 Aug, 2022 19:10
cdshaffer	q1: no assembler that I know of is aware of scientific standards about which strand should be the top strand and which should be the bottom strand. These standards are determined in a community-by-community way and so vary from one system to another. For example in eukaryotes we typically use the standard set by the cytologists and how they present whole chromosomes. Thus, in a very high quality assembly (where we probably have evidence for the locations of centromeres and telomeres) we will publish the sequence to match the typical cytological display. The phagesdb community has standards for determination of base 1 and strand. For all our phage deterination of Base 1 determination depends on the type of phage end structure, while strand is usually picked so the structural genes are top strand and near the beginning of the sequence. Dan posted some videos here with lots of help on this but you need to be able to look at the raw assembly to answer some of these questions. Finally, a lot of published sequences are actually sequences of prophage and base 1 and orientation are set by the location of the insertion site and the standard orientation for the host genome. [based on your gene matching I think this is the case for NC_015296.1] There are a collection of phage like this in the phamerator database where the order and orientation of the sequence has been changed from the genbank record to a different order and orientation so as to match (as best as possible) the typical order in the phamerator database, this makes drawing and interpreting the comparison maps at phamerator.org much easier. Q2: I would recommend you set your base 1 and strand using a similar stratagy, that is to say, pick the base 1 and orientation to make the subsequent steps of comparison as easy as possible. But that of course depends on what you're comparing your genome to. The good news here is that DNA master has a nice feature if you want to "roll" the genome around to set a different base 1 as well as the ability to switch to the complementary strand.

Link to this post | posted 08 Aug, 2022 19:10

cdshaffer

q1: no assembler that I know of is aware of scientific standards about which strand should be the top strand and which should be the bottom strand. These standards are determined in a community-by-community way and so vary from one system to another. For example in eukaryotes we typically use the standard set by the cytologists and how they present whole chromosomes. Thus, in a very high quality assembly (where we probably have evidence for the locations of centromeres and telomeres) we will publish the sequence to match the typical cytological display.

The phagesdb community has standards for determination of base 1 and strand. For all our phage deterination of Base 1 determination depends on the type of phage end structure, while strand is usually picked so the structural genes are top strand and near the beginning of the sequence. Dan posted some videos here with lots of help on this but you need to be able to look at the raw assembly to answer some of these questions.

Finally, a lot of published sequences are actually sequences of prophage and base 1 and orientation are set by the location of the insertion site and the standard orientation for the host genome. [based on your gene matching I think this is the case for NC_015296.1] There are a collection of phage like this in the phamerator database where the order and orientation of the sequence has been changed from the genbank record to a different order and orientation so as to match (as best as possible) the typical order in the phamerator database, this makes drawing and interpreting the comparison maps at phamerator.org much easier.

Q2: I would recommend you set your base 1 and strand using a similar stratagy, that is to say, pick the base 1 and orientation to make the subsequent steps of comparison as easy as possible. But that of course depends on what you're comparing your genome to. The good news here is that DNA master has a nice feature if you want to "roll" the genome around to set a different base 1 as well as the ability to switch to the complementary strand.

Link to this post \| posted 08 Aug, 2022 20:56
stpage	Finally, a lot of published sequences are actually sequences of prophage and base 1 and orientation are set by the location of the insertion site and the standard orientation for the host genome. [based on your gene matching I think this is the case for NC_015296.1] Ah, ok, that is interesting. I hadn't suspected that. I would recommend you set your base 1 and strand using a similar stratagy, that is to say, pick the base 1 and orientation to make the subsequent steps of comparison as easy as possible. But that of course depends on what you're comparing your genome to. The good news here is that DNA master has a nice feature if you want to "roll" the genome around to set a different base 1 as well as the ability to switch to the complementary strand. Great information, Chris. I'll give that a try. Maybe it will also help us pull out a few more functions for structural genes by synteny. Thank you so much for your help.

Link to this post | posted 08 Aug, 2022 20:56

stpage

Finally, a lot of published sequences are actually sequences of prophage and base 1 and orientation are set by the location of the insertion site and the standard orientation for the host genome. [based on your gene matching I think this is the case for NC_015296.1]
Ah, ok, that is interesting. I hadn't suspected that.

I would recommend you set your base 1 and strand using a similar stratagy, that is to say, pick the base 1 and orientation to make the subsequent steps of comparison as easy as possible. But that of course depends on what you're comparing your genome to. The good news here is that DNA master has a nice feature if you want to "roll" the genome around to set a different base 1 as well as the ability to switch to the complementary strand.
Great information, Chris. I'll give that a try. Maybe it will also help us pull out a few more functions for structural genes by synteny. Thank you so much for your help.

Link to this post \| posted 15 Jul, 2023 13:55
jcaoyao@gmail.com	Dear SEA PHAGES, Thousand thanks for support with a few URGENT questions. 1. Once you have the raw files generated from Illumina, do you first use fastQC to quality-control them, then use fastP to clean them, before using Newbler at all? If not, how do you quality-check and clean them before ever using Newbler or Consed? 2. I suspect the fastq files of my phage genomes are contaminated with bacterial reads. Would that be an issue when assembling the phage reads in Newbler? If it would, how might I filter out those unwanted contaminant reads? I couldn’t a tutorial or instructions manual on the matter, but if you have any, please feel free to point them to me. Edited 15 Jul, 2023 13:57

Link to this post | posted 15 Jul, 2023 13:55

jcaoyao@gmail.com

Dear SEA PHAGES,

Thousand thanks for support with a few URGENT questions.

1. Once you have the raw files generated from Illumina, do you first use fastQC to quality-control them, then use fastP to clean them, before using Newbler at all? If not, how do you quality-check and clean them before ever using Newbler or Consed?

2. I suspect the fastq files of my phage genomes are contaminated with bacterial reads. Would that be an issue when assembling the phage reads in Newbler? If it would, how might I filter out those unwanted contaminant reads? I couldn’t a tutorial or instructions manual on the matter, but if you have any, please feel free to point them to me.

Edited 15 Jul, 2023 13:57

Link to this post \| posted 17 Jul, 2023 18:38
cdshaffer	I use fastqc to check, but the last few years the data has been so good i would probably not bother if I have to set up a new pipeline. I use trimmomatic to trim. Most because it was an easy install using brew (a mac package installer); but, as Dan said above the most recent machines are so good at the short reads (I am currently using 2x150) that it is probably not necessary. So this will depend on the machine used by your sequencer. I then move the data into the old SEA VM and use newbler to assemble and consed to screen quality, determine strand, and find base 1. I would not recommend you use the whole fastq file unless it is a tiny number of reads. One of the main jobs of an assembler is to distinguish read errors from valid sequence. If you give the assembler too many reads it will see the same error multiple times and evaluate that sequence as valid. I typically use 50,000 to 150,000 reads in an assembly (depending on genome size). this gives about 200X coverage for my typical genomes. I use the unix command "head" to get just the number of reads I want from the beginning of the fastq file. Just remember that each read in a fastq file takes up 4 lines. So as an example to get the first 50,000 reads (which would be 200,000 lines of data) from my "data.fastq" file and create a new "data_50k.fastq" file I would execute this on the unix command line: `head -200000 data.fastq > data_50k.fastq` Contamination will only be an issue if the fraction of bacterial reads is high, assemblers are totally fine with low levels of contamination. So try and if you get a number of smaller contigs with lower than expected coverage try again adding more reads. The optimal number of reads will need to be determined empirically, just try to get the number of phage reads where you have 100 - 200X coverage of the phage genome in your sequence data. If the bacterial contamination rate is too high you may, as you suggest, need to try to "fish out" the phage reads from your fastq file before you attempt assembly but that that would require substantial effort of testing each read and you likely have several million reads so you only to go that route if there is just too much contamination for the assembler to handle.

Link to this post | posted 17 Jul, 2023 18:38

cdshaffer

I use fastqc to check, but the last few years the data has been so good i would probably not bother if I have to set up a new pipeline.

I use trimmomatic to trim. Most because it was an easy install using brew (a mac package installer); but, as Dan said above the most recent machines are so good at the short reads (I am currently using 2x150) that it is probably not necessary. So this will depend on the machine used by your sequencer.
I then move the data into the old SEA VM and use newbler to assemble and consed to screen quality, determine strand, and find base 1.

I would not recommend you use the whole fastq file unless it is a tiny number of reads. One of the main jobs of an assembler is to distinguish read errors from valid sequence. If you give the assembler too many reads it will see the same error multiple times and evaluate that sequence as valid. I typically use 50,000 to 150,000 reads in an assembly (depending on genome size). this gives about 200X coverage for my typical genomes. I use the unix command "head" to get just the number of reads I want from the beginning of the fastq file. Just remember that each read in a fastq file takes up 4 lines. So as an example to get the first 50,000 reads (which would be 200,000 lines of data) from my "data.fastq" file and create a new "data_50k.fastq" file I would execute this on the unix command line:

head -200000 data.fastq > data_50k.fastq

Contamination will only be an issue if the fraction of bacterial reads is high, assemblers are totally fine with low levels of contamination. So try and if you get a number of smaller contigs with lower than expected coverage try again adding more reads. The optimal number of reads will need to be determined empirically, just try to get the number of phage reads where you have 100 - 200X coverage of the phage genome in your sequence data. If the bacterial contamination rate is too high you may, as you suggest, need to try to "fish out" the phage reads from your fastq file before you attempt assembly but that that would require substantial effort of testing each read and you likely have several million reads so you only to go that route if there is just too much contamination for the assembler to handle.

Link to this post \| posted 17 Jul, 2023 21:15
jcaoyao@gmail.com	Thank you Dr Shaffer for your thorough clarifications.

Link to this post \| posted 19 Jul, 2023 10:22
jcaoyao@gmail.com	DanRussell Hi Kyle, Good questions! There are a few resources that might be helpful here. One is that I wrote a small software package that helps streamline some of the assembly/QC process for phage genomes. It's called phageAssembler and is on github. https://github.com/SEA-PHAGES/phageAssembler It's only really meant to be installed on the 2017 SEA Virtual Machine. (I didn't really spend the time to make it thoroughly cross-platform.) But it should work there if you follow the Quick Start instructions. Because Newbler and consed are already installed on the SEA VM, it can use those installations and basically does the following: INPUT: fastq file 1. Downsample reads from your fastq file to get a workable number (default 80,000) 2. Assemble those reads with Newbler 3. Report #s of contigs & sizes 4. BLAST large contigs against a phage database and report possible cluster 5. Attempt to locate base 1 by similarity to genomes in the database 6. Report coverage and GC% of assembled contigs 7. Run AceUtil to search and tag assembly weak areas 8. Create consed-ready file for review 9. Write findings to a log file You can certainly do all those steps independently if you'd like to learn the process, but this script kind of gets you to the actual analysis part, skipping a lot of the need to learn command-line stuff for many different programs. The second resource is a chapter I wrote that details the whole process: https://pubmed.ncbi.nlm.nih.gov/29134591/ (If you can't get access, I can share the manuscript.) It's a more general look at what things you need to think about when sequencing and finishing phage genomes. Finally, there are some video tutorials I made that walk through some of the assembly/finishing process. These are a bit old and potentially outdated, but probably still have some useful info if you want to do more of the steps yourself. https://phagesdb.org/workflow/Sequencing/ And also, if you do sequence/assemble your own, we would definitely like to double-check them and include them in PhagesDB. To do so, we'd need your final sequence file and the sequencing reads. Hope that helps! –Dan Hi Dan, I would love to use your script, but it asks for some passwords. May I ask what they are? Thanks very much.

Link to this post | posted 19 Jul, 2023 10:22

jcaoyao@gmail.com

DanRussell
Hi Kyle,

Good questions! There are a few resources that might be helpful here. One is that I wrote a small software package that helps streamline some of the assembly/QC process for phage genomes. It's called phageAssembler and is on github.

https://github.com/SEA-PHAGES/phageAssembler

It's only really meant to be installed on the 2017 SEA Virtual Machine. (I didn't really spend the time to make it thoroughly cross-platform.) But it should work there if you follow the Quick Start instructions. Because Newbler and consed are already installed on the SEA VM, it can use those installations and basically does the following:

INPUT: fastq file
1. Downsample reads from your fastq file to get a workable number (default 80,000)
2. Assemble those reads with Newbler
3. Report #s of contigs & sizes
4. BLAST large contigs against a phage database and report possible cluster
5. Attempt to locate base 1 by similarity to genomes in the database
6. Report coverage and GC% of assembled contigs
7. Run AceUtil to search and tag assembly weak areas
8. Create consed-ready file for review
9. Write findings to a log file

You can certainly do all those steps independently if you'd like to learn the process, but this script kind of gets you to the actual analysis part, skipping a lot of the need to learn command-line stuff for many different programs.

The second resource is a chapter I wrote that details the whole process:
https://pubmed.ncbi.nlm.nih.gov/29134591/

(If you can't get access, I can share the manuscript.) It's a more general look at what things you need to think about when sequencing and finishing phage genomes.

Finally, there are some video tutorials I made that walk through some of the assembly/finishing process. These are a bit old and potentially outdated, but probably still have some useful info if you want to do more of the steps yourself.
https://phagesdb.org/workflow/Sequencing/

And also, if you do sequence/assemble your own, we would definitely like to double-check them and include them in PhagesDB. To do so, we'd need your final sequence file and the sequencing reads.

Hope that helps!
–Dan

Hi Dan, I would love to use your script, but it asks for some passwords. May I ask what they are? Thanks very much.

Link to this post \| posted 05 Sep, 2023 10:49
jcaoyao@gmail.com	Hello Dear Dr Russell, The Illumina generates 2 fastq files per phage genome, called R1 and R2. Your Newbler assembly tutorials used only one fastq file to assemble the genome. Is that the R1 or R2 file? Should I use either R1 or R2 fastq files separately to assemble and see how both results corroborate each other? Or should I merge both R1 and R2 files into one fastq and assemble that? So far, each file when run individually, has given me 100+ contigs (a couple large ones plus around 100 short contigs of about 100 bps), despite having been downsampled. Could this be due to contamination during DNA extraction? Thank you a lot.

Link to this post | posted 05 Sep, 2023 10:49

jcaoyao@gmail.com

Hello Dear Dr Russell,

The Illumina generates 2 fastq files per phage genome, called R1 and R2. Your Newbler assembly tutorials used only one fastq file to assemble the genome. Is that the R1 or R2 file?

Should I use either R1 or R2 fastq files separately to assemble and see how both results corroborate each other? Or should I merge both R1 and R2 files into one fastq and assemble that? So far, each file when run individually, has given me 100+ contigs (a couple large ones plus around 100 short contigs of about 100 bps), despite having been downsampled. Could this be due to contamination during DNA extraction?

Thank you a lot.

Recent Activity

Getting Started with Phage Assembly