Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.
Recent Activity
Getting Started with Phage Assembly
Link to this post | posted 12 Aug, 2021 20:05 | |
---|---|
|
Thanks, Dan! That helps! Christine |
Link to this post | posted 07 Aug, 2022 17:41 | |
---|---|
|
Hi, I have a (circular) genome that was assembled at NCSU and its closest hit is a phage that was annotated in 2010 at Wellcome Trust Sanger back in 2010 (NC_015296.1). So, no UPitt assembly QC on either, I'm afraid. Hopefully, someone will help me wrap my head around this. The strand is completely opposite between the two and there is a relative inversion in the middle. That is, all my forward genes are listed as complement in the other phage and vice versa. But, (with gp210 approximately the ends of both) the gp numbers are partially inverted gp1+strand=gp64-strand, gp2+strand=63-strand….. gp61-=gp1+; gp 62-=gp210+, gp 63=209….etc). So, it is like there are two giant crossovers if you map them in a phamerator fashion. I know that BLASTN searches also the reverse complement, but all of these genes are listed as Q1:S1 essentially, not reverse. So, I think even if I assume that my genome should have gp 61 as gp1, there is still an inversion relative to each other. Otherwise, I would have gp1=gp210, gp2=gp209…. Finally, there are other phages (published more recently but less sequence conservation) that have the same strand but whose halves are flipped relative to mine. gp1+=gp142+, gp27-=gp168-, gp32-=gp172- So, my current hypothesis is that phage A may be in GenBank as reverse complement but that doesn't matter. Coding is strand agnostic, so I think if I had gp1=gp210, gp2=gp209…. for the whole genome it would not matter, right? Q1: Is it possible that my assembly should be inverted in some way? "Complementing the contig" for the whole genome? Even if I were to base it on the reverse complement, it would still have an inversion compared to the other. Q2: Or, should I start with gp1 in the middle and flip the halves? Q3: Alternatively, am I thinking about this whole thing stupidly and there really is no consensus on which strand or where to start gp1? Thanks. |
Link to this post | posted 07 Aug, 2022 18:04 | |
---|---|
|
More info added in bold Hi, I have a (circular) genome that was assembled at NCSU and its closest hit is a phage that was annotated in 2010 at Wellcome Trust Sanger back in 2010 (NC_015296.1). So, no UPitt assembly QC on either, I'm afraid. Hopefully, someone will help me wrap my head around this. The strand is completely opposite between the two and there is a relative inversion in the middle. That is, all my forward genes are listed as complement in the other phage and vice versa. But, (with gp210 approximately the ends of both) the gp numbers are partially inverted gp1+strand=gp64-strand, gp2+strand=63-strand….. gp61-=gp1+; gp 62-=gp210+, gp 63=209….etc). So, it is like there are two giant crossovers if you map them in a phamerator fashion. I know that BLASTN searches also the reverse complement, but all of these genes are listed as Q1:S1 essentially, not reverse. So, I think even if I assume that my genome should have gp 61 as gp1, there is still an inversion relative to each other. Otherwise, I would have gp1=gp210, gp2=gp209…. Finally, there are other phages (published more recently but less sequence conservation) that have the same strand but whose halves are flipped relative to mine. gp1+=gp142+, gp27-=gp168-, gp32-=gp172- So, my current hypothesis is that phage A may be in GenBank as reverse complement but that doesn't matter. Coding is strand agnostic, so I think if I had gp1=gp210, gp2=gp209…. But, we have about 80% NKF and so large-scale functional synteny is a challenge but tapemeasure protein appears to be after the transition from forward to reverse around our gp 30. So it is early in the genome with our gp1 but not in the forward direction, and a couple of tail fiber proteins are "after" tapemeasure as reverse genes. So that would indicate that our assembly is maybe entirely reverse complement compared to the consensus setup? Q1: Is it possible that my assembly should be inverted in some way? "Complementing the contig" for the whole genome? (Even if I were to base it on the reverse complement, it would still have an inversion compared to the other.) Q2: Or, should I start with gp1 in the middle and flip the halves? Thanks. |
Link to this post | posted 08 Aug, 2022 19:10 | |
---|---|
|
q1: no assembler that I know of is aware of scientific standards about which strand should be the top strand and which should be the bottom strand. These standards are determined in a community-by-community way and so vary from one system to another. For example in eukaryotes we typically use the standard set by the cytologists and how they present whole chromosomes. Thus, in a very high quality assembly (where we probably have evidence for the locations of centromeres and telomeres) we will publish the sequence to match the typical cytological display. The phagesdb community has standards for determination of base 1 and strand. For all our phage deterination of Base 1 determination depends on the type of phage end structure, while strand is usually picked so the structural genes are top strand and near the beginning of the sequence. Dan posted some videos here with lots of help on this but you need to be able to look at the raw assembly to answer some of these questions. Finally, a lot of published sequences are actually sequences of prophage and base 1 and orientation are set by the location of the insertion site and the standard orientation for the host genome. [based on your gene matching I think this is the case for NC_015296.1] There are a collection of phage like this in the phamerator database where the order and orientation of the sequence has been changed from the genbank record to a different order and orientation so as to match (as best as possible) the typical order in the phamerator database, this makes drawing and interpreting the comparison maps at phamerator.org much easier. Q2: I would recommend you set your base 1 and strand using a similar stratagy, that is to say, pick the base 1 and orientation to make the subsequent steps of comparison as easy as possible. But that of course depends on what you're comparing your genome to. The good news here is that DNA master has a nice feature if you want to "roll" the genome around to set a different base 1 as well as the ability to switch to the complementary strand. |
Link to this post | posted 08 Aug, 2022 20:56 | |
---|---|
|
Finally, a lot of published sequences are actually sequences of prophage and base 1 and orientation are set by the location of the insertion site and the standard orientation for the host genome. [based on your gene matching I think this is the case for NC_015296.1] Ah, ok, that is interesting. I hadn't suspected that. I would recommend you set your base 1 and strand using a similar stratagy, that is to say, pick the base 1 and orientation to make the subsequent steps of comparison as easy as possible. But that of course depends on what you're comparing your genome to. The good news here is that DNA master has a nice feature if you want to "roll" the genome around to set a different base 1 as well as the ability to switch to the complementary strand. Great information, Chris. I'll give that a try. Maybe it will also help us pull out a few more functions for structural genes by synteny. Thank you so much for your help. |
Link to this post | posted 15 Jul, 2023 13:55 | |
---|---|
|
Dear SEA PHAGES, Thousand thanks for support with a few URGENT questions. 1. Once you have the raw files generated from Illumina, do you first use fastQC to quality-control them, then use fastP to clean them, before using Newbler at all? If not, how do you quality-check and clean them before ever using Newbler or Consed? 2. I suspect the fastq files of my phage genomes are contaminated with bacterial reads. Would that be an issue when assembling the phage reads in Newbler? If it would, how might I filter out those unwanted contaminant reads? I couldn’t a tutorial or instructions manual on the matter, but if you have any, please feel free to point them to me. |
Link to this post | posted 17 Jul, 2023 18:38 | |
---|---|
|
I use fastqc to check, but the last few years the data has been so good i would probably not bother if I have to set up a new pipeline. I use trimmomatic to trim. Most because it was an easy install using brew (a mac package installer); but, as Dan said above the most recent machines are so good at the short reads (I am currently using 2x150) that it is probably not necessary. So this will depend on the machine used by your sequencer. I then move the data into the old SEA VM and use newbler to assemble and consed to screen quality, determine strand, and find base 1. I would not recommend you use the whole fastq file unless it is a tiny number of reads. One of the main jobs of an assembler is to distinguish read errors from valid sequence. If you give the assembler too many reads it will see the same error multiple times and evaluate that sequence as valid. I typically use 50,000 to 150,000 reads in an assembly (depending on genome size). this gives about 200X coverage for my typical genomes. I use the unix command "head" to get just the number of reads I want from the beginning of the fastq file. Just remember that each read in a fastq file takes up 4 lines. So as an example to get the first 50,000 reads (which would be 200,000 lines of data) from my "data.fastq" file and create a new "data_50k.fastq" file I would execute this on the unix command line:
Contamination will only be an issue if the fraction of bacterial reads is high, assemblers are totally fine with low levels of contamination. So try and if you get a number of smaller contigs with lower than expected coverage try again adding more reads. The optimal number of reads will need to be determined empirically, just try to get the number of phage reads where you have 100 - 200X coverage of the phage genome in your sequence data. If the bacterial contamination rate is too high you may, as you suggest, need to try to "fish out" the phage reads from your fastq file before you attempt assembly but that that would require substantial effort of testing each read and you likely have several million reads so you only to go that route if there is just too much contamination for the assembler to handle. |
Link to this post | posted 17 Jul, 2023 21:15 | |
---|---|
|
Thank you Dr Shaffer for your thorough clarifications. |
Link to this post | posted 19 Jul, 2023 10:22 | |
---|---|
|
DanRussell Hi Dan, I would love to use your script, but it asks for some passwords. May I ask what they are? Thanks very much. |
Link to this post | posted 05 Sep, 2023 10:49 | |
---|---|
|
Hello Dear Dr Russell, The Illumina generates 2 fastq files per phage genome, called R1 and R2. Your Newbler assembly tutorials used only one fastq file to assemble the genome. Is that the R1 or R2 file? Should I use either R1 or R2 fastq files separately to assemble and see how both results corroborate each other? Or should I merge both R1 and R2 files into one fastq and assemble that? So far, each file when run individually, has given me 100+ contigs (a couple large ones plus around 100 short contigs of about 100 bps), despite having been downsampled. Could this be due to contamination during DNA extraction? Thank you a lot. |