SEA-PHAGES | All posts created by cdshaffer

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
next →

Link to this post \| posted 28 Feb, 2024 18:10
cdshaffer	Yes If you are using a Dot plot tool to compare genomes and it checks both strands you are good. In your case, if you have large sections of one genome that are inverted in another genome(an thus on the other strand) this will be seen in the dot plot as long diagonal lines that change the slope from positive to negative. However, the protocols as posted on QUBES uses Gepard (which is really fast) but it only compares the top strand of each sequence. So to look for similarity when you suspect one sequence is inverted, you would need to compare the reverse complement of one of the phage to the normal strand sequence of the other. Other programs like NCBI BLASTN compare both strands (use the "compare two sequences" check box). BLASTn can be quite a bit slower (when dealing with multiple phage sequences, and may fail totally if your sequences are too long), but it you want to look for large scale similarity and you are not sure which strand to look, BLASTn will probably do better. I would do an initial assessment with BLAST on a single genome vs single genome and once I knew which strands to compare I could do the final comparisons in Gepard. Edited 28 Feb, 2024 22:48

Link to this post | posted 28 Feb, 2024 18:10

Yes If you are using a Dot plot tool to compare genomes and it checks both strands you are good. In your case, if you have large sections of one genome that are inverted in another genome(an thus on the other strand) this will be seen in the dot plot as long diagonal lines that change the slope from positive to negative.

However, the protocols as posted on QUBES uses Gepard (which is really fast) but it only compares the top strand of each sequence. So to look for similarity when you suspect one sequence is inverted, you would need to compare the reverse complement of one of the phage to the normal strand sequence of the other.

Other programs like NCBI BLASTN compare both strands (use the "compare two sequences" check box). BLASTn can be quite a bit slower (when dealing with multiple phage sequences, and may fail totally if your sequences are too long), but it you want to look for large scale similarity and you are not sure which strand to look, BLASTn will probably do better. I would do an initial assessment with BLAST on a single genome vs single genome and once I knew which strands to compare I could do the final comparisons in Gepard.

Edited 28 Feb, 2024 22:48

Posted in: Bioinformatic Tools and Analyses → Phage Comparative Genomics Lab Manual - QUBES Resource

Link to this post \| posted 22 Feb, 2024 21:31
cdshaffer	When I use that sequence in an HHPRED search I get an alignment to roughly the 1st half of crystal 5LD9 the JAMM/MPN(+) Protease ( amino acids 10 - 90). On the PDB page for the crystal it looks like the crystal has the same amino acid coordinates as does the native protein, so I can use those ~10 - 90 coordinates where I look at the literature on this protein. According to this paper the active site residues of the JAMM protease motif are (ExnH xHx7Sx2D ). This motif has a nice match in the phage protein, (the HxH are at 73 and 75, the S and D are also there at the correct distance ) so I think this phage protein is also, like JAMM/MPN(+), a metalloprotease. So now the question is more of an issue of nomenclature/semantics. Should there be two terms in the approved list (something like "metalloprotease HEXXH type" and "metalloprotease EHHSD type" ) or should we lump together the HEXXH and EHHSD types under the same "metalloprotease" term and update the approved terms list to maybe say "Typically has HEXXH motif but other metalloprotease motifs (e.g. "ExnHxHx7Sx2D" ) have been described and can be used to support this function if present" or words to that effect. Edited 22 Feb, 2024 21:40

Link to this post | posted 22 Feb, 2024 21:31

cdshaffer

When I use that sequence in an HHPRED search I get an alignment to roughly the 1st half of crystal 5LD9 the JAMM/MPN(+) Protease ( amino acids 10 - 90). On the PDB page for the crystal it looks like the crystal has the same amino acid coordinates as does the native protein, so I can use those ~10 - 90 coordinates where I look at the literature on this protein. According to this paper the active site residues of the JAMM protease motif are (ExnH xHx7Sx2D ). This motif has a nice match in the phage protein, (the HxH are at 73 and 75, the S and D are also there at the correct distance ) so I think this phage protein is also, like JAMM/MPN(+), a metalloprotease.

So now the question is more of an issue of nomenclature/semantics. Should there be two terms in the approved list (something like "metalloprotease HEXXH type" and "metalloprotease EHHSD type" ) or should we lump together the HEXXH and EHHSD types under the same "metalloprotease" term and update the approved terms list to maybe say "Typically has HEXXH motif but other metalloprotease motifs (e.g. "ExnHxHx7Sx2D" ) have been described and can be used to support this function if present" or words to that effect.

Edited 22 Feb, 2024 21:40

Posted in: Functional Annotation → Metalloprotease without HEXXH motif?

Link to this post \| posted 16 Feb, 2024 18:33
cdshaffer	Very short answer: use the official function list. Long answer: Many times when you do a deep dive into issues like this (where the evidence is strong enough to call two different terms), you find one of two things going on. Usually it turns out the two terms are mostly synonymous. Like one term traces back to an E coli protein and the other term comes from studies in B subtilis. Both proteins probably fulfill the same biological role so they are pretty much the "same protein", they just have different names. The other likely result is that one term is a more specific term that the other. Like is it a "Car" or a "Ford" or a "Mustang". All these terms might apply. In this case, and I am guessing here, but I would not be surprised if Rec A and UvsX are very similar to each other and we really just have two synonyms. This is easy to check, do an HHPred search with Rec A or UvsX and see how well they align to each other. If they are basically the same protein then you know you are in the first situation above. I am going to assume that the two proteins are mostly the same and not levels of specificity, then the way to proceed is to use the Official function list. If one term is on the list and the other is not, you have two choices: 1. use the term on the list OR 2. Decide that even thought they are "mostly" the same they are in fact different enough that both terms should be on the list. If you think that is the case, post your proposal to add the term to the approved list on the proper forum; once you get it approved then everyone can use it and everyone's annotations are all the better for your contribution. This is why I always tell my students that while this second option can be a lot of work it is also a real accomplishment. Finding a new, novel, and fundamentally different function that is not on the list and convincing the list keepers of this, is very impressive indeed! But it takes time and effort, reading papers and developing the evidence to get to a convincing argument that the two terms are distinct enough to justify both on the list. Edited 16 Feb, 2024 18:37

Link to this post | posted 16 Feb, 2024 18:33

cdshaffer

Very short answer: use the official function list.

Long answer:
Many times when you do a deep dive into issues like this (where the evidence is strong enough to call two different terms), you find one of two things going on. Usually it turns out the two terms are mostly synonymous. Like one term traces back to an E coli protein and the other term comes from studies in B subtilis. Both proteins probably fulfill the same biological role so they are pretty much the "same protein", they just have different names. The other likely result is that one term is a more specific term that the other. Like is it a "Car" or a "Ford" or a "Mustang". All these terms might apply.

In this case, and I am guessing here, but I would not be surprised if Rec A and UvsX are very similar to each other and we really just have two synonyms. This is easy to check, do an HHPred search with Rec A or UvsX and see how well they align to each other. If they are basically the same protein then you know you are in the first situation above.

I am going to assume that the two proteins are mostly the same and not levels of specificity, then the way to proceed is to use the Official function list. If one term is on the list and the other is not, you have two choices: 1. use the term on the list OR 2. Decide that even thought they are "mostly" the same they are in fact different enough that both terms should be on the list. If you think that is the case, post your proposal to add the term to the approved list on the proper forum; once you get it approved then everyone can use it and everyone's annotations are all the better for your contribution.

This is why I always tell my students that while this second option can be a lot of work it is also a real accomplishment. Finding a new, novel, and fundamentally different function that is not on the list and convincing the list keepers of this, is very impressive indeed! But it takes time and effort, reading papers and developing the evidence to get to a convincing argument that the two terms are distinct enough to justify both on the list.

Edited 16 Feb, 2024 18:37

Posted in: Annotation → RecA-like recombinase or UvsX-like recombinase for KentuckyRacer 62351-62378

Link to this post \| posted 15 Feb, 2024 22:30
cdshaffer	I call this "gene content analysis", and according to the guiding principles rule 2: "Genes do not overlap by more than a few bp, although up to about 30 is legitimate". I would also add that like all rules, exceptions exist. All that is to say that you are correct to be suspicious given the very large overlap one or the other is very likely a false positive from the gene predictors used to create the draft annotations. So for evidence as to what are real genes and what are false positives I would rank evidence in this order and list the evidence FOR a real gene and against the hypothesis it is a false positive (from strongest to weakest, not from what I look at first to last) 1. Strong HHPRED alignments to well characterized crystalized proteins (this will almost never happen to a false positive) 2. Strong BLAST alignment to a well characterized protein with an assigned function (again almost never happens to a false positive) 3a. Good coding potential with the BLACK signal not the red signal 3b. Good BLAST hits to other well annotated phages {3a and 3b are tied for quality in my mind] 4. Then would come Rule 9 in the guiding principles: "Switches in gene orientation are relatively rare" [does not apply in your case, but added here as another source of evidence as many times the two genes that overlap are on different strands] So you probably want to check 1 above, as for 2 you did not state if the matches in the other phage have assigned function or not, so you have some more investigation to do but by rule 3 you at least have a good hypothesis as to which is more likely the false positive. Edited 15 Feb, 2024 22:34

Link to this post | posted 15 Feb, 2024 22:30

cdshaffer

I call this "gene content analysis", and according to the guiding principles rule 2: "Genes do not overlap by more than a few bp, although up to about 30 is legitimate". I would also add that like all rules, exceptions exist. All that is to say that you are correct to be suspicious given the very large overlap one or the other is very likely a false positive from the gene predictors used to create the draft annotations.

So for evidence as to what are real genes and what are false positives I would rank evidence in this order and list the evidence FOR a real gene and against the hypothesis it is a false positive (from strongest to weakest, not from what I look at first to last)
1. Strong HHPRED alignments to well characterized crystalized proteins (this will almost never happen to a false positive)
2. Strong BLAST alignment to a well characterized protein with an assigned function (again almost never happens to a false positive)
3a. Good coding potential with the BLACK signal not the red signal
3b. Good BLAST hits to other well annotated phages
{3a and 3b are tied for quality in my mind]

4. Then would come Rule 9 in the guiding principles: "Switches in gene orientation are relatively rare" [does not apply in your case, but added here as another source of evidence as many times the two genes that overlap are on different strands]

So you probably want to check 1 above, as for 2 you did not state if the matches in the other phage have assigned function or not, so you have some more investigation to do but by rule 3 you at least have a good hypothesis as to which is more likely the false positive.

Edited 15 Feb, 2024 22:34

Posted in: Annotation → 2 genes in same place of Cluster BE phage, Kentucky Racer

Link to this post \| posted 09 Feb, 2024 16:32
cdshaffer	there is one more important difference in the installation for Arm based mac's. once you get mysql installed and are setting up conda { see here } you need to change the conda create command by adding some bits at the beginning. So you want to change the create command from `> conda create –name pdm_utils curl python pip biopython==1.77 networkx paramiko pymysql sqlalchemy tabulate urllib3` to `CONDA_SUBDIR=osx-64 conda create –name pdm_utils ….etc.` Then activate the conda environment the first time add this second command: `conda activate pdm_utils conda env config vars set CONDA_SUBDIR=osx-64` You should only have to run the "conda env config…." line one time to set things up. From then on you can just use `conda activate` and `conda deactivate` as outlined in the instructions.

Link to this post | posted 09 Feb, 2024 16:32

cdshaffer

there is one more important difference in the installation for Arm based mac's. once you get mysql installed and are setting up conda { see here } you need to change the conda create command by adding some bits at the beginning. So you want to change the create command from

> conda create –name pdm_utils curl python pip biopython==1.77 networkx paramiko pymysql sqlalchemy tabulate urllib3

CONDA_SUBDIR=osx-64 conda create –name pdm_utils ….etc.

Then activate the conda environment the first time add this second command:

conda activate pdm_utils
conda env config vars set CONDA_SUBDIR=osx-64

You should only have to run the "conda env config…." line one time to set things up. From then on you can just use

conda activate

and

conda deactivate

as outlined in the instructions.

Posted in: Bioinformatic Tools and Analyses → PDM utils on a Mac M1

Link to this post \| posted 28 Jan, 2024 00:08
cdshaffer	Looking at the database I can see that Gene 33 in Poultris is a tRNA gene, so it does not show up on the list of protein coding genes (i.e. the list on phagesdb). The prediction has the tRNA gene from 21834 to 21936 which is completely within the protein coding gene 32. The database does not give the provenance of the prediction so no way to tell if it was called by tRNA-scan or Aragorn, nor which version of those programs were used. But given the 100% overlap with gene 32 it is probably a false positive result, so I am going to guess tRNA-Scan (no shade on tRNA-scan). tRNA-Scan gives a score with its calls, so its whole design philosophy is to call everything no matter how unlikely and just give the really unlikely ones a very bad score. Just another example of why human manual annotation is still a "Good thing" ™

Link to this post | posted 28 Jan, 2024 00:08

cdshaffer

Looking at the database I can see that Gene 33 in Poultris is a tRNA gene, so it does not show up on the list of protein coding genes (i.e. the list on phagesdb). The prediction has the tRNA gene from 21834 to 21936 which is completely within the protein coding gene 32. The database does not give the provenance of the prediction so no way to tell if it was called by tRNA-scan or Aragorn, nor which version of those programs were used. But given the 100% overlap with gene 32 it is probably a false positive result, so I am going to guess tRNA-Scan (no shade on tRNA-scan). tRNA-Scan gives a score with its calls, so its whole design philosophy is to call everything no matter how unlikely and just give the really unlikely ones a very bad score. Just another example of why human manual annotation is still a "Good thing" ™

Posted in: Phamerator → Missing gene in PhagesDB draft genome

Link to this post \| posted 26 Sep, 2023 17:46
cdshaffer	I cannot help you much with the add solexa reads perl script, I have never used it. I always just use the newbler graphical interface, I create a new project from scratch and set everthing up like this: start by creating a project folder, usually on the desktop; i copy the fastq file with just the reads I want to try to assemble into that folder. open newebler graphical interface, select new project, navigate to the new project folder I just created, give the project a name and click OK. I then go to the project tab, select the "fastq reads" sub-tab then hit the plus sign in the left side. I then select the fastq file that I prepared for proper size. I then go to parameters tab and in the input sub-tab, make sure large/complex genome is unchecked and Heterozygotic mode is also unchecked. In the computation sub-tab number of CPU's is set to 0 (so that all CPU's are used). In the output sub-tab include consensus and quick output are checked, reads limited to one contig & output trimmed tread are unchecked. For the other settings I use Pairwise alignment None Ace format consed16 Ace read mode Default alignment info: output small all contig threshold 100 large contig threshold 500 scaffold length threshold 2000 I run the assembly (click the start) then I use consed to open the ace file which I will find in a folder called edit_dir which will be buried down a few folder levels within the project folder. typically the edit_dir will be in a folder called consed in a folder called assembly in a folder with the name of the project in the project folder. as for your other question using the perl script, my guess is you need the full description of the location of the ace file. this means in the command line you need to specifically name every folder in the exact order to tell the perl script exactly where to find the ace file. IN the above example where I create a project folder on the desktop this part of the command would be quite long something like: -ace /home/seafaculty/Desktop/projectfolder/projectname/assembly/consed/edit_dir/454Contigs.ace.1 where several of those entries between the / need to be the exact names of your folders in your system. Also make sure to folders have spaces in the names or it gets really tricky. You can get the exact thing to type if you can find the ace file in the graphical interface, right click on it and select properties and copy and paste from the Location entry. If you want can you copy and paste your exact command and the exact responce from the computer. Also run the `pwd` command and copy/paste the output

Link to this post | posted 26 Sep, 2023 17:46

cdshaffer

I cannot help you much with the add solexa reads perl script, I have never used it. I always just use the newbler graphical interface, I create a new project from scratch and set everthing up like this:

start by creating a project folder, usually on the desktop; i copy the fastq file with just the reads I want to try to assemble into that folder. open newebler graphical interface, select new project, navigate to the new project folder I just created, give the project a name and click OK.

I then go to the project tab, select the "fastq reads" sub-tab then hit the plus sign in the left side. I then select the fastq file that I prepared for proper size. I then go to parameters tab and in the input sub-tab, make sure large/complex genome is unchecked and Heterozygotic mode is also unchecked. In the computation sub-tab number of CPU's is set to 0 (so that all CPU's are used). In the output sub-tab include consensus and quick output are checked, reads limited to one contig & output trimmed tread are unchecked. For the other settings I use
Pairwise alignment None
Ace format consed16
Ace read mode Default
alignment info: output small
all contig threshold 100
large contig threshold 500
scaffold length threshold 2000

I run the assembly (click the start) then I use consed to open the ace file which I will find in a folder called edit_dir which will be buried down a few folder levels within the project folder. typically the edit_dir will be in a folder called consed in a folder called assembly in a folder with the name of the project in the project folder.

as for your other question using the perl script, my guess is you need the full description of the location of the ace file. this means in the command line you need to specifically name every folder in the exact order to tell the perl script exactly where to find the ace file. IN the above example where I create a project folder on the desktop this part of the command would be quite long something like:
-ace /home/seafaculty/Desktop/projectfolder/projectname/assembly/consed/edit_dir/454Contigs.ace.1

where several of those entries between the / need to be the exact names of your folders in your system. Also make sure to folders have spaces in the names or it gets really tricky. You can get the exact thing to type if you can find the ace file in the graphical interface, right click on it and select properties and copy and paste from the Location entry.

If you want can you copy and paste your exact command and the exact responce from the computer. Also run the

pwd

command and copy/paste the output

Posted in: Newbler → Getting Started with Phage Assembly

Link to this post \| posted 19 Sep, 2023 15:15
cdshaffer	5 GB should work, I had less than that with my last laptop and had many successful assemblies. So set it to 5, boot up the machine and try assembly again. if it does not finish in 5 or 10 minutes I would try cutting the number of reads in half and try assembly again. Hopefully you can find a working solution with enough data for good assembly but not so too much data that slow everything down drastically with memory overflow issues.

Link to this post | posted 19 Sep, 2023 15:15

cdshaffer

5 GB should work, I had less than that with my last laptop and had many successful assemblies. So set it to 5, boot up the machine and try assembly again. if it does not finish in 5 or 10 minutes I would try cutting the number of reads in half and try assembly again. Hopefully you can find a working solution with enough data for good assembly but not so too much data that slow everything down drastically with memory overflow issues.

Posted in: Newbler → Getting Started with Phage Assembly

Link to this post \| posted 18 Sep, 2023 20:10
cdshaffer	You can only change those settings while the machine is OFF. So go the the SEA VM, and shutdown. Once the VM machine is off you should see a green area which are permissible settings. Once you have made and saved the changes you can boot up the SEA VM again. I would also recommend you shut down most other programs running on the host while you work on assembly (like email clients, web browsers, Word, etc,) This will give your machine as much free memory as possible to work with. Edited 18 Sep, 2023 20:11

Link to this post | posted 18 Sep, 2023 20:10

cdshaffer

You can only change those settings while the machine is OFF. So go the the SEA VM, and shutdown. Once the VM machine is off you should see a green area which are permissible settings. Once you have made and saved the changes you can boot up the SEA VM again. I would also recommend you shut down most other programs running on the host while you work on assembly (like email clients, web browsers, Word, etc,) This will give your machine as much free memory as possible to work with.

Edited 18 Sep, 2023 20:11

Posted in: Newbler → Getting Started with Phage Assembly

Link to this post \| posted 18 Sep, 2023 16:19
cdshaffer	yes ram memory. How are you running newbler? I run it using the old SEA VM in virtualbox with an older intel mac host. For that set up I go to Machine -> settings -> System -> motherboard on that page is a slider called "Base memory". My iMac has 16 Gb, so virtualBox allows me up to about 12 GB. I have it set to 10 GB which is plenty. I would recommend setting it to 8 GB or as high as allowed given your computer set-up. Then try assembly again. If it still takes too long reduce the number of reads and try again. If you have a different set up post the description.

Link to this post | posted 18 Sep, 2023 16:19

cdshaffer

yes ram memory. How are you running newbler? I run it using the old SEA VM in virtualbox with an older intel mac host. For that set up I go to Machine -> settings -> System -> motherboard
on that page is a slider called "Base memory". My iMac has 16 Gb, so virtualBox allows me up to about 12 GB. I have it set to 10 GB which is plenty. I would recommend setting it to 8 GB or as high as allowed given your computer set-up. Then try assembly again. If it still takes too long reduce the number of reads and try again. If you have a different set up post the description.

Posted in: Newbler → Getting Started with Phage Assembly

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
next →

Recent Activity

All posts created by cdshaffer