SEA-PHAGES | SIF-Blast; SIF-HHPred; SIF-Syn

Link to this post \| posted 13 Feb, 2018 16:43
GregFrederick@letu.edu	Our analysis process (tell me if we can save time anyway or streamline the process): 1. If, via the DNA Master Internal NCBI BlastP analysis, we ID a definite defined (according to the current approved function list), we list this function under SIF-Blast. 2. If the automated NCBI Blast inside DNA Master Blast only lists "theoretical protein" or proteins of NKF, we blast the product in PhagesDB and individually on the NCBI BlastP website to examine lower ranked hits for potential function. 3. If the Blast analyses in #1 and #2 do not lead us to a function, we step to HHPred analysis. If we see an HHPred defined function with an E value of better than e-16 or >90% probability across the majority of the protein we are recording it under SIF-HHPred. 4. If we do not find a function via steps #1 though #3 above, we look for a function via synteny. QUESTIONS: A. If we ID an allowed function via Blast analysis, it seems unnecessary to do the HHPred analysis. Do we really need to record the SIF HHPred results for those features where BlastP IDs are 100%? B. If the condition in Question A above is met, do we actually need to record the SIF-Syn info from Phamerator in the notes for DNA Master? We are annotating a lot of phages this semester, with very few students and I don't think we will get some of them fully annotated if these seemingly redundant data points must be included for every feature. Help?!?! Edited 13 Feb, 2018 16:46

Link to this post | posted 13 Feb, 2018 16:43

Our analysis process (tell me if we can save time anyway or streamline the process):

1. If, via the DNA Master Internal NCBI BlastP analysis, we ID a definite defined (according to the current approved function list), we list this function under SIF-Blast.
2. If the automated NCBI Blast inside DNA Master Blast only lists "theoretical protein" or proteins of NKF, we blast the product in PhagesDB and individually on the NCBI BlastP website to examine lower ranked hits for potential function.
3. If the Blast analyses in #1 and #2 do not lead us to a function, we step to HHPred analysis. If we see an HHPred defined function with an E value of better than e-16 or >90% probability across the majority of the protein we are recording it under SIF-HHPred.
4. If we do not find a function via steps #1 though #3 above, we look for a function via synteny.

QUESTIONS:

A. If we ID an allowed function via Blast analysis, it seems unnecessary to do the HHPred analysis. Do we really need to record the SIF HHPred results for those features where BlastP IDs are 100%?

B. If the condition in Question A above is met, do we actually need to record the SIF-Syn info from Phamerator in the notes for DNA Master?

We are annotating a lot of phages this semester, with very few students and I don't think we will get some of them fully annotated if these seemingly redundant data points must be included for every feature.

Help?!?! smile

Edited 13 Feb, 2018 16:46

Link to this post \| posted 21 Feb, 2018 15:45
welkin	Hi Greg, Unfortunately, these data points are not redundant. The problem with assigning functions via only one source is that if an error has been introduced into the database, the error will be propagated in your annotation. The only way to make sure is to confirm your functional assignments through all sources. We are working to make all of the phage-related functional assigns match the entries in GEnBank— should happen shortly, in which case you would only need to look at one of these databases for functions in our phage genes. You should still look in GenBank for functions from sources outside of our database. And it is OK to relinquish genomes if you don't think you can do them all. I'd rather have fewer annotations done at a high standard than more that we have to fix on the back end. Best, Welkin

Link to this post | posted 21 Feb, 2018 15:45

welkin

Hi Greg,
Unfortunately, these data points are not redundant. The problem with assigning functions via only one source is that if an error has been introduced into the database, the error will be propagated in your annotation. The only way to make sure is to confirm your functional assignments through all sources. We are working to make all of the phage-related functional assigns match the entries in GEnBank— should happen shortly, in which case you would only need to look at one of these databases for functions in our phage genes. You should still look in GenBank for functions from sources outside of our database.

And it is OK to relinquish genomes if you don't think you can do them all. I'd rather have fewer annotations done at a high standard than more that we have to fix on the back end.

Best,
Welkin

Link to this post \| posted 21 Feb, 2018 16:10
GregFrederick@letu.edu	Welkin Pope Hi Greg, Unfortunately, these data points are not redundant. The problem with assigning functions via only one source is that if an error has been introduced into the database, the error will be propagated in your annotation. The only way to make sure is to confirm your functional assignments through all sources. We are working to make all of the phage-related functional assigns match the entries in GEnBank— should happen shortly, in which case you would only need to look at one of these databases for functions in our phage genes. You should still look in GenBank for functions from sources outside of our database. And it is OK to relinquish genomes if you don't think you can do them all. I'd rather have fewer annotations done at a high standard than more that we have to fix on the back end. Best, Welkin Ok. Thanks. So if we use PECAAN (as we are for one larger genome) I understand that it does not currently include the SIF codes in the output. Does this mean that we need to manually go back in and enter them after we import from PECAAN? Thoughts?

Link to this post | posted 21 Feb, 2018 16:10

GregFrederick@letu.edu

Welkin Pope
Hi Greg,
Unfortunately, these data points are not redundant. The problem with assigning functions via only one source is that if an error has been introduced into the database, the error will be propagated in your annotation. The only way to make sure is to confirm your functional assignments through all sources. We are working to make all of the phage-related functional assigns match the entries in GEnBank— should happen shortly, in which case you would only need to look at one of these databases for functions in our phage genes. You should still look in GenBank for functions from sources outside of our database.

And it is OK to relinquish genomes if you don't think you can do them all. I'd rather have fewer annotations done at a high standard than more that we have to fix on the back end.

Best,
Welkin

Ok. Thanks. So if we use PECAAN (as we are for one larger genome) I understand that it does not currently include the SIF codes in the output.

Does this mean that we need to manually go back in and enter them after we import from PECAAN? Thoughts?

Link to this post \| posted 21 Feb, 2018 16:13
GregFrederick@letu.edu	One more clarification question Welkin: When you say "look in GenBank" you mean BLASTP against the NCBI database, correct?

Link to this post \| posted 21 Feb, 2018 16:16
welkin	Yes. For the most part, I doubt you will find much in GenBank that is not one of our phage proteins. But it is always worth looking, or even doing a limited search in which you deliberately exclude our phages to find new things, as new sequences are being added daily.

Link to this post \| posted 21 Feb, 2018 16:17
welkin	And as regarding PECAAN, yes, you will have to enter them manually for now. I know Claire was planning on updating PECAAN, so that may become available.

Link to this post \| posted 21 Feb, 2018 16:36
GregFrederick@letu.edu	Thank you. GF

Link to this post \| posted 22 Feb, 2018 17:03
GregFrederick@letu.edu	Wilken. I greatly appreciate you and your responses. We are trying to be diligent in providing accurate and valid information. So I have a few additional questions concerning the instructions in the new online guide related to the SIF reports. SIF-Blast: The instruction guide states to consider anything of e-4 or smaller. The manual states to report as follows: SIF-BLAST [NKF / function, database, phage name, gene number, database gene accession number, %alignment, evalue] We are using the NCBI website blast for this analysis (in addition to PhagesDB). Obviously, if we only hit hypothetical proteins, we enter "SIF-BLAST NKF" But if we see something that looks like a function in the NCBI Blast we report as follows (for example): function: as listed in NCBI database: NCBI (For this SIF-BLAST we are only using NCBI and not phagesDB. Is this the correct approach?) phage name: If the hit is a phage we list it. If it not a phage, do you want us to list the organism name??? gene number: This is not easy to find in the NCBI data. I think we may have found a way to get it. But the process is cumbersome. Do you need this gp# if the hit is not a phage. database gene accession number: This we can get through NCBI but not phagesDB. There is really no clarification in this section of the User Manual which BLAST source should be reported here. But since PhagesDB does not provide this number we are using NCBI. Is this correct? %alignment: NCBI provides this. Again, PhagesDB does not provide this value. Should this be corrected on the PhagesDB Blast result output? e value: We are looking for e-4 and smaller (closer to zero) and only reporting those. This is correct, right? I thought I remembered an e-value of e-16 and smaller last year. SIF-HHPred: For HHPred the user guide gives the following notes format. SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment, probability] But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions). QUESTION:* I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct? In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater. QUESTION: Do you still want us to list non-phage related functions with P-value of 90% or greater? If so, what is the correct syntax? SIF-SYNTENY: This note code is also new for us this year. So I want to make certain we are doing this correctly. We are looking at Synteny via Phamerator comparisons. Normally we pull up 3-5 phages in the same sub-cluster for comparison. QUESTION: Is comparing 3-5 members of the same cluster for synteny enough data? The code listed in the current version of the online User Guide lists the following code for SIF-SYNTENY. SIF-Syn: [NKF / function, phage(s) used to infer ] QUESTIONS: The code states "phage(s)" If there is only one, that is easy. But that raises the following questions. Q: According to the User Guide synteny can only be defined for the following 12 genes. Terminase Portal protein Capsid maturation protease Scaffolding protein Major capsid protein Major tail protein Tail assembly chaperones Tape measure protein Minor tail proteins lysin A holin lysin B But the notes options provided limit us to NKF or phage(s) list. NKF does not seem appropriate for those genes were we can actually see functions assigned in related phages. Q: In the case described directly above, is NKF correct or would NA more applicable? Q: If only 1 phage of 4-5 list a function, we list that function and phage. Do you also want to know that there are 37 (hypothetical number) phages that do not list that same function? Q: What if multiple functions are listed in multiple phages and multiple functions for a given feature, all of which are on the approved function name list as "approved". How do we list those? Do we list one or all phage/functions? We have found a few features that have approved function names (in one case three functions) with all of them on the approved list. We are going with the most specific function. But I am not certain that is y'alls desire. Q: How many surrounding genes should be considered in the assessment of Synteny? If there are any minimal limits on numbers of conserved features before we list a phage as SIF-SYN info? i.e. if two genes are adjacent in another solo phage in the same order as ours, with listed function, is that enough data to consider valuable from the synteny perspective? Lots of questions. Thank you in advance for taking time to help with these answers. Your time is much appreciated. Thank you. Edited 22 Feb, 2018 22:14

Link to this post | posted 22 Feb, 2018 17:03

GregFrederick@letu.edu

Wilken.

I greatly appreciate you and your responses. We are trying to be diligent in providing accurate and valid information. So I have a few additional questions concerning the instructions in the new online guide related to the SIF reports.

SIF-Blast:

The instruction guide states to consider anything of e-4 or smaller.

The manual states to report as follows:

SIF-BLAST [NKF / function, database, phage name, gene number, database gene accession number, %alignment, evalue]

We are using the NCBI website blast for this analysis (in addition to PhagesDB).

Obviously, if we only hit hypothetical proteins, we enter "SIF-BLAST NKF"

But if we see something that looks like a function in the NCBI Blast we report as follows (for example):

function: as listed in NCBI
database: NCBI (For this SIF-BLAST we are only using NCBI and not phagesDB. Is this the correct approach?)
phage name: If the hit is a phage we list it. If it not a phage, do you want us to list the organism name???
gene number: This is not easy to find in the NCBI data. I think we may have found a way to get it. But the process is cumbersome. Do you need this gp# if the hit is not a phage.
database gene accession number: This we can get through NCBI but not phagesDB. There is really no clarification in this section of the User Manual which BLAST source should be reported here. But since PhagesDB does not provide this number we are using NCBI. Is this correct?
%alignment: NCBI provides this. Again, PhagesDB does not provide this value. Should this be corrected on the PhagesDB Blast result output?
e value: We are looking for e-4 and smaller (closer to zero) and only reporting those. This is correct, right? I thought I remembered an e-value of e-16 and smaller last year.
SIF-HHPred:

For HHPred the user guide gives the following notes format.

SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment*, probability]

But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions).

QUESTION: I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct?

In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater.

QUESTION: Do you still want us to list non-phage related functions with P-value of 90% or greater? If so, what is the correct syntax?
SIF-SYNTENY:

This note code is also new for us this year. So I want to make certain we are doing this correctly.

We are looking at Synteny via Phamerator comparisons. Normally we pull up 3-5 phages in the same sub-cluster for comparison.

QUESTION: Is comparing 3-5 members of the same cluster for synteny enough data?

The code listed in the current version of the online User Guide lists the following code for SIF-SYNTENY.

SIF-Syn: [NKF / function, phage(s) used to infer ]

QUESTIONS: The code states "phage(s)" If there is only one, that is easy. But that raises the following questions.

Q: According to the User Guide synteny can only be defined for the following 12 genes.

Terminase
Portal protein
Capsid maturation protease
Scaffolding protein
Major capsid protein
Major tail protein
Tail assembly chaperones
Tape measure protein
Minor tail proteins
lysin A
holin
lysin B

But the notes options provided limit us to NKF or phage(s) list. NKF does not seem appropriate for those genes were we can actually see functions assigned in related phages.

Q: In the case described directly above, is NKF correct or would NA more applicable?

Q: If only 1 phage of 4-5 list a function, we list that function and phage. Do you also want to know that there are 37 (hypothetical number) phages that do not list that same function?

Q: What if multiple functions are listed in multiple phages and multiple functions for a given feature, all of which are on the approved function name list as "approved". How do we list those? Do we list one or all phage/functions? We have found a few features that have approved function names (in one case three functions) with all of them on the approved list. We are going with the most specific function. But I am not certain that is y'alls desire.

Q: How many surrounding genes should be considered in the assessment of Synteny?
If there are any minimal limits on numbers of conserved features before we list a phage as SIF-SYN info? i.e. if two genes are adjacent in another solo phage in the same order as ours, with listed function, is that enough data to consider valuable from the synteny perspective?

Lots of questions. Thank you in advance for taking time to help with these answers. Your time is much appreciated. Thank you.

Edited 22 Feb, 2018 22:14

Link to this post \| posted 06 Mar, 2018 16:09
GregFrederick@letu.edu	We really need help with this question. We have just about everything else completed. But this line: SIF-HHPred: For HHPred the user guide gives the following notes format. SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment, probability] But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions). QUESTION:* I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct? In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater. So based on the SIF-HHPred datascript line in the user guide and above, do we indicate "NKF" if we do not hit a phage. We have been listing identified conserved domains as we have in previous years. But the information required by the above datascript does not really fit that well. HELP???

Link to this post | posted 06 Mar, 2018 16:09

GregFrederick@letu.edu

We really need help with this question. We have just about everything else completed. But this line:

SIF-HHPred:

For HHPred the user guide gives the following notes format.

SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment*, probability]

But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions).

QUESTION: I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct?

In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater.

So based on the SIF-HHPred datascript line in the user guide and above, do we indicate "NKF" if we do not hit a phage. We have been listing identified conserved domains as we have in previous years. But the information required by the above datascript does not really fit that well.

HELP???

Link to this post \| posted 06 Mar, 2018 16:27
GregFrederick@letu.edu	GregFrederick@letu.edu We really need help with this question. We have just about everything else completed. But this line: SIF-HHPred: For HHPred the user guide gives the following notes format. SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment, probability] But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions). QUESTION:* I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct? In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater. So based on the SIF-HHPred datascript line in the user guide and above, do we indicate "NKF" if we do not hit a phage. We have been listing identified conserved domains as we have in previous years. But the information required by the above datascript does not really fit that well. HELP??? . . Related question: The script for SIF-HHPred also asks for "%alignment" But the HHPRed website only give the following data for "hits". Probability: E-value: Score: Aligned Cols: Identities: Similarity: Is the percent alignment something we need to calculate? Or should we use one of the above provided values? If we need to calculate the percent alignment, can you please tell us how this should be done? Again, we are typically only finding small domains of 20 to 100 amino acid residues with this HHPred analysis so percent alignment values will normally be very very small if I understand correctly what you want. But please show us how to calculate these. Thank you. gf Edited 06 Mar, 2018 16:29

Link to this post | posted 06 Mar, 2018 16:27

GregFrederick@letu.edu

GregFrederick@letu.edu
We really need help with this question. We have just about everything else completed. But this line:

SIF-HHPred:

For HHPred the user guide gives the following notes format.

SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment*, probability]

But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions).

QUESTION: I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct?

In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater.

So based on the SIF-HHPred datascript line in the user guide and above, do we indicate "NKF" if we do not hit a phage. We have been listing identified conserved domains as we have in previous years. But the information required by the above datascript does not really fit that well.

HELP???

.
.
Related question: The script for SIF-HHPred also asks for "%alignment" But the HHPRed website only give the following data for "hits".

Probability: E-value: Score: Aligned Cols: Identities: Similarity:

Is the percent alignment something we need to calculate? Or should we use one of the above provided values?

If we need to calculate the percent alignment, can you please tell us how this should be done?

Again, we are typically only finding small domains of 20 to 100 amino acid residues with this HHPred analysis so percent alignment values will normally be very very small if I understand correctly what you want. But please show us how to calculate these. Thank you. gf

Edited 06 Mar, 2018 16:29

Recent Activity

SIF-Blast; SIF-HHPred; SIF-Syn