SEA-PHAGES | All posts created by welkin

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
next →

Link to this post \| posted 06 Mar, 2018 21:11
welkin	Sorry, I don't think we are putting this one in any time soon. There are too many different types of unrelated proteins that could provide a similar effect. For now, stick to membrane protein. once we start getting a better handle on the proteins and their classes, maybe we can come up with something more detailed.

Posted in: Request a new function on the SEA-PHAGES official list → Superinfection Immunity Protein

Link to this post \| posted 06 Mar, 2018 21:09
welkin	no. never seen that before. but usually freshmen find a way to do all sorts of things I've never seen before

Posted in: DNA Master → Validation error messages

Link to this post \| posted 06 Mar, 2018 19:24
welkin	GregFrederick@letu.edu GregFrederick@letu.edu We really need help with this question. We have just about everything else completed. But this line: SIF-HHPred: For HHPred the user guide gives the following notes format. SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment, probability] But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions). QUESTION:* I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct? In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater. So based on the SIF-HHPred datascript line in the user guide and above, do we indicate "NKF" if we do not hit a phage. We have been listing identified conserved domains as we have in previous years. But the information required by the above datascript does not really fit that well. HELP??? . . Related question: The script for SIF-HHPred also asks for "%alignment" But the HHPRed website only give the following data for "hits". Probability: E-value: Score: Aligned Cols: Identities: Similarity: Is the percent alignment something we need to calculate? Or should we use one of the above provided values? If we need to calculate the percent alignment, can you please tell us how this should be done? Again, we are typically only finding small domains of 20 to 100 amino acid residues with this HHPred analysis so percent alignment values will normally be very very small if I understand correctly what you want. But please show us how to calculate these. Thank you. gf Hi Greg, I have updated the Guide, and I think I've clarified all your questions. BEst, Welkin

Link to this post | posted 06 Mar, 2018 19:24

welkin

GregFrederick@letu.edu
GregFrederick@letu.edu
We really need help with this question. We have just about everything else completed. But this line:

SIF-HHPred:

For HHPred the user guide gives the following notes format.

SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment*, probability]

But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions).

QUESTION: I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct?

In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater.

So based on the SIF-HHPred datascript line in the user guide and above, do we indicate "NKF" if we do not hit a phage. We have been listing identified conserved domains as we have in previous years. But the information required by the above datascript does not really fit that well.

HELP???
.
.
Related question: The script for SIF-HHPred also asks for "%alignment" But the HHPRed website only give the following data for "hits".

Probability: E-value: Score: Aligned Cols: Identities: Similarity:

Is the percent alignment something we need to calculate? Or should we use one of the above provided values?

If we need to calculate the percent alignment, can you please tell us how this should be done?

Again, we are typically only finding small domains of 20 to 100 amino acid residues with this HHPred analysis so percent alignment values will normally be very very small if I understand correctly what you want. But please show us how to calculate these. Thank you. gf

Hi Greg,
I have updated the Guide, and I think I've clarified all your questions.

BEst,
Welkin

Posted in: Notes and Final Files → SIF-Blast; SIF-HHPred; SIF-Syn

Link to this post \| posted 06 Mar, 2018 19:10
welkin	Hi Greg— I will try to get to all of this and/or update the guide to clarify for everyone. As of right now, all of the entries in the phagesdb database should match the entries for the same phages in the NCBI GenBank database. The reason for BLASTing against NCBI is to find information that is not found at phagesdb. So SIF-BLAST will be more complete if you use NCBI. IF your top hit is not a phage, but has a good e value and % alignment, that is OK. you should still report it. As we move into more distantly related hosts, we are likely to see more database matches that are not just actinobacteriophages. It certainly is a pain to find the gene number when it is not a phage. You may omit the gene number if it is not a phage. Make sure you supply the NCBI gene record number if you can't find the gene number. phagesdb uses the BLAST package that NCBI provides. the % alignment does not come with the package as a reported number the way it does when you BLAST on the NCBI site or through DNA Master. hhpred is both for finding new functions and for supporting your BLAST functional assignments. The two outputs should agree, or at least not assign two completely different functions. There are many phage genes that have been crystalized or added to the pFam database. If your best match is not a phage, supply the organism name and the database record number. synteny: comparing three to five phages should give you a good idea about what genes to look for. you should also scroll through all the pham pages on phagesdb to make sure that you are not missing underreported functions, which is what can happen when you choose 3-5 genomes are random. Synteny can be used for more than those twelve genes. those 12 are the minimum that it can be used for. I will clarify in teh guide. I agree that NA is probably more correct. Either NA or NKF is fine. You do not know that the five phages are correct simply because they have a function listed and the rest don't. The first phage gene could have been assigned that function in error and the rest could be blind copies— this is why we are having three lines of function investigations for every gene. Which brings to me to the next point— conflicting functional assignments. If all the assignments are variations on the same function (LysA, endolysin, lysin A) choose the function that matches the official list. If the functions do not agree (portal vs capsid morphogenesis protein) you've found a database error. you will have to use the rest of your investigations to figure out what the right answer is. hopefully soon we will get some kind of tracker going for people to report database errors so we can fix them. You should not pick the most specific function for your gene unless you can support it. We want the most specific supportable function for each gene. as far as synteny goes, yes, you can use it on everything, but it just isn't as important in all cases. it is very important in the structural genes, less so in the integration cassette, still worthwhile in genes that have partners, like RecE and RecT. As we uncover more functions, we may find more genes sets that are always together, and therefore synteny should always be evaluated. GregFrederick@letu.edu Wilken. I greatly appreciate you and your responses. We are trying to be diligent in providing accurate and valid information. So I have a few additional questions concerning the instructions in the new online guide related to the SIF reports. SIF-Blast: The instruction guide states to consider anything of e-4 or smaller. The manual states to report as follows: SIF-BLAST [NKF / function, database, phage name, gene number, database gene accession number, %alignment, evalue] We are using the NCBI website blast for this analysis (in addition to PhagesDB). Obviously, if we only hit hypothetical proteins, we enter "SIF-BLAST NKF" But if we see something that looks like a function in the NCBI Blast we report as follows (for example): function: as listed in NCBI database: NCBI (For this SIF-BLAST we are only using NCBI and not phagesDB. Is this the correct approach?) phage name: If the hit is a phage we list it. If it not a phage, do you want us to list the organism name??? gene number: This is not easy to find in the NCBI data. I think we may have found a way to get it. But the process is cumbersome. Do you need this gp# if the hit is not a phage. database gene accession number: This we can get through NCBI but not phagesDB. There is really no clarification in this section of the User Manual which BLAST source should be reported here. But since PhagesDB does not provide this number we are using NCBI. Is this correct? %alignment: NCBI provides this. Again, PhagesDB does not provide this value. Should this be corrected on the PhagesDB Blast result output? e value: We are looking for e-4 and smaller (closer to zero) and only reporting those. This is correct, right? I thought I remembered an e-value of e-16 and smaller last year. SIF-HHPred: For HHPred the user guide gives the following notes format. SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment, probability] But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions). QUESTION:* I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct? In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater. QUESTION: Do you still want us to list non-phage related functions with P-value of 90% or greater? If so, what is the correct syntax? SIF-SYNTENY: This note code is also new for us this year. So I want to make certain we are doing this correctly. We are looking at Synteny via Phamerator comparisons. Normally we pull up 3-5 phages in the same sub-cluster for comparison. QUESTION: Is comparing 3-5 members of the same cluster for synteny enough data? The code listed in the current version of the online User Guide lists the following code for SIF-SYNTENY. SIF-Syn: [NKF / function, phage(s) used to infer ] QUESTIONS: The code states "phage(s)" If there is only one, that is easy. But that raises the following questions. Q: According to the User Guide synteny can only be defined for the following 12 genes. Terminase Portal protein Capsid maturation protease Scaffolding protein Major capsid protein Major tail protein Tail assembly chaperones Tape measure protein Minor tail proteins lysin A holin lysin B But the notes options provided limit us to NKF or phage(s) list. NKF does not seem appropriate for those genes were we can actually see functions assigned in related phages. Q: In the case described directly above, is NKF correct or would NA more applicable? Q: If only 1 phage of 4-5 list a function, we list that function and phage. Do you also want to know that there are 37 (hypothetical number) phages that do not list that same function? Q: What if multiple functions are listed in multiple phages and multiple functions for a given feature, all of which are on the approved function name list as "approved". How do we list those? Do we list one or all phage/functions? We have found a few features that have approved function names (in one case three functions) with all of them on the approved list. We are going with the most specific function. But I am not certain that is y'alls desire. Q: How many surrounding genes should be considered in the assessment of Synteny? If there are any minimal limits on numbers of conserved features before we list a phage as SIF-SYN info? i.e. if two genes are adjacent in another solo phage in the same order as ours, with listed function, is that enough data to consider valuable from the synteny perspective? Lots of questions. Thank you in advance for taking time to help with these answers. Your time is much appreciated. Thank you.

Link to this post | posted 06 Mar, 2018 19:10

welkin

Hi Greg—
I will try to get to all of this and/or update the guide to clarify for everyone.

As of right now, all of the entries in the phagesdb database should match the entries for the same phages in the NCBI GenBank database. The reason for BLASTing against NCBI is to find information that is not found at phagesdb. So SIF-BLAST will be more complete if you use NCBI.

IF your top hit is not a phage, but has a good e value and % alignment, that is OK. you should still report it. As we move into more distantly related hosts, we are likely to see more database matches that are not just actinobacteriophages.

It certainly is a pain to find the gene number when it is not a phage. You may omit the gene number if it is not a phage. Make sure you supply the NCBI gene record number if you can't find the gene number.

phagesdb uses the BLAST package that NCBI provides. the % alignment does not come with the package as a reported number the way it does when you BLAST on the NCBI site or through DNA Master.

hhpred is both for finding new functions and for supporting your BLAST functional assignments. The two outputs should agree, or at least not assign two completely different functions.
There are many phage genes that have been crystalized or added to the pFam database. If your best match is not a phage, supply the organism name and the database record number.

synteny: comparing three to five phages should give you a good idea about what genes to look for. you should also scroll through all the pham pages on phagesdb to make sure that you are not missing underreported functions, which is what can happen when you choose 3-5 genomes are random.

Synteny can be used for more than those twelve genes. those 12 are the minimum that it can be used for. I will clarify in teh guide.

I agree that NA is probably more correct. Either NA or NKF is fine.

You do not know that the five phages are correct simply because they have a function listed and the rest don't. The first phage gene could have been assigned that function in error and the rest could be blind copies— this is why we are having three lines of function investigations for every gene.
Which brings to me to the next point— conflicting functional assignments. If all the assignments are variations on the same function (LysA, endolysin, lysin A) choose the function that matches the official list. If the functions do not agree (portal vs capsid morphogenesis protein) you've found a database error. you will have to use the rest of your investigations to figure out what the right answer is.
hopefully soon we will get some kind of tracker going for people to report database errors so we can fix them.
You should not pick the most specific function for your gene unless you can support it. We want the most specific supportable function for each gene.

as far as synteny goes, yes, you can use it on everything, but it just isn't as important in all cases. it is very important in the structural genes, less so in the integration cassette, still worthwhile in genes that have partners, like RecE and RecT. As we uncover more functions, we may find more genes sets that are always together, and therefore synteny should always be evaluated.

GregFrederick@letu.edu
Wilken.

I greatly appreciate you and your responses. We are trying to be diligent in providing accurate and valid information. So I have a few additional questions concerning the instructions in the new online guide related to the SIF reports.

SIF-Blast:

The instruction guide states to consider anything of e-4 or smaller.

The manual states to report as follows:

SIF-BLAST [NKF / function, database, phage name, gene number, database gene accession number, %alignment, evalue]

We are using the NCBI website blast for this analysis (in addition to PhagesDB).

Obviously, if we only hit hypothetical proteins, we enter "SIF-BLAST NKF"

But if we see something that looks like a function in the NCBI Blast we report as follows (for example):

function: as listed in NCBI
database: NCBI (For this SIF-BLAST we are only using NCBI and not phagesDB. Is this the correct approach?)
phage name: If the hit is a phage we list it. If it not a phage, do you want us to list the organism name???
gene number: This is not easy to find in the NCBI data. I think we may have found a way to get it. But the process is cumbersome. Do you need this gp# if the hit is not a phage.
database gene accession number: This we can get through NCBI but not phagesDB. There is really no clarification in this section of the User Manual which BLAST source should be reported here. But since PhagesDB does not provide this number we are using NCBI. Is this correct?
%alignment: NCBI provides this. Again, PhagesDB does not provide this value. Should this be corrected on the PhagesDB Blast result output?
e value: We are looking for e-4 and smaller (closer to zero) and only reporting those. This is correct, right? I thought I remembered an e-value of e-16 and smaller last year.
SIF-HHPred:

For HHPred the user guide gives the following notes format.

SIF-HHPred [NKF / function, database, phage name, gene number, database accession number, %alignment*, probability]

But HHPred (with the databases indicated in the User Manual) rarely finds phages. It does find conserved domains (and thus some possible functions).

QUESTION: I thought the purpose of HHPred was to attempt to ID previously undertermined functions or domains. Is this still correct?

In the case of HHPred and following the instructions in the new User Guide we are looking for functions or conserved domains with a P value of 90% or greater.

QUESTION: Do you still want us to list non-phage related functions with P-value of 90% or greater? If so, what is the correct syntax?
SIF-SYNTENY:

This note code is also new for us this year. So I want to make certain we are doing this correctly.

We are looking at Synteny via Phamerator comparisons. Normally we pull up 3-5 phages in the same sub-cluster for comparison.

QUESTION: Is comparing 3-5 members of the same cluster for synteny enough data?

The code listed in the current version of the online User Guide lists the following code for SIF-SYNTENY.

SIF-Syn: [NKF / function, phage(s) used to infer ]

QUESTIONS: The code states "phage(s)" If there is only one, that is easy. But that raises the following questions.

Q: According to the User Guide synteny can only be defined for the following 12 genes.

Terminase
Portal protein
Capsid maturation protease
Scaffolding protein
Major capsid protein
Major tail protein
Tail assembly chaperones
Tape measure protein
Minor tail proteins
lysin A
holin
lysin B

But the notes options provided limit us to NKF or phage(s) list. NKF does not seem appropriate for those genes were we can actually see functions assigned in related phages.

Q: In the case described directly above, is NKF correct or would NA more applicable?

Q: If only 1 phage of 4-5 list a function, we list that function and phage. Do you also want to know that there are 37 (hypothetical number) phages that do not list that same function?

Q: What if multiple functions are listed in multiple phages and multiple functions for a given feature, all of which are on the approved function name list as "approved". How do we list those? Do we list one or all phage/functions? We have found a few features that have approved function names (in one case three functions) with all of them on the approved list. We are going with the most specific function. But I am not certain that is y'alls desire.

Q: How many surrounding genes should be considered in the assessment of Synteny?
If there are any minimal limits on numbers of conserved features before we list a phage as SIF-SYN info? i.e. if two genes are adjacent in another solo phage in the same order as ours, with listed function, is that enough data to consider valuable from the synteny perspective?

Lots of questions. Thank you in advance for taking time to help with these answers. Your time is much appreciated. Thank you.

Posted in: Notes and Final Files → SIF-Blast; SIF-HHPred; SIF-Syn

Link to this post \| posted 06 Mar, 2018 18:36
welkin	There are 32-36 tRNAs in each Cluster C phage. These will be located in three different clusters of tRNA genes in the genome. Make sure you run both tRNAscanSE and Aragorn online to find them all and get the correct end coordinates. And remember that Phamerator maps don't display tRNA genes, so there will only be a gap in coding sequences on the map of phages with finished annotations.

Posted in: Cluster C Annotation Tips → tRNAs in Cluster C

Link to this post \| posted 06 Mar, 2018 18:34
welkin	Cluster A phages have between 0 and 3 tRNAs, found in the left arm around gene 5ish. make sure you run Aragorn online to get the exact coordinates. And remember that tRNAs don't appear on Phamerator maps, so you will just see a gap in the coding genes in these areas in other annotated phages.

Posted in: Cluster A Annotation Tips → tRNAs in cluster A

Link to this post \| posted 02 Mar, 2018 16:08
welkin	Yep– this thread is just about Cluster B, which currently only has Mycobacterium phage members.

Posted in: Cluster B Annotation Tips → Tail assembly chaperones?

Link to this post \| posted 02 Mar, 2018 15:59
welkin	Singletons are challenging because you lack the comparative data that can really drive the decision as to which genes are in or out, which starts are conserved across a cluster, etc. When you annotate a Singleton, rely on 4bp overlaps for start choices, your gene prediction algorithm outputs, and solid functional data. And don't worry, as we find new cluster members, we can make changes to older annotations when we have more comparative data to work from. Edited 02 Mar, 2018 15:59

Link to this post | posted 02 Mar, 2018 15:59

welkin

Singletons are challenging because you lack the comparative data that can really drive the decision as to which genes are in or out, which starts are conserved across a cluster, etc.
When you annotate a Singleton, rely on 4bp overlaps for start choices, your gene prediction algorithm outputs, and solid functional data.
And don't worry, as we find new cluster members, we can make changes to older annotations when we have more comparative data to work from.

Edited 02 Mar, 2018 15:59

Posted in: Singleton Annotation Tips → Don't Panic!

Link to this post \| posted 02 Mar, 2018 15:56
welkin	The most important thing to do as you start to annotate a member of a new cluster is to make sure you review all of the sequences and not just your own. Starterator and Phamerator will be really helpful in identifying genes and starts conserved throughout the cluster; and you may need to look at DNA Master files for all the genome to resolve features like the slippery sequence in the tail assembly chaperone.

Link to this post | posted 02 Mar, 2018 15:56

welkin

The most important thing to do as you start to annotate a member of a new cluster is to make sure you review all of the sequences and not just your own. Starterator and Phamerator will be really helpful in identifying genes and starts conserved throughout the cluster; and you may need to look at DNA Master files for all the genome to resolve features like the slippery sequence in the tail assembly chaperone.

Posted in: My Cluster is all draft annotations! Help! → Use All The Sequences

Link to this post \| posted 01 Mar, 2018 17:44
welkin	We only recently found the slippery site in the Cluster E tail assembly chaperone genes. This means that there are many Cluster E phages in which the frameshift has not been added to the annotation. Make sure you find a Cluster E phage with the frameshift, as all E annotations from now on must have one.

Posted in: Cluster E Annotation Tips → Tail assembly chaperones?

← previous
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
next →

Recent Activity

All posts created by welkin