SEA-PHAGES | New Features in PECAAN

Link to this post \| posted 09 Apr, 2019 19:22
jawsWPI	Claire Rinehart Tammy, We currently do not place the function into the function box in our DNA Master Full Annotation export. Welkin just pointed out to me that the online guide tells how to generate minimal files from the complete notes when functions are recorded in the function field. We will work on adding this feature to the complete file export. Meanwhile, you can use the Export CDS Function button on the export page of PECAAN to export a file that can be copy/pasted into the DNA Master Documentation page which will parse the functions into the notes field. These minimal functions can then be copied en masse into the product or function fields in DNA Master by clicking on the right hand triangle in the Notes field. Thanks, Claire I'm trying to generate a DNA master minimal file using copy/paste/parse from an exported CDS function, but it won't auto-label the blank products as "hypothetical" it's just leaving them blank. I tried using the 'new SEA format' full annotation, but that puts NKF in the function field which also causes the same problem if you follow the entire step sequence for generating a minimal file from the guide (from the guide: "NKF" is also not considered default, and won't be overwritten". Is anyone else having this problem? Is there a solution besides pasting in "hypothetical" in to all the blank product fields?

Link to this post | posted 09 Apr, 2019 19:22

Claire Rinehart
Tammy,
We currently do not place the function into the function box in our DNA Master Full Annotation export. Welkin just pointed out to me that the online guide tells how to generate minimal files from the complete notes when functions are recorded in the function field. We will work on adding this feature to the complete file export. Meanwhile, you can use the Export CDS Function button on the export page of PECAAN to export a file that can be copy/pasted into the DNA Master Documentation page which will parse the functions into the notes field. These minimal functions can then be copied en masse into the product or function fields in DNA Master by clicking on the right hand triangle in the Notes field.
Thanks,
Claire

I'm trying to generate a DNA master minimal file using copy/paste/parse from an exported CDS function, but it won't auto-label the blank products as "hypothetical" it's just leaving them blank. I tried using the 'new SEA format' full annotation, but that puts NKF in the function field which also causes the same problem if you follow the entire step sequence for generating a minimal file from the guide (from the guide: "NKF" is also not considered default, and won't be overwritten" smile

.

Is anyone else having this problem? Is there a solution besides pasting in "hypothetical" in to all the blank product fields?

Link to this post \| posted 10 Apr, 2019 13:54
lhughes	My best guess is that the field isn't actually empty, but has something in it (a space perhaps, or a hard return), which is why it isn't being labeled. You probably need to make sure those fields are actually empty in order to get them to write. If you are copying and pasting into the Function field at some point in the process, you might have an extra hard return in the data (this can happen when copying from other programs) that is still in the field and causing the problem. Lee

Link to this post | posted 10 Apr, 2019 13:54

lhughes

My best guess is that the field isn't actually empty, but has something in it (a space perhaps, or a hard return), which is why it isn't being labeled. You probably need to make sure those fields are actually empty in order to get them to write. If you are copying and pasting into the Function field at some point in the process, you might have an extra hard return in the data (this can happen when copying from other programs) that is still in the field and causing the problem.

Lee

Link to this post \| posted 10 Apr, 2019 15:27
jawsWPI	Lee Hughes My best guess is that the field isn't actually empty, but has something in it (a space perhaps, or a hard return), which is why it isn't being labeled. You probably need to make sure those fields are actually empty in order to get them to write. If you are copying and pasting into the Function field at some point in the process, you might have an extra hard return in the data (this can happen when copying from other programs) that is still in the field and causing the problem. Lee This was also my thought and I have done everything I can think of to make sure the fields are truly empty before using the DNA master tools to transfer copy data from one field to another. The hidden space or hard return appears to be imbedded in the PECAAN text file that I'm copying into DNA master documentation before parsing. Here is the only hint I have: the last gene product, which is NKF, actually is labeled as hypothetical. I've looked at the text file (copied below) and the only difference is that there isn't a "CDS" immediately after the double quotes (signifying the start of the next feature call) for this last gene. Attached is what the text file looks like in notepad, but below is what I got when I copied the highlighted section and pasted that text into this message. There definitely seems to be a carriage return right after every note except the last one. CDS 54801 - 55664 /gene="81" /product="gp81" /locus tag="NarutoRun_81" /note="" CDS 55666 - 56145 /gene="82" /product="gp82" /locus tag="NarutoRun_82" /note="" Claire is this something new that we have to work around (this is the first time I have used PECANN for the full annotations)? I assume there haven't been issues before when parsing PECAAN CDS functions text files. 5Kb

Link to this post | posted 10 Apr, 2019 15:27

jawsWPI

Lee Hughes
My best guess is that the field isn't actually empty, but has something in it (a space perhaps, or a hard return), which is why it isn't being labeled. You probably need to make sure those fields are actually empty in order to get them to write. If you are copying and pasting into the Function field at some point in the process, you might have an extra hard return in the data (this can happen when copying from other programs) that is still in the field and causing the problem.

Lee

This was also my thought and I have done everything I can think of to make sure the fields are truly empty before using the DNA master tools to transfer copy data from one field to another. The hidden space or hard return appears to be imbedded in the PECAAN text file that I'm copying into DNA master documentation before parsing. Here is the only hint I have: the last gene product, which is NKF, actually is labeled as hypothetical. I've looked at the text file (copied below) and the only difference is that there isn't a "CDS" immediately after the double quotes (signifying the start of the next feature call) for this last gene.
Attached is what the text file looks like in notepad, but below is what I got when I copied the highlighted section and pasted that text into this message. There definitely seems to be a carriage return right after every note except the last one.

CDS 54801 - 55664
/gene="81"
/product="gp81"
/locus tag="NarutoRun_81"
/note=""

CDS 55666 - 56145
/gene="82"
/product="gp82"
/locus tag="NarutoRun_82"
/note=""

Claire is this something new that we have to work around (this is the first time I have used PECANN for the full annotations)? I assume there haven't been issues before when parsing PECAAN CDS functions text files.

Link to this post \| posted 10 Apr, 2019 16:55
ClaireRinehart	JoAnn, We just processed Chotabhai from PECAAN into DNA Master and then into the submission pipeline without any problems of getting the Hypothetical Protein tag to populate. Usually in these situations, we have found that the software that you are using to copy the file and paste it into the DNA Master documentation is inserting a character that is not compatible with the DNA Master parsing. Would you please copy the PECAAN "Export CDS Function" file and paste it into a new file, save it and then send it to us so that we can compare to our processed file. email to claire.rinehart@wku.edu. Please also indicate what software package you are using to view the PECAAN "Export CDS Function" file and to copy from. Thanks, Claire

Link to this post | posted 10 Apr, 2019 16:55

ClaireRinehart

JoAnn,
We just processed Chotabhai from PECAAN into DNA Master and then into the submission pipeline without any problems of getting the Hypothetical Protein tag to populate.

Usually in these situations, we have found that the software that you are using to copy the file and paste it into the DNA Master documentation is inserting a character that is not compatible with the DNA Master parsing.

Would you please copy the PECAAN "Export CDS Function" file and paste it into a new file, save it and then send it to us so that we can compare to our processed file. email to claire.rinehart@wku.edu.

Please also indicate what software package you are using to view the PECAAN "Export CDS Function" file and to copy from.

Thanks,
Claire

Link to this post \| posted 10 Apr, 2019 17:30
cdshaffer	Lee is correct about the presence of non-printable or control characters in your notes field, if you see this line in your documentation: /note="" then you know there is something in the notes which will mean "hypothetical protein" will not be added correctly. I see this a lot and I always assumed it was an instance of a very common problem in bioinformatics that has to do with newline definitions of mac vs. dos vs. windows vs. linux (its complicated but see this wiki page if you are curious). If my suspicion is correct then PECAAN is not the problem and cannot fix anything, but rather its an issue with one of the various specific programs you use and/or your OS as they handle text as you process the downloaded file. The solution I have found is to use the "recreate" button in DNA Master and then parse again. This has fixed the issue for me. So the protocol then is: 1. Paste in the downloaded file into the documentation window 2. Parse this entry by clicking the "parse" button, the next "parse" button, then "yes" 3. Recreate the documentation by clicking the "recreate" button, then "yes" 4. Now parse this version of the documentation by following step 2 again I usually do one more round of recreate and parse just to be sure. So the click stream I use is: paste, parse, parse, yes, recreate, yes, parse, parse, yes, recreate, yes, parse, parse, yes Edited 10 Apr, 2019 17:49

Link to this post | posted 10 Apr, 2019 17:30

cdshaffer

Lee is correct about the presence of non-printable or control characters in your notes field, if you see this line in your documentation:

/note=""

then you know there is something in the notes which will mean "hypothetical protein" will not be added correctly.

I see this a lot and I always assumed it was an instance of a very common problem in bioinformatics that has to do with newline definitions of mac vs. dos vs. windows vs. linux (its complicated but see this wiki page if you are curious). If my suspicion is correct then PECAAN is not the problem and cannot fix anything, but rather its an issue with one of the various specific programs you use and/or your OS as they handle text as you process the downloaded file.

The solution I have found is to use the "recreate" button in DNA Master and then parse again. This has fixed the issue for me. So the protocol then is:

1. Paste in the downloaded file into the documentation window
2. Parse this entry by clicking the "parse" button, the next "parse" button, then "yes"
3. Recreate the documentation by clicking the "recreate" button, then "yes"
4. Now parse this version of the documentation by following step 2 again

I usually do one more round of recreate and parse just to be sure. So the click stream I use is:
paste, parse, parse, yes, recreate, yes, parse, parse, yes, recreate, yes, parse, parse, yes

Edited 10 Apr, 2019 17:49

Link to this post \| posted 10 Apr, 2019 19:05
jawsWPI	Thanks Chris. That worked and I could see the changes during each iteration.

Link to this post \| posted 26 Jul, 2019 05:01
DrHHNZ	HI Claire! We are annotating now in the southern hemisphere. Can you tell us how you use the "region" information in the NCBI Blast output? This means that the Blast match being described includes a "region" in the annotation … but what does that mean? Thank you for everything you do! All best, Heather

Link to this post \| posted 26 Jul, 2019 13:14
ClaireRinehart	Heather, In the NCBI outputs there are several tagged descriptor lines like: /note and /product. Occasionally, when the editors at NCBI find that a protein has a domain that they feel matches one of the functional domains, they will insert a /region note. Whenever you find a Yes under the Region header in the NCBI BLAST it will be a blue link. If you click on this link a separate window will pop up that will contain the /region note and additional annotation lines from the NCBI output. So, the Region column is just a flag that lets you see that there additional information or confirmation that has been added to the original annotation by NCBI. You will also notice that the Yes / No designators are only present for matches that have greater than a 70% identity, this was an arbitrary cutoff that we chose to save search time. Enjoy! Claire

Link to this post | posted 26 Jul, 2019 13:14

ClaireRinehart

Heather,
In the NCBI outputs there are several tagged descriptor lines like: /note and /product. Occasionally, when the editors at NCBI find that a protein has a domain that they feel matches one of the functional domains, they will insert a /region note. Whenever you find a Yes under the Region header in the NCBI BLAST it will be a blue link. If you click on this link a separate window will pop up that will contain the /region note and additional annotation lines from the NCBI output. So, the Region column is just a flag that lets you see that there additional information or confirmation that has been added to the original annotation by NCBI. You will also notice that the Yes / No designators are only present for matches that have greater than a 70% identity, this was an arbitrary cutoff that we chose to save search time.

Enjoy!

Claire

Link to this post \| posted 16 Aug, 2019 03:27
heather4	ClaireRinehart Heather, In the NCBI outputs there are several tagged descriptor lines like: /note and /product. Occasionally, when the editors at NCBI find that a protein has a domain that they feel matches one of the functional domains, they will insert a /region note. Whenever you find a Yes under the Region header in the NCBI BLAST it will be a blue link. If you click on this link a separate window will pop up that will contain the /region note and additional annotation lines from the NCBI output. So, the Region column is just a flag that lets you see that there additional information or confirmation that has been added to the original annotation by NCBI. You will also notice that the Yes / No designators are only present for matches that have greater than a 70% identity, this was an arbitrary cutoff that we chose to save search time. Enjoy! Claire Thanks Claire! That was great. I do have a new question today. In the function window in PECAAN we have no option to choose "integrase" the only options are "serine integrase" or tyrosine integrase". However, there are no instructions that I can find that would give the students a hint about which of these to choose. Does anyone know if this is coming? The annotated genomes that we are finding matches to all use the function "integrase". Thanks! Heather

Link to this post | posted 16 Aug, 2019 03:27

heather4

ClaireRinehart
Heather,
In the NCBI outputs there are several tagged descriptor lines like: /note and /product. Occasionally, when the editors at NCBI find that a protein has a domain that they feel matches one of the functional domains, they will insert a /region note. Whenever you find a Yes under the Region header in the NCBI BLAST it will be a blue link. If you click on this link a separate window will pop up that will contain the /region note and additional annotation lines from the NCBI output. So, the Region column is just a flag that lets you see that there additional information or confirmation that has been added to the original annotation by NCBI. You will also notice that the Yes / No designators are only present for matches that have greater than a 70% identity, this was an arbitrary cutoff that we chose to save search time.

Enjoy!

Claire

Thanks Claire! That was great. I do have a new question today. In the function window in PECAAN we have no option to choose "integrase" the only options are "serine integrase" or tyrosine integrase". However, there are no instructions that I can find that would give the students a hint about which of these to choose. Does anyone know if this is coming?
The annotated genomes that we are finding matches to all use the function "integrase".

Thanks!
Heather

Link to this post \| posted 16 Aug, 2019 18:49
ClaireRinehart	Heather, Yes, only having the tyrosine and serine integrase options does often require a little more work. One place that I like to go for this information is HHPred. If you can find the hits that have four letter/number names before a _ in the left column, these links lead to the PDB database that usually has a rich set of information. I like to read the collapsed PubMed Abstract under the literature section. This often has reference to the type of integrase. If there is nothing there, search down to the Small Molecules section and you can sometimes find reference to a serine or tyrosine interaction. Another place in PECAAN to look is at the Pham link under the Starterator dropdown box. This takes you to the Phagesdb summary for the Pham that has the Phages, their functions and sizes. You should see a consistent set of either serine integrases or tyrosine integrases in this pham list. Another quick summary of the hits found in Phagesdb is in the Phages Function Frequency table above the Phagesdb BLAST. This shows all of the top 100 function hits and will give you a feel for the number of hits called as y-int or s-int as well as their associated phams. If there are Conserved Domain Database hits these will usually define the integrase type also. Finally, some of the top NCBI hits will often contain either the serine or tyrosine type. I hope this is helpful. Thanks, Claire Edited 16 Aug, 2019 19:02

Link to this post | posted 16 Aug, 2019 18:49

ClaireRinehart

Heather,
Yes, only having the tyrosine and serine integrase options does often require a little more work.
One place that I like to go for this information is HHPred. If you can find the hits that have four letter/number names before a _ in the left column, these links lead to the PDB database that usually has a rich set of information. I like to read the collapsed PubMed Abstract under the literature section. This often has reference to the type of integrase. If there is nothing there, search down to the Small Molecules section and you can sometimes find reference to a serine or tyrosine interaction. Another place in PECAAN to look is at the Pham link under the Starterator dropdown box. This takes you to the Phagesdb summary for the Pham that has the Phages, their functions and sizes. You should see a consistent set of either serine integrases or tyrosine integrases in this pham list. Another quick summary of the hits found in Phagesdb is in the Phages Function Frequency table above the Phagesdb BLAST. This shows all of the top 100 function hits and will give you a feel for the number of hits called as y-int or s-int as well as their associated phams. If there are Conserved Domain Database hits these will usually define the integrase type also. Finally, some of the top NCBI hits will often contain either the serine or tyrosine type.
I hope this is helpful.
Thanks,
Claire

Edited 16 Aug, 2019 19:02

Recent Activity

New Features in PECAAN