SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

Can we call DNA Binding proteins based on DNABIND and DNA Binder results?

| posted 15 Feb, 2022 16:34
During the recent workshop, two the programs DNABIND and DNA Binder were mentioned for predicting DNA Binding Proteins. We have found several genes that have been predicted by both programs to be DNA binding proteins (varying strengths), but do not necessarily have strong HHpred alignments to DNA binding. Current BLASTp hits in NCBI and phagesDb are “Hypothetical Proteins.” However, these genes also appear to be in either an operon or in the syntenic region with other DNA binding proteins, such as DNA methylase, translocase, resolvase, and specific-DNA-binding proteins. Can we call these genes the general name of “DNA binding protein” based on the two programs and the strong possibility of the operon or in syntenic region? In general, is it possible to call DNA binding proteins based on these two programs alone? Two examples from the P1 phage Dynamo are gp 44 (start/stop: 31954-32103) and gp 51 (36150-3642smile. See attached file.
Thanks!
Fred
| posted 15 Feb, 2022 17:44
Hi Fred,
I don't know enough about DNABIND or DNA Binder to know how good that they are predicting DNA binding proteins. An analysis of what we have called DNA binding proteins with these programs is in order to determine if we would want to adopt this, I think.
Make sense?
debbie
| posted 15 Feb, 2022 18:26
Hi Debbie and Fred,

I've attempted to begin a systematic analysis to determine how much we can trust the outputs from either of these programs.

I accumulated a list of diverse types of DNA binding proteins: tyrosine or serine integrases, terminase large subunits, HTH DNA-binding proteins, RecE exonuclease, RecT ssDNA binding protein, etc. I pulled representative sequences from a subset of phams predominated by proteins with these functions.

With the caveat that I've only run 6 sequences so far, I'm not impressed by DNABIND. It's very fast (which is nice!), but only two of the sequences were predicted as DNA-binding proteins (a tyrosine integrase and an HTH DNA-binding protein). The others were all reported as having a probability less than 40% of being DNA-binding.

DNABINDER is MUCH slower - I'm still waiting on the first protein sequence, nearly an hour later. Ignoring the question of whether we can trust its output, I'm of the opinion that this program is too slow to warrant systematic use by SEA-PHAGES annotators.

-Christian
| posted 15 Feb, 2022 21:38
debbie
Hi Fred,
I don't know enough about DNABIND or DNA Binder to know how good that they are predicting DNA binding proteins. An analysis of what we have called DNA binding proteins with these programs is in order to determine if we would want to adopt this, I think.
Make sense?
debbie

Hi Debbie,
I concur with you that "An analysis of what we have called DNA binding proteins with these programs is in order to determine if we would want to adopt this."
For now, I would settle for NKF until we get more support. Christian's current work is in order, and we too are starting on preliminary work to confirm.
Fred
| posted 15 Feb, 2022 21:45
chg60
Hi Debbie and Fred,

I've attempted to begin a systematic analysis to determine how much we can trust the outputs from either of these programs.

I accumulated a list of diverse types of DNA binding proteins: tyrosine or serine integrases, terminase large subunits, HTH DNA-binding proteins, RecE exonuclease, RecT ssDNA binding protein, etc. I pulled representative sequences from a subset of phams predominated by proteins with these functions.

With the caveat that I've only run 6 sequences so far, I'm not impressed by DNABIND. It's very fast (which is nice!), but only two of the sequences were predicted as DNA-binding proteins (a tyrosine integrase and an HTH DNA-binding protein). The others were all reported as having a probability less than 40% of being DNA-binding.

DNABINDER is MUCH slower - I'm still waiting on the first protein sequence, nearly an hour later. Ignoring the question of whether we can trust its output, I'm of the opinion that this program is too slow to warrant systematic use by SEA-PHAGES annotators.

-Christian

Hi Christian,
DNABINDER will eat your lunch if you leave the setting as PSSM! If you want to go home earlier, please change the selection to "amino acid composition." The PSSM model is looking at the evolutionary trends, which is why it takes so long; amino acid composition is predicting on the basis of the amino acid composition like we are wanting (more details are explained on their website).
Cheers!
Fred
| posted 16 Feb, 2022 17:42
I also think that the issue of false positives (i.e. specificity) is as important as Christian's initial investigation into false negatives (sensitivity). So I would like to know if we pick 50 or so phams that we know are NOT dna binding (structural proteins come to mind first, lysis enzymes, holins, suggestions?) for these proteins, do we get any false positives? I always think that keeping a protein NKF is better than giving it a false function. So if no false positives are evident then no major harm would be expected from doing the screen with the two programs and I think it might be worth considering. t

If we were to consider I would take the common practice like we do for "membrane proteins" which is we only screen proteins at the end without a better annotation. That is, screen the NKF's at the end for possible "DNA binding" activity. Then only add the annotation "DNA Binding" to the protein if BOTH programs predicted a DNA binding activity.
But I will say again that all depends on having an excellent specificity (like 100%).
Edited 16 Feb, 2022 18:33
| posted 16 Feb, 2022 18:50
Sounds like a great student project!
| posted 16 Feb, 2022 19:01
Ok I have looked into this more. DNAbinder has an estimated false positive rate of 5-7% when using the "realistic dataset". The biggest issue I don't like is it appears you can only load 1 sequence at a time which makes general use for all genes quite labor and time intensive, not ideal for a good program to recommend everyone use. Also it runs in an old version of PERL and the server is running on Solaris so could be very time consuming to update to a modern server and make it easily usable as a general program (i.e. to make it less labor intensive).

DNABIND does allow for multiple submissions, so you could just dump all the protein sequences from a phage in but the program is really designed to analyze 3d structures and the say on the page: " Although it [DNABIND] can predict DNA binding from the protein sequence alone, pure sequence-based prediction was only validated on a very small set of sequences (all of them belonging to structures in the Protein Data Bank)."
So there is no way to know its performance with just fasta peptide sequences. The interesting idea would be to combine this program with the newest 3-d predictors and see how it preforms. Again this would be so time consuming I would not recommend using as a general protocol for all phage but a very interesting question that a student could investigate.

Bottom line here, I think the issues with practicality mean it should not be added as a recommended protocol for all annotations of all phage but it would be suitable for individual "in depth" investigations. I always have my students do some kind of individual research during the last 3-4 weeks of the semester and using these two programs to search through all the NKF proteins for possible function would certainly qualify for a suitable project. Especially with an eye on specificity and sensitivity issues.

I guess the question remains is: should these be used to add the DNA binding to approved annotations when anyone does the work. To me again that would rely on a convincing data analysis showing a very low rate of false positives.
Edited 16 Feb, 2022 19:31
 
Login to post a reply.