SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

Clarification Question About HNH Endonuclease Function Determination in view of hits to the Ref Sequences

| posted 04 May, 2023 23:30
This is a loaded multi-question but please bear with me!

HNH is expected to have a typical ββα-metal fold and Zn-finger motif (which would need protein modeling software to decipher; DOI: https://doi.org/10.1038/srep42542), and the Official Function List simply states that it “Must have H-N-H over a 30 aa span.” It would help students if there was an easy way to make a determination on this since it may not always be obvious in HHPred. Besides just considering the percent probability, should we also consider the e-values (and probably have an e-value cut-off)? Additionally, must it always hit chain A as well as the Zn-finger motif, or could it hit other chains such as chain D, with non-zinc motifs such as for Manganese or strontium ions?

In view phagesDB & HHPred data, we are seeking clarification of the HNH function status of the following five draft Glaske16 genes at positions: 44853-45341 bp (gp 70), 51656-52198 bp (gp 83), 54100-54426 bp (gp 91), 56773-57150 bp (gp 9smile, and 60940-61320 bp (gp 117). Their respective sequences are provided, along with background information.

>Glaske16_gp70_(44853-45341 bp)
MPDGNQPACKYGACNDPVLARGFCKLHYYRNRDGKPMDGPRRSYSTGPRAWTYERLASVPITSTGAHQRVRRLWGSASLYPCATCGGPAKDWAYDGTDPTHYYEQGRKAWSHFSRWPEFYMPMCKPCHSNHDRRAAADELREYRQWKMRNPGKTLEDLEGVAZ

>Glaske16_gp83_(51656-52198 bp)
MDTIWKPIPQDPTGLYLASQDGRILRKEYVIEKLQSHGHLYRRVMPEKIVKQCIKDRAPSHGVHPIIQMRSSTQYASTVERRVSSLIAAAWHGLPYEAGDRTAQNDWRIGFIDGDPSNVHADNLEWVSNQGVNTHHSHDFYYENLKAYRAQAAVETAESFLARYYSPDEIDWSTAERIAAZ

>Glaske16_gp91_( 54100-54426 bp)
MPTNSKNGPRSRGRTGGKFERAKWRVLKANQICAHPDCRQLIDLDLKWPDPMSPTVNHIIPVKDLAWDDPLTYSVENLEPMHLVCNQRLGAGPRKKKPKHPQSRNWREZ

>Glaske16_gp98_(56773-57150 bp)
MALAGEAKREYQRQWRANRRAAWFAGKACVRCGSDEDLELDHVDPTLKVTNAVWSWSQERRDVELAKCQVLCNACHKAKTISQTVITIGLKAYRHGTCSMYEHHRCRCGLCRLWARNKKRRQRAAZ

>Glaske16_gp117_(60940-61320 bp)
MQREYMRRWVANRRSAFFASKQCAMCGAGEELELDHIDPTKKVDHRIWSWTDARRSEELAKCQVLCASCHKKKTGEQWYANRSVSENAHHGTSRRYRKMKCRCGLCRLGNTNRSRALRQRHRVPVEZ

The reference sequences for HNH endonuclease provided in the Official SEA-PHAGES Function List (as of May 9, 2023) are Sisi gp 99 and Arianna gp 54. Both match Geobacillus virus E2 hit 5H0M_A in PDB, with Sisi having a 93.5% alignment, 98.7% probability, and E-value: 1e-7, while Arianna has 67.3% alignment, 98.7% probability, and E-value: 2.2e-7.
>Sisi_gp99 MPRAPKVCRHAGCTTLTTTGTCPQHTTHRWGNHQGRKVPHRLQQATFRRDNWTCQSCGHTATPGSGQLHADHIQPRSRGGADTLDNMRTLCKACHAPKSRAEARGSNT
> Arianna_gp54
MAWSNGSSRTSSKHWQALRASAKKQLGYYCCAVCGITPAGGARLELDHIIPVAEGGSDEMANLQWLCARHHAIKTRAESRRGAQRRAARRRLPQRPHPGLR
HHPred for Arianna is: PDB, Geobacillus virus E2, hit # 5H0M_A, 67.3% alignment, Probability: 98.67%, E-value: 2.2e-7.
In view of the above, we can now specifically ask about the following five draft Glaske16 genes at positions: 44853-45341 bp (gp 70), 51656-52198 bp (gp 83), 54100-54426 bp (gp 91), 56773-57150 bp (gp 9smile, and 60940-61320 bp (gp 117).
Glaske16 gp 70 (44853-45341 bp has the top PhagesDb hit as Skinny gp 71 which is called Hypothetical Protein, yet it is 100% identical, q1:s1, but has >10 hits to HNH endonuclease). I am inclined to call this an HNH endonuclease, except if the forum suggests otherwise. Again, below is its aa sequence:
MPDGNQPACKYGACNDPVLARGFCKLHYYRNRDGKPMDGPRRSYSTGPRAWTYERLASVPITSTGAHQRVRRLWGSASLYPCATCGGPAKDWAYDGTDPTHYYEQGRKAWSHFSRWPEFYMPMCKPCHSNHDRRAAADELREYRQWKMRNPGKTLEDLEGVAZ

However, this gene, like the two reference sequences, hits HNH chain A of the same Geobacillus virus E2 hit 5H0M_A in PDB, with 75.5% alignment, 99.19% probability, and E-value: 2.5e-11, with everything exactly as seen above for the two reference sequences, including the HNH endonuclease at position 76-124 (https://www.rcsb.org/structure/5H0M).
Notably, Skinny gp 93 which is called HNH has got poor e-values

Next is Glaske16 gene at 51656-52198 bp (draft gp 83); its sequence is below:
MDTIWKPIPQDPTGLYLASQDGRILRKEYVIEKLQSHGHLYRRVMPEKIVKQCIKDRAPSHGVHPIIQMRSSTQYASTVERRVSSLIAAAWHGLPYEAGDRTAQNDWRIGFIDGDPSNVHADNLEWVSNQGVNTHHSHDFYYENLKAYRAQAAVETAESFLARYYSPDEIDWSTAERIAAZ
This one too hits HNH endonuclease in phagesDB. HHPred shows it in PDB with 54.1% alignment, Probability: 99.76%, E-value: 6.2e-18, but notably, it does not hit the same chain as the ref chain (it hits 1U3E_M; https://www.rcsb.org/structure/1U3E) and no Zn+2 motif, but instead Mn+2 and Sr+2, but it also has the βα.
What is your verdict on this gene in Glaske16 at 51656-52198 bp in view of the above? I am inclined to call it HNH endonuclease, except if the forum suggests otherwise.

Next is Glaske16 gp 91 at position 54100-54426 bp. Has several hits to HNH in phagesDB.
MPTNSKNGPRSRGRTGGKFERAKWRVLKANQICAHPDCRQLIDLDLKWPDPMSPTVNHIIPVKDLAWDDPLTYSVENLEPMHLVCNQRLGAGPRKKKPKHPQSRNWREZ
This has a low e-value but hits the same chain as the ref sequence, and the zinc motif (https://www.rcsb.org/structure/5H0M ), and is called HNH endonuclease, and another hit at 4H9D_A (https://www.rcsb.org/structure/4H9D).
What is your verdict on this gene in Glaske16 gp91 at 54100-54426 bp in view of the above? I am inclined but wary to call it HNH endonuclease because of the e-values, but again, it hits are the same as the Ref sequences; any suggestions?

The next question is about the Glaske16 gp98 at position 56773-57150 bp. It has more than 60 hits to HNH endonuclease in phagesDB. Its sequence is below:
MALAGEAKREYQRQWRANRRAAWFAGKACVRCGSDEDLELDHVDPTLKVTNAVWSWSQERRDVELAKCQVLCNACHKAKTISQTVITIGLKAYRHGTCSMYEHHRCRCGLCRLWARNKKRRQRAAZ

It also hits the same hit 5H0M_A in PDB with the same everything as the reference sequences, and high probability (98%), alignment 52.4%, but with not as great an e value (0.000029). What is your verdict on this one?

Finally, the Glaske16 gp117 at 60940-61320 bp. This gene has more than 70 hits to HNH endonuclease in phagesDB. What is your verdict on this one? Its sequence is:
MQREYMRRWVANRRSAFFASKQCAMCGAGEELELDHIDPTKKVDHRIWSWTDARRSEELAKCQVLCASCHKKKTGEQWYANRSVSENAHHGTSRRYRKMKCRCGLCRLGNTNRSRALRQRHRVPVEZ
It also hits the same hit 5H0M_A in PDB (https://www.rcsb.org/structure/5H0M) with the same everything as the reference sequences Sisi gp 99 and Arianna gp 54, and high probability (98.05%), alignment 49.6%, but with not as great an e-value (0.000017). What is your verdict on this one?
See details in attached file.
Edited 09 May, 2023 18:55
| posted 05 May, 2023 14:44
Fred,
I'm going to provide a general answer to this and then if you need further clarification, please ask.
First, I would recommend that we ignore RefSeq at NCBI Blast for now. When we curate and correct our annotations they are not being carried over to the RefSeq data. We do not have control of the RefSeqs so they are there, but I would not use the data.

In general, what is ascribed in a GenBank file is someone's interpretation into their investigations for that protein. We continually ask that you find more 'primary' data than that. So use the conserved domain database and the PDB in HHPred. All other pieces of data can again get you to heresay, so carefully evaluate.

*And I am attaching the disclaimer that all points here can be broken.

As for the specific HNH call, SMART has discovered that when you look at the alignment hits to an HNH, you but have helix-turn helices in your protein, and there must be a conserved H -N -H present. I can validate that we have called HNH's inappropriately in the past and are still rectifying that a bit at a time.

Finally, I cannot stress enough to all that using BLAST hits to assign function is fraught with error. And the degree to which it is erroneous is protein dependent. For example, if you blast one the first big genes in the genomes, it is likely the terminase, so when you blast it, and it hits it, you are likely to be right. BUT, when you HHPred it, it will also it a crystal structure of a terminase, so what is the rational of your call. matching what others called OR a significant hit to a crystal structure with all part identified. Terminases are not confusing, so if you stop at a blast hit, you will still be right. But there are other calls that require closer inspection - like HNHs, endonuclease, and endolysins to name a few.

There is no catch all direction I can provide for how to navigate this except do the entire bit of work and make the case.
Full disclosures, I have not investigated what you asked. So if I have not directly answered your questions, or this would not allow you to correctly assign a function to you your protein, let me know.
best,
debbie
| posted 05 May, 2023 19:33
As for a simple method for students to use:
I just copied the sequences into word. Then used advanced find and replace to make all the H's red font style and the N's green; that took all of 30 seconds. It then took me less than 5 minutes to screen all the proteins by eye and I was able to find an HNH pattern in all the sequences except 1 ( gp98 ) One had HNNH [since any amino acid can be between the H's and the N I would say that having more than 1 N is OK but maybe not] so if HNNH should be rejected we meed to clarify the simple test that there “Must have H-N-H over a 30 aa span.”

easy enough for students to do.

see attached with the colors and my underlines for the HNH patters I found.
| posted 09 May, 2023 19:10
Thank you Debbie & Christopher.
I think the simple test would be very helpful, as the PDB hits were matching the Reference sequences as illustrated in the attachment with my initial post. I have edited the initial post to clarify that I was not referring to NCBI Ref sequences, but rather to the reference sequences provided in the Official Function List. The edit is, "The reference sequences for HNH endonuclease provided in the Official SEA-PHAGES Function List (as of May 9, 2023)." If SMART could clarify the "HNNH" pointed out by Christopher, that would be great as well.

I have also taken another look at Glaske16 gp 98 at position 56773-57150 bp which Christopher pointed out that it is not an HNH endonuclease(see attached) and note that it has HNNH in a 35 aa span (not exactly HNH), whereas the Official Function List states that HNH endonuclease “Must have H-N-H over a 30 aa span.” If we henceforward won’t be calling this an HNH endonuclease, since many previous annotators have called it HNH (Glaske16 gp98 has more than 60 hits to HNH endonuclease in phagesDB!), could it help to state in the Official Function List that for any gene to be called an HNH endonuclease, it “Must have H-N-H within a span of not more than 30 aa,” besides clarifying whether H-N-N-H could also be acceptable if it is within a 30 aa span?

Thanks!
Fred
Edited 10 May, 2023 05:01
| posted 12 May, 2023 04:34
Hi Fred and Chris,
am not convinced that this isn't something that we could call an HNH.
I just skimmed a paper, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3592412/ , that seems to suggest that the motif could be HNH, HNK, or HNN.
In general, there are a number of endonuclease 'types' that we have not addressed. I am not sure we have a complete enough understanding to differentiate.
debbie
| posted 12 May, 2023 20:08
I too think we could call gp98 an HNH, I did an HHPRED search with gp98 against the pfam database since there is a pfam motif with the label "HNH endonuclease" (PF01844) In this case, looking at the alignment gp98 would be of the HKH type. So given the definition used by that paper from Deb, gp98 should be annotated as "HNH endonuclease"
Others may want the definition to be a more strict definition of "only those endonuclease that actually have those exact 3 specific amino acids" and might argue that we should call gp98 an HKH endonuclease or just endonuclease.

There is no right or wrong answer to how a term should be defined, but given Fred's totally valid points and the comments from Deb's paper I think we should just change the note on the approved terms list to "Has H-N-H within 30 aa span but minor variations allowed, see forum topic 5505" or something similar
| posted 13 May, 2023 00:14
Thank you Debbie & Chris!
I think a note such as “Has H-N-H within 30-40 aa span but minor variations such as HNK, HNN, HNNH allowed, see forum topic 5505” or something similar as Chris suggests would be very helpful.
Fred
 
Login to post a reply.