SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Welcome to the forums at seaphages.org. Please feel free to ask any questions related to the SEA-PHAGES program. Any logged-in user may post new topics and reply to existing topics. If you'd like to see a new forum created, please contact us using our form or email us at info@seaphages.org.

RefSeq and INSDC name disagreements in NCBI Blast for Functonal Assignment

| posted 17 Feb, 2023 19:50
I start out this post with the caveat that I may be doing something really dumb here, so I apologize if this is a known issue and I have just somehow missed it.

Here's the situation: We're annotating MulchSalad (F) and had just started to teach function calling.

We picked gene 1 to demonstrate with (big mistake! smile).

Basically here's the issue: If you BLAST on Phagesdb this gene hits to terminase small subunit. This is shown in the Pham view for other annotated genomes.

If we BLAST (blastp) on NCBI with default settings, we see "minor tail protein" for the annotation:
https://capture.dropbox.com/ZrKNG9p6b1vT2pDa

I was confused by this and dug deeper. Apparently, this is because RefSeq is defaulting for matches. INSDC regular GenBank entries are still there with the names given by SEA-PHAGES, but they are hidden in the hits by default. You can directly compare this if you click around:
https://capture.dropbox.com/u8NpcBuUFOAcg2UT

If you look at the RefSeq entry it gives "minor tail protein" and if you look at GenBank it says "terminase small subunit"

I am not sure if this is a common problem, but we definitely found it here. I am not sure there is really a question here so much as an observation. If I *do* have a question it's likely unanswerable–why did RefSeq call this MTP? HHPred agrees this is a terminase large subunit or terminase.

Is this a common issue? This is the first time I'd ever seen anything like this. If it's isolated, I can deal with it ad hoc… if it's common maybe I need to plan a workaround.

Thanks, all!
Kyle

Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129
| posted 17 Feb, 2023 20:24
Hi Kyle,
Well, this is quite messed up, isn't it.
I will investigate further.
in the meantime, I would like to provide what I think of BLASTp functions calls at NCBI, i don't value them very much - not without supporting evidence. So if you continue to investigate, there is no supporting data for a minor tail protein except NCBI said so. there is no way that a functional call can be made from the blast data that HHPred data sources does not support. (HHPred does a Psi blast, so it is finding more distant relationships to a protein than a single blast could.)
Looks like this is a terminase, small subunit to me.
debbie
| posted 19 Feb, 2023 13:21
That is very helpful, Debbie! Thank you!

My inclination was to ignore this particular NCBI RefSeq call–so I am glad to hear you agree. And in the meantime I am definitely hoping this isn't more widespread!

Kyle

Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129
| posted 22 Feb, 2023 20:00
debbie
Hi Kyle,
So if you continue to investigate, there is no supporting data for a minor tail protein except NCBI said so. there is no way that a functional call can be made from the blast data that HHPred data sources does not support."

We're getting no blastp CDD or HHPred support for at least two "minor tail proteins" in cluster K phages, including in the publication supplementary tables posted there. I would call them as hypothetical protein. I haven't dug yet for any synteny rules or something related to these- can you point me towards something to use?
I commiserate with you Kyle.
| posted 23 Feb, 2023 01:14
Hi Allison,
Minor tail proteins are the most common functional assignments that are acceptable to call with no CDD or HHPred supporting data. The genes that are "eligible" are the 4-6 large genes downstream of tape measure. I would not make the assignment if the gene is not a relatively big gene.
In the case of Kyle's question, there is discrepant data that contradicts a minor tail protein call - synteny and HHPred hits.
Best,
debbie
| posted 23 Feb, 2023 13:40
Perfect, thank you.
| posted 23 Feb, 2023 17:33
Thank you–I like to use the MTPs as an example and it's nice to be reminded of the parameters around the function call!

(This silly RefSeq issue notwithstanding!)

Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129
| posted 24 Feb, 2023 20:52
So we've got another instance of the same issue, Debbie.

MulchSalad_Draft_37 is an integrase/tyrosine integrase by analysis on Phagesdb BLAST/HHPred as shown in the pham report:

https://capture.dropbox.com/TobVF9m4QnDFz6Hl

NCBI BLAST is showing mostly "endonuclease" as the annotation:

https://capture.dropbox.com/g0O9azXO34T0rhG9

And if you do the "Identical Proteins" analysis on any of the predicted "endonuclease" genes you see the difference again:

https://capture.dropbox.com/RTPOWS6fziZ4TMpK

What we're seeing again is the RefSeq ("curated"smile database is calling most of our Cluster F pham 68632 integrases as endonucleases whereas GenBank/INSDC and Phagesdb are calling them integrases. So somewhere along the way RefSeq "curation" is changing the default of what we are calling these genes.

And since apparently, by default, NCBI BLAST shows the RefSeq results and only shows the GenBank results if you dig deeper, I'm guessing this will cause a certain amount of confusion going forward.

Meh.

Kyle

Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129
Edited 24 Feb, 2023 20:54
| posted 03 Mar, 2023 19:29
Just a note that we continue to find this in working on MulchSalad (F1) genes.

We saw some genes today in which RefSeq is calling the gene one thing and INSDC is instead calling it a hypothetical protein! So it even gets more complicated!

Kyle

Kyle MacLea
Associate Professor, University of New Hampshire at Manchester
kyle.maclea@unh.edu +1 603-641-4129
| posted 05 Jul, 2024 21:52
I have noticed that NCBI is revising whole blocks of phages and submitting them as new submissions with different accession numbers than the original SEA_Phage submissions. This is easy to detect when looking at the new NCBI file because it is a new reference (1) that has been added to the original reference(2:
LOCUS YP_010057231 46 aa linear PHG 10-JAN-2023
DEFINITION HNH endonuclease [Mycobacterium phage Cane17].
ACCESSION YP_010057231
VERSION YP_010057231.1
DBLINK BioProject: PRJNA485481
DBSOURCE REFSEQ: accession NC_054716.1
KEYWORDS RefSeq.
SOURCE Mycobacterium phage Cane17
ORGANISM Mycobacterium phage Cane17
Viruses; Duplodnaviria; Heunggongvirae; Uroviricota;
Caudoviricetes; Ceeclamvirinae; Bixzunavirus; Bixzunavirus cane17.
REFERENCE 1 (residues 1 to 46)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (07-MAY-2021) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 2 (residues 1 to 46)
AUTHORS Fast,K.M., Castleberry,S., Jones,I.K., Larrimore,J.D., Long,C.A.,
Pritchett,N.C., Keener,T., Sandel,M.W., Bollivar,D.W.,
Garlena,R.A., Russell,D.A., Pope,W.H., Jacobs-Sera,D. and
Hatfull,G.F.
TITLE Direct Submission
JOURNAL Submitted (28-JUL-201smile Biology, Illinois Wesleyan University, 1312
Park Street, Bloomington, IL 61701, USA

The major problem that I have with this is that we are not able to see the evidence or documentation that led to this huge change. This causes problems with our students that see an overwhelming block of identical functions in NCBI usually without noticing the original submissions with different functions. Sometimes the original submissions are visible in the NCBI BLAST results as shown in the PECAAN output below:

HNH endonuclease [Mycobacterium phage Cane17]
>gb|AXQ51660.1| hypothetical protein SEA_CANE17_46 [Mycobacterium phage Cane17] >gb|QAY13996.1| hypothetical protein SEA_COLT_48 [Mycobacterium phage Colt]

and other times the original SEA_ evidence is sorted way down the list of results.

We instruct our students to go with the Phagesdb results which are supported by HHPred or CDD evidence.

The NCBI results are great for confirming the 1:1 start correlations.

Enjoy!
Claire Rinehart
 
Login to post a reply.