SEA-PHAGES | Blastp with ClusteredNR

Link to this post \| posted today, 18:44
aajohnson	How do we feel about blastp with clusteredNR (nr_cluster_seq) database for positional annotation? It seems like clusteredNR is the new default, not sure if this is by desigh or because it's the top of a list. For me the search returns much faster compared to non-redundant protein sequences database. I'm working on a mycobacteriophage and seeing a wider diversity of hits (non-mycobacteriophage) higher in the descriptions table which is interesting. But the alignment tab shows only 1 hit from a cluster, which doesn't really illustrate how many other sequences have a 1:1 amino acid match between query and subject. It doesn't give you a message like "See 13 other titles"- this was something I previously told my students to look for to understand the depth of matches. The number of sequence in the cluster on the descriptions tab (20) is not the number of sequences that are identical according to clusteredNR result (15- I haven't really inspected these yet) or with the number with a 1:1 match using nr database (13). [Yes, I do tell them to also use phagesdb blastp, but I want them using a tool they will use more broadly after this class.] Anyone [Chris S ] have thoughts?

Link to this post | posted today, 18:44

How do we feel about blastp with clusteredNR (nr_cluster_seq) database for positional annotation? It seems like clusteredNR is the new default, not sure if this is by desigh or because it's the top of a list.
For me the search returns much faster compared to non-redundant protein sequences database. I'm working on a mycobacteriophage and seeing a wider diversity of hits (non-mycobacteriophage) higher in the descriptions table which is interesting. But the alignment tab shows only 1 hit from a cluster, which doesn't really illustrate how many other sequences have a 1:1 amino acid match between query and subject. It doesn't give you a message like "See 13 other titles"- this was something I previously told my students to look for to understand the depth of matches. The number of sequence in the cluster on the descriptions tab (20) is not the number of sequences that are identical according to clusteredNR result (15- I haven't really inspected these yet) or with the number with a 1:1 match using nr database (13).
[Yes, I do tell them to also use phagesdb blastp, but I want them using a tool they will use more broadly after this class.]
Anyone [Chris S smile

] have thoughts?

Link to this post \| posted 39 minutes ago
debbie	Hi Allison, In general, I would discourage using Blastp at ncbi for stat information. My rationale is this. Do not use any Ref Sequence data because it is not provided by the owner of the sequence. For most data outside of the SEA-PHAGES program annotations are done with automated software and does not follow the same scrutiny that we use. Finally having Starterator data (with 'raw' nucleotide alignments) surpasses alignments provided of called genes that have not gone though our scrutiny. As for the clustering part, again that is provided in Starterator. Also, matches to identical sequences doesn't provide much depth either. You are looking at same instances when they are identical, in which case whether you agree or not is a about whether 'we' agree' with each other, not how a sequence is conserved over time. The hits to non-actinobacteriophage data is of interest, but again what criteria was used to make that call, especially as it pertains to starts. There is some really nice data available at the ncbi hits that could also be investigated. LIke multiple seqeunce alignments. best, debbie

Link to this post | posted 39 minutes ago

debbie

Hi Allison,
In general, I would discourage using Blastp at ncbi for stat information.
My rationale is this. Do not use any Ref Sequence data because it is not provided by the owner of the sequence. For most data outside of the SEA-PHAGES program annotations are done with automated software and does not follow the same scrutiny that we use. Finally having Starterator data (with 'raw' nucleotide alignments) surpasses alignments provided of called genes that have not gone though our scrutiny.

As for the clustering part, again that is provided in Starterator. Also, matches to identical sequences doesn't provide much depth either. You are looking at same instances when they are identical, in which case whether you agree or not is a about whether 'we' agree' with each other, not how a sequence is conserved over time.

The hits to non-actinobacteriophage data is of interest, but again what criteria was used to make that call, especially as it pertains to starts.

There is some really nice data available at the ncbi hits that could also be investigated. LIke multiple seqeunce alignments.

best,
debbie

Recent Activity

Blastp with ClusteredNR