4

Is there a paper or web page describing the procedure for creating the nr database used by NCBI's BLAST implementation?

I presume it's some type of clustering, but I'm curious about how exactly sequences are condensed into non-redundant representatives.

juniper-
  • 900
  • 6
  • 13
  • Hi, what where your search terms that didn't work? (It would help to prevent other's people time). What do you expect them to do differently than what you do? – llrs May 06 '19 at 07:22
  • Did you perhaps mean to leave this comment for a different question? I'm not sure what search terms you are referring to. – juniper- May 06 '19 at 18:55
  • No, I didn't confuse your question with another. I am sure you tried to find yourself the answer to your question. I was referring to the search terms you used to find the answer to the question. I hoped we could know what did you search and suggest other search terms or explain how did we arrive to the answer (to help you in your future searches). – llrs May 07 '19 at 07:13
  • 2
    You are right, I did try to find the answer to my question myself. I searched for the terms in the question itself: NCBI BLAST nr, BLAST non-redundant, etc... But your comment got me to search some more and I ended up finding my answer. It's in an answer below :-) – juniper- May 07 '19 at 16:32
  • So your answer appears to confirm my answer- good- i can delete mine if it's not helpful/needed? – Chris_Rands May 08 '19 at 08:47
  • 1
    Please don't delete your answer! The links to the papers are helpful and it provides more context. – juniper- May 09 '19 at 04:13

2 Answers2

4

The Refseq team and also the NCBI resource coordinators team publish a new paper every few years, so check out the many papers (e.g. here or here), but to answer your 2nd question, non-redundancy here is (I think) defined very strictly as proteins that are identical in terms of sequence and length, so the clustering is trivial, without the need for a sophisticated clustering algorithm as required to detect more remote homologs.

Chris_Rands
  • 3,948
  • 12
  • 31
  • This is certainly helpful reading, but RefSeq non-redundant protein sequences (a recent development, if I'm not mistaken) are not the same thing as the nr database (which has been around a while). – Daniel Standage May 06 '19 at 15:51
  • 1
    @DanielStandage agreed, "nr" is used ambiguously to describe non-redundant proteins (nr) or nucleotides (nt) or these Refseq records. But yes the BLAST nr databases are not just Refseq records, but also Genbank+PDB etc. i think they still use the strict identical non-redundant definition though? – Chris_Rands May 06 '19 at 16:30
2

Did a little more searching and found the answer in the README on BLAST's ftp site: ftp://ftp.ncbi.nlm.nih.gov/blast/db/README

6. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are merged into one entry in these databases. To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries Q57293.1 and AAB05030.1 have the same sequence, in every respect:

>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC [Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae] MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN ANPDQFDPDATKAFIHFTEQGIFLLNKE

Individual sequences are now identifed simply by their accession.version.

For databases whose entries are not from official NCBI sequence databases, such as Trace database, the gnl| convention is used. For custom databases, this convention should be followed and the id for each sequence must be unique, if one would like to take the advantage of indexed database, which enables specific sequence retrieval using blastdbcmd program included in the blast executable package. One should refer to documents distributed in the standalone BLAST package for more details.

Landed on that README from this question on biostars.org: https://www.biostars.org/p/217456/

Edit

In that same README file is some information on the origin of the sequences in the non-redundant sets:

+-----------------------+-----------------------------------------------------+
|File Name              | Content Description                                 |
+-----------------------+-----------------------------------------------------+
nr.gz*                  | non-redundant protein sequence database with entries
                           from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz*                  | nucleotide sequence database, with entries from all
                          traditional divisions of GenBank, EMBL, and DDBJ;
                          excluding bulk divisions (gss, sts, pat, est, htg)
                          and wgs entries. Partially non-redundant.
juniper-
  • 900
  • 6
  • 13
  • 1
    Great job @juniper! It would be useful for the community if you could add to your answer that nr is a mix of different databases (not just RefSeq) and then accept your own answer. Not so humble... but that would clarify the outcome and save time for everyone I think. – gui11aume Dec 30 '20 at 14:31
  • 1
    Sure thing. It is done. – juniper- Dec 30 '20 at 19:56