I recently had a script fail due to poor handling of BLAST output. The BLAST -outfmt staxids field usually returns a single taxid, but occasionally it returns two or more taxids separated by a semicolon, such as 556514;701533. Fixing the script to handle this should be fairly straightforward. But the original failure happened when I was searching a local copy of the NCBI nt database, and I'd like to create a test case with a small query and small database that reproduces this issue more rapidly.
I am familiar with the -taxid and -taxid_map flags of the makeblastdb command, but I have been unable to reproduce this behavior with BLAST; that is, make BLAST return multiple taxids for a sequence.
- If I provide
556514;701533or556514,701533as a sequence's taxid with-taxidor-taxid_map, themakeblastdbcommand fails citing an integer conversion error. - If I provide
556514 701533as the taxid in thetaxid_mapfile, the database will be indexed successfully but BLAST will return the invalid value0forstaxids. - If I duplicate each entry in the
-taxid_mapfile so that there is an entry for each taxid (see below), only a single taxid is reported by BLAST.seq1 556514 seq1 701533 seq2 556514 seq2 701533 seq3 556514 seq3 701533 ...
When constructing a custom local BLAST database, how do I associated a database sequence with more than one taxid?
UPDATE: The offending entry (or one such entry) appears to be a duplicate or merged entry in the nt database. When I use blastdbcmd to retrieve the subject sequence by accession, nt returns a sequence with two deflines on single line, like so.
>accession1 plain text description >accession2 other description
ACGT...
Using either of the two accessions, blastdbcmd will return the exact same entry, and the two staxids shown above correspond to the two distinct accessions.
So part of the issue seems to be how to construct a BLAST database such that multiple accessions can point to a single entry/subject.