4

Assume that I wish to find all complete human mitochondrial genome records on GenBank (or rather, NCBI nuccore) that also have an entry in NCBI's Biosample database.

MYQUERY="(mitochondrion[TITLE] OR mitochondrial[TITLE]) \
    AND complete genome[TITLE] \
    AND (human[TITLE] or homo sapiens[TITLE])"

I know that 61,719 such records exist on NCBI nuccore as of Sep-23-2023.

esearch -db nuccore -query "$MYQUERY" | \
    xtract -pattern ENTREZ_DIRECT -element Count

I also know that 452 such records exist on NCBI Biosample as of Sep-23-2023.

esearch -db biosample -query "$MYQUERY" | \
    xtract -pattern ENTREZ_DIRECT -element Count

However, these record sets either do not have an intersection or are not correctly linked (or I am using the edirect tools incorrectly), as 0 hits are identified.

esearch -db nuccore -query "$MYQUERY" | \
    elink -target biosample | \
    xtract -pattern ENTREZ_DIRECT -element Count

On the other hand, first esearching in the biosample database and then linking the hits to the nuccore database does seem to give 271 hits.

esearch -db biosample -query "$MYQUERY" | \
    elink -target nuccore | \
    xtract -pattern ENTREZ_DIRECT -element Count

What is the reason for this seemingly unpredictable behavior?

terdon
  • 10,071
  • 5
  • 22
  • 48
  • It's because the elink is messed up and not consistent. It's quite a lot of code to sort it out, but it's doable (I've done it). In an ideal world the elink would be the key across the relational database - it ain't always the case but its the best thing there is. – M__ Sep 24 '23 at 00:22
  • Not related to your issue, just a couple of points about good practices in shell scripts: i) avoid using CAPS for shell variable names. Since, by convention, global environment variables are capitalized, it is bad practice to also use caps for your own, local variables because this can lead to naming collisions and hard to find bugs. ii) There is no need for \ after a | to break a line. | is a list terminator, so it's fine as a line break character. See – terdon Sep 24 '23 at 10:01

0 Answers0