6

The Question

I want to find high coverage SRA entries, e.g., above 100x.

I guess the best way is to use https://www.ncbi.nlm.nih.gov/sra with an appropriate search term. I don't mind if the search results contain some "false-positives". (I.e., if some of the found entries have coverage above 80x, while I aimed for 100x, that's fine.) I want to avoid using Entrez.esummary on a very large number of SRA entries (e.g., through Biopython), as it would be significantly slower (if I understand correctly).

How can I do that?


My Attempt

I tried using the Mbases field.
Assuming the reference genome is at most 100kbp (e.g., in case I am interested in viruses), I thought a good approximation would be to search for SRA entries with more than 100*100k bases, i.e., at least 10M bases. According to https://www.ncbi.nlm.nih.gov/books/NBK3837/, I thought I could use the search term 10:9999999[Mbases], but disappointingly, this didn't work. Just to make sure my syntax is right, I verified that 100:101[ReadLength] works. So it seems that Mbases doesn't support the range syntax.

Oren Milman
  • 261
  • 1
  • 8

1 Answers1

1

If processing a huge table with lots of "false positives" is OK, then you could enter your search term, from top right choose "Send to" > "File" Format: "RunInfo".

Michael
  • 56
  • 1