How to automate NCBI genome download

Question

I need to download all the completely assembled cyanobacterial genome's GenBank file(.gbff) from NCBI(RefSeq or INSDC ftp data).

For this I think, the steps are:

Need to find the completely assembled genomes.
find the GenBank file URL based on the taxonomic name.
download the GenBank file(.gbff file)

Is their any way to do this by using any python module or any other idea??

Biopython has tools that could help you for steps 2 and 3, but I'm not sure how to get the information of your step 1. — bli, Sep 05 '18 at 07:23
@Arijit Searching for "Download biopython Genebank" lead you here — llrs, Sep 05 '18 at 07:31

Peter Menzel · Accepted Answer · 2018-09-29T07:24:04.260

Outline of solution:

get this file: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
filter for lines having "Complete Genome" in column 12
filter for lines having a taxid (column 6) that is a descendant of taxon id 1117 (phylum Cyanobacteria), you can use nodes.dmp from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
The file download URL is the concatenation of column 20 and the last field of column 20, after separating it at '/', followed by appending '_genomic.gbff.gz'. For example: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/365/GCF_000007365.1_ASM736v1/GCF_000007365.1_ASM736v1_genomic.gbff.gz

Edit:
Since I will need something like that soon, I made a Perl script for downloading genomes.
For Cyanobacteria, you would do ./download-refseq-genomes.pl 1117

score 2 · Answer 2 · answered Sep 29 '20 at 20:44

As an alternative, you can use the NCBI Datasets CLI, in a two step process to generate a dehydrated dataset for later hydration:

# Download metadata for all Cyanobacteria 
# And pull out accession for 'Complete' genomes
./datasets assembly-descriptors taxon Cyanobacteria --limit ALL \
     | jq -r '.assemblies[].assembly | "\(.assembly_accession) \(.assembly_level)"' \
     | grep Complete \
     | cut -f 1 -d ' ' > accs.txt
Download a dehydrated dataset for the provided set of accessions
./datasets download assembly -i accs.txt -f 1117_complete.zip 

     --dehydrated 

     --exclude-protein 

     --exclude-rna 

     --exclude-seq 

     --exclude-gff3 

     --include-gbff

You may be interested in some of the other files that are excluded.

Once you have the dehydrated dataset, you can rehydrate it when you are ready to analyze the data with:

unzip 1117_complete.zip -d 1117_complete
./datasets rehydrate -f 1117_complete

score 0 · Answer 3 · answered Dec 06 '22 at 07:40

Use the NCBI Datasets command line tool to download genomes by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank).

datasets download genome taxon cyanobacterial \
  --assembly-source refseq --assembly-level complete_genome \
  --exclude-genomic-cds --exclude-gff3 --exclude-protein \
  --exclude-rna --exclude-seq --include-gbff \
  --filename genomes.zip

How to automate NCBI genome download

3 Answers3

Download a dehydrated dataset for the provided set of accessions