5

I need to download all the completely assembled cyanobacterial genome's GenBank file(.gbff) from NCBI(RefSeq or INSDC ftp data).

For this I think, the steps are:

  1. Need to find the completely assembled genomes.
  2. find the GenBank file URL based on the taxonomic name.
  3. download the GenBank file(.gbff file)

Is their any way to do this by using any python module or any other idea??

Arijit Panda
  • 285
  • 1
  • 8

3 Answers3

3

Outline of solution:

Edit:
Since I will need something like that soon, I made a Perl script for downloading genomes.
For Cyanobacteria, you would do ./download-refseq-genomes.pl 1117

Peter Menzel
  • 443
  • 4
  • 9
2

As an alternative, you can use the NCBI Datasets CLI, in a two step process to generate a dehydrated dataset for later hydration:

# Download metadata for all Cyanobacteria 
# And pull out accession for 'Complete' genomes
./datasets assembly-descriptors taxon Cyanobacteria --limit ALL \
     | jq -r '.assemblies[].assembly | "\(.assembly_accession) \(.assembly_level)"' \
     | grep Complete \
     | cut -f 1 -d ' ' > accs.txt

Download a dehydrated dataset for the provided set of accessions

./datasets download assembly -i accs.txt -f 1117_complete.zip
--dehydrated
--exclude-protein
--exclude-rna
--exclude-seq
--exclude-gff3
--include-gbff

You may be interested in some of the other files that are excluded.

Once you have the dehydrated dataset, you can rehydrate it when you are ready to analyze the data with:

unzip 1117_complete.zip -d 1117_complete
./datasets rehydrate -f 1117_complete
0

Use the NCBI Datasets command line tool to download genomes by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank).

datasets download genome taxon cyanobacterial \
  --assembly-source refseq --assembly-level complete_genome \
  --exclude-genomic-cds --exclude-gff3 --exclude-protein \
  --exclude-rna --exclude-seq --include-gbff \
  --filename genomes.zip
Forrest Vigor
  • 387
  • 1
  • 4