10

I would like to do easily reproducible analysis using publicly available data from NCBI, so I have chosen a snakemake.

I would like to write a single rule, that would be able to download any genome given a species code name and separated table of species and their NCBI IDs. So I wrote a script scripts/download.sh that takes a species code and download the genome to data/<species_code>/genome.fa.gz. The script internally reads the table tables/download_table.tsv, where are corresponding species code names and NCBI IDs.

So I tried to do a snakemake like this :

species='Cbir Avag Fcan Lcla Dcor Dpac Pdav Psp62 Psp79 Minc1 Minc2 Mjav Mare Mflo Mhap Pant'

rule download:
    input :
        "tables/download_table.tsv"
    output :
        "data/{sp}/genome.fa.gz"
    shell :
        "scripts/download.sh {sp}"

However, snakemake returned an error message I do not really understand :

Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.`. 

Is there a way to write a single rule for downloading all the genomes?

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59

2 Answers2

9

The problem is that you need a master rule that requires all of your desired outputs as inputs, in your case it would be :

rule all:
    input:
        expand("data/{sp}/genome.fa.gz", sp=species.split(' '))

You'll also need separate download link inputs for each species. You could make a separate download_table.tsv for each species, but it would probably be easier to make a config file with this information, and add a params keyword to your rule. Something like:

rule download:
   params:
       url=config['locations']['sp']
   output :
       "data/{sp}/genome.fa.gz"
   shell :
       "scripts/download.sh {params.url}"
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
heathobrien
  • 1,816
  • 7
  • 16
  • 3
    I wouldn't erase it if I were you. A lot of people get confused about Snakemake error messages, so its helpful to have some examples on here. – heathobrien Nov 03 '17 at 15:45
  • 1
    Then I will at least dramatically edit it. (I try to do it in a way so your answer still fits) – Kamil S Jaron Nov 03 '17 at 15:47
  • @KamilSJaron I agree the question should be kept, but maybe change the title to reflect the error message, this has nothing to do with NCBI – Chris_Rands Nov 03 '17 at 19:13
  • Also note that Snakemake has an NCBI remote provider: http://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html#genbank-ncbi-entrez – Johannes Köster Nov 06 '17 at 10:51
  • @JohannesKöster I was looking at it. But since I could not make it work without remote I have not even tried with remote. And then the problem I originally had with non-remote solution is very much unrelated to downloading of sequences. It's a good tip though, maybe work an answer? – Kamil S Jaron Nov 10 '17 at 15:53
3

Given the Makefile you provided and which someone then deleted from the question, you should probably just add this line (this should be the first rule):

rule all: input: ["data/{}/genome.fa.gz".format(x) for x in species.split()]

This rule just specifies a list of expected output files and corresponds to the following lines from the original Makefile:

GENOMES=$(patsubst %, data/%/genome.fa.gz, $(SPECIES)) all : $(GENOMES)

Karel Břinda
  • 1,909
  • 9
  • 19