3

Consider the following scenario.

├── sample-alice
│   ├── sequence_1.fastq
│   ├── sequence_2.fastq
│   ├── ...
│   └── sequence_n.fastq
└── sample-bob
    ├── sequence_1.fastq
    ├── sequence_2.fastq
    ├── ...
    └── sequence_m.fastq

I'm trying to write a single rule in my Snakefile that will preprocess each sample generically. The rule will execute a single command that takes all of Fastq files associated with the sample simultaneously. The samples do not have the same number of Fastq files.

If it was a single sample, I could do something like this.

rule preprocess:
    input: glob('sample-alice/*.fastq')
    output: 'alice-clean.fastq'
    shell: 'mypreproccmd {input} > {output}'

On the other hand, if there was a single Fastq file for each sample, I could use a wildcard to write a generic rule like this.

rule preprocess:
    input: 'sample-{samp}/sequence_1.fastq'
    output: '{samp}-clean.fastq'
    shell: 'mypreproccmd {input} > {output}'

Is it possible to combine these two approaches to write a rule for handling multi-file samples generically?

Daniel Standage
  • 5,080
  • 15
  • 50
  • I'm not sure if this exactly solves your question, but in one of my workflows I use the directory as input (as specified in the config.yaml). The shell command then uses {input}/*.fastq.gz See some examples here: https://github.com/wdecoster/nano-snakemake/blob/master/rules/align.smk – Wouter De Coster Mar 05 '19 at 20:29

1 Answers1

5

A nice solution comes from a colleague of mine, as seen in this Snakemake workflow. The trick is to access the wildcards programmatically using an anonymous (lambda) function. In my example above, it would be implemented as follows.

from glob import glob

rule preprocess: input: lambda wildcards: glob('sample-{samp}/*.fastq'.format(samp=wildcards.samp)) output: '{samp}-clean.fastq' shell: 'mypreproccmd {input} > {output}'

Daniel Standage
  • 5,080
  • 15
  • 50