Bash scripting FastQC for multiple fastq files in multiple directories

Question

I am completely new to bioinformatics so I'm looking to learn how to do this.

I have multiple directories with fastq files: E.g; 10 Directories with each time series, each with Treatment and control directories, each with rep1 rep2 rep3.

For example: T9/Infected/Rep1/*.fastq.gz.

I'm looking to create a loop to run fastQC on each fastq file instead of having to submit a separate job for each directory.

Then to either output the fastQC data to a single directory or if possible a directory corresponding to each rep - e.g. rep1 results go into a folder called rep1 and so on.

conchoecia · Answer 1 · 2018-10-26T19:30:49.320

5

multiqc kind of glazes over some important information, like the exact adapters and duplicated sequences in a library. If you plan to spend big $$ for sequencing a library it is better to look at both the multiqc report and the actual fastqc html report to get a better idea of any error modes.

Going off of @Kubator's answer, I noticed that there was no command to run fastqc.

Here's a simple one-liner to run fastqc in parallel on all of your fastq files. The -j 25 uses 25 threads. Change 25 to however many threads you want/have for max speed.

#Run fastqc on everything in parallel.
> find ../reads/ -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j 25 --verbose
# copies all the fastqc files to directory ./    
> find ../reads/ -name '*fastqc.*' | xargs -I '{}' mv '{}' ./

These files might be output by multiqc, anyway!

edit - now the command is safe for whitespace in file names. Please, no newlines in filenames though!

edited Oct 26 '18 at 19:30

answered Oct 26 '18 at 17:02

conchoecia

3,141
2
16
40

4

+1, but note that this will break in the unlikely case where any of the fastq file names contain whitespace. A safer version is find . -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' but that still fails in the (even more unlikely) case where a file name contains a newline. This should work for anything (but requires a version of find with -printf, like GNU find): find . -name '*.fastq.gz' -printf '"%p"\n' | parallel -j 25 --verbose. – terdon Oct 26 '18 at 17:18
1

whitespaces in file names. Oh, the horror! Thanks for your comments! – conchoecia Oct 26 '18 at 19:28
2

heh, I know. Anyone who has spaces in fastq filenames deserves what they get :) sorry, I'm just very used to posting on [Unix.se] where this sort of detail is more relevant. – terdon Oct 26 '18 at 20:49

score 2 · Answer 2 · answered Oct 31 '18 at 13:33

This is the typical kind of job for snakemake.

Assuming your have one file per replicate named for instance T9/Infected/Rep1/Rep1.fastq.gz, you can prepare a file that you call Snakefile with the following content:

timepoints = list(range(10))
conditions = ["control", "infected"]
replicates = [1, 2, 3]

rule all:
    input:
        expand(
            "T{time}/{cond}/Rep{rep}/Rep{rep}_fastqc.html",
            time=timepoints,
            cond=conditions,
            rep=replicates)

rule do_fastqc:
    input:
        fastq = "T{time}/{cond}/Rep{rep}/Rep{rep}.fastq.gz"
    output:
        html = "T{time}/{cond}/Rep{rep}/Rep{rep}_fastqc.html"
    shell:
        """
        fastqc {input.fastq}
        """

Put this file in the directory that contains the T* directories and run snakemake from there.

The top all rule explains which files you want. The do_fastqc rule explains how to make one fastqc report from one fastq.gz file.

With a bit more work, this can be used to submit jobs to a computing cluster. Snakemake has some tools for this.

If you don't know the exact names of the fastq files but they all follow the same pattern, you will need to use the glob python module and do a little bit of programming to determine the possible values for rep, cond and time. The "snakefile" can contain any python code you want.

If there are no regular pattern in the file names, fix this issue first ;)

score 1 · Answer 3 · answered Oct 26 '18 at 10:56

Example dir structure:

$ find FastQC/
FastQC/
FastQC/T9
FastQC/T9/Infected
FastQC/T9/Infected/Rep1
FastQC/T9/Infected/Rep1/test11.fastq.gz
FastQC/T9/Infected/Rep1/test1.fastq.gz
FastQC/T9/Infected/Rep2
FastQC/T9/Infected/Rep2/test2.fastq.gz
FastQC/T9/Infected/Rep3
FastQC/T9/Infected/Rep3/test3.fastq.gz

If understood well You need to run some job on every *.fastq.gz file. Then You can do something like this (my example job is gzip test, replace with your job):

Rookie:

$ find FastQC/ -type f -name "*.fastq.gz" | xargs gzip -tv
FastQC/T9/Infected/Rep1/test11.fastq.gz:         OK
FastQC/T9/Infected/Rep1/test1.fastq.gz:  OK
FastQC/T9/Infected/Rep2/test2.fastq.gz:  OK
FastQC/T9/Infected/Rep3/test3.fastq.gz:  OK

Solid:

$ find FastQC/ -type f -name "*.fastq.gz" -print0 | xargs -0 -I {} gzip -tv {}
FastQC/T9/Infected/Rep1/test11.fastq.gz:         OK
FastQC/T9/Infected/Rep1/test1.fastq.gz:  OK
FastQC/T9/Infected/Rep2/test2.fastq.gz:  OK
FastQC/T9/Infected/Rep3/test3.fastq.gz:  OK

find finds files with name *.fastq.gz and outputs it with zero byte delimited (to support weird characters like space etc. in filenames)
xargs represents output as {} and passes it to gzip -tv

If You want to copy files inside one heap folder:

$ find FastQC/ -type f -name "*.fastq.gz" -print0 | xargs -0 -I {} cp -pv {} FastQC_heap/
`FastQC/T9/Infected/Rep1/test11.fastq.gz' -> `FastQC_heap/test11.fastq.gz'
`FastQC/T9/Infected/Rep1/test1.fastq.gz' -> `FastQC_heap/test1.fastq.gz'
`FastQC/T9/Infected/Rep2/test2.fastq.gz' -> `FastQC_heap/test2.fastq.gz'
`FastQC/T9/Infected/Rep3/test3.fastq.gz' -> `FastQC_heap/test3.fastq.gz'

svp · Answer 4 · 2021-08-20T14:40:31.417

-1

You can use following command from the parent folder to run fastqc on fastq.gz files in subdirectories

find -name '*.gz' | xargs fastqc

edited Aug 20 '21 at 14:40

answered Aug 17 '21 at 09:22

svp

99
2

1

what is the reason for downvote? Can you mention the error in the command – svp Aug 19 '21 at 11:11
I think you could also explain what it actually does (but I did not downvoted you, don't know) – Kamil S Jaron Aug 19 '21 at 17:23
1

dear downvoters, this the best method that can be used to run fastqc on the files located in different locations... – svp Aug 20 '21 at 14:42
Also, what's the advantage of your approach compared to https://bioinformatics.stackexchange.com/a/5295/57? – Kamil S Jaron Aug 20 '21 at 21:40
1

@KamilSJaron are you asking me the advantage of this approach?. This is an alternate way of doing it. This is quite faster compared to other methods and command is also easy. The requirement of the person here he mentioned as Then to either output the fastQC data to a single directory or if possible a directory corresponding to each rep - e.g. rep1 results go into a folder called rep1 and so on. – svp Aug 23 '21 at 05:13

Bash scripting FastQC for multiple fastq files in multiple directories

4 Answers4