nextflow: tumor normal sample - how to make code organic

Question

I'm developing pipeline in nextflow it uses tools: fastp, bwa mem index, bwamem, gatk mark duplicates, gatk setupnmd, gatk applybqsr, gatk recalibrate. I do this using FASTQ.gz of normal and tumor DNA samples.
I'm interested to use mutect2, lancet and strelka in nextsteps. However, I am unable to write code in the pipeline that makes sense. I use loads of shell scripting in mutect2 to use tumor-normal.
To expand, I've input in FASTQ as tiny_n, tiny_t, where _n is normal and _t is tumor. I run pipeline until applybqsr for each individually. However, as I reach mutect2, I check if sample contains _n or otherwise, I then create an empty file for the output as required in the mutect2 process.

How do I overcome creating dummy/empty file?

Please see below code of main.nf:

params.raw_files="/sc/arion/projects/tiny/all_tumors_normals/tiny_*{R1,R2}.fastq"
include { FASTP} from './fastp_process.nf'
//include {bwa_index} from './index_process.nf'
include { align_bwa_mem} from './bwamem_process_already_index.nf'
include { gatk_markduplicates} from './gatk_markduplicates_process.nf'
include { recalibrate_bam } from './recalibratebam_process.nf'
include {setupnmdtags} from './setupnmdtags_process.nf'
include { applybqsr } from './applybqsr_process.nf'
include { mutect2 } from './mutect2_process.nf'
workflow {
    read_pairs_ch = Channel.fromFilePairs(params.raw_files) 

    FASTP(read_pairs_ch) 

    align_bwa_mem(FASTP.out.reads) 
    gatk_markduplicates(align_bwa_mem.out.sorted_bams)
    setupnmdtags(gatk_markduplicates.out.sorted_bam_mark_duplicates)
    recalibrate_bam(setupnmdtags.out.setupnmdtags_bam)


applybqsr(gatk_markduplicates.out.sorted_bam_mark_duplicates,recalibrate_bam.out.recalib_table)
mutect2(applybqsr.out.recal_bam)
}

below is code of mutect2 to handle only one sample, run for normal, create dummy/empty file for tumor sample, mutect2_process.nf:

process mutect2 {
        maxForks 3
        debug true
        tag "${sample_id}"
 memory '15 GB'
    input:
    tuple val(sample_id),path(bqsrbam)

    output:
    tuple val(sample_id),path(&quot;${sample_id}.mutect2out.vcf&quot;)


    script:
                     def avail_mem = task.memory ? task.memory.toGiga() : 0
            def java_options = [
    avail_mem ? &quot;-Xmx${avail_mem}G&quot; : &quot;&quot;,
    &quot;-XX:+UseSerialGC&quot;,
]

    &quot;&quot;&quot;
    echo &quot;${sample_id}\n&quot;

    grep_value=\\\$(echo &quot;${sample_id}|  | grep &quot;_n&quot; &quot;)

    if [[ ! -z &quot;\\\${grep_value}&quot; ]]
    then 

            identifier=\\\$(echo ${sample_id} | awk -F&quot;_&quot; '{print \\\$1}')
            echo &quot;\\\${identifier}&quot;_t.sorted.markdup.recal.bam&quot;&quot;

            ml gatk/4.1.3.0

gatk --java-options "${java_options.join(' ')}" Mutect2 -L $params.target -I "${params.outdir}/applybqsr/""\${identifier}"_t.sorted.markdup.recal.bam"" -I  "${params.outdir}/applybqsr/""\${identifier}"_n.sorted.markdup.recal.bam"" -normal \${identifier}"_n" -tumor \${identifier}"_t" --output ${sample_id}".mutect2out.vcf" -R $params.hg38genome
    else 
            echo &quot;Nothing to do\n&quot;
            touch ${sample_id}&quot;.mutect2out.vcf&quot;

    fi

    &quot;&quot;&quot;
    }



workflow.onComplete { 
        log.info ( workflow.success ? "Done mutect2!" : "Oops .. something went wrong in mutect2" )
}

There is a load of unix/shell code that does trivial checks. It runs gatk mutect2 oly when _n is found. How do I overcome: output defined tuple val(sample_id),path("${sample_id}.mutect2out.vcf") ?
Is there a better way to do this?

Steve · Accepted Answer · 2023-07-14T00:51:33.487

I think you have a design problem here. A better way would be perform this operation in your workflow block. This would allow you to feed in the tumor and normal BAMs (and avoid accessing files outside of the working directory as previously discussed). You can use the map operator to extract the patient or case id from the sample names and the filter operator to separate the samples into two channels. You could also use the branch operator for this like in the example below. The join operator can then be used to join together the items emitted in a consistent way. For example:

workflow {
ref_fasta = file( params.ref_fasta )
targets = file( params.targets )

...

applybqsr.out.recal_bam
    .branch { sample_id, bam -&gt;

        def case_id = sample_id.substring(0, sample_id.lastIndexOf('_'))

        tumor: sample_id.toUpperCase().endsWith('_T')
            return tuple( case_id, sample_id, bam )

        normal: sample_id.toUpperCase().endsWith('_N')
            return tuple( case_id, sample_id, bam )

    }
    .set { cases_ch }

joined_ch = cases_ch.tumor.join( cases_ch.normal )

mutect2( joined_ch, ref_fasta, targets )

}

process mutect2 {

    tag { sample_id }

    module 'gatk/4.1.3.0'

    input:
    tuple val(case_id), val(tumor), path(tumor_bam), val(normal), path(normal_bam)
    path reference_fasta
    path intervals_bed

    output:
    tuple val(case_id), path("${case_id}.mutect2.vcf")

    script:
    def avail_mem = task.memory ? task.memory.toGiga() : 0
    def java_options = [
        avail_mem ? "-Xmx${avail_mem}G" : "",
        "-XX:+UseSerialGC",
    ]

    """
    gatk \\
        --java-options "${java_options.join(' ')}" \\
    Mutect2 \\
        -R "${reference_fasta}" \\
        -L "${intervals_bed}" \\
        -I "${tumor_bam}" \\
        -I "${normal_bam}" \\
        -normal "${normal}" \\
        -tumor "${tumor}" \\
        -O "${case_id}".mutect2.vcf"
    """
}

There's also likely an issue with how applybqsr is invoked. If you require multiple input channels, most of the time what you'll want is one queue channel and one or more value channels. I suspect the inputs for applybqsr also need to be joined into a single queue channel.

I really don't understand what you did with .branch in applybqsr. Nextflow is so mysterious. How are you defining normal: and tumor: without define keyword? Also, I have samples as: MM_tiny_n_R1.fastq.gz MM_tiny_t_R1.fastq.gz Thank you for your reply. — Death Metal, Jul 13 '23 at 13:39
@DeathMetal Sorry that should have been cases_ch.tumor. Fixed now. The boolean expressions we define in our branch closure can be given unique labels, which we can then use as named channels. — Steve, Jul 13 '23 at 14:16
@DeathMetal With your current glob pattern, the fromFilePairs factory method should give you the sample names: MM_tiny_n and MM_tiny_t. I assumed your FASTP and dowstream processes also return tuples containing the sample names/ids, but I might be wrong.. — Steve, Jul 13 '23 at 14:19
This is complicated for me, sorry. Would you mind explaining in tumor: sample_id.endsWith('_T') return tuple( case_id, sample_id, bam ) what is case_id? Second, how does tuple has three elements in it? — Death Metal, Jul 13 '23 at 18:32
If you get a chance, I've a new question at https://bioinformatics.stackexchange.com/questions/21310/nextflow-missing-value-declared-as-output-parameter-output-variable-defined — Death Metal, Jul 13 '23 at 18:33
@DeathMetal The case_id is really just the patient or case or donor from which the samples are derived. For example, MM_tiny would be the case or donor id here. Tuples can contain any number of elements. The elements don't even need to be of the same type necessarily. — Steve, Jul 14 '23 at 00:55

nextflow: tumor normal sample - how to make code organic

1 Answers1

Linked