I'm developing pipeline in nextflow it uses tools: fastp, bwa mem index, bwamem, gatk mark duplicates, gatk setupnmd, gatk applybqsr, gatk recalibrate. I do this using FASTQ.gz of normal and tumor DNA samples.
I'm interested to use mutect2, lancet and strelka in nextsteps. However, I am unable to write code in the pipeline that makes sense. I use loads of shell scripting in mutect2 to use tumor-normal.
To expand, I've input in FASTQ as tiny_n, tiny_t, where _n is normal and _t is tumor. I run pipeline until applybqsr for each individually. However, as I reach mutect2, I check if sample contains _n or otherwise, I then create an empty file for the output as required in the mutect2 process.
How do I overcome creating dummy/empty file?
Please see below code of main.nf:
params.raw_files="/sc/arion/projects/tiny/all_tumors_normals/tiny_*{R1,R2}.fastq"
include { FASTP} from './fastp_process.nf'
//include {bwa_index} from './index_process.nf'
include { align_bwa_mem} from './bwamem_process_already_index.nf'
include { gatk_markduplicates} from './gatk_markduplicates_process.nf'
include { recalibrate_bam } from './recalibratebam_process.nf'
include {setupnmdtags} from './setupnmdtags_process.nf'
include { applybqsr } from './applybqsr_process.nf'
include { mutect2 } from './mutect2_process.nf'
workflow {
read_pairs_ch = Channel.fromFilePairs(params.raw_files)
FASTP(read_pairs_ch)
align_bwa_mem(FASTP.out.reads)
gatk_markduplicates(align_bwa_mem.out.sorted_bams)
setupnmdtags(gatk_markduplicates.out.sorted_bam_mark_duplicates)
recalibrate_bam(setupnmdtags.out.setupnmdtags_bam)
applybqsr(gatk_markduplicates.out.sorted_bam_mark_duplicates,recalibrate_bam.out.recalib_table)
mutect2(applybqsr.out.recal_bam)
}
below is code of mutect2 to handle only one sample, run for normal, create dummy/empty file for tumor sample, mutect2_process.nf:
process mutect2 {
maxForks 3
debug true
tag "${sample_id}"
memory '15 GB'
input:
tuple val(sample_id),path(bqsrbam)
output:
tuple val(sample_id),path("${sample_id}.mutect2out.vcf")
script:
def avail_mem = task.memory ? task.memory.toGiga() : 0
def java_options = [
avail_mem ? "-Xmx${avail_mem}G" : "",
"-XX:+UseSerialGC",
]
"""
echo "${sample_id}\n"
grep_value=\\\$(echo "${sample_id}| | grep "_n" ")
if [[ ! -z "\\\${grep_value}" ]]
then
identifier=\\\$(echo ${sample_id} | awk -F"_" '{print \\\$1}')
echo "\\\${identifier}"_t.sorted.markdup.recal.bam""
ml gatk/4.1.3.0
gatk --java-options "${java_options.join(' ')}" Mutect2 -L $params.target -I "${params.outdir}/applybqsr/""\${identifier}"_t.sorted.markdup.recal.bam"" -I "${params.outdir}/applybqsr/""\${identifier}"_n.sorted.markdup.recal.bam"" -normal \${identifier}"_n" -tumor \${identifier}"_t" --output ${sample_id}".mutect2out.vcf" -R $params.hg38genome
else
echo "Nothing to do\n"
touch ${sample_id}".mutect2out.vcf"
fi
"""
}
workflow.onComplete {
log.info ( workflow.success ? "Done mutect2!" : "Oops .. something went wrong in mutect2" )
}
There is a load of unix/shell code that does trivial checks. It runs gatk mutect2 oly when _n is found.
How do I overcome: output defined tuple val(sample_id),path("${sample_id}.mutect2out.vcf") ?
Is there a better way to do this?
.branchin applybqsr. Nextflow is so mysterious. How are you definingnormal:andtumor:without define keyword? Also, I have samples as: MM_tiny_n_R1.fastq.gz MM_tiny_t_R1.fastq.gz Thank you for your reply. – Death Metal Jul 13 '23 at 13:39cases.tumorstored/created? – Death Metal Jul 13 '23 at 13:41cases_ch.tumor. Fixed now. The boolean expressions we define in our branch closure can be given unique labels, which we can then use as named channels. – Steve Jul 13 '23 at 14:16fromFilePairsfactory method should give you the sample names:MM_tiny_nandMM_tiny_t. I assumed your FASTP and dowstream processes also return tuples containing the sample names/ids, but I might be wrong.. – Steve Jul 13 '23 at 14:19tumor: sample_id.endsWith('_T') return tuple( case_id, sample_id, bam )what is case_id? Second, how does tuple has three elements in it? – Death Metal Jul 13 '23 at 18:32MM_tinywould be the case or donor id here. Tuples can contain any number of elements. The elements don't even need to be of the same type necessarily. – Steve Jul 14 '23 at 00:55