How to run a nextflow process for each file generated by another process separately - Input tuple does not match input set cardinality error

Question

I have a process called FINDTAIL that generates different number of files depending on the input data. Its either 2 files or four files i.e.

1. read_1{,.pr,.sl}.fasta and read_2{,.pr,.sl}.fasta

or

2. read_1.pr.fasta, read_1.sl.fasta, read_2.pr.fasta and read_2.sl.fasta

/*
 * the `FINDTAIL` process takes "${pair_id}_strand_check.txt" file  
 * and executes commands according to the reads strandedness. 
 * The output generated is FASTA files.  
 */
process FINDTAIL {
tag { pair_id }

debug true

input:
path strand_check_file
tuple val(pair_id), path(reads)

output:
tuple val(pair_id), path(&quot;tail_${pair_id}_{1,2}{,.pr,.sl}.fasta&quot;)

script:
&quot;&quot;&quot;

#!/usr/bin/env python3
import subprocess
import pandas as pd
import os
result = pd.read_csv("${strand_check_file}", sep="\r\n", header=None, engine='python')
failed = float(result.iloc[1,0].replace('Fraction of reads failed to determine: ', ''))
fwd = float(result.iloc[2,0].replace('Fraction of reads explained by "1++,1--,2+-,2-+": ', ''))
rev = float(result.iloc[3,0].replace('Fraction of reads explained by "1+-,1-+,2++,2--": ', ''))
fwd_percent = fwd/(fwd+rev)
rev_percent = rev/(fwd+rev)
if float(result.iloc[1,0].replace('Fraction of reads failed to determine: ', '')) > 0.50:
    cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta"
    cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}_2.sl.fasta"
    print(cmd1)
    subprocess.call(cmd1, shell=True)
    print(cmd2)
    subprocess.call(cmd2, shell=True)
if fwd_percent > 0.9:
    #Over 90% of reads explained by "1++,1--,2+-,2-+
    #Data is likely FR/fr-secondstrand
    cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail_${pair_id}1.pr.fasta"
    cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail${pair_id}_2.pr.fasta"
    print(cmd1)
    subprocess.call(cmd1, shell=True)
    print(cmd2)
    subprocess.call(cmd2, shell=True)
elif rev_percent > 0.9:
    # Over 90% of reads explained by "1+-,1-+,2++,2--
    # Data is likely RF/fr-firststrand
    cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta"
    cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}_2.sl.fasta"
    print(cmd1)
    subprocess.call(cmd1, shell=True)
    print(cmd2)
    subprocess.call(cmd2, shell=True)
elif max(fwd_percent, rev_percent) < 0.6:
    #Under 60% of reads explained by one direction
    #Data is likely unstranded
    cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta"
    cmd2 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail${pair_id}1.pr.fasta"
    cmd3 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}2.sl.fasta"
    cmd4 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail${pair_id}_2.pr.fasta"
    print(cmd1)
    subprocess.call(cmd1, shell=True)
    print(cmd2)
    subprocess.call(cmd2, shell=True)
    print(cmd3)
    subprocess.call(cmd3, shell=True)
    print(cmd4)
    subprocess.call(cmd4, shell=True)
else:
    #if strand couldnt be detected
    # we assume it is first-stranded
    # since most illumina reads are
    cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta"
    cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}_2.sl.fasta"
    print(cmd1)
    subprocess.call(cmd1, shell=True)
    print(cmd2)
    subprocess.call(cmd2, shell=True)
    """
}

I want to pass these fasta file outputs of the FINDTAIL process into an ALIGN process that runs on each file separately.

The ALIGN process:

/*
 * define the `ALIGN` process that outputs unmapped files in FASTA format
 * It will take the FASTA files creates by the `FINDTAIL` process 
 * and try to align them INDIVIDUALLY.
 */
fasta_file_ch = find_tail_ch.flatten()
process ALIGN {
tag { pair_id }

input:
path index_files, stageAs: &quot;bt2_index/*&quot;
tuple val(pair_id)
file(fasta_file) from fasta_file_ch

output:
tuple val(pair_id), path(&quot;unmapped_tail_${pair_id}_{1,2}{,.pr,.sl}.fasta&quot;)

script:
def idx = index_files[0].getBaseName(2)    

&quot;&quot;&quot;
bowtie2 -x &quot;<span class="math-container">${idx}" -f "$</span>{fasta_file}&quot; --un &quot;unmapped_tail_<span class="math-container">${pair_id}/$</span>{it.baseName}.fasta&quot;
&quot;&quot;&quot;

}

I have tried using the flatten() function to format my find_tail_ch but it still does not work.

Best wishes~

Steve · Accepted Answer · 2022-12-27T02:56:06.523

4

If the output channel produces a tuple where the first element is a simple value and the second element is a collection of files, like in this case, you can simply use the transpose operator to produce a transposition where the first element in each resulting tuple is the pair_id and the second is a FASTA file. You'll then need to update your process definition to receive this tuple, for example:

process ALIGN {
tag { pair_id }

input:
path index_files, stageAs: &quot;bt2_index/*&quot;
tuple val(pair_id), path(fasta_file)

output:
tuple val(pair_id), path(&quot;${fasta_file.baseName}.unmapped.fasta&quot;)

script:
def idx = index_files[0].getBaseName(2)    

&quot;&quot;&quot;
bowtie2 \\
    -x &quot;<span class="math-container">${idx}" \\
    -f "$</span>{fasta_file}&quot; \\
    --un &quot;${fasta_file.baseName}.unmapped.fasta&quot;
&quot;&quot;&quot;

}

Your workflow block might look like:

workflow {
bt2_index = ...

...

align_input_ch = find_tail_ch.transpose()

ALIGN( bt2_index, align_input_ch )

}

edited Dec 27 '22 at 02:56

answered Dec 26 '22 at 14:39

Steve

3,099
1
4
12

Thank you very very much Steve! – pubsurfted Dec 27 '22 at 08:15
If you could clarify one thing: since we have used the transpose function for align_input_ch, if I were to pass down the contents i.e. two separate fasta files, of the ALIGN process to another process i.e. file by file separately, will it automatically pass separately? – pubsurfted Dec 27 '22 at 09:53
1

@pubsurfted Yes, if I understand correctly. But if you're ever in doubt, you can use the view() operator to see what's going on. In the above example, you could try: ALIGN.out.view(). I think you should be able to see either two or four process with the sample pair_id in the first element of each tuple. Be sure to add -ansi-log false to your command line options. – Steve Dec 27 '22 at 11:39
1

Thanks again! I'm so grateful for your help. – pubsurfted Dec 28 '22 at 05:05

How to run a nextflow process for each file generated by another process separately - Input tuple does not match input set cardinality error

1 Answers1