4

I have a process called FINDTAIL that generates different number of files depending on the input data. Its either 2 files or four files i.e.

1. read_1{,.pr,.sl}.fasta and read_2{,.pr,.sl}.fasta

or

2. read_1.pr.fasta, read_1.sl.fasta, read_2.pr.fasta and read_2.sl.fasta

/*
 * the `FINDTAIL` process takes "${pair_id}_strand_check.txt" file  
 * and executes commands according to the reads strandedness. 
 * The output generated is FASTA files.  
 */

process FINDTAIL {

tag { pair_id }

debug true

input:
path strand_check_file
tuple val(pair_id), path(reads)

output:
tuple val(pair_id), path("tail_${pair_id}_{1,2}{,.pr,.sl}.fasta")

script:
"""

#!/usr/bin/env python3

import subprocess import pandas as pd import os

result = pd.read_csv("${strand_check_file}", sep="\r\n", header=None, engine='python')

failed = float(result.iloc[1,0].replace('Fraction of reads failed to determine: ', '')) fwd = float(result.iloc[2,0].replace('Fraction of reads explained by "1++,1--,2+-,2-+": ', '')) rev = float(result.iloc[3,0].replace('Fraction of reads explained by "1+-,1-+,2++,2--": ', '')) fwd_percent = fwd/(fwd+rev) rev_percent = rev/(fwd+rev)

if float(result.iloc[1,0].replace('Fraction of reads failed to determine: ', '')) > 0.50: cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta" cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}_2.sl.fasta" print(cmd1) subprocess.call(cmd1, shell=True) print(cmd2) subprocess.call(cmd2, shell=True)

if fwd_percent > 0.9: #Over 90% of reads explained by "1++,1--,2+-,2-+ #Data is likely FR/fr-secondstrand cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail_${pair_id}1.pr.fasta" cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail${pair_id}_2.pr.fasta" print(cmd1) subprocess.call(cmd1, shell=True) print(cmd2) subprocess.call(cmd2, shell=True)

elif rev_percent > 0.9: # Over 90% of reads explained by "1+-,1-+,2++,2-- # Data is likely RF/fr-firststrand cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta" cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}_2.sl.fasta" print(cmd1) subprocess.call(cmd1, shell=True) print(cmd2) subprocess.call(cmd2, shell=True)

elif max(fwd_percent, rev_percent) < 0.6: #Under 60% of reads explained by one direction #Data is likely unstranded cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta" cmd2 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail${pair_id}1.pr.fasta" cmd3 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}2.sl.fasta" cmd4 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type pr --d > " + "tail${pair_id}_2.pr.fasta" print(cmd1) subprocess.call(cmd1, shell=True) print(cmd2) subprocess.call(cmd2, shell=True) print(cmd3) subprocess.call(cmd3, shell=True) print(cmd4) subprocess.call(cmd4, shell=True)

else: #if strand couldnt be detected # we assume it is first-stranded # since most illumina reads are cmd1 = "findtail_v1.01 --input_file " + "${reads[0]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail_${pair_id}1.sl.fasta" cmd2 = "findtail_v1.01 --input_file " + "${reads[1]}" + " --seqlength 500 --endgap 2 --taillength 10 --identity 10 --ptype A --stype T --output_format fasta --output_type sl --d > " + "tail${pair_id}_2.sl.fasta" print(cmd1) subprocess.call(cmd1, shell=True) print(cmd2) subprocess.call(cmd2, shell=True) """ }

I want to pass these fasta file outputs of the FINDTAIL process into an ALIGN process that runs on each file separately.

The ALIGN process:

/*
 * define the `ALIGN` process that outputs unmapped files in FASTA format
 * It will take the FASTA files creates by the `FINDTAIL` process 
 * and try to align them INDIVIDUALLY.
 */

fasta_file_ch = find_tail_ch.flatten()

process ALIGN {

tag { pair_id }

input:
path index_files, stageAs: &quot;bt2_index/*&quot;
tuple val(pair_id)
file(fasta_file) from fasta_file_ch

output:
tuple val(pair_id), path(&quot;unmapped_tail_${pair_id}_{1,2}{,.pr,.sl}.fasta&quot;)

script:
def idx = index_files[0].getBaseName(2)    

&quot;&quot;&quot;
bowtie2 -x &quot;<span class="math-container">${idx}" -f "$</span>{fasta_file}&quot; --un &quot;unmapped_tail_<span class="math-container">${pair_id}/$</span>{it.baseName}.fasta&quot;
&quot;&quot;&quot;

}

I have tried using the flatten() function to format my find_tail_ch but it still does not work.

Best wishes~

pubsurfted
  • 383
  • 1
  • 6

1 Answers1

4

If the output channel produces a tuple where the first element is a simple value and the second element is a collection of files, like in this case, you can simply use the transpose operator to produce a transposition where the first element in each resulting tuple is the pair_id and the second is a FASTA file. You'll then need to update your process definition to receive this tuple, for example:

process ALIGN {
tag { pair_id }

input:
path index_files, stageAs: &quot;bt2_index/*&quot;
tuple val(pair_id), path(fasta_file)

output:
tuple val(pair_id), path(&quot;${fasta_file.baseName}.unmapped.fasta&quot;)

script:
def idx = index_files[0].getBaseName(2)    

&quot;&quot;&quot;
bowtie2 \\
    -x &quot;<span class="math-container">${idx}" \\
    -f "$</span>{fasta_file}&quot; \\
    --un &quot;${fasta_file.baseName}.unmapped.fasta&quot;
&quot;&quot;&quot;

}

Your workflow block might look like:

workflow {
bt2_index = ...

...

align_input_ch = find_tail_ch.transpose()

ALIGN( bt2_index, align_input_ch )

}

Steve
  • 3,099
  • 1
  • 4
  • 12
  • Thank you very very much Steve! – pubsurfted Dec 27 '22 at 08:15
  • If you could clarify one thing: since we have used the transpose function for align_input_ch, if I were to pass down the contents i.e. two separate fasta files, of the ALIGN process to another process i.e. file by file separately, will it automatically pass separately? – pubsurfted Dec 27 '22 at 09:53
  • 1
    @pubsurfted Yes, if I understand correctly. But if you're ever in doubt, you can use the view() operator to see what's going on. In the above example, you could try: ALIGN.out.view(). I think you should be able to see either two or four process with the sample pair_id in the first element of each tuple. Be sure to add -ansi-log false to your command line options. – Steve Dec 27 '22 at 11:39
  • 1
    Thanks again! I'm so grateful for your help. – pubsurfted Dec 28 '22 at 05:05