Update
The issue was that bwa was running out of memory and failing, but that error wasn't floating to the top (see @Steve's answer, below). I was getting an error from samtools. I should have been able to figure that out, but hopefully this post saves someone else the trouble.
Original post
I'm experiencing a strange intermittent parsing error when piping bwa mem to samtools view. It only seems to happen when the command runs within a Nextflow pipeline (though it seems unlikely that Nextflow has anything to do with it). If I cd to the directory and run bash .command.sh, it runs to completion without issue.
Nextflow process
Here's the Nextflow process I'm running (code credit to @Steve, though modified to handle .bam or .cram). The portion that's running in this case
process bwa_mem_proc {
tag { "<span class="math-container">${sample_name}:$</span>{fastq.name}" }
label 'bwa_mem_proc'
input:
tuple val(sample_name), path(fastq), path(header)
output:
tuple val(sample_name), path("${fastq.baseName}.unsorted.mini.bam")
script:
def task_cpus = task.cpus > 1 ? task.cpus - 1 : task.cpus
"""
if [ "<span class="math-container">$params.output_format" = "bam" ]; then
bwa mem \\
-p \\
-t $</span>{task_cpus} \\
-M \\
-C \\
-H <(grep "^@RG" "<span class="math-container">${header}") \\
"$</span>{params.align_to_ref}" \\
"<span class="math-container">${fastq}" |
samtools view \\
-1 \\
-o "$</span>{fastq.baseName}.unsorted.mini.bam" \\
-
else
bwa mem \\
-p \\
-t <span class="math-container">${task_cpus} \\
-M \\
-C \\
-H <(grep "^@RG" "$</span>{header}") \\
"<span class="math-container">${params.align_to_ref}" \\
"$</span>{fastq}" |
samtools view \\
-C \\
-T params.align_to_ref \\
-o "${fastq.baseName}.unsorted.mini.bam" \\
-
fi
"""
}
Exact error
Here's an example of the intermittent errors I'm getting. I would have assumed that some read (e.g., read # 2_102_429 in the .fastq) is malformed, but as I mentioned, running bash .command.sh in the working directory for the failed run finishes without issue.
Command error:
[M::mem_pestat] skip orientation FF
[M::mem_process_seqs] Processed 700000 reads in 109.002 CPU sec, 16.411 real sec
[M::process] 0 single-end sequences; 700000 paired-end sequences
[M::process] read 700000 sequences (70000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (24, 301945, 4, 10)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (118, 180, 399)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 961)
[M::mem_pestat] mean and std.dev: (179.38, 124.36)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1242)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (247, 319, 420)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 766)
[M::mem_pestat] mean and std.dev: (342.90, 127.88)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 939)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (326, 764, 1452)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3704)
[M::mem_pestat] mean and std.dev: (893.30, 530.98)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4830)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 700000 reads in 97.823 CPU sec, 14.292 real sec
[M::process] 0 single-end sequences; 700000 paired-end sequences
[M::process] read 700000 sequences (70000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (14, 196658, 9, 15)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (206, 422, 1021)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2651)
[M::mem_pestat] mean and std.dev: (506.92, 458.91)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 3466)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (248, 322, 429)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 791)
[M::mem_pestat] mean and std.dev: (345.17, 129.22)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 972)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (485, 777, 1315)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2975)
[M::mem_pestat] mean and std.dev: (961.60, 569.27)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 3805)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[M::mem_process_seqs] Processed 700000 reads in 275.897 CPU sec, 39.910 real sec
[M::process] 0 single-end sequences; 700000 paired-end sequences
[M::process] read 700000 sequences (70000000 bp)...
[W::sam_read1_sam] Parse error at line 2102429
samtools view: error reading file "-"
A previous sample complained about an @SQ header line missing the LN: tag. It's like the pipe gets interrupted and lines get truncated, or something.
[E::sam_hrecs_update_hashes] Header includes @SQ line "KI270330" with no LN: tag
samtools view: failed to add PG line to the header
Anyone seen this before?
bwawas running out of memory but that error wasn't floating to the top. – Mark Ebbert Feb 23 '22 at 16:26