Removing repeated reads from nanopore 1D² reads

Question

This is summary of discussions and a question that was posted by Devon O'Rourke on Twitter

Following Albacore basecalls on a 1D² library I get two sets of .fq files, summary stats, etc.: one for the 1D basecalling script, and one for the 1D² script. I'd like to use the reads generated from these scripts in a genome assembly.

After a little bit of grep searching, it seems like there are overlapping reads among the 1D and 1D² directories - namely, those 1D reads that generated 1D² reads!

How do people approach the choice of selecting which directory to pull the unique read from if you're planning on assembling a genome? I'd think that the 1D² reads are of higher quality and be preferred, but perhaps there are 1D² reads which are lower in quality than one of the two 1D pairs?

Just looking for advice on how folks approach 1D² libraries, fastq outputs, and subsequent mapping. For what it's worth I'm hoping to use Minimap2 to proceed with alignments but am happy to hear others thoughts on alternatives.

Update

I tried the first solution proposed by @gringer, but my filtered_1d_reads.fq file contains same number of reads as raw_1d_reads.fq file.

I'm pretty sure the problem is the naming convention difference between the linear and paired reads. linear reads always follow a read_id of alphanumeric (8)-(4)-(4)-(4)-(12), while paired reads are (8)-(4)-(4)-(4)-(20)-(4)-(4)-(4)-(12). When I looked at it in more detail, the paired read "read_id" naming is just a concatenation of the two linear "read_id"'s that are partnered. So that (20) value is really the last (12) of the first linear read, and the first (8) of the second linear pair.

An additional QC step (as suggested by @Wouter_De_Coster) would be a good idea. It could probably be done by using the run statistics file in the workspace directory to generate the read ids for filtering. — gringer, Jan 18 '18 at 09:52
Parsing the sequencing_summary.txt is convenient and fast. I have an old 2D example here in my test data folder, showing that it contains qscores of template, complement, and 2d. If that would be the same for 1D then filtering out 1D2 reads which we don't want to keep because they have lower quality than parents would be a breeze. — Wouter De Coster, Jan 18 '18 at 11:22
Bit silly, but here it goes. > I'd think that the 1D² reads are of higher quality and be preferred,

but perhaps there are 1D² reads which are lower in quality than one of the two 1D pairs? Based on this it would make the most sense to have a script preferentially select 1D2 reads unless they have a lower quality than the 1D 'parents', and keep all 1D reads that did not contribute to a 1D2 read. I don't have a dataset to play with, but this shouldn't be too hard to write I believe. — Wouter De Coster, Jan 18 '18 at 08:06

gringer · Accepted Answer · 2018-01-17T23:23:54.350

You can exclude the 1D² reads by generate a list of the ids from the 1D² reads, then filtering those reads out from the 1D fastq files. Here's one way to do that using some accessory scripts that I have written:

$ cat 1d2_reads.fq | ~/scripts/fastx-length.pl | awk '{print $2}' > readNames.txt
$ cat 1d_reads.fq | ~/scripts/fastx-fetch.pl -v -i readNames.txt > filtered_1d_reads.fq

In order to deal with the concatenated read IDs, it's necessary to switch to a language like Perl that supports the use of backreferences for the splitting. The following perl command (replacing the first line in the above procedure) inserts a line break after the 36th character in the 2nd value, then prints that value:

$ cat 1d2_reads.fq | ~/scripts/fastx-length.pl | perl -lane '$F[1] =~ s/(.{36})/$1\n/; print $F[1];' > readNames.txt

So a full sequence of operations would look something like this:

$ cat 1d2_reads.fq | ~/scripts/fastx-length.pl | perl -lane '$F[1] =~ s/(.{36})/$1\n/; print $F[1];' > readNames.txt
$ cat 1d_reads.fq | ~/scripts/fastx-fetch.pl -v -i readNames.txt > filtered_1d_reads.fq
$ cat 1d2_reads.fq filtered_1d_reads.fq > unique_reads.fq

Removing repeated reads from nanopore 1D² reads

Update

1 Answers1

Linked