This is summary of discussions and a question that was posted by Devon O'Rourke on Twitter
Following Albacore basecalls on a 1D² library I get two sets of .fq files, summary stats, etc.: one for the 1D basecalling script, and one for the 1D² script. I'd like to use the reads generated from these scripts in a genome assembly.
After a little bit of grep searching, it seems like there are overlapping reads among the 1D and 1D² directories - namely, those 1D reads that generated 1D² reads!
How do people approach the choice of selecting which directory to pull the unique read from if you're planning on assembling a genome? I'd think that the 1D² reads are of higher quality and be preferred, but perhaps there are 1D² reads which are lower in quality than one of the two 1D pairs?
Just looking for advice on how folks approach 1D² libraries, fastq outputs, and subsequent mapping. For what it's worth I'm hoping to use Minimap2 to proceed with alignments but am happy to hear others thoughts on alternatives.
Update
I tried the first solution proposed by @gringer, but my filtered_1d_reads.fq file contains same number of reads as raw_1d_reads.fq file.
I'm pretty sure the problem is the naming convention difference between the linear and paired reads. linear reads always follow a read_id of alphanumeric (8)-(4)-(4)-(4)-(12), while paired reads are (8)-(4)-(4)-(4)-(20)-(4)-(4)-(4)-(12). When I looked at it in more detail, the paired read "read_id" naming is just a concatenation of the two linear "read_id"'s that are partnered. So that (20) value is really the last (12) of the first linear read, and the first (8) of the second linear pair.