-1

I'm trying to read a gzip-compressed file line by line.

I used the method suggested in this post. It works fine for the first ~700 lines of the file, but then stops without error and ignores the next millions of lines.

Here is a minimal working example (Rust 1.57.0):

use std::io::{prelude::*, BufReader};
use std::fs::File;
use flate2; // 1.0
use flate2::read::GzDecoder;

fn main() {
    let r1 = "/home/path/to/bigfile.gz";
    let file = File::open(r1).unwrap();
    let reader = BufReader::new(GzDecoder::new(file));
    let mut i = 0;
    for l in reader.lines() {
        println!("{}", i);
        i+=1;
    }
}

Since this code compiles and is able to read the start of the file, why does it stop at some point?

E_net4 - Krabbe mit Hüten
  • 24,143
  • 12
  • 85
  • 121
Lurk
  • 13
  • 4
  • 1
    Cannot reproduce the issue with a text file containing two million lines and that exact code. What may help diagnose the problem is listen to the compiler warning and handle the `Result` in `l`. – E_net4 - Krabbe mit Hüten Feb 23 '22 at 15:33
  • @E_net4thecurator Thanks for your response, the compiler warning refers to the unused variable l, it is solved with ```println!("{} {}", i, l.unwrap());``` but the issue remains the same. I assume there is something wrong with my file but I don't know how to test it. – Lurk Feb 23 '22 at 15:38
  • 1
    Two things come to mind: 1) give more details (rustc version, and the specified version of the `flate2` crate); 2) try a different file, and provide a way to produce one that reproduces the problem. – E_net4 - Krabbe mit Hüten Feb 23 '22 at 15:46
  • @E_net4thecurator So, for the versions, I have rustc 1.57.0 (f1edd0429 2021-11-29) and flate2 = "1.0". Regarding the file, I can't share it since it's medical data. To give more details about the files, they are fastq.gz files countaining sequencing reads and they all fail at the same line. I have 2 types of files, one with indexes, and others with reads. The reads make up for longer lines and fail earlier than the indexes. – Lurk Feb 23 '22 at 15:59

1 Answers1

0

I found the issue, my files where not gzip encoded but bgzip encoded, meaning the flate2 parser thought the end of one bgzip block was the end of the file.

The solution is to use rust_htslib::bgzf::Reader like this :

let r1_reader = BufReader::new(Reader::from_path(r1).unwrap());
Lurk
  • 13
  • 4