9

I am new to Nanopore sequencing analysis. I have a couple of questions regarding it which are as follows:

  1. How do I know if my fast5 file is multiread or single read file?
  2. Is there a way to combine all the fast5 files into a single fast5 file?
  3. How can I extract the flowcell type and sequencing kit information from the fast5 file? Is there a command that I can use to extract it? I know I can use the following command to retrieve it:
h5dump -a /UniqueGlobalKey/context_tags/sequencing_kit -a /UniqueGlobalKey/context_tags/flowcell_type <fast5 file>

However, it works well for single read file and not for multiread files. Please correct me if I am wrong. Thanks!

1 Answers1

6

[Note: As of 2024, ONT now uses pod5 as the standard output format from their MinKNOW sequencing program, and all pod5 files are "multi" pod5 files.]

1. The quickest way to know if it's a single or multi fast5 is the file name. If it mentions channels and read numbers (e.g. ..._read_9_ch_471_..., then it's a single read file. Alternatively, the multi-read files have only read IDs at the top level (i.e. read_<32-character-string-with-additional-dashes>). Example:

$ h5ls <read_file_name>.fast5 | head
read_0115573a-042a-42a3-b4e2-f0338b4a6c66 Group
read_015b870d-8a03-4f22-a73e-fb9f833b94c7 Group
read_0167393a-dede-4a27-91af-494622b63261 Group
read_01e658de-853b-4f58-93ad-87fbd735254c Group
read_0202acc2-33f4-4d29-9dac-c7f80c758c86 Group
read_022d68f0-6f8e-4124-bfb0-11262efa4a67 Group
read_0258a3c9-4ffe-42ab-a42d-89466fcefbf5 Group
read_032da6db-e84a-43ec-8c25-bcaead992066 Group
read_03383754-be43-420b-87fc-60c2eb833963 Group
read_036b4824-bb0a-42dd-ade0-29d89112ff07 Group

2. ONT provide a single_to_multi program for converting to the multi-read format. Details can be found on this github page. Here's a quick example of how it could be run:

$ single_to_multi_fast5 --input_path fast5 --save_path fast5_multi \
    --filename_base multi_file_myRunID --batch_size 4000 --recursive

3a. For multi-fast5 files, the kit/cell attributes are in the context_tags group of the file (because a multi-fast5 file could contain fast5 elements from multiple different sequencing runs with different run parameters):

$ h5dump -g /read_<readID>/context_tags multi_file_myRunID_0.fast5 | \
  grep -A 11 -e 'sequencing_kit' -e 'flowcell_type'
   ATTRIBUTE "flowcell_type" {
      DATATYPE  H5T_STRING {
         STRSIZE 11;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "flo-min106"
      }
   }
--
   ATTRIBUTE "sequencing_kit" {
      DATATYPE  H5T_STRING {
         STRSIZE 11;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "sqk-pbk004"
      }
   }

3b. For single fast5 files (at least, in the most recent version of them), they are found in the UniqueGlobalKey group:

$ h5dump -g /UniqueGlobalKey single_file_read_X_ch_Y_strand.fast5 | \
  grep -A 11 -e 'sequencing_kit' -e 'flowcell_type'
      ATTRIBUTE "flowcell_type" {
         DATATYPE  H5T_STRING {
            STRSIZE 11;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "flo-min106"
         }
      }
--
      ATTRIBUTE "sequencing_kit" {
         DATATYPE  H5T_STRING {
            STRSIZE 11;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "sqk-pcs108"
         }
      }

[note that this file structure may change in the future]

gringer
  • 14,012
  • 5
  • 23
  • 79