4

I found here this awk script:

BEGIN {
  headertype="";
  }
{
  if($0 ~ "^@") {
    countread++;
    headertype="@";
  }
  else if($0 ~ "^+") {
    headertype="+";
  }
  else if(headertype="@") { # This is a nuc sequence
    len=length($0);
    if (len>4) {
      readlength[len]++;
    }
  }
}
END {
  for (i in readlength){
    countstored+=readlength[i];
    lensum+=readlength[i]*i;
    print i, readlength[i];
  }
  print "reads read = "countread > "/dev/stderr";
  print "reads stored = "countstored > "/dev/stderr";
  print "read average length = "lensum/countstored > "/dev/stderr";
}

and I just wonder if it is possible to shorten it with bioawk?

gringer
  • 14,012
  • 5
  • 23
  • 79
user977828
  • 453
  • 3
  • 9

2 Answers2

8

This can also be done with regular awk.

awk '{if(NR%4==2) {count++; bases += length} } END{print bases/count}' <fastq_file>

The NR%4==2 count the second line out of every block of 4. length is a built-in that defaults to the length of the line, same as length($0). In this case you can inject you custom printing to the END{} block but countread and countstored will always be the same since the averaging is done on the fly.

Bioathlete
  • 2,574
  • 12
  • 29
  • 1
    Note that, like OP’s code, this will only work on FASTQ files that don’t break the sequence and/or quality string across multiple lines. – Konrad Rudolph Aug 20 '18 at 18:01
  • 2
    That is correct. However that practice is highly frowned upon so should not be an issue with most FASTQ files – Bioathlete Aug 20 '18 at 22:04
  • I've yet to ever encounter a FASTQ file which has sequence or quality on multiple lines, I think for all intensive purposes assuming 4 lines per record should be fine. – Matt Bashton Aug 21 '18 at 09:43
3

This script is wrong because a quality string may start with @. With bioawk, it can be simplified to:

bioawk -c fastx '{ readlength[length($seq)]++; countread++ } END{...}'

The END{} block is the same as your original version.

EDIT: forgot -c fastx is the original answer. Thank @MattBashton for pointing this out.

user172818
  • 6,515
  • 2
  • 13
  • 29