4

I was wondering if I could use the gff parsing capability of bioawk to facilitate the parsing of gtf files, and I looked at the following help message:

$ bioawk -c help
bed:
    1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts 
sam:
    1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual 
vcf:
    1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info 
gff:
    1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute 
fastx:
    1:name 2:seq 3:qual 4:comment 

I see that there are 10 fields defined for gff parsing.

However, when I look at the pages for gtf and gff in the ensembl website (gff/gtf and gff3), all have 9 fields.

I'm curious about these "filter" and "group" fields. What are they meant for? Are they extracted from some of the columns mentioned in the above pages?

bli
  • 3,130
  • 2
  • 15
  • 36

1 Answers1

5

I would consider the description there a bug. The filter is actually the strand, strand is the frame, group is the attribute, and attribute does nothing. These are really meant to be the 9 columns.

Edit: There's a bug report related to this.

Edit 2: I've made a pull request to clarify this and fix the aforementioned bug report.

Edit 3: I realized that I never directly answered the title of your question (mea culpa). bioawk itself will work with gff, gff3, or gtf files. It really is just treating them as tab-separated files with named columns (this is surprisingly convenient, since it's a PITA to remember what column does what).

Edit 4: The PR has been merged. If you install from github then you'll see corrected field names.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60