5

I tried to use the following command

uniq -u reference.fasta >> reference_uniq.fasta

I'd like a count of the unique headers.

terdon
  • 10,071
  • 5
  • 22
  • 48
crispr
  • 51
  • 3

3 Answers3

5

If you just want the number of unique headers, you can do this:

grep '>' reference.fasta | sort | uniq | wc -l

If you want a list of the unique headers, you can do this:

grep '>' reference.fasta | sort | uniq

If you want a histogram of how many times each header occurs, you can do this:

grep '>' reference.fasta | sort | uniq -c | awk '{printf("%s\t%s\n", $1, $2)}'
terdon
  • 10,071
  • 5
  • 22
  • 48
conchoecia
  • 3,141
  • 2
  • 16
  • 40
3

The uniq command expects sorted input. Interestingly, the sort command actually has a "unique" option, -u, which means uniq is not strictly needed. For the fastest processing, you can look for the '>' character at the start of lines with grep:

grep '^>' reference.fasta | sort -u > reference_headers_unique.fasta

For returning the number of unique lines, pipe through wc -l:

grep '^>' reference.fasta | sort -u | wc -l

For more information about regular expressions, see here.

gringer
  • 14,012
  • 5
  • 23
  • 79
  • 1
    This seems like the best solution given, just note it assumes that the desired output order for the unique headers is the order provided by sort (which may not be stable depending on the locale) – Chris_Rands Oct 10 '18 at 09:39
1

If, for some reason, you need to keep the original order of the input file, so you only want the first occurrence of each header, you can do:

awk '/^>/ && !a[$0]++' reference.fasta > headers

Or, to get the number only:

awk '/^>/ && !a[$0]++{k++}; END{print k}' reference.fasta 

However, these only make sense if you need to keep the order of the original file. If not, use Gringer's grep approach which will be much faster (reference.fasta is hg19):

$ time awk '/^>/ && !a[$0]++{k++}; END{print k}' reference.fasta 
93

real    0m34.510s
user    0m31.861s
sys     0m2.634s

$ time grep '^>' reference.fasta | sort -u | wc -l
93

real    0m16.597s
user    0m0.820s
sys     0m3.725s
terdon
  • 10,071
  • 5
  • 22
  • 48