Assumptions:
- fields are delimited by non-alphanumerics (ie,
^[:alnum:])
- using case insensitive comparisons
- in OP's expected output
intergeneric count should be 4
- ignore blank lines
Sample data:
$ cat list.dat
promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic
intron (NM_001135610, intron 6 of 7)
Intergenic
Intergenic
Intergenic
Intron (NM_201628, intron 1 of 14) # case change for first character
# blank line
Intergenic7+e3u # first delimiter == "+"
intron9(NM_201628, intron 1 of 14) # first delimiter == "("
One awk idea:
awk '
{ split($0,arr,"[^[:alnum:]]") # split line using all non-alphanumerics as delimiters
if ( arr[1] != "" ) # if not a blank line ...
count[tolower(arr[1])]++ # lowecase/count the first field
}
END { for (i in count) # loop through list of words
printf "%s count : %s\n", i, count[i]
}
' list.dat
# or as a one-liner (sans comments)
awk '{split($0,arr,"[^[:alnum:]]"); if (arr[1] != "") count[tolower(arr[1])]++} END {for (i in count) printf "%s count : %s\n", i, count[i]}' list.dat
This generates:
intergenic7 count : 1
intron9 count : 1
intron count : 3
intergenic count : 4
promoter count : 1
NOTES:
- we're using
tolower() to facilitate case-insensitive comparison so all output is lowercase
- OP hasn't stipulated how the output is to be used so this solution could be further modified to take into consideration display order, saving to a file, saving to an associative array, etc