Count a word in a line only once while for looping

Question

I really need help for a specific approach.

I have a list like this;

promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic  
intron (NM_001135610, intron 6 of 7)
Intergenic  
Intergenic  
Intergenic  
intron (NM_201628, intron 1 of 14)

and for example intron is counting more than once in a line. I want to count every word in a single line only once.

For the above list, the output should be ;

promoter count : 1
intron count : 3 #not 6
Intergenic count : 5

Generally, I manipulate these kind of txt files in command line. I need to run this for a big set so that I really need help! Thank you so so much

Any how, search here S.O. there are some duplicates already, like https://stackoverflow.com/questions/15984414/bash-script-count-unique-lines-in-file — Jetchisel, Sep 12 '21 at 00:35
Or this, https://stackoverflow.com/questions/68333498/shell-script-to-get-hosts-and-the-total-number-of-requests-per-host-from-log-fil — Jetchisel, Sep 12 '21 at 00:45
You want the counts of the first column where space or - is the column delimiter? — Shawn, Sep 12 '21 at 00:58

score 0 · Answer 1 · answered Sep 12 '21 at 02:01

0

With awk:

$ awk -F'[ \t]+|-' '
    { counts[$1]++ }
    END { for (c in counts) printf "%s count : %d\n", c, counts[c] }' input.txt
Intergenic count : 4
intron count : 3
promoter count : 1

answered Sep 12 '21 at 02:01

Shawn

38,372
3
13
43

score 0 · Answer 2 · answered Sep 12 '21 at 09:34

0

When the first word is followed by - or a space, you can count the first words with

sed 's/[ -].*//' file | sort | uniq -c

answered Sep 12 '21 at 09:34

Walter A

17,923
2
22
40

markp-fuso · Answer 3 · 2021-09-12T15:16:06.630

Assumptions:

fields are delimited by non-alphanumerics (ie, ^[:alnum:])
using case insensitive comparisons
in OP's expected output intergeneric count should be 4
ignore blank lines

Sample data:

$ cat list.dat
promoter-TSS (NM_004753)
intron (NM_001013630, intron 2 of 3)
Intergenic
intron (NM_001135610, intron 6 of 7)
Intergenic
Intergenic
Intergenic
Intron (NM_201628, intron 1 of 14)              # case change for first character
                                                # blank line
Intergenic7+e3u                                 # first delimiter == "+"
intron9(NM_201628, intron 1 of 14)              # first delimiter == "("

One awk idea:

awk '
    { split($0,arr,"[^[:alnum:]]")              # split line using all non-alphanumerics as delimiters
      if ( arr[1] != "" )                       # if not a blank line ...
         count[tolower(arr[1])]++               # lowecase/count the first field
    }
END { for (i in count)                          # loop through list of words
      printf "%s count : %s\n", i, count[i]
    }
' list.dat

# or as a one-liner (sans comments)

awk '{split($0,arr,"[^[:alnum:]]"); if (arr[1] != "") count[tolower(arr[1])]++} END {for (i in count) printf "%s count : %s\n", i, count[i]}' list.dat

This generates:

intergenic7 count : 1
intron9 count : 1
intron count : 3
intergenic count : 4
promoter count : 1

NOTES:

we're using tolower() to facilitate case-insensitive comparison so all output is lowercase
OP hasn't stipulated how the output is to be used so this solution could be further modified to take into consideration display order, saving to a file, saving to an associative array, etc

Count a word in a line only once while for looping

3 Answers3