How to count the number of characters in each line of a file, excluding a list of specific characters?

Question

How can I count how many characters appear within a file, minus those from a specific list. Here is an example file:

你好吗？
我很好，你呢？
我也很好。

I want to exclude any occurrences of ？, ，, and 。 from the count. The output would look like this:

3
5
4

score 3 · Answer 1 · answered Nov 16 '13 at 10:37

3

A pure bash solution:

while IFS= read -r l; do
    l=${l//[？，。]/}
    echo "${#l}"
done < file

answered Nov 16 '13 at 10:37

gniourf_gniourf

score 2 · Answer 2 · answered Nov 15 '13 at 06:23

2

Try

sed 's/[，。？]//g' file | perl -C -nle 'print length'

The sed part removes unwanted characters, and the perl part counts the remaining characters.

answered Nov 15 '13 at 06:23

Hari Menon

We both used `perl`, but on the opposite sides of the pipeline – jordanm Nov 15 '13 at 06:25
yep. The only problem is I am not very sure of how perl handles unicode. Not an expert in perl. – Hari Menon Nov 15 '13 at 06:27
2

Piping `sed` to `perl` feels very redundant. `perl -CAS -nle 's/[，。？]//g; print length'` does the trick. – Mark Reed Nov 16 '13 at 01:44
It probably is for perl pros. But from the point of view of a perl newbie, sed+minimum perl will be less intimidating. – Hari Menon Nov 16 '13 at 08:12
Except that the Perl substitution looks *exactly* like the `sed` substitution. See my comment on jordanm's answer. – Dennis Williamson Dec 04 '13 at 17:21

score 2 · Answer 3 · answered Nov 15 '13 at 06:24

2

One way is to remove those characters from the stream and then use wc -m. Here is an example that uses perl to remove the characters:

perl -pe 's/(\？|,|，|。)//g' file.txt | \ 
  while read -r line; do 
    printf "$line" | wc -m ; 
  done

answered Nov 15 '13 at 06:24

jordanm

1

`perl -lne 's/[？,，。]//g; print length` – Dennis Williamson Dec 04 '13 at 17:20

score 2 · Answer 4 · answered Nov 16 '13 at 01:31

2

or more simple:

tr -d [？,，。] <file | wc -m

answered Nov 16 '13 at 01:31

thom

score 1 · Answer 5 · edited May 23 '17 at 12:01

1

A simple solution, approached to this one, but using awk:

sed 's/[？，。]//g' file | awk '{ print length($0) }'

edited May 23 '17 at 12:01

Community

answered Nov 15 '13 at 08:01

Radu Rădeanu

Awk can do that substitution - no need for `sed`. `awk '{gsub("[？，。]", ""); print length()}'` – Dennis Williamson Dec 04 '13 at 17:27

5 Answers5