real word count in NLTK

Question

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

len(string_of_text) is a character count, including spaces
len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...

score 21 · Answer 1 · answered Sep 05 '14 at 13:19

Tokenization with nltk

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

Returns

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']

dhg · Accepted Answer · 2012-05-20T21:02:12.397

15

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

edited May 20 '12 at 21:02

answered May 20 '12 at 20:46

dhg

51,689
8
120
144

2

So the NLTK does't have any functions for these operations? – Zach May 20 '12 at 21:14
Alternatively, you can use `re.split()` on punctuation and whitespace. – Joel Cornett May 20 '12 at 21:43
@Joel: That would cause problems for punctuation that is embedded inside of words (eg, `U.S.`). – dhg May 20 '12 at 21:46
@dhg: `U.S.` pretty ambiguous, but I see what you're saying. Out of curiosity, is there any reason you're not using `re.findall()` ? – Joel Cornett May 20 '12 at 22:12
1

@Joel: 1) I mean't that if you split the word `U.S.` on punctuation, you would get the two words `U` and `S`, and that is wrong. 2) `findall` would work in this particular case, but the way I've written it, you can use the regex to define exactly what it means to be a "punctuation token" (perhaps in a more complex way than I have). – dhg May 20 '12 at 22:19

score 2 · Answer 3 · answered Sep 01 '19 at 06:35

Removing Punctuation (with no regex)

Use the same solution as dhg, but test that a given token is alphanumeric instead of using a regex pattern.

from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Advantages:

Works better with non English languages as "À".isalnum() is True while bool(nonPunct.match("à")) is False (an "à" is not a punctuation mark at least in French).
Does not need to use the re package.

real word count in NLTK

3 Answers3

Removing Punctuation

Average Number of Characters

Removing Punctuation (with no regex)

Linked