9

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

  1. len(string_of_text) is a character count, including spaces
  2. len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...

Zach
  • 4,404
  • 13
  • 42
  • 58

3 Answers3

21

Tokenization with nltk

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)

Returns

['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']
petra
  • 2,382
  • 1
  • 18
  • 11
15

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75
dhg
  • 51,689
  • 8
  • 120
  • 144
  • 2
    So the NLTK does't have any functions for these operations? – Zach May 20 '12 at 21:14
  • Alternatively, you can use `re.split()` on punctuation and whitespace. – Joel Cornett May 20 '12 at 21:43
  • @Joel: That would cause problems for punctuation that is embedded inside of words (eg, `U.S.`). – dhg May 20 '12 at 21:46
  • @dhg: `U.S.` pretty ambiguous, but I see what you're saying. Out of curiosity, is there any reason you're not using `re.findall()` ? – Joel Cornett May 20 '12 at 22:12
  • 1
    @Joel: 1) I mean't that if you split the word `U.S.` on punctuation, you would get the two words `U` and `S`, and that is wrong. 2) `findall` would work in this particular case, but the way I've written it, you can use the regex to define exactly what it means to be a "punctuation token" (perhaps in a more complex way than I have). – dhg May 20 '12 at 22:19
2

Removing Punctuation (with no regex)

Use the same solution as dhg, but test that a given token is alphanumeric instead of using a regex pattern.

from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Advantages:

  • Works better with non English languages as "À".isalnum() is True while bool(nonPunct.match("à")) is False (an "à" is not a punctuation mark at least in French).
  • Does not need to use the re package.
Adrien Pacifico
  • 1,209
  • 1
  • 13
  • 27