16

I have a string "Hello I am going to I with hello am". I want to find how many times a word occur in the string. Example hello occurs 2 time. I tried this approach that only prints characters -

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

I want to learn how to find the word count.

Ashwini Chaudhary
  • 232,417
  • 55
  • 437
  • 487
Varun
  • 981
  • 3
  • 16
  • 28
  • 1
    `Hello` and `hello` are same? – Ashwini Chaudhary Jul 02 '12 at 20:11
  • 1
    Depending on your use case, there's one more thing you might need to consider: some words have their meanings change depending upon their capitalization, like `Polish` and `polish`. Probably that won't matter for you, but it's worth remembering. – DSM Jul 02 '12 at 20:28
  • Could you define you data set more for us, will you worry about punctuation such as in `I'll`, `don't` etc .. some of these raised in comments below. And differences in case? – Levon Jul 02 '12 at 20:38

9 Answers9

43

If you want to find the count of an individual word, just use count:

input_string.count("Hello")

Use collections.Counter and split() to tally up all the words:

from collections import Counter

words = input_string.split()
wordCount = Counter(words)
Joel Cornett
  • 23,166
  • 9
  • 59
  • 85
  • Is collections module part of basic python installation? – Varun Jul 02 '12 at 20:13
  • 1
    I'm copying part of a comment by @DSM left for me since I also used `str.count()` as my initial solution - this has a problem since `"am ham".count("am")` will yield 2 rather than 1 – Levon Jul 02 '12 at 20:35
  • 1
    @Varun: I believe `collections` is in Python 2.4 and above. – Joel Cornett Jul 02 '12 at 23:32
  • @Levon: You're absolutely right. I believe using Counter, along with a regex word collector is probably the best option. Will edit answer accordingly. – Joel Cornett Jul 02 '12 at 23:33
  • 1
    Well .. credit goes to @DSM who made me aware of this in the first place (since I was using `str.count()` too) – Levon Jul 02 '12 at 23:37
  • why not just use len() instead of count()? words = input_string.split() ... wordCount = len(words) – Bimo Jun 27 '17 at 19:09
6

Counter from collections is your friend:

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())
Martijn Pieters
  • 963,270
  • 265
  • 3,804
  • 3,187
5
from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

Using re.findall is more versatile than split, because otherwise you cannot take into account contractions such as "don't" and "I'll", etc.

Demo (using your example):

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

If you expect to be making many of these queries, this will only do O(N) work once, rather than O(N*#queries) work.

ninjagecko
  • 83,651
  • 23
  • 134
  • 142
3

The vector of occurrence counts of words is called bag-of-words.

Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer. Example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

Output:

2 am
1 going
2 hello
1 to
1 with

Part of the code was taken from this Kaggle tutorial on bag-of-words.

FYI: How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Community
  • 1
  • 1
Franck Dernoncourt
  • 69,497
  • 68
  • 312
  • 474
2

Here is an alternative, case-insensitive, approach

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

It matches by converting the string and target into lower-case.

ps: Takes care of the "am ham".count("am") == 2 problem with str.count() pointed out by @DSM below too :)

Levon
  • 129,246
  • 33
  • 194
  • 186
  • 2
    Using count by itself can lead to unexpected results, though: `"am ham".count("am") == 2`. – DSM Jul 02 '12 at 20:07
  • @DSM .. good point .. I'm not happy with this solution anyway since it's case sensitive, looking at an alternative right now ... – Levon Jul 02 '12 at 20:08
2

Considering Hello and hello as same words, irrespective of their cases:

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})
Ashwini Chaudhary
  • 232,417
  • 55
  • 437
  • 487
1

You can divide the string into elements and calculate their number

count = len(my_string.split())

Booharin
  • 709
  • 9
  • 8
  • Code-only answers are considered low quality: make sure to provide an explanation what your code does and how it solves the problem. It will help the asker and future readers both if you can add more information in your post. See also Explaining entirely code-based answers: https://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers – borchvm Jan 23 '20 at 12:28
0

You can use the Python regex library re to find all matches in the substring and return the array.

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

Prints:

2
ode2k
  • 2,585
  • 12
  • 20
0
def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result
  • 2
    Hello, welcome to SO. Your answer contains only code. It would be better if you could also add some commentary to explain what it does and how. Can you please [edit] your answer and add it? Thank you! – Fabio says Reinstate Monica Nov 01 '18 at 21:51