How to find the count of a word in a string?

Question

I have a string "Hello I am going to I with hello am". I want to find how many times a word occur in the string. Example hello occurs 2 time. I tried this approach that only prints characters -

def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

I want to learn how to find the word count.

Depending on your use case, there's one more thing you might need to consider: some words have their meanings change depending upon their capitalization, like `Polish` and `polish`. Probably that won't matter for you, but it's worth remembering. — DSM, Jul 02 '12 at 20:28
Could you define you data set more for us, will you worry about punctuation such as in `I'll`, `don't` etc .. some of these raised in comments below. And differences in case? — Levon, Jul 02 '12 at 20:38

score 43 · Accepted Answer · answered Jul 02 '12 at 20:05

43

If you want to find the count of an individual word, just use count:

input_string.count("Hello")

Use collections.Counter and split() to tally up all the words:

from collections import Counter

words = input_string.split()
wordCount = Counter(words)

answered Jul 02 '12 at 20:05

Joel Cornett

23,166
9
59
85

Is collections module part of basic python installation? – Varun Jul 02 '12 at 20:13
1

I'm copying part of a comment by @DSM left for me since I also used `str.count()` as my initial solution - this has a problem since `"am ham".count("am")` will yield 2 rather than 1 – Levon Jul 02 '12 at 20:35
1

@Varun: I believe `collections` is in Python 2.4 and above. – Joel Cornett Jul 02 '12 at 23:32
@Levon: You're absolutely right. I believe using Counter, along with a regex word collector is probably the best option. Will edit answer accordingly. – Joel Cornett Jul 02 '12 at 23:33
1

Well .. credit goes to @DSM who made me aware of this in the first place (since I was using `str.count()` too) – Levon Jul 02 '12 at 23:37
why not just use len() instead of count()? words = input_string.split() ... wordCount = len(words) – Bimo Jun 27 '17 at 19:09

score 6 · Answer 2 · answered Jul 02 '12 at 20:05

6

Counter from collections is your friend:

>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

answered Jul 02 '12 at 20:05

Martijn Pieters

963,270
265
3,804
3,187

score 5 · Answer 3 · answered Jul 02 '12 at 20:05

5

from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

Using re.findall is more versatile than split, because otherwise you cannot take into account contractions such as "don't" and "I'll", etc.

Demo (using your example):

>>> countWords("Hello I am going to I with hello am")
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

If you expect to be making many of these queries, this will only do O(N) work once, rather than O(N*#queries) work.

answered Jul 02 '12 at 20:05

ninjagecko

83,651
23
134
142

2

+1 for re. `split` solutions won't work with phrases containing punctuations. – georg Jul 02 '12 at 20:35
This is the best answer for me +1 – Nahko Feb 25 '20 at 22:45

score 3 · Answer 4 · edited May 23 '17 at 11:54

The vector of occurrence counts of words is called bag-of-words.

Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer. Example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

Output:

2 am
1 going
2 hello
1 to
1 with

Part of the code was taken from this Kaggle tutorial on bag-of-words.

FYI: How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Levon · Answer 5 · 2012-07-02T20:43:54.310

2

Here is an alternative, case-insensitive, approach

sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

It matches by converting the string and target into lower-case.

ps: Takes care of the "am ham".count("am") == 2 problem with str.count() pointed out by @DSM below too :)

edited Jul 02 '12 at 20:43

answered Jul 02 '12 at 20:05

Levon

129,246
33
194
186

2

Using count by itself can lead to unexpected results, though: `"am ham".count("am") == 2`. – DSM Jul 02 '12 at 20:07
@DSM .. good point .. I'm not happy with this solution anyway since it's case sensitive, looking at an alternative right now ... – Levon Jul 02 '12 at 20:08

Ashwini Chaudhary · Answer 6 · 2012-07-02T20:22:32.630

2

Considering Hello and hello as same words, irrespective of their cases:

>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

edited Jul 02 '12 at 20:22

answered Jul 02 '12 at 20:14

Ashwini Chaudhary

232,417
55
437
487

I would go with `Counter(strs.lower().split())`. Reduces some of the overhead for a faster runtime – inspectorG4dget Jul 02 '12 at 20:15
1

Isn't this just Martijn Pieters' solution now, though? – DSM Jul 02 '12 at 20:21
@DSM I somehow didn't saw his solution, updated my solution back to the original version. :) – Ashwini Chaudhary Jul 02 '12 at 20:23

Booharin · Answer 7 · 2020-01-23T13:05:13.077

1

You can divide the string into elements and calculate their number

count = len(my_string.split())

edited Jan 23 '20 at 13:05

answered Jan 23 '20 at 10:02

Booharin

709
9
8

Code-only answers are considered low quality: make sure to provide an explanation what your code does and how it solves the problem. It will help the asker and future readers both if you can add more information in your post. See also Explaining entirely code-based answers: https://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers – borchvm Jan 23 '20 at 12:28

score 0 · Answer 8 · answered Sep 09 '16 at 20:06

0

You can use the Python regex library re to find all matches in the substring and return the array.

import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

Prints:

answered Sep 09 '16 at 20:06

ode2k

2,585
12
20

score 0 · Answer 9 · answered Nov 01 '18 at 19:45

0

def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result

answered Nov 01 '18 at 19:45

2

Hello, welcome to SO. Your answer contains only code. It would be better if you could also add some commentary to explain what it does and how. Can you please [edit] your answer and add it? Thank you! – Fabio says Reinstate Monica Nov 01 '18 at 21:51

How to find the count of a word in a string?

9 Answers9

Linked

Related