2

I am working with a string of text that I want to search through and only find 4 letters words. It works, except it also finds 4+ letter words as well.

import re
test ="hello, how are you doing tonight?"
total = len(re.findall(r'[a-zA-Z]{3}', text))
print (total)

It finds 15, although I am not sure how it found that many. I thought I might have to use \b to pick the beginning and the end of the word, but that didn't seem to work for me.

Aran-Fey
  • 35,525
  • 9
  • 94
  • 135
netrate
  • 405
  • 2
  • 6
  • 14

3 Answers3

12

Try this

re.findall(r'\b\w{4}\b',text)

The regex matches:

\b, which is a word boundary. It matches the beginning or end of a word.

\w{4} matches four word characters (a-z, A-Z, 0-9 or _).

\b is yet another word boundary.

**As a side note, your code contains typos, the second parameter of the re.findall should be the name of your string variable, which is test. Also, your string does not contain any 4 letter words so the suggested code will give the output of 0.

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
diypcjourney
  • 189
  • 1
  • 6
  • Yes thank you. I did notice that after. I have made the correction and added in the \b as well. Great ! – netrate Feb 19 '18 at 23:44
0

Here's a way without regex:

from string import punctuation

s = "hello, how are you doing tonight?"

[i for i in s.translate(str.maketrans('', '', punctuation)).split(' ') if len(i) > 4]

# ['hello', 'doing', 'tonight']
jpp
  • 147,904
  • 31
  • 244
  • 302
0

You can use re.findall to locate all letters, and then filter based off of length:

import re
test ="hello, how are you doing tonight?"
final_words = list(filter(lambda x:len(x) == 4, re.findall('[a-zA-Z]+', test)))
Ajax1234
  • 66,333
  • 7
  • 57
  • 95