Getting "hello" from "1. hello" with Regex

Question

I am just learning regex and I am having trouble with getting the word from a list

From a list like:

[ "1. hello - jeff", "2. gello - meff", "3. fellow - gef", "12. willow - left"]

I want retrieve the words: "hello", "gello", "fellow", and "willow"

Here is my simplified code so far

for i in [ARRAY OF LISTED WORDS]:
  word = re.findall(r'^((?![0-9]?[0-9]. ))\w+', i)
  print(word)

Honestly tried a lot of combinations and couldn't get a good article online that I understood. Thanks in advance!

Does this answer your question? [Learning Regular Expressions](https://stackoverflow.com/questions/4736/learning-regular-expressions) — mkrieger1, Oct 06 '20 at 23:52
@ggorlen I have tried r"\w+", I only got the number: "1.", "2.", "3." etc. Your second suggestion got me the same thing. I hope I did it right. — Curtis Hu, Oct 06 '20 at 23:54
@CurtisHu Don't listed to ggorlen. I'm not sure what his pattern is going to achieve. I've posted an answer below. https://stackoverflow.com/a/64235386/2847946 — Mark Moretto, Oct 07 '20 at 00:06
@MarkMoretto, you probably don't realize that OP changed their requirements. See the edit history. For all you know, it'll change again. OP needs to clarify their spec fully or the question is essentially unanswerable, lucky guesswork aside. — ggorlen, Oct 07 '20 at 00:08
Thank you for pointing that out. I still don't see how "\w+" will skip over: a numeric value, a period, and space in the string "1. hello" to only capture "hello." — Mark Moretto, Oct 07 '20 at 00:16
Yeah, as I commented (and removed once requirements changed) `\w+` gives you a list of all words using `findall`. You can take the last or middle item in the list `findall` returns. The point is to show that there are many ways to achieve the result and it's unclear why some might be better than others absent more information. Is it possible some items won't be in the shown format? — ggorlen, Oct 07 '20 at 00:28

score 0 · Accepted Answer · answered Oct 06 '20 at 23:53

0

You are looking for one or more non-spaces ('\S+') between digits followed by a period followed by a space ('\d+\.\s'), and a space followed by a dash ('\s-'):

pattern = r'\d+\.\s(\S+)\s-'
[re.findall(pattern, l)[0] for l in your_list]

answered Oct 06 '20 at 23:53

DYZ

51,549
10
60
87

score 0 · Answer 2 · answered Oct 07 '20 at 00:05

Your regex pattern:

pattern = r"""
    \d+     # 1 or more digits
    \.      # Escaped period character
    \s+?    # 1 or more whitespace
    (\w+)   # 1 or more alphabetic characters
    \s+     # 1 or more whitespace
    -       # hyphen
    .*      # zero or more of anything besides newline.
"""

list of strings:

words = [ "1. hello - jeff", "2. gello - meff", "3. fellow - gef", "12. willow - left"]


for word in words:
    # capture results in a variable
    # re.X for verbose pattern format.
    tmp = re.search(pattern, word, flags = re.X)
    # If variable is not None, print results of the first captured group.
    if tmp:
        print(tmp.group(1))

Output:

hello
gello
fellow
willow

Getting "hello" from "1. hello" with Regex

2 Answers2