Python Tokenize Contractions using regex

Question

I tired to follow this question to create a regex expression that separates contractions from the word.

Here is my attempt:

 line = re.sub( r'\s|(n\'t)|\'m|(\'ll)|(\'ve)|(\'s)|(\'re)|(\'d)', r" \1",line) #tokenize contractions

However, only the first match is tokenized. For example: should've can't mustn't we'll changes to should ca n't must n't we

No need to use `\1` or wrap the whole pattern with yet another parentheses. To refer to the whole match, you just need `\g<0>`. — Wiktor Stribiżew, Aug 13 '21 at 20:22

horcrux · Accepted Answer · 2021-08-13T13:20:36.163

1

\1 refers to the first capturing group!

You could put all the options in the same capturing group:

(n\'t|\'m|\'ll|\'ve|\'s|\'re|\'d)

See a demo here.

For deepening the topic, I suggest you to read Parentheses for Grouping and Capturing.

edited Aug 13 '21 at 13:20

answered Aug 13 '21 at 13:18

horcrux

Thank you, but this removes the contractions totally, I want to tokenized it (separate it) – M.A.G Aug 13 '21 at 13:20
@M.A.G You obviously still have to replace with `r" \1"` – horcrux Aug 13 '21 at 13:21
Oh right, I mistakenly removed it. Thank you! – M.A.G Aug 13 '21 at 13:25

score 1 · Answer 2 · answered Aug 13 '21 at 13:34

Another variation without capture groups using the full match \g<0> in the replacement.

Using multiple single chars 'm 's and 'd could shortened using a character class '[msd]

Note that the \' does not have to be escaped when wrapping the pattern in double quotes.

n't|'(?:ll|[vr]e|[msd])

import re

line = "should've can't mustn't we'll"
line = re.sub(r"n't|'(?:ll|[vr]e|[msd])", r" \g<0>", line)
print(line)

Output

should 've ca n't must n't we 'll

2 Answers2