0

I tired to follow this question to create a regex expression that separates contractions from the word.

Here is my attempt:

 line = re.sub( r'\s|(n\'t)|\'m|(\'ll)|(\'ve)|(\'s)|(\'re)|(\'d)', r" \1",line) #tokenize contractions

However, only the first match is tokenized. For example: should've can't mustn't we'll changes to should ca n't must n't we

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
M.A.G
  • 503
  • 1
  • 5
  • 18

2 Answers2

1

\1 refers to the first capturing group!

You could put all the options in the same capturing group:

(n\'t|\'m|\'ll|\'ve|\'s|\'re|\'d)

See a demo here.

For deepening the topic, I suggest you to read Parentheses for Grouping and Capturing.

horcrux
  • 6,493
  • 6
  • 27
  • 40
1

Another variation without capture groups using the full match \g<0> in the replacement.

Using multiple single chars 'm 's and 'd could shortened using a character class '[msd]

Note that the \' does not have to be escaped when wrapping the pattern in double quotes.

n't|'(?:ll|[vr]e|[msd])

Regex demo

import re

line = "should've can't mustn't we'll"
line = re.sub(r"n't|'(?:ll|[vr]e|[msd])", r" \g<0>", line)
print(line)

Output

should 've ca n't must n't we 'll
The fourth bird
  • 127,136
  • 16
  • 45
  • 63