1

I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.

Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.

I am using Python's re engine.

Here's my regex:

\b               - Start at word boundary
(?!              - Negative lookahead to avoid starting with the or The
   [t|T]he       - the and The
)
\w+              - Other letters are fine
(?<!             - Negative look behind
    [t|T]he      - the or The shouldn't occur before \w+
)
\b               - Word boundary

Expected output for a given input:

Input: Atheist Others Their Hello the The bathe hottie tahaie theater

Expected Output: Hello hottie tahaie

As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.

Any help will be greatly appreciated.


Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.

watchtower
  • 3,800
  • 12
  • 43
  • 85

1 Answers1

3

The approach is simpler than your original regular expression:

\b(?!\w*[t|T]he)\w+\b

We match a word, but make sure that there is no the within the word using a "padded" negative lookahead. Your original approach only disallowed the at the front or the back of the word as it allowed for no padding after/before the word boundary.

(?![tT]he) only matches at the current position, while (?:\w*[tT]he) allows the match to extend from the current position, because the \w* can be used as filler.

Corion
  • 3,729
  • 1
  • 16
  • 27
  • Thanks. Do you mind explaining this a bit? I am having hard time understanding your logic. Specifically, why do you have negative lookahead for `\w*` with `tThe`? I didn't get this part. Also, why is it that words that end in `the` (e.g. bathe) are ignored? – watchtower Oct 13 '18 at 08:06
  • I use the \w* to allow the RE engine look forward an arbitrary amount of characters into the word. That way, it checks the whole word for `the`, and if it finds it, it fails the whole match. – Corion Oct 13 '18 at 11:15
  • Thanks Corion. This is very helpful. It seems `?!\w*[tT]he` would search for `the` and `The` anywhere in a word. I am curious: what's the difference between `?!\w*[tT]he` and `?![tT]he`? I tried this on regex101 and was not sure about the highlights it threw at me. Could you please explain this, if you don't mind? – watchtower Oct 13 '18 at 16:20
  • 1
    `(?![tT]he)` only matches _at the current position_, while `(?:\w*[tT]he)` allows the match to extend from the current position, because the `\w*` can be used as filler. – Corion Oct 13 '18 at 16:25