I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.
Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.
I am using Python's re engine.
Here's my regex:
\b - Start at word boundary
(?! - Negative lookahead to avoid starting with the or The
[t|T]he - the and The
)
\w+ - Other letters are fine
(?<! - Negative look behind
[t|T]he - the or The shouldn't occur before \w+
)
\b - Word boundary
Expected output for a given input:
Input: Atheist Others Their Hello the The bathe hottie tahaie theater
Expected Output: Hello hottie tahaie
As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.
Any help will be greatly appreciated.
Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.