Explanation for a complicated regular expression

Question

I have some text data as follows.

{"Timestamp": "Tue Apr 07 00:32:29 EDT 2015",Title: Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance <br><br>Product: Gecko<br>CPUs: 8<br>Language: en-GB"}

From the below text, I am extracting title only (Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance) using the following regular expression:

appcodename = re.search(r'Title: ((?:(?!<br>).)+)', message).group(1)

I am trying to understand how the above regular expression works.

(?! ) is a negative lookahead for  

(?:(?! ).)+) - what does this mean? Can someone break it down for me. Also, how many capture groups are there in the regular expression.

This isn't complicated. You obviously haven't seen [this](http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html). — rr-, May 21 '15 at 08:19
@rr - Just started wrapping my head around regular expression.Will probably get to that in a couple of years :) — liv2hak, May 21 '15 at 08:20
@liv2hak: :) Keep up your experiments, mind that everyone who is answering here also study it more and more every day. — Wiktor Stribiżew, May 21 '15 at 08:39
Pick a regular expression visualizer/debugger, either a local GUI app or an online one like [Debuggex](https://www.debuggex.com/); it'll make your life a whole lot easier. — abarnert, May 21 '15 at 08:57
This looks like a broken JSON. You should probably ask the site owner to fix the data. — nhahtdh, May 21 '15 at 09:30
@nhahtdh - I deliberately removed some sensitive data from JSON :) — liv2hak, May 21 '15 at 09:42
@liv2hak: "Don't trouble trouble till trouble troubles you. It only doubles trouble and troubles others, too." :) — Wiktor Stribiżew, May 21 '15 at 09:50

score 3 · Answer 1 · edited May 23 '17 at 12:14

3

You do not need such a complicated regex to get the title. Use

Title:\s*(.*?)(?=\s*<br/?>)

See demo

We match Title:, then whitespace \s*, then any characters up tp   with (.*?)(?=\s*<br/?>).

As for (?:(?! ).)+, it means capture 1 or more characters not followed with  . There is an SO post where this construction is explained in detail.

Here is an image from regex101 (go to Regex Debugger tab, then click + on the right) with the visualization what that construction is doing (checks if the next character is  , and if not, consumes and backtracks, etc):

enter image description here

As for the question regarding how many capture groups are there in the regular expression, Title: ((?:(?! ).)+) has 1 capturing (((?:(?! ).)+)) and 1 non-capturing ((?:(?! ).)) groups.

edited May 23 '17 at 12:14

Community

1
1

answered May 21 '15 at 08:18

Wiktor Stribiżew

561,645
34
376
476

thanks :) But I am trying to learn.so I would like to know why my regex works as well :) – liv2hak May 21 '15 at 08:19
I added the explanation. – Wiktor Stribiżew May 21 '15 at 08:19
1

Also, please check the link I added. – Wiktor Stribiżew May 21 '15 at 08:22
You can also check the visualization of what `(?:(?!
).)+` is doing. – Wiktor Stribiżew May 21 '15 at 08:30
@liv2hak: `.` in `(?:(?!
).)+` does not refer to anything, it tells the regex engine to match any character but a newline. However, before matching ("consuming") that character, the regex engine checks if that character is ``, the `.` pattern does not trigger, the character is not consumed. If it is not found, the character is consumed, the engine goes on matching the substring in our input. – Wiktor Stribiżew May 21 '15 at 22:12

anubhava · Answer 2 · 2015-05-21T08:23:08.890

2

First of all you don't need lookahead here. What you're doing can be done using this simple regex also:

>>> re.search(r'Title: *(.+?) *<br>', message).group(1)
"Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance"

btw your regex:

Title: ((?:(?!<br>).)+)

is using a negative lookahead (?! ) which checks presence of   before matching character after literal text Title:.

edited May 21 '15 at 08:23

answered May 21 '15 at 08:18

anubhava

713,503
59
514
593

if it is after each character why is '.' coming after the negative look ahead? – liv2hak May 21 '15 at 08:22
1

Yes my bad, it is checking for presence of `
` **before** matching a character. – anubhava May 21 '15 at 08:23

score 1 · Answer 3 · answered May 21 '15 at 08:40

1

What ((?:(?! ).)+) means is:

((?:(?!<br>).)+)
^... Match the regex and capture its match into backreference 1

((?:(?!<br>).)+)
 ^... Match the regex (non capturing group)

((?:(?!<br>).)+)
    ^... Assert that it is not possible to match the regex <br>

((?:(?!<br>).)+)
            ^... Match a single character, that is not a line break character 

((?:(?!<br>).)+)
              ^... Between one and unlimmited times

answered May 21 '15 at 08:40

Andie2302

4,741
3
21
39

capture its match into backreference 1 - means it is capturing group 1? – liv2hak May 21 '15 at 09:46
1

@liv2hak yes, backreference 1 = capturing group 1 – Andie2302 May 21 '15 at 09:47
The non-capturing group applies only to
or also to `.` ? – liv2hak May 21 '15 at 09:51
The non capturing group applies also to `.` --> `(?:(?!
).)` – Andie2302 May 21 '15 at 09:55
So the `.` here refers to the characters following
and not to the characters before the `
` but after `Title` – liv2hak May 21 '15 at 10:04

Explanation for a complicated regular expression

3 Answers3