1

I have some text data as follows.

{"Timestamp": "Tue Apr 07 00:32:29 EDT 2015",Title: Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance <br><br>Product: Gecko<br>CPUs: 8<br>Language: en-GB"}

From the below text, I am extracting title only (Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance) using the following regular expression:

appcodename = re.search(r'Title: ((?:(?!<br>).)+)', message).group(1)

I am trying to understand how the above regular expression works.

(?!<br>) is a negative lookahead for <br>

(?:(?!<br>).)+) - what does this mean? Can someone break it down for me. Also, how many capture groups are there in the regular expression.

Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
liv2hak
  • 13,641
  • 48
  • 139
  • 250

3 Answers3

3

You do not need such a complicated regex to get the title. Use

Title:\s*(.*?)(?=\s*<br/?>)

See demo

We match Title:, then whitespace \s*, then any characters up tp <br/> with (.*?)(?=\s*<br/?>).

As for (?:(?!<br>).)+, it means capture 1 or more characters not followed with <br>. There is an SO post where this construction is explained in detail.

Here is an image from regex101 (go to Regex Debugger tab, then click + on the right) with the visualization what that construction is doing (checks if the next character is <br>, and if not, consumes and backtracks, etc):

enter image description here

As for the question regarding how many capture groups are there in the regular expression, Title: ((?:(?!<br>).)+) has 1 capturing (((?:(?!<br>).)+)) and 1 non-capturing ((?:(?!<br>).)) groups.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 561,645
  • 34
  • 376
  • 476
2

First of all you don't need lookahead here. What you're doing can be done using this simple regex also:

>>> re.search(r'Title: *(.+?) *<br>', message).group(1)
"Indian Herald: India's Latest News, Business, Sport, Weather, Travel, Technology, Entertainment, Politics, Finance"

btw your regex:

Title: ((?:(?!<br>).)+)

is using a negative lookahead (?!<br>) which checks presence of <br> before matching character after literal text Title:.

anubhava
  • 713,503
  • 59
  • 514
  • 593
1

What ((?:(?!<br>).)+) means is:

((?:(?!<br>).)+)
^... Match the regex and capture its match into backreference 1

((?:(?!<br>).)+)
 ^... Match the regex (non capturing group)

((?:(?!<br>).)+)
    ^... Assert that it is not possible to match the regex <br>

((?:(?!<br>).)+)
            ^... Match a single character, that is not a line break character 

((?:(?!<br>).)+)
              ^... Between one and unlimmited times
Andie2302
  • 4,741
  • 3
  • 21
  • 39