0

I asked a similar question earlier for which Nokogiri was recommended as a solution. I've used Nokogiri and it certainly works fine.

But due to certain reasons, I must use regex to extract a keyword from a HTTP response body.

Format of the keyword is as follows:

<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>

Here, Date is a dynamic variable, and I need to extract 'TestExample [Date]' from the HTTP response body. Also, <title> can be lower or upper case.

Assuming 'response' has the http response, I have tried doing the following:

>> response
=> "<HTML>\n<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>\n</HTML>"

Then make a regex to search:

>> regex
=> /<title>TestExample (.*?)<\/title>/mi

When I do a response[regex] there are no results. No results with response.match(regex) and response.scan(regex).

How can I do this task using regex?


Update:

For this task, this regex works fine:

response.match(/<title>(.*)<\/title>/mi).captures.first
the Tin Man
  • 155,156
  • 41
  • 207
  • 295
Sunshine
  • 469
  • 9
  • 24

3 Answers3

3

As other people said, Regex is not the way to go. If you're really bound to using Regexes (not just being too lazy to refactor?), this should do the trick:

response.match(/<title>(.*)<\/title>/mi).captures.first
Patrick Oscity
  • 51,870
  • 16
  • 134
  • 161
  • Thanks for the answer. It sure works. :) ... I am noobie but not too lazy. :) ... as I mentioned earlier, I have used Nokogiri for other tasks but for this one, I must use regex only. Could you please tell about captures.first? – Sunshine Jun 10 '13 at 20:03
  • Ok then, no offense i was just trying to find out ;-) I'm just curoius: Why can't you use Nokogiri? – Patrick Oscity Jun 10 '13 at 20:07
  • 1
    `captures` gives you the capture groups, i.e. the parts of the regex enclosed in parentheses `(...)` as an array. `first` will give you the first element of the array. – Patrick Oscity Jun 10 '13 at 20:08
  • I've seen HTML with duplicated and multiple `` tags, which would cause this to behave badly, especially when the `` block was repeated after the body. – the Tin Man Jun 10 '13 at 20:30
  • @theTinMan I hear ya. Let me try to share some more info on this regex use. There is a web app running on a device. It's got a welcome page where user first lands in post log in. This url is pretty minimal in content & has certain standard keywords. The regex is to extract & match these keyword. It is not a frequently updated app, the target page has a pre-set content & most (dynamic) functionality lies on separate pages. Do you still see any issue with using regex? – Sunshine Jun 10 '13 at 20:39
  • @Sunshine the problem is not how complex your specific HTML is, regular expressions just aren't the right tool for this because in general **they cannot parse HTML**. Regular expressions can only parse regular languages and HTML is not such a language. [Here's a fun read about this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Patrick Oscity Jun 10 '13 at 20:58
2

The correct way to handle this IS using a parser. Nokogiri will handle every requirement you stated, without breaking because of case differences or a difference in date.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [Date]"

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TITLE>TestExample [1/1/2000]</TITLE></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [1/1/2000]"

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TiTlE>TestExample [Jan. 1, 2000]</tItLe></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [Jan. 1, 2000]"

doc.title
=> "TestExample [Jan. 1, 2000]"
the Tin Man
  • 155,156
  • 41
  • 207
  • 295
  • just curious, can I locate the keyword if it is *anywhere* in the http response? – Sunshine Jun 10 '13 at 20:42
  • The keyword? You mean tag? If it's in the parsed HTML body, yes. More importantly, it won't be fooled if `""` is in text somewhere, unlike a regex which would have a very hard time telling. – the Tin Man Jun 10 '13 at 21:47
1

You can try with this pattern too:

/(?<=<title>)[^<]++/i

[^<] means all characters but < (character class)
[^<]+ means 1 or more characters from this class
[^<]++ means 1 or more characters from this class, and be possessive

a possessive quantifier informs the regex engine that it doesn't need to backtrack, thus performances are better.

example:

response.match(/(?<=<title>)[^<]++/i)

the idea is to not use the dot and replace it by a character class that exclude <

Note that the result is the whole pattern, no need to use capture group here and no need to test what is coming after. I remove the m modifier (that stand for DOTALL) cause i don't use the dot.

I just control with a lookbehind that there's <title> before.

Casimir et Hippolyte
  • 85,718
  • 5
  • 90
  • 121