0

Currently, I am grabbing titles using the following method:

title = html_response[/<title[^>]*>(.*?)<\/title>/,1]

This does a great job at catching "This is a title" from <title>This is a title</title>. However, there are some web pages that open the title tag on one line, print the title on the next line, and then close the title tag.

The Ruby line I presented above doesn't catch titles such as those, so I'm just trying to find a fix for that.

halfer
  • 19,471
  • 17
  • 87
  • 173
LewlSauce
  • 4,420
  • 5
  • 33
  • 66

2 Answers2

4

This famous stackoverflow post explains why it's a bad idea to use regular expressions to parse HTML. A better approach is to use a gem like Nokogiri to parse out the title tags.

Community
  • 1
  • 1
Mori
  • 26,205
  • 10
  • 63
  • 70
1

Obligatory don't use regex with HTML sentence.

title = html_response[/<title[^>]*>(.*?)<\/title>/m,1]

The m enables multiline mode.

cfeduke
  • 22,750
  • 10
  • 60
  • 65