5

I need to parse an array out of a website. The part of the JavaScript I want to parse looks like this:

_arPic[0] = "http://example.org/image1.jpg";
_arPic[1] = "http://example.org/image2.jpg";
_arPic[2] = "http://example.org/image3.jpg";
_arPic[3] = "http://example.org/image4.jpg";
_arPic[4] = "http://example.org/image5.jpg";
_arPic[5] = "http://example.org/image6.jpg";

I get the whole JavaScript using something like this:

product_page = Nokogiri::HTML(open(full_url))    
product_page.css("div#main_column script")[0]

Is there an easy way to parse all the variables?

the Tin Man
  • 155,156
  • 41
  • 207
  • 295
nohayeye
  • 1,889
  • 2
  • 15
  • 15

2 Answers2

2

If I read you correctly you're trying to parse the JavaScript and get a Ruby array with your image URLs yes?

Nokogiri only parses HTML/XML so you're going to need a different library; A cursory search turns up the RKelly library which has a parse function that takes a JavaScript string and returns a parse tree.

Once you have a parse tree you're going to need to traverse it and find the nodes of interest by name (e.g. _arPic) then get the string content on the other side of the assignment.

Alternatively, if it doesn't have to be too robust (and it wouldn't be) you can just use a regex to search the JavaScript if possible:

/^\s*_arPic\[\d\] = "(.+)";$/

might be a good starter regex.

the Tin Man
  • 155,156
  • 41
  • 207
  • 295
Ron Warholic
  • 9,894
  • 29
  • 47
0

The easy way:

_arPic = URI.extract product_page.css("div#main_column script")[0].text

which can be shortened to:

_arPic = URI.extract product_page.at("div#main_column script").text
pguardiario
  • 51,516
  • 17
  • 106
  • 147