24

In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.

In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:

source = view_source(http://stackoverflow.com)

where source would be this text:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc
the Tin Man
  • 155,156
  • 41
  • 207
  • 295
Eric
  • 1,225
  • 2
  • 12
  • 26

8 Answers8

30

Use Net::HTTP:

require 'net/http'

source = Net::HTTP.get('stackoverflow.com', '/index.html')
Alan W. Smith
  • 23,261
  • 4
  • 66
  • 92
robbrit
  • 16,968
  • 4
  • 48
  • 65
17
require 'open-uri'
source = open(url){|f|f.read}

UPD: more modern syntax

require 'open-uri'
source = open(url, &:read)
Nakilon
  • 33,683
  • 14
  • 104
  • 137
13
require 'open-uri'
source = open(url).read

short, simple, sweet.

Matt Rose
  • 337
  • 2
  • 3
7

Yes, like this:

require 'open-uri'

open('http://stackoverflow.com') do |file|
    #use the source Eric
    #e.g. file.each_line { |line| puts line }
end
Skilldrick
  • 67,147
  • 33
  • 171
  • 227
3

You could use the builtin Net::HTTP:

>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'

Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".

Community
  • 1
  • 1
Josh Lee
  • 161,055
  • 37
  • 262
  • 269
3

Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.

Topher Fangio
  • 19,836
  • 15
  • 59
  • 92
  • 1
    Nokogiri has nothing to do with retrieving a page, it only parses the page once it's been retrieved by a HTTP client or read from a file. It's a very important distinction. – the Tin Man Dec 18 '15 at 19:18
  • @theTinMan - Indeed, this was more informational and perhaps should have been posted as a comment rather than an answer. My assumption was that after getting the HTML, the OP would want to do something with it :-) – Topher Fangio Dec 18 '15 at 19:21
  • 1
    We hope they'd want to do something more with it, rather than clog a network and bog down a CPU. – the Tin Man Dec 18 '15 at 19:22
3
require 'mechanize'

agent = Mechanize.new
page = agent.get('http://google.com/')

puts page.body

you can then do a lot of other cool stuff with mechanize as well.

Beanish
  • 1,632
  • 9
  • 19
1

If you have cURL installed, you could simply:

url = 'http://stackoverflow.com'
html = `curl #{url}`

If you want to use pure Ruby, look at the Net::HTTP library:

require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body
Nakilon
  • 33,683
  • 14
  • 104
  • 137
Phrogz
  • 284,740
  • 104
  • 634
  • 722