How to get the HTML source of a webpage in Ruby

Question

In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.

In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:

source = view_source(http://stackoverflow.com)

where source would be this text:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc

score 30 · Accepted Answer · edited May 03 '13 at 20:36

30

Use Net::HTTP:

require 'net/http'

source = Net::HTTP.get('stackoverflow.com', '/index.html')

edited May 03 '13 at 20:36

Alan W. Smith

23,261
4
66
92

answered Nov 18 '10 at 16:37

robbrit

16,968
4
48
65

Nakilon · Answer 2 · 2013-06-21T12:39:32.637

17

require 'open-uri'
source = open(url){|f|f.read}

UPD: more modern syntax

require 'open-uri'
source = open(url, &:read)

edited Jun 21 '13 at 12:39

answered Nov 18 '10 at 16:38

Nakilon

33,683
14
104
137

2

Even shorter: `source = open(url).read` – Mark Thomas Nov 18 '10 at 18:01
2

@Mark Thomas, it will not close connection. – Nakilon Nov 18 '10 at 19:16
2

Both of these will close the connection? – Tom Rossi Sep 08 '13 at 20:05

score 13 · Answer 3 · answered Nov 18 '10 at 17:36

13

require 'open-uri'
source = open(url).read

short, simple, sweet.

answered Nov 18 '10 at 17:36

Matt Rose

337
2
3

6

Will not close connection. – Nakilon Nov 18 '10 at 19:17
And I had to do `URI.open...`. – Sixtyfive Mar 22 '22 at 13:36

score 7 · Answer 4 · answered Nov 18 '10 at 16:38

7

Yes, like this:

require 'open-uri'

open('http://stackoverflow.com') do |file|
    #use the source Eric
    #e.g. file.each_line { |line| puts line }
end

answered Nov 18 '10 at 16:38

Skilldrick

67,147
33
171
227

1

+1 for use the source :D – tckmn Sep 01 '13 at 16:58

score 3 · Answer 5 · edited May 23 '17 at 12:34

3

You could use the builtin Net::HTTP:

>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'

Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".

edited May 23 '17 at 12:34

Community

1
1

answered Nov 18 '10 at 16:37

Josh Lee

161,055
37
262
269

score 3 · Answer 6 · answered Nov 18 '10 at 16:39

3

Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.

answered Nov 18 '10 at 16:39

Topher Fangio

19,836
15
59
92

1

Nokogiri has nothing to do with retrieving a page, it only parses the page once it's been retrieved by a HTTP client or read from a file. It's a very important distinction. – the Tin Man Dec 18 '15 at 19:18
@theTinMan - Indeed, this was more informational and perhaps should have been posted as a comment rather than an answer. My assumption was that after getting the HTML, the OP would want to do something with it :-) – Topher Fangio Dec 18 '15 at 19:21
1

We hope they'd want to do something more with it, rather than clog a network and bog down a CPU. – the Tin Man Dec 18 '15 at 19:22

score 3 · Answer 7 · answered Nov 18 '10 at 16:42

3

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://google.com/')

puts page.body

you can then do a lot of other cool stuff with mechanize as well.

answered Nov 18 '10 at 16:42

Beanish

1,632
9
19

score 1 · Answer 8 · edited Nov 18 '10 at 16:51

1

If you have cURL installed, you could simply:

url = 'http://stackoverflow.com'
html = `curl #{url}`

If you want to use pure Ruby, look at the Net::HTTP library:

require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body

edited Nov 18 '10 at 16:51

Nakilon

33,683
14
104
137

answered Nov 18 '10 at 16:40

Phrogz

284,740
104
634
722

How to get the HTML source of a webpage in Ruby

8 Answers8

Linked