Make wget download page resources on a different domain

Question

How do you use wget to download an entire site (domain A) when its resources are on another domain, (domain B)?
I've tried:
wget -r --level=inf -p -k -E --domains=domainA,domainB http://www.domainA

The reason that command doesn't work is because using --domains by itself doesn't turn --span-hosts on. Adding --span-hosts would've solved the problem. :| — Parsa, Oct 19 '14 at 01:19

Parsa · Accepted Answer · 2022-12-09T23:34:13.240

21

wget --recursive --level=inf --page-requisites --convert-links --html-extension \
     --span-hosts=domainA,domainB url-on-domainA

UPDATE: I remember the command above worked for me in the past (that was in 2010, and I was using GNU Tools for Windows back then); however, when I tried to use it again recently, I had to modify it as follows:

wget --recursive --level=inf --page-requisites --convert-links \
     --adjust-extension --span-hosts --domains=domainA,domainB domainA

The shorthand version for it would be: wget -rEDpkH -l inf domainA,domainB domainA

Breakdown of the flags used

-r = --recursive
-l <depth> = --level=<depth> (maximum recursion depth. 0 or inf mean unlimited recursion)
-E = --adjust-extension (add appropriate extensions to files that have been converted to HTML or CSS)
-p = --page-requisites (download all the files necessary to display a page (e.g., images, stylesheets)
-K = --backup-converted (save a backup of the original file with a .orig extension before conversion)
-k = --convert-links (converts links to make them suitable for local viewing)
-H = --span-hosts (Span to any host; allow downloading from hosts different than the one in the original URL)
-D <domain-list> = --domains=<domain-list> (Limit spanning to specified domains)
-np = --no-parent (do not visit links that are not under the same directory as the current one)
-U <agent-string> = --user-agent=<agent-string>

Reference

GNU Wget Manual: https://www.gnu.org/software/wget/manual/wget.html

edited Dec 09 '22 at 23:34

answered Nov 08 '10 at 05:36

Parsa

565

I get: wget: --span-hosts: Invalid boolean domainA,domainB'; useon' or `off'. After changing to on, it does not work. – Matthew Flaschen Feb 14 '14 at 01:26
@MatthewFlaschen What I've written here worked for me. Could you provide the arguments you've used? – Parsa Feb 26 '14 at 02:04
I don't have the exact command I ran before. However, I have the same problem with:
wget --recursive --level=inf --page-requisites --convert-links --html-extension --span-hosts=example.org,iana.org example.org

I'm using GNU Wget 1.13.4 on Debian.
– Matthew Flaschen Feb 28 '14 at 05:42
3

Try --span-hosts --domains=example.org,iana.org - I think --span-hosts needs to be a boolean, and then you use --domains to specify which hosts to span. – Eric Mill Oct 18 '14 at 20:47
Konklone, --span-hosts is a boolean from 1.12 and later, I didn't know that. @MatthewFlaschen, I updated the answer. By the way, that will still work on 1.11 and earlier, if you're using GNU Tools for Windows. – Parsa Oct 19 '14 at 01:11
Amazing answer, clear, with short version, everything explained, before and after, maintained. Wow! Thank you – Tomas Votruba Jul 09 '19 at 13:43

score 1 · Answer 2 · answered Apr 09 '13 at 09:26

1

wget --recursive --level=inf --page-requisites --convert-links --html-extension -rH -DdomainA,domainB domainA

answered Apr 09 '13 at 09:26

mnml

2,031

This partly works. However, for some reason, it doesn't seem to work if the URL (at the end) is a redirect. Also, it downloads links too, not just page requisites. Also, -r and --recursive are the same. – Matthew Flaschen Feb 14 '14 at 01:44

score 0 · Answer 3 · answered Feb 14 '14 at 02:01

wget --page-requisites --convert-links --adjust-extension --span-hosts --domains domainA,domainB domainA

You might need to ignore robots.txt (note, this may be a violation of some terms of service, and you should download the minimum required). See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion .

score -1 · Answer 4 · answered Jun 21 '19 at 14:26

Consider using HTTrack. It has more options when crawling content on other domains than wget. Using wget with --span-hosts, --domains and --accept where insufficient for my needs but HTTrack did the job. I remember that setting limit of re-directions on other domains helped a lot.

Make wget download page resources on a different domain

4 Answers4

Breakdown of the flags used

Reference

Linked