23

How do you use wget to download an entire site (domain A) when its resources are on another domain, (domain B)?
I've tried:
wget -r --level=inf -p -k -E --domains=domainA,domainB http://www.domainA

Parsa
  • 565
  • Wow! No one after all this time? – Parsa Oct 01 '10 at 08:03
  • The reason that command doesn't work is because using --domains by itself doesn't turn --span-hosts on. Adding --span-hosts would've solved the problem. :| – Parsa Oct 19 '14 at 01:19

4 Answers4

21
wget --recursive --level=inf --page-requisites --convert-links --html-extension \
     --span-hosts=domainA,domainB url-on-domainA

UPDATE: I remember the command above worked for me in the past (that was in 2010, and I was using GNU Tools for Windows back then); however, when I tried to use it again recently, I had to modify it as follows:

wget --recursive --level=inf --page-requisites --convert-links \
     --adjust-extension --span-hosts --domains=domainA,domainB domainA

The shorthand version for it would be: wget -rEDpkH -l inf domainA,domainB domainA

Breakdown of the flags used
  • -r = --recursive
  • -l <depth> = --level=<depth> (maximum recursion depth. 0 or inf mean unlimited recursion)
  • -E = --adjust-extension (add appropriate extensions to files that have been converted to HTML or CSS)
  • -p = --page-requisites (download all the files necessary to display a page (e.g., images, stylesheets)
  • -K = --backup-converted (save a backup of the original file with a .orig extension before conversion)
  • -k = --convert-links (converts links to make them suitable for local viewing)
  • -H = --span-hosts (Span to any host; allow downloading from hosts different than the one in the original URL)
  • -D <domain-list> = --domains=<domain-list> (Limit spanning to specified domains)
  • -np = --no-parent (do not visit links that are not under the same directory as the current one)
  • -U <agent-string> = --user-agent=<agent-string>
Reference
Parsa
  • 565
  • I get: wget: --span-hosts: Invalid boolean domainA,domainB'; useon' or `off'. After changing to on, it does not work. – Matthew Flaschen Feb 14 '14 at 01:26
  • @MatthewFlaschen What I've written here worked for me. Could you provide the arguments you've used? – Parsa Feb 26 '14 at 02:04
  • I don't have the exact command I ran before. However, I have the same problem with:

    wget --recursive --level=inf --page-requisites --convert-links --html-extension --span-hosts=example.org,iana.org example.org

    I'm using GNU Wget 1.13.4 on Debian.

    – Matthew Flaschen Feb 28 '14 at 05:42
  • 3
    Try --span-hosts --domains=example.org,iana.org - I think --span-hosts needs to be a boolean, and then you use --domains to specify which hosts to span. – Eric Mill Oct 18 '14 at 20:47
  • Konklone, --span-hosts is a boolean from 1.12 and later, I didn't know that. @MatthewFlaschen, I updated the answer. By the way, that will still work on 1.11 and earlier, if you're using GNU Tools for Windows. – Parsa Oct 19 '14 at 01:11
  • Amazing answer, clear, with short version, everything explained, before and after, maintained. Wow! Thank you – Tomas Votruba Jul 09 '19 at 13:43
1

wget --recursive --level=inf --page-requisites --convert-links --html-extension -rH -DdomainA,domainB domainA

mnml
  • 2,031
  • This partly works. However, for some reason, it doesn't seem to work if the URL (at the end) is a redirect. Also, it downloads links too, not just page requisites. Also, -r and --recursive are the same. – Matthew Flaschen Feb 14 '14 at 01:44
0
wget --page-requisites --convert-links --adjust-extension --span-hosts --domains domainA,domainB domainA

You might need to ignore robots.txt (note, this may be a violation of some terms of service, and you should download the minimum required). See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion .

-1

Consider using HTTrack. It has more options when crawling content on other domains than wget. Using wget with --span-hosts, --domains and --accept where insufficient for my needs but HTTrack did the job. I remember that setting limit of re-directions on other domains helped a lot.