How do you use wget to download an entire site (domain A) when its resources are on another domain, (domain B)?
I've tried:
wget -r --level=inf -p -k -E --domains=domainA,domainB http://www.domainA
- 3
- 2
- 565
4 Answers
wget --recursive --level=inf --page-requisites --convert-links --html-extension \
--span-hosts=domainA,domainB url-on-domainA
UPDATE: I remember the command above worked for me in the past (that was in 2010, and I was using GNU Tools for Windows back then); however, when I tried to use it again recently, I had to modify it as follows:
wget --recursive --level=inf --page-requisites --convert-links \
--adjust-extension --span-hosts --domains=domainA,domainB domainA
The shorthand version for it would be: wget -rEDpkH -l inf domainA,domainB domainA
Breakdown of the flags used
-r=--recursive-l <depth>=--level=<depth>(maximum recursion depth.0orinfmean unlimited recursion)-E=--adjust-extension(add appropriate extensions to files that have been converted to HTML or CSS)-p=--page-requisites(download all the files necessary to display a page (e.g., images, stylesheets)-K=--backup-converted(save a backup of the original file with a.origextension before conversion)-k=--convert-links(converts links to make them suitable for local viewing)-H=--span-hosts(Span to any host; allow downloading from hosts different than the one in the original URL)-D <domain-list>=--domains=<domain-list>(Limit spanning to specified domains)-np=--no-parent(do not visit links that are not under the same directory as the current one)-U <agent-string>=--user-agent=<agent-string>
Reference
- GNU Wget Manual: https://www.gnu.org/software/wget/manual/wget.html
- 565
-
I get: wget: --span-hosts: Invalid boolean
domainA,domainB'; useon' or `off'. After changing to on, it does not work. – Matthew Flaschen Feb 14 '14 at 01:26 -
@MatthewFlaschen What I've written here worked for me. Could you provide the arguments you've used? – Parsa Feb 26 '14 at 02:04
-
I don't have the exact command I ran before. However, I have the same problem with:
wget --recursive --level=inf --page-requisites --convert-links --html-extension --span-hosts=example.org,iana.org example.orgI'm using GNU Wget 1.13.4 on Debian.
– Matthew Flaschen Feb 28 '14 at 05:42 -
3Try
--span-hosts --domains=example.org,iana.org- I think--span-hostsneeds to be a boolean, and then you use--domainsto specify which hosts to span. – Eric Mill Oct 18 '14 at 20:47 -
Konklone, --span-hosts is a boolean from 1.12 and later, I didn't know that. @MatthewFlaschen, I updated the answer. By the way, that will still work on 1.11 and earlier, if you're using GNU Tools for Windows. – Parsa Oct 19 '14 at 01:11
-
Amazing answer, clear, with short version, everything explained, before and after, maintained. Wow! Thank you – Tomas Votruba Jul 09 '19 at 13:43
wget --recursive --level=inf --page-requisites --convert-links --html-extension -rH -DdomainA,domainB domainA
- 2,031
-
This partly works. However, for some reason, it doesn't seem to work if the URL (at the end) is a redirect. Also, it downloads links too, not just page requisites. Also, -r and --recursive are the same. – Matthew Flaschen Feb 14 '14 at 01:44
wget --page-requisites --convert-links --adjust-extension --span-hosts --domains domainA,domainB domainA
You might need to ignore robots.txt (note, this may be a violation of some terms of service, and you should download the minimum required). See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion .
- 2,590
Consider using HTTrack. It has more options when crawling content on other domains than wget. Using wget with --span-hosts, --domains and --accept where insufficient for my needs but HTTrack did the job. I remember that setting limit of re-directions on other domains helped a lot.
- 758
--domainsby itself doesn't turn--span-hostson. Adding--span-hostswould've solved the problem. :| – Parsa Oct 19 '14 at 01:19