271

Unfortunately, our hosting provider experienced 100% data loss, so I've lost all content for two hosted blog websites:

(Yes, yes, I absolutely should have done complete offsite backups. Unfortunately, all my backups were on the server itself. So save the lecture; you're 100% absolutely right, but that doesn't help me at the moment. Let's stay focused on the question here!)

I am beginning the slow, painful process of recovering the website from web crawler caches.

There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick, but I had some bad results using this:

  • My IP address was quickly banned from Google for using it
  • I get lots of 500 and 503 errors and "waiting 5 minutes…"
  • Ultimately, I can recover the text content faster by hand

I've had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren't that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I've had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I've done so far, I am confident I can recover all the lost blog post text and comments.

However, the images that go with each blog post are proving…more difficult.

Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages?

(And, again, please, no backup lectures. You're totally, completely, utterly right! But being right isn't solving my immediate problem… Unless you have a time machine…)

Simon Hayter
  • 32,999
  • 7
  • 59
  • 119
Jeff Atwood
  • 13,932
  • 18
  • 64
  • 79
  • 4
    This will be an nice test to see if images do live forever in the internet. – rick schott Dec 11 '09 at 21:09
  • 101
    When somebody like Jeff Atwood himself can lose two entire websites in one fell swoop... Well. I'm going to review my own backup procedures, for one :P –  Dec 11 '09 at 21:10
  • 243
    @Phoshi: Jeff has some good articles on Coding Horror on backup. You should give them a quick read. –  Dec 11 '09 at 21:31
  • 36
    joshhunt wins one (1) internet. This offer may not be combined with other offers, exchanged, or substituted. No rainchecks. – Adam Davis Dec 11 '09 at 21:50
  • 7
    @joshhunt: epic! –  Dec 11 '09 at 22:05
  • 5
    I have to ask: Peak (I believe the Trilogy host?) isn't going to be susceptible to anything like this, right? –  Dec 11 '09 at 22:55
  • 30
    The lengths some people will go to, to earn rep on SU... –  Dec 12 '09 at 00:11
  • 4
    Crowed-sourced backup retrieval. Nice... – Luke Dec 12 '09 at 00:38
  • 2
    In Google Reader I have 495 posts all the way back to March 5, 2007 As others have said, no images though –  Dec 12 '09 at 06:03
  • 1
    I had to recover my wife's company site once from the Google cache as well. The hosting plan did include "nightly back-ups", but failed to deliver on that point when required. Lesson learned. As for recovery of the pictures, you can get quite a few from an image restricted query over archive.org http://web.archive.org/web/sa_re_im_/http://codinghorror.com/ –  Dec 12 '09 at 13:48
  • Comment markup seems to mangle the query url. I have it in an answer below. –  Dec 12 '09 at 13:56
  • 2
    Offtopic but important and related question: does Jeff have offsite backups for the stackoverflow/serverfault/superuser websites? (I should check blog.stackoverflow.com, he probably posted about it there. Oh, wait...) –  Dec 12 '09 at 14:25
  • @CesarB: If all goes wrong, they still have the data dump. – Macha Dec 12 '09 at 14:51
  • @Macha: yes, but the data dump does not have non-public data, the loss of which could be a bit harder to recover from (there probably is a post at blog.stackoverflow.com listing the database tables which are not in the public data dump. Oh, wait...). Not to mention the site's source code (though this last one is probably alredy "offsite" at least at the developer's machines, since AFAIK it is in a compiled language. I should look for a post at either blog.stackoverflow.com or codinghorror.com which tells which language it was writen in...). –  Dec 12 '09 at 21:14
  • 3
    Are you going to use this as a good excuse to lose the post on NP Complete? Sorry, just had to... –  Dec 13 '09 at 04:15
  • 28
    Please don't refer to what you did as "backups" - if those files are on the same server, they're in no way "backups." –  Dec 14 '09 at 22:35
  • Wait, I have Time Machine on my Mac! Does that count?? :-) (Ironically, in this case, having "Time Machine(TM)" actively backing up to an external drive WOULD have saved you!) –  Dec 16 '09 at 17:03
  • 1
    Time machine you say? How about a way-back machine?

    http://www.waybackmachine.org/

    –  Dec 17 '09 at 15:38
  • 1
    Jeff, since comments are disabled on CH now, I'll comment here. I got news of your lost sites on SO, and I remember looking forward with great interest in how you responded to all of it. It was nice to see that your response was mature and humble. Thank you. – John Dec 17 '09 at 22:12
  • 3
    This is why I'm a stickler for the old-fashioned "write it on my computer, then FTP it to the web server". If the server goes down I have all my pages on my computer and vice-versa. – DisgruntledGoat Dec 24 '09 at 23:50
  • 2
    I'm missing how you actually solved this or any kind of follow up. Extra points for pointing what you used to begin to do at least 1 offsite backup. –  Jun 15 '11 at 12:29
  • I created a service http://recovermywebsite.com just because I experienced losing my site... It is in it's very early alpha/beta stage, so don't expect too much of it :) Also it's use is for retrieving the html, for now it doesn't retrieve the images automatically. – Dofs Jan 05 '12 at 14:21
  • 1
    @Jeff Atwood - I remember this debacle but I don't remember how you actually resolved it. Maybe you should add some comments here to tell us what was fruitful and what wasn't. –  May 14 '12 at 13:01
  • Damn! Even I lost it. –  Dec 12 '12 at 17:38
  • It is always advisable to download the website files and take a backup of database as sql file, at least once a week. We cannot always trust the hosts. – shasi kanth Feb 12 '14 at 09:27
  • 1
    Who was hosting your website? – Zerium Apr 17 '14 at 00:06
  • 1
    What an awful hosting provider. – Online User Nov 19 '14 at 23:00
  • I sincerely hope that hosting provider is out of business now. – Hashim Aziz Sep 06 '19 at 22:45
  • 1
    Related: https://stackoverflow.blog/2009/12/13/blog-outage-backup-policies/ – corn on the cob Nov 04 '20 at 17:38

42 Answers42

225

Here's my wild stab in the dark: configure your web server to return 304 for every image request, then crowd-source the recovery by posting a list of URLs somewhere and asking on the podcast for all your readers to load each URL and harvest any images that load from their local caches. (This can only work after you restore the HTML pages themselves, complete with the <img ...> tags, which your question seems to imply that you will be able to do.)

This is basically a fancy way of saying, "get it from your readers' web browser caches." You have many readers and podcast listeners, so you can effectively mobilize a large number of people who are likely to have viewed your web site recently. But manually finding and extracting images from various web browsers' caches is difficult, and the entire approach works best if it's easy enough that many people will try it and be successful. Thus the 304 approach. All it requires of readers is that they click on a series of links and drag off any images that do load in their web browser (or right-click and save-as, etc.) and then email them to you or upload them to a central location you set up, or whatever. The main drawback of this approach is that web browser caches don't go back that far in time. But it only takes one reader who happened to load a post from 2006 in the past few days to rescue even a very old image. With a big enough audience, anything is possible.

John Siracusa
  • 1,481
  • 1
  • 8
  • 4
  • 53
    +1 for the most creative approach. Could actually work since CH has some many readers. –  Dec 11 '09 at 22:17
  • 17
    implemented here? http://www.diovo.com/2009/12/getting-cached-images-in-your-website-from-the-visitors/ – Jeff Atwood Dec 14 '09 at 21:18
  • 3
    I think you could crawl your static files for the image tags and copy all of those into one giant page of images, instead of having everybody click each link. The diovo.com implementation looks very impressive, hope it works out for you. –  Dec 15 '09 at 06:00
  • OMG! Very nice analysis. – Soner Gönül Aug 07 '11 at 14:41
  • 4
    In fact, you should be able to retrieve images using canvas and send them home by AJAX. – Tomáš Zato May 20 '14 at 17:24
  • @JeffAtwood chance you remember what that link was doing? It 404s now :( – Mike Ciffone Nov 28 '21 at 03:58
67

Some of us follow you with an RSS reader and don't clear caches. I have blog posts that appear to go back to 2006. No images, from what I can see, but might be better than what you're doing now.

retracile
  • 571
  • 3
  • 4
  • +1 definitely. Google Reader doesn't, but I bet a desktop-based one would. –  Dec 11 '09 at 21:01
  • 3
    You could also ask people to check their browser caches. Those who view Coding Horror retro-style might have some of the images cached. –  Dec 11 '09 at 21:04
  • I've got blog posts back to 2005 in GReader, but unfortunately, they don't have images, and they won't let me just export those as a series of pages... I could email them to you though, Jeff... – Glen Solsberry Dec 11 '09 at 21:08
  • Yeah, there was an implied "I'll send you what I have if you ask for it." in my answer as well. –  Dec 11 '09 at 21:09
  • At least the RSS will make it easier to import. –  Dec 11 '09 at 21:12
  • Don't forget that typing site:codinghorror.com into the Google also coughs up every blog post he ever made. – George Stocker Dec 11 '09 at 21:13
  • 4
    Too many RSS readers assume images will never die. I know mine does :( –  Dec 11 '09 at 21:14
63

(1) Extract a list of the filenames of all missing images from the HTML backups. You'll be left with something like:

  • stay-puft-marshmallow-man.jpg
  • internet-properties-dialog.png
  • yahoo-homepage-small.png
  • password-show-animated.gif
  • tivo2.jpg
  • michael-abrash-graphics-program

(2) Do a Google Image Search for those filenames. It seems like MANY of them have been, um, "mirrored" by other bloggers and are ripe for the taking because they have the same filename.

(3) You could do this in an automated fashion if it proves successful for, say, 10+ images.

51

By going to Google Image search and typing site:codinghorror.com you can at least find the thumbnailed versions of all of your images. No, it doesn't necessarily help, but it gives you a starting point for retrieving those thousands of images.

Codinghorror images

It looks like Google stores a larger thumbnail in some cases:

Google vs. Bing

Google is on the left, Bing on the right.

George Stocker
  • 561
  • 6
  • 9
  • 2
    yeah, worst case, we'll have to scale up the thumbnails from Google. I hear Bing stores larger thumbnails, though? – Jeff Atwood Dec 11 '09 at 20:59
  • I don't know; I'm not a bing sort of guy. I don't even know if they do Image search like Google does. I'll find out and update said post. – George Stocker Dec 11 '09 at 21:03
  • 19
    I don't know if this is you. But Imageshack seems to have many of your blog images. http://profile.imageshack.us/user/codinghorror – Nick Berardi Dec 11 '09 at 21:04
  • They seem to have what looks like 456 images that are full size. This might be the best bet for recovering everything. Maybe they can even provide you a dump. – Nick Berardi Dec 11 '09 at 21:09
  • 29
    Use the Google thumbnails as a start, then use http://www.tineye.com/ to see if anyone is hosting a copy. – sep332 Dec 11 '09 at 21:48
41

Sorry to hear about the blogs. Not going to lecture. But I did find what appears to be your images on Imageshack. Are they really yours or has somebody been keeping a copy of them around.

http://profile.imageshack.us/user/codinghorror

They seem to have what looks like 456 images that are full size. This might be the best bet for recovering everything. Maybe they can even provide you a dump.

Nick Berardi
  • 556
  • 4
  • 7
39

Jeff, I have written something for you here

In short what I propose you do is:

  1. Configure the web server to return 304 for every image request. 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer)

  2. In every page in the website, add a small script to capture the image data and send it to the server.

  3. Save the image data in the server.

  4. Voila!

You can get the scripts from the given link.

29

Try this query on the Wayback Machine:

http://web.archive.org/web/*sa_re_im_/http://codinghorror.com/*

This will get you all the images from codinghorror.com archived by archive.org. This returns 3878 images, some of which are duplicates. It will not be complete, but a good start none the less.

For the remaining images, you can use the thumbnails from a search engine cache, and then do a reverse look-up using these at http://www.tineye.com/ . You give it the thumbnail image, and it will give you a preview and a pointer to closely matching images found on the web.

27

+1 on the dd recommendation if (1) the raw disk is available somewhere; and (2) the images were simple files. Then you can use a forensic 'data-carving' tool to (for example) pull out all credible ranges that appear to be JPGs/PNGs/GIFs. I've recovered 95%+ of the photos on an iPhone that was wiped this way.

The open source tools 'foremost' and its successor 'scalpel' can be used for this:

http://foremost.sourceforge.net/

http://www.digitalforensicssolutions.com/Scalpel/

27

Luckily, future generations will be ok.

Even with only some of this big rock, scientists/linguiststs figured out a lot.

Rosetta Stone

If a few pictures are missing, leave it to someone to figure out in a couple thousand years.

Hopefully, you're laughing a little. :)

22

You could always try archive.org, as well. Use the wayback machine. I've used this to recover images from my websites.

  • 3
    Doesn't seem to have much of a cache for CodingHorror, at least. I do see images for blog.stackoverflow though. –  Dec 11 '09 at 21:00
  • i rebuilt a website using internet wayback machine once but i tried a few times since and it really doesn't archive very many sites... – djangofan Dec 11 '09 at 21:35
  • Looks like it goes back to 2004 here http://web.archive.org/web/*/http://www.codinghorror.com –  Dec 11 '09 at 23:41
  • Thank goodness it didn’t have a robots.txt file huh? :) – Synetech Dec 12 '09 at 20:06
15

So, absolute worst case, you can't recover a thing. Damn.

Try grabbing the minified google ones, and putting them through TinEye, the reverse-image search engine. Hopefully it should grab any duplicates or rehosts people have made.

15

It is a long shot, but you could consider:

  • Posting the exact list of picture you are missing
  • crowd-sourcing the retrieval process through all your readers's internet cache.

For instance, see the Nirsoft Mozilla Cache Viewer:

alt text
(source: nirsoft.net)

It can quickly dig up any "blog.stackoverflow.com" picture one might still have through a simple command line:

MozillaCacheView.exe -folder "C:\Documents and Settings\Administrator\Local Settings\Application Data\Mozilla\Firefox\Profiles\acf2c3u2.default\Cache" 
/copycache "http://blog.stackoverflow.com" "image" /CopyFilesFolder "c:\temp\blogso" /UseWebSiteDirStructure 0

Note: they have the same cache explorer for Chrome.

alt text
(source: nirsoft.net)

(I must have 15 days worth of blog.stackoverflow.com pictures in it)

And Internet Explorer, or Opera.


Then update the public list to reflect what the readers report finding in their cache.

Glorfindel
  • 385
  • 2
  • 7
  • 12
13

In the past I've used http://www.archive.org/ to pull up cached images. It's kind of hit or miss but it has worked for me.
Also, when trying to recover stock photos that I've used on an old site, www.tineye.com is great when I only have the thumbnails and I need the full size images.

I hope this helps you. Good Luck.

11

This is probably not the easiest or most full-proof solution, but services like Evernote typically save both the text and images when they are stored inside the application - maybe some helpful readers who saved your articles could save the images and send them back to you?

11

I've had great experiences with archive.org. Even if you aren't able to extract all of your blog posts from the site, they keep periodical snapshots:

alt text

This way you can check out each page and see the blog posts you made. With the names of all the posts you can easily find them in Google's cache if archive.org doesn't have it. Archive tries to keep images, Google cache will have images, and I haven't emptied my cache recently so I can help you with the more recent blog posts :)

Glorfindel
  • 385
  • 2
  • 7
  • 12
8

Have you tried your own local browser cache? Pretty good chance some of the more recent stuff is still there. https://lifehacker.com/resurrect-images-from-my-web-browser-cache-33300382

(Or you could compile a list of all missing images and everyone could check their cache to see if we can fill in the blanks)

Mike Ciffone
  • 6,360
  • 5
  • 37
8

A suggestion for the future: I use Windows Live Writer for blogging and it saves local copies of posts on my machine, in addition to publishing them out to the blog.

Matt Sherman
  • 453
  • 2
  • 7
7

I suggest the combination of archive.org and a request anonymizer like [Tor][2]. I suggest using anonymizer because that way each of your requests will have a random IP and location and that way you can avoid getting banned by a archive.org (like Google did) for unusually high number of requests.

Good Luck, there are a lot of gems in that blog.

  • Given that Jeff wants to make a donation to archive.org, so abusing the anonymizer might not be absolutely innacceptable. But I still want give you a kick for that. :-| –  Dec 15 '09 at 07:14
7

About five years ago, an early incarnation of an external hard drive on which I was storing all my digital photos failed badly. I made an image of the hard drive using dd and wrote a rudimentary tool to recover anything that looked like a JPEG image. Got most of my photos out of that.

So, the question is, can you get a copy of the virtual machine disk image which held the images?

Sinan Ünür
  • 161
  • 1
  • 4
7

The web archive caches the images. It's under heavy load right now, you should be ok until 2008 or so.

http://web.archive.org/web/20080618014552rn%5F2/www.codinghorror.com/blog/

6

The wayback machine will have some. Google cache and similar caches will have some.

One of the most effective things you'll be able to do is to email the original posters, asking for help.

I do actually have some infrastructural recommendations, for after this is all cleaned up. The fundamental problem isn't actually backups, it's lack of site replication and lack of auditing. If you email me at the private email field's contents, later, when you're sort of back on your feet, I'd love to discuss the matter with you.

6

If your images were stored on an external service such as Flickr or a CDN (as mentioned in one of your podcasts), you may still have the image resources there.

Some of the images could be found searching on Google Images and click on "Find similar images", maybe there are copies on other sites.

splattne
  • 225
  • 1
  • 7
5

archive.org sometimes hides images. Get each URL manually (or write a short script) and query them for it like this:

string.Format("GET /*/{0}", nextUri)

Of course that's going to be quite a pain to search through.

I might have some in my browser cache. If I do I'll host them somewhere.

4

If you're hoping to try to scrape users' caches, you may want to set the server to respond 304 Not Modified to all conditional-GET ('If-Modified-Since' or 'If-None-Match') requests, which browsers use to revalidate their cached material.

If your initial caching headers on static content like images were pretty liberal -- allowing things to be cached for days or months -- you could keep getting revalidate requests for a while. Set a cookie on those requests, and appeal to those users to run a script against their cache to extract the images they still have.

Beware, though: the moment you start putting up any textual content with inline resources that aren't yet present, you could be wiping out those cached versions as revalidators hit 404s.

4

You could use TinEye to find duplicates of your images by searching the thumbnails with google cache. This will help only with images you've taken from others site, though.

4

I've managed to recover these files from my Safari cache on Snow Leopard:

bad-code-offset-back.jpg
bad-code-offset-front.jpg
code-whitespace-invisible.png
code-whitespace-visible.png
coding-horror-official-logo-small.png
coding-horror-text.png
codinghorror-search-logo1.png
crucial-ssd-128gb-ct128m225.jpg
google-microformat-results-forum.png
google-microformat-results-review.png
kraken-cthulhu.jpg
mail.png
powered-by-crystaltech-web-hosting.png
ssd-vs-magnetic-graph.png

If anyone else wants to try, I've written a Python script to extract them to ~/codinghorror/filename, which I've put online here.

I hope this helps.

4

At the risk of pointing out the obvious, try mining your own computer's backups for the images. I know my backup strategy is haphazard enough that I have multiple copies of a lot of files hanging around on external drives, burned discs, and in zip/tar files. Good luck!

lo_fye
  • 141
  • 2
3

Did you get a chance to see if, your hosting provider has any backup at all (some older versions)?

  • it does not look good.. their backup program was unable to backup the virtual machine hard drive files, so there are no backups. – Jeff Atwood Dec 11 '09 at 21:03
2

How much is this data worth to you? If it's worth a significant sum (thousands of dollars) then consider asking your hosting provider for the hard drive used to store the data for your website (in the case of data loss due to hardware failure). You can then take the drive to ontrack or some other data recovery service to see what you can get off the drive. This might be tricky to negotiate due to the possibility of other people's unrecovered data on the drive as well, but if you really care about it you can probably work it out.

2

Very sorry to hear this and I am very annoyed for you, and the timing - I wanted an offline copy of a few of your posts and did HTTrack on your entire site but had to go out (this was a couple of weeks ago) and I stopped it.

If the host is half descent - and by the fact I am guessing you are a good customer... I would ask them to either send you the hard drives (as I am guessing they should be using RAID) or do some recovery themselves.

Whilst this may not be a fast process, I did this with one host for a client and was able to recover entire databases intact (... basically, the host tried an upgrade for the control panel they were using and messed it up.. but nothing was overwritten).

Whatever happens - Good luck from all your fans on the SO sites!

wilhil
  • 321
  • 1
  • 6
2

Your images, ask SUN microsystems to give them back to you, they have made "an entire internet backup" ... in a shipping container

"The Internet Archive offers long-term digital preservation to the ephemeral Internet," said Brewster Kahle, founder, the Internet Archive organization. "As more of the world's most valuable information moves online and data grows exponentially, the Internet Archive will serve as a living history to ensure future generations can access and continue to preserve these important documents over time."

Founded in 1996 by Brewster Kahle, the Internet Archive is a non-profit organization that has built a library of Internet sites and other cultural artifacts in digital form that include moving images, live audio, audio and text formats. The Archive offers free access to researchers, historians, scholars, and the general public; and also features "The Wayback Machine" -- a digital time capsule that allows users to see archived versions of Web pages across time. At the end of 2008, the Internet Archive housed over three petabtyes of information, which is roughly equivalent to about 150 times the information contained in the Library of Congress. Going forward, the Archive is expected to grow at approximately 100 terabytes a month.

alt text
(source: gawker.com)

more here and here

Glorfindel
  • 385
  • 2
  • 7
  • 12
2

This is my python script, it will scrape though google cache and download the content of your webiste, and it can run without trouble with 503 504 404 error (Google blocks IP that send many request): https://gist.github.com/3787790

PhamThang
  • 1
  • 2
1

Have you tried doing a Google Image search, with the syntax site:codinghorror.com?

1

I can read old posts on my Google Reader account. Maybe that helps: relating to your horror.

1

I was going to suggest Warrick because it was written by one of my CS professors. I'm sorry to hear that you had a bad experience with it. Maybe you can at least send him a note with some bug reports.

1

I do have full text entries for Codinghorror in my RSS reader back to June 30, 2009, if that will help at all. E-mail me at jake (at) orty (dot) com. I'll see if I can get them dumped out of Newsgator Inbox in any sort of usable format. I might have them back further (I'll need to dig up my archived PST files). Can't help with images, but it's a start (shrug).

(Nevermind: Looks like you have plenty more options above than I could provide. Sorry about the noise, feel free to flag to delete.)

0

Maybe you could crowd-source it asking us to look in our browser caches. I generally read Coding Horror via Google Reader, so my Firefox cache doesn't seem to have anything from codinghorror.com in it.

Others can look in their own Firefox cache by browsing to: about:cache?device=disk .

0

Just another shot at retrieving the content.

I was subscribed using feed burner. So might have some archives in my mail! You can ask others, who might be able to forward you those posts.

0

This happened to me once and I had to rebuild my WordPress blog. I was able to recover all of the text from search engine caches just like you are doing. However, when you recreate the posts you can really screw up your inbound links if you don't give them the original permalinks. Images weren't much of a problem for me because I tend to store them locally.

0

Just automate grabbing the individual Google page cache files.

Here's a Ruby script I used in the past.

http://pastie.org/739757

My script doesn't appear to have any sleeps. I didn't get IP banned for some reason, but I'd recommend adding one.

0

You could try to get the broken HDD from hosting company and give it to a hdd recovery service, I think you could find one. At least the backup images would probably be restored form there. Also this disk could be part of some mirror/RAID system and there is somewhere a mirror image?

0

Most solutions use a combination of blog reader assistance, archive.org, and Google caching. Consider turning this data crisis into a blog recovery tool specification. Several features listed in the question and answers look ready to automate, given knowledge an owner would have of their root site.

  1. Restore pages from archive.org, Google cache, or local cache using web spider that avoids bannable techniques
  2. Check local cache, Google image search, and imageshack for matching file names
  3. After initial recovery, make list of site's missing images and other URLs (e.g., return 304 code for images)
  4. Add upload or contribution form for readers who have cached versions
  5. Site owner previews and validates contributions
  6. Resubmit recovered pages to search engines, if desired

Owners that derive a lot of value from quick recovery might offer a bounty for missing files or other outside assistance.