Recover sites from Google Cache

Date June 10, 2009

Over the weekend, VaServ and its subsidiary companies were hacked, resulting in many machines losing data.  Lots of clients were left with virtual servers that had been completely wiped.  Many of those folks were also lacking working backups… myself included, for this particular machine.

Wondering how to proceed, I began searching Google and other engines for cached versions of my site.  I was in luck!  Almost all of my content had been indexed and cached by Google!  The only question was how to efficiently import that data back into my WordPress installation and get up and running again.

Enter Warrick.  Warrick is a utility written in Perl that scrapes the cached content from Google, Yahoo, Live Search, and the Internet Archive.  What it finds is downloaded and saved as static html.

Sadly, none of the images on my site were present in any of the available caches.  It had only been up a few months and was indexed well, but the images just hadn’t been saved anywhere else.  Luckily, I still had local copies of the theme and images used in posts; I was able to upload that content again.

Also note that using Yahoo as a cache source appears to be broken right now.  Warrick was receiving nothing but 500 errors when trying to retrieve content via Yahoo’s cache.  Specify which cache sources to use like this:

warrick.pl -r -wr g,ia http://example.com/

My technique was to set up a site with just the static content that Warrick was able to retrieve.  Then, I recreated the WordPress installation in another location, uploaded the media, and began copying and pasting content back into WordPress.  When I was satisfied, I replaced the static site with my newly recreated WordPress installation.

The whole process took only about an hour.  Luckily, I had the theme and images saved elsewhere, otherwise I would have been out of luck.

If you’ve lost your data to a server crash or a hack, all is not lost!  Good luck!

See Warrick’s download and information page for all the available options.

2 Responses to “Recover sites from Google Cache”

  1. Colin said:

    Surely a better tip o’ the day would be “KEEP A F***ING BACKUP!”

    Then, when your site is hacked, you just upload it again and don’t bother with these scripts.

  2. canon5dshooter said:

    I can empathize with you. I had a blog that I was working on constantly and I kept putting off the crucial backup. I was then hacked from a machine in China and lost my wordpress installation. I am still rebuilding the blog as I was very busy when the site was brought down. Had Google not cached my content I would have been SOL.

    I now keep a backup but I am thankful for Google cache!

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>