An Epic Fail In Webmastering

Published on June 10, 2011 at midnight by XC

This post is going to be riddled with fail, most of it mine. The short version is, my site has been down for over two weeks and it took me a week and a half to even notice it. That, of course describes fail number one and fail number two, but there’s plenty more fail where that came from!

It all began when I started shifting around some of my own accounts in order to free up some IP addresses. Some of the domains had expired, some did not need their own IP and then there was this site, echoreply.us that had its own IP and I could not figure out why. So, I moved it to the main system IP, went to bed and forgot all about it.

What I completely forgot was that I had a 4 1/2 year old trial SSL certificate installed, and the stupid dolt that lives on my server (we’ll just say its name rhymes with zeb ghost scavenger) happily let me move the site to the main shared IP, despite the server name vhost also using it. That’s right, anyone who visited ‘/’ on this site for nearly two weeks got redirected to the Apache success page.

To fix it, I just removed the account and was ready to restore from a day old backup, when I realized .. oh crap, backups are corrupt. I think we’re somewhere near fail number 11 at this point, I completely lost count.

Thankfully, I store most parts of the site under version control. I was able to retrieve it from my build bot installation and restore the database from a week old copy that I received via e-mail. A little finesse got everything back without disrupting anyone else (too much), now it was time to figure out why backups bailed for this, and only this account.

The last fail? I forgot I rsync’ed a whole slew of Xen templates to this domain as a go between from home and some stuff I was doing remotely on a bunch of Xen nodes. All 12 GB of them. A painfully slow rm -rf later, I’m back down to a manageable 400 MB.

This series of unfortunate fails has been brought to you by: lack of sleep for 4 1/2 weeks. Human beings have been, and always will be a single point of failure.