Earlier this week we had what could have been a catastrophic infrastructure failure - our ElasticSearch cluster went offline. Given that this powers the core functional element of our entire business, this was not a good thing to wake up to.
As I started investigating what had gone wrong, I could practically hear the sniggers from the cloud service naysayers I have met over the years (and I do still meet them) enjoying a moment of schadenfreude as I found myself (and my company) in the hands of a third party, beyond my control. Because of course we don't host ElasticSearch ourselves - we outsourced that to someone else - as an add-on to our Heroku application(s). And given that we weren't paying for the premium service, I couldn't even raise them on email - as they were still in bed.
This is precisely the situation that those who still run and manage their own infrastructure point to when pushed. It at times like these, they say, that you need to have total control over your own platform. I totally disagree with that of course - it's precisely at times like these that I don't want to be personally responsible for fixing the problem - I want someone who knows what they are doing to be on the hook. I would much rather have Amazon's engineers fixing my infrastructure than anyone I could reasonably afford.
It was whilst pondering this that I had a brainwave. If setting up an add-on was that simple, why not just set up another, with another provider. The connection from our application to the search cluster is a single URL config setting. Our index is small enough for us to be able to rebuild and push it over the wire within the timeout window, so I tried the following:
$ heroku addons:add searchbox
$ heroku config:set HAYSTACK_URL=`heroku config:get SEARCHBOX_URL`
$ heroku run ./manage.py rebuild_index
And with that, the site was back online. We had essentially hot-swapped the core component behind our site in under 60s.