As a Django app, running on Heroku, we lean on pip for our package management. One of pip's install options is the '--editable' flag, which allows you to install a package direct from a source repo.
Each time we deploy to Heroku, the platform runs the following command:
pip install -r requirements.txt
This command installs everything listed in the requirements file, which can include packages loaded directly from a git repo. This is a snippet from our current requirements.txt:
(I've redacted the name of the library that I'm about to talk about, as I don't want to single them out specifically.)
We deployed a v. small bug-fix yesterday, which should have had no material affect whatsoever on the site. The tests all passed, the site ran fine locally, the deployment went without error. And then I logged into the site. Or tried to. The site just hung, and eventually (after 30s) Heroku's routing infrastructure cut the connection and logged an 'H12' timeout error. The error appeared at first to be inconsistent (some pages worked fine, others did not), but once a bad page was requested, the entire site would hang, for all users, until restarted. This was in every sense a critical error.
Fortunately, Heroku has a
rollback command that simply repoints to the previous application 'slug' - which worked fine, so the site was only offline for a few minutes (and on a weekend, which is v. low traffic).
It took about 4 hours to track the bug down; there was nothing in the logs at all to indicate what the problem might be, or even where it might appear, however after carpet-bombing the application with debug statements I eventually tracked it down to a template tag in an included library (hence every page that relied on that tag caused the site to crash).
The library didn't log anything, which made it almost impossible to track down what the issue might be, however when I looked at the source on Github I noticed that a commit was pushed yesterday that included some threading code. I still have no idea what the bug is (I've raised an issue with the repo owner), but I do know that our problem was caused because we were referencing the repo version in our requirements file, and so when I deployed a new version of our app, the buggy version of the library was installed.
The fix is simple - remove the
--editable version from our requirements file, and fix it to a version that we know is good.
The lessons learned from this:
- Do not use
--editable package references in production deployments
- Always have a local development environment that mirrors production
- If we'd been using closed-source we'd have been f*ked
(PS as a personal favour - if you're releasing libraries, please add copious amounts of logging statements - more the merrier - you can always turn verbose logging off - but you can't turn it on if it doesn't exist.)