Follow up on today's incident on Heroku
In the past week, two incidents on Heroku's side (this one on July 23 and this one today) caused errors on our side. Cron To Go failed to execute jobs for over 2 hours today and almost an hour and a half on July 23 and the dashboard wasn't available.
The issues on Heroku's side resulted in Heroku's authentication APIs timing out. This didn't affect only Cron To Go, but also all of Heroku's UI (including Heroku's support platform which requires a sign-in) and other add-ons.
We've started our own research on what can be improved on our end to make sure that the next time this happens, our service level doesn't degrade in such a manner if possible.
First and foremost, we wanted to make sure that you are aware that we report all issues as soon as we're aware of them and we urge you to subscribe to our status page as soon as you can. Transparency is one of our core values, but it's up to you to opt-in to status updates. You may notice that we actually opened incidents before Heroku did, when we noticed high job failure rates, but we were under the impression that we're just going to have to wait for Heroku to find the culprit.
You may also want to follow us on Twitter, where our status updates are now also posted. We'll also make sure to expose the status not just as a link on our websites too in case that the app is not available.
Today's incident was longer and caused even more job failures. We contacted Heroku and we hope to hear from them soon to better understand what was going on. Our engineers dived deeper into our code and figured that we actually refresh the authentication tokens with every call we make, which made us more vulnerable to this kind of error. We're already working on changing this, so that in the future, your jobs will be less likely to be affected as Heroku's tokens last 8 hours. We'll make sure to thoroughly test this change and deploy it as soon as we can.
We apologize for the inconvenience this outage has caused and we will continue to learn and improve our products to further increase the reliability of our and your service.
Crazy Ant Labs team
Edit - 2020-07-28 - We've pushed the changes to the way we use Heroku's authentication APIs after thorough tests and we've been monitoring it for quite some time - it works well.
Edit 2020-07-30 - there have been a few more (smaller-scale) authentication issues on Heroku's side and we're pleased to see that no jobs have been affected.