Everything looks to be back to 100% operational. Steps have also been put in place to mitigate this from happening again/to making the recovery from such an event much faster.
Posted May 03, 2021 - 10:05 PDT
We have managed to scale our impacted services in parallel rather than in sequence. As such, we should be back online 100% in around 30 mins (allowing grace period for K8s to scale etc.).
As always we appreciate your patience, and will update in around 15 mins with updates.
Posted May 03, 2021 - 09:49 PDT
We have identified a hard upper bound for when all data will be back online of around 6 and a half hours. We are working now to reduce this time, and have identified:
- there is 0 data loss - each pod which was accidentally shut down will reclaim all of its data in around 7 minutes - if we can figure out how to sidestep our scaling, we should be able to get back up inside of 10 minutes (this is what we are currently investigating)
Posted May 03, 2021 - 09:35 PDT
Due to a bad migration, we have extreme delay in being able to serve our historical data. This has 0 impact on our ingestion, but checks, and many dashboards will be severely impacted.
Please note: No data has been lost.
We will report back soon with information on expected time till we can recover.
Posted May 03, 2021 - 08:43 PDT
This incident affected: Metrics Ingestion and GraphQL API.