Studio partial availability (high error rates & latency)
Incident Report for Apollo Graph, Inc.
Resolved
After some time monitoring, we are confident that we are back online and humming along happily. Thank you for your patience today (and for the last few days!)

We are hard at work over at Apollo HQ working on resiliency efforts to ensure uptime and reliability for all your GraphQL tooling needs. We know that these outages have been disruptive to your workflows and frustrating to your day-to-day, and we are doing all we can to take our learning from the recent incidents to prioritize those resiliency efforts.

Consider this incident officially resolved, and please write in to our support portal if you notice anything awry.
Posted Sep 22, 2021 - 21:33 UTC
Monitoring
Our latency has decreased back to palatable levels after the successful decommission of the errant node. We are continuing to monitor the situation and are hopeful that this bodes well for a full return to service availability. We will update this page as to any changes that we observe. At this time, all features should be fully available, with the exception of experimental performance alerts.
Posted Sep 22, 2021 - 20:47 UTC
Update
Currently, the overall impact is largely limited to heightened latency of our CLI and all API requests that involve querying for historic usage information (e.g. clients page, fields page). There are a number of features that are experiencing higher error rates as well, though at this time, you should be able to load Apollo Studio and execute rover and apollo CLI requests and have them complete (eventually).

We have rolled out a variety of changes and performed several experiments to narrow down to the root cause. Our working theory on the remainder of high latency issues is that a particular node within our timeseries database is in the critical path for the majority of requests and acting as a bottleneck. We are in the process of decommissioning that node and spreading its responsibilities, and we hope that this change will bring latency back to palatable levels.
Posted Sep 22, 2021 - 20:02 UTC
Investigating
We are continuing the investigation. We appreciate your patience during this time and apologize for any inconvenience.
Posted Sep 22, 2021 - 19:15 UTC
Identified
We are currently rolling out and update to our query layer that will mitigate many of the availability problems that users are seeing when using Apollo Studio. As of now, the effects of the incident have been contained to the web interface's uptime as well as CLI responses, as our internal APIs are exhibiting high error rates and long latencies. Our metrics ingestion and processing have not been affected during this time.
Posted Sep 22, 2021 - 18:11 UTC
Update
Performance alerts are currently disabled while we continue to investigate the situation.
Posted Sep 22, 2021 - 16:40 UTC
Update
We are continuing efforts to investigate degraded performance for both the performance alerts feature, operations view, and the checks feature. We appreciate your patience at this time.
Posted Sep 22, 2021 - 16:31 UTC
Update
We are continuing to investigate this issue.
Posted Sep 22, 2021 - 16:20 UTC
Investigating
We are aware that the experimental performance alerts feature is exhibiting partial uptime and users can expect some notifications getting delayed or dropped. We are continuing our investigation.
Posted Sep 22, 2021 - 16:08 UTC
This incident affected: GraphQL API, Notifications, and Studio UI.