Studio partial availability (high error rates & latency)

Resolved·Partial outage

After some time monitoring, we are confident that we are back online and humming along happily. Thank you for your patience today (and for the last few days!)

We are hard at work over at Apollo HQ working on resiliency efforts to ensure uptime and reliability for all your GraphQL tooling needs. We know that these outages have been disruptive to your workflows and frustrating to your day-to-day, and we are doing all we can to take our learning from the recent incidents to prioritize those resiliency efforts.

Consider this incident officially resolved, and please write in to our support portal if you notice anything awry.

Wed, Sep 22, 2021, 09:33 PM

(4 years ago)

Affected components

Sep 22, 2021, 04:08 PM

08:47 PM

GraphQL API

Notifications

Studio UI

Updates

Resolved

After some time monitoring, we are confident that we are back online and humming along happily. Thank you for your patience today (and for the last few days!)

Consider this incident officially resolved, and please write in to our support portal if you notice anything awry.

Wed, Sep 22, 2021, 09:33 PM

Monitoring

Our latency has decreased back to palatable levels after the successful decommission of the errant node. We are continuing to monitor the situation and are hopeful that this bodes well for a full return to service availability. We will update this page as to any changes that we observe. At this time, all features should be fully available, with the exception of experimental performance alerts.

Wed, Sep 22, 2021, 08:47 PM(46 minutes earlier)

Investigating

Currently, the overall impact is largely limited to heightened latency of our CLI and all API requests that involve querying for historic usage information (e.g. clients page, fields page). There are a number of features that are experiencing higher error rates as well, though at this time, you should be able to load Apollo Studio and execute rover and apollo CLI requests and have them complete (eventually).

We have rolled out a variety of changes and performed several experiments to narrow down to the root cause. Our working theory on the remainder of high latency issues is that a particular node within our timeseries database is in the critical path for the majority of requests and acting as a bottleneck. We are in the process of decommissioning that node and spreading its responsibilities, and we hope that this change will bring latency back to palatable levels.

Wed, Sep 22, 2021, 08:02 PM(44 minutes earlier)

Investigating

We are continuing the investigation. We appreciate your patience during this time and apologize for any inconvenience.

Wed, Sep 22, 2021, 07:15 PM(47 minutes earlier)

Identified

We are currently rolling out and update to our query layer that will mitigate many of the availability problems that users are seeing when using Apollo Studio. As of now, the effects of the incident have been contained to the web interface's uptime as well as CLI responses, as our internal APIs are exhibiting high error rates and long latencies. Our metrics ingestion and processing have not been affected during this time.

Wed, Sep 22, 2021, 06:11 PM(1 hour earlier)

Investigating

Performance alerts are currently disabled while we continue to investigate the situation.

Wed, Sep 22, 2021, 04:40 PM(1 hour earlier)

Investigating

We are continuing efforts to investigate degraded performance for both the performance alerts feature, operations view, and the checks feature. We appreciate your patience at this time.

Wed, Sep 22, 2021, 04:31 PM

Investigating

We are continuing to investigate this issue.

Wed, Sep 22, 2021, 04:20 PM(11 minutes earlier)

Investigating

We are aware that the experimental performance alerts feature is exhibiting partial uptime and users can expect some notifications getting delayed or dropped. We are continuing our investigation.

Wed, Sep 22, 2021, 04:08 PM(11 minutes earlier)

Apollo Graph, Inc.