Metrics ingestion outage

Incident Report for Apollo Graph, Inc.

Resolved

UPDATE (11/3/2022): Due to the way in which traces and stats are processed, it's possible that servers which were reporting usage to Apollo during this incident are in a state where reporting raw traces to Apollo Studio is disabled. If you are not seeing traces in Apollo Studio and have traces enabled for a percentage of requests, we recommend restarting any such servers to regain trace-reporting functionality. Reporting aggregate statistics is unaffected.

---

Upon the most recent update, metric ingestion has returned to a stable state and new metrics are being processed.

Overall, the result of this incident is that between 9:30am to 1pm Pacific, metrics were unable to be collected and have been permanently lost. Users will see a gap of data for this time in graphs within Studio as well as any other integrated analytics tools (e.g. Datadog). Further, some users may have experienced issues in buffering pending metrics reports during this outage, depending on their configuration.

The root cause of this outage has been identified as a large scale event, along with several additional factors which caused such a long delay in our ability to restore service. We have collected initial notes during this incident that already point to improvements we can make on our side, and we will be conducting a thorough postmortem to ensure that such an incident does not occur again. Thank you to everyone for your patience and support. We are committed to meeting high standards of reliability and excellence and using the learnings from this incident to improve.

Posted Oct 29, 2022 - 21:07 UTC

Investigating

We're observing increased latency in our GraphQL API, and are currently investigating. Our metrics ingestion endpoint should be normal, and we're continuing to monitor success rate.

Posted Oct 29, 2022 - 20:41 UTC

Update

We have released a change to no longer serve any rate limits for our metrics ingestion endpoint. We are continuing to monitor the traffic to ensure that our success rate for this endpoint remains close to 100%, and we expect things to be returning to normal.

Posted Oct 29, 2022 - 20:29 UTC

Monitoring

We have successfully rolled out an update that is causing responses to metrics submissions to either succeed (200) or fail with a rate-limit status code (429). For users that have been experiencing issues in their infrastructure related to buffering these reports, we recommend restarting any servers that are emitting these metrics. We are currently accepting 50% of traffic and gradually increasing the scale of metrics traffic that we accept to ensure continued service.

We understand that this incident caused a variety of issues for our customers, and we are already underway at collecting action items and follow-ups. Thank you for your patience as we resolve the underlying issues.

Posted Oct 29, 2022 - 20:19 UTC

Update

We implemented a fix for Studio UI and latency is trending downwards back to healthy levels. Studio UI should be operational again and we'll continue to monitor its state.

Posted Oct 29, 2022 - 19:59 UTC

Update

Latency for our API has drastically increased and Studio-UI takes a long time to load. We are continuing to investigate and appreciate your continued patience

Posted Oct 29, 2022 - 19:39 UTC

Update

We are continuing to work on a fix for this issue.

Posted Oct 29, 2022 - 19:36 UTC

Update

We're seeing increased latency from our GraphQL API, which is causing Studio to be slow to load

Posted Oct 29, 2022 - 19:34 UTC

Update

We are in the middle of restarting some of our systems, you may experience 429s instead of 5XXs while this is occurring.

Posted Oct 29, 2022 - 19:14 UTC

Update

Our reporting endpoint has mostly not been accepting stats and traces for the duration of the incident. Our latency for our GraphQL API has also seen a significant increase.

Posted Oct 29, 2022 - 19:09 UTC

Identified

We've determined the cause of the issue and are working on a fix.

Posted Oct 29, 2022 - 18:11 UTC

Investigating

Our API endpoint for reporting metrics and some parts of our GraphQL endpoint are experiencing increased latency and errors, we are currently investigating.

Posted Oct 29, 2022 - 16:47 UTC

This incident affected: Metrics Ingestion and Studio UI.