Major Outage across ChatGPT and API
Incident Report for OpenAI
Postmortem

On November 8th between 5:42AM - 7:16AM PT a large portion of requests to OpenAI failed with 502 or 503 error codes. All models and API endpoints saw significant failures over the course of the event.

The outage occurred due to routing layer nodes hitting memory limits and failing readiness checks. Eventually, a sufficient number of nodes in the service became unavailable, that there wasn’t sufficient capacity to serve the incoming traffic and the service was unable to recover. That morning, there were dramatically more completions than any prior day, which we believe was the sufficient tipping point for the service.

The issue was mitigated through a combination of limiting incoming traffic, redeploying the service en masse and then slowly ramping the traffic back up.

As part of the incident response, we have already implemented the following measures:

  • The underlying issue which led to the chronic memory issues, was that we were continually allocating a new responseBuffer in a loop rather than reusing it. This was causing GC to get behind, particularly when there were lots of requests. The fix was to pre-allocate the buffer and reuse it. Since shipping, we have seen a 3X improvement in both memory and CPU usage.
  • We have adjusted the configured memory limits to an appropriate level and the service now has significant available headroom.
  • We have implemented a series of rate limit controls to more gracefully load shed traffic.
  • As an additional precaution, we increased the service’s capacity.

Additionally, we will be implementing the following changes to prevent future incidents altogether:

  • Implement a series of alerting changes to catch the underlying memory behavior before they escalate into potential service issues.
  • It was not possible to enable auto scaling on this service in the past as adding capacity to it could negatively impact an upstream service. The underlying issue has been rectified and we will configure auto scaling for this service.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Posted Nov 15, 2023 - 17:35 PST

Resolved
Between 5:42AM - 7:16AM PT we saw errors impacting all services. We identified the problem and implemented a fix. We are now seeing normal responses from our services.
Posted Nov 08, 2023 - 07:46 PST
Monitoring
A fix has been implemented and we are gradually seeing the services recover. We are currently monitoring the situation.
Posted Nov 08, 2023 - 07:33 PST
Identified
We've identified an issue resulting in high error rates across the API and ChatGPT, and we are working on remediation.
Posted Nov 08, 2023 - 06:50 PST
Update
We are seeing high error rates across the API and ChatGPT and are actively investigating possible causes.
Posted Nov 08, 2023 - 06:49 PST
Update
We are continuing to investigate this issue.
Posted Nov 08, 2023 - 06:02 PST
Investigating
We are currently investigating this issue.
Posted Nov 08, 2023 - 05:54 PST
This incident affected: API and ChatGPT.