Elevated Error Rate for ChatGPT and API
Incident Report for OpenAI
Postmortem

Summary

On November 25, 2024 starting at 10:15am PT, a large portion of both API and ChatGPT traffic failed with timeouts and 503 error codes.

ChatGPT saw degraded performance for Paid and Enterprise from 10:15 to 12pm, spiking at a 13% error rate for Paid and 23% error rate for Enterprise. Degraded performance for Free continued until 1:20pm. Search in ChatGPT was also disabled between 11:15am and 12:05pm.

API impact was most pronounced across the gpt-4-turbo-preview, gpt-4-o125-preview, and text-embedding-3-large models, where clients saw increased latency and error rates between 10:15am until as late as 1pm.

Root cause & mitigation

A global change to Kubernetes namespace labels inadvertently triggered a recomputation of metadata by our networking layer. This change rolled out to our small clusters safely but overwhelmed the control plane in three of our largest GPU clusters, leading to cascading failures, significant latency, and elevated error rates across multiple OpenAI products.

We mitigated impact by manually moving workloads out of the impacted clusters, which itself was made more complex due to the unresponsive control plane in these large clusters. 

Once we identified the root cause, we worked with our cloud provider to scale up the number of control plane nodes while simultaneously scaling down the number of data plane nodes. This allowed the control plane to recover and we have since moved some major workloads back into these clusters with additional protections.

Prevention

In the short term, we've already:

  • Locked namespace changes to ensure that we don't see a recurrence of this incident
  • Frozen certain types of infrastructure deploys 

We're continuing to work on:

  • Allowlisting identities to prevent a recurrence of this class of incidents
  • Improvements to our GPU load balancing logic to ensure that traffic gets automatically re-routed more effectively when possible

We're also investing in several critical improvements across our infrastructure – specifically, improving change management for fleet-wide infrastructure changes.

We know that our customers rely on OpenAI to be available, and that extended outages like these are especially damaging. We're investing heavily in these areas and will continue to improve our service reliability in the coming days and months.

Posted Dec 06, 2024 - 14:31 PST

Resolved
This issue has now been resolved.
Starting at 10:20am PT, customers experienced elevated errors on ChatGPT and API.
ChatGPT was mostly recovered by 11:55am PT, with some free plan customers continuing to experience issues until 1:20pm PT.
API performance was recovered for most customers by 1:30pm PT, with a smaller number of customers continuing to experience issues until 3:45pm PT.
Posted Nov 25, 2024 - 16:15 PST
Monitoring
We have implemented a fix for all API models with the exception of 'gpt-4-1106-preview', which we are continuing to work on. We are continuing to monitor performance for across all APIs as well as ChatGPT, and will post an update as soon as able.
Posted Nov 25, 2024 - 15:03 PST
Update
We have resolved issues surrounding ChatGPT. Some GPT-4 class models (excluding 4o class) accessed by API may continue to experience elevated errors. We are continuing to work towards resolution, and will provide an update as soon as able.
Posted Nov 25, 2024 - 14:27 PST
Update
We are continuing to work towards resolving the issue, and will provide an update as soon as possible.
Posted Nov 25, 2024 - 13:21 PST
Update
We are continuing to work towards resolving the issue, and will provide an update as soon as possible.
Posted Nov 25, 2024 - 12:18 PST
Update
We have identified that this issue may also cause elevated errors in the API. We are continuing to work towards implementing a fix.
Posted Nov 25, 2024 - 11:33 PST
Identified
We have identified the root cause of this issue, and are currently working to implement a fix.
Posted Nov 25, 2024 - 11:11 PST
Investigating
We are currently experiencing elevated error rates for ChatGPT. We are currently investigating.
Posted Nov 25, 2024 - 11:03 PST
This incident affected: API and ChatGPT.