On November 25, 2024 starting at 10:15am PT, a large portion of both API and ChatGPT traffic failed with timeouts and 503 error codes.
ChatGPT saw degraded performance for Paid and Enterprise from 10:15 to 12pm, spiking at a 13% error rate for Paid and 23% error rate for Enterprise. Degraded performance for Free continued until 1:20pm. Search in ChatGPT was also disabled between 11:15am and 12:05pm.
API impact was most pronounced across the gpt-4-turbo-preview, gpt-4-o125-preview, and text-embedding-3-large models, where clients saw increased latency and error rates between 10:15am until as late as 1pm.
A global change to Kubernetes namespace labels inadvertently triggered a recomputation of metadata by our networking layer. This change rolled out to our small clusters safely but overwhelmed the control plane in three of our largest GPU clusters, leading to cascading failures, significant latency, and elevated error rates across multiple OpenAI products.
We mitigated impact by manually moving workloads out of the impacted clusters, which itself was made more complex due to the unresponsive control plane in these large clusters.
Once we identified the root cause, we worked with our cloud provider to scale up the number of control plane nodes while simultaneously scaling down the number of data plane nodes. This allowed the control plane to recover and we have since moved some major workloads back into these clusters with additional protections.
In the short term, we've already:
We're continuing to work on:
We're also investing in several critical improvements across our infrastructure – specifically, improving change management for fleet-wide infrastructure changes.
We know that our customers rely on OpenAI to be available, and that extended outages like these are especially damaging. We're investing heavily in these areas and will continue to improve our service reliability in the coming days and months.