This post-mortem details an incident that occurred on December 11, 2024, where all OpenAI services experienced significant downtime. The issue stemmed from a new telemetry service deployment that unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems. In this post, we break down the root cause, outline the steps taken for remediation, and share the measures we are implementing to prevent similar incidents in the future.
Between 3:16 PM PST and 7:38 PM PST on December 11, 2024, all OpenAI services experienced significant degradation or complete unavailability. This event was the result of an internal change to roll out new telemetry across our fleet and was not caused by a security incident or a recent launch. All products started to degrade at 3:16 PM.
OpenAI operates hundreds of Kubernetes clusters globally. Kubernetes has a control plane responsible for cluster administration and a data plane from where we actually serve workloads like model inference.
As part of a push to improve reliability across the organization, we’ve been working to improve our cluster-wide observability tooling to strengthen visibility into the state of our systems. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.
Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. This issue was most pronounced in our largest clusters, so our testing didn’t catch it – and DNS caching made the issue far less visible until the rollouts had begun fleet-wide.
The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane.
In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.
The change was tested in a staging cluster, where no issues were observed. The impact was specific to clusters exceeding a certain size, and our DNS cache on each node delayed visible failures long enough for the rollout to continue.
Our main reliability concern prior to deployment was resource consumption of the new telemetry service. Before deployment, we evaluated resource utilization metrics in all clusters (CPU/memory) to ensure that the deployment wouldn’t disrupt running services. While resource requests were tuned on a per cluster basis, no precautions were taken to assess Kubernetes API server load. This rollout process monitored service health but lacked sufficient cluster health monitoring protocols.
The Kubernetes data plane (responsible for handling user requests) is designed to operate independently of the control plane. However, the Kubernetes API server is required for DNS resolution, which is a critical dependency for many of our services.
DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution. This timing was critical because it delayed the visibility of the issue, allowing the rollout to continue before the full scope of the problem was understood. Once the DNS caches were empty, the load on the DNS servers was multiplied, adding further load to the control plane and further complicating immediate mitigation.
Monitoring a deployment and reverting an offending change is generally straightforward, and we have tools to detect and roll back bad deployments. In this case, our detection tools worked and we detected the issue a few minutes before customers started seeing impact. But the fix for this issue required us to remove the offending service. In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.
We identified the issue within minutes and immediately spun up multiple workstreams to explore different ways to bring our clusters back online quickly:
By pursuing all three in parallel, we eventually restored enough control to remove the offending service.
Once we regained access to some Kubernetes control planes, we began to see immediate recovery. Where possible, we shifted traffic to healthy clusters while working on remediation for the others. Some clusters remained in a degraded state as many services tried to download resources simultaneously, saturating resource limits and requiring additional manual intervention.
This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways. Namely –
To prevent similar incidents, we are implementing the following measures:
We're continuing our work on improved phased rollouts with better monitoring for all infrastructure changes to ensure that any failure has limited impact and is detected early. All infrastructure-related configuration changes moving forward will follow a robust phased rollout process, with improved continuous monitoring that ensures that both the service workloads and the clusters (including the Kubernetes control plane) are healthy.
The Kubernetes data plane should be able to survive without the control plane for longer, and we’ll run tests that explicitly exercise this scenario. We’ll also run tests to intentionally roll out bad changes to ensure that our systems detect and roll back appropriately.
We do not yet have a mechanism to ensure access to the API server when the data plane is putting too much pressure on the control plane. We’ll implement break-glass mechanisms to ensure that engineers can access Kubernetes API servers in any circumstances.
Our dependency on Kubernetes DNS for service discovery creates a link between our Kubernetes data plane and control plane. We are investing in systems to decouple the Kubernetes data plane from the control plane so that the control plane is not load bearing for processing mission critical services and product workloads.
We’ll implement improved caching and dynamic rate limiters for the resources that are necessary for cluster startup and run regular exercises where we rapidly replace an entire cluster with an explicit goal of fast and correct startup.
We apologize for the impact that this incident caused to all of our customers – from ChatGPT users to developers to businesses who rely on OpenAI products. We’ve fallen short of our own expectations. We recognize that it is critical to provide a highly reliable service to all of you, and will be prioritizing the preventative measures outlined above to continue improving reliability. Thank you for your patience during this disruption.