API, ChatGPT & Sora Facing Issues

Incident Report for OpenAI

Postmortem

Introduction

This post-mortem details an incident that occurred on December 11, 2024, where all OpenAI services experienced significant downtime. The issue stemmed from a new telemetry service deployment that unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems. In this post, we break down the root cause, outline the steps taken for remediation, and share the measures we are implementing to prevent similar incidents in the future.

Impact

Between 3:16 PM PST and 7:38 PM PST on December 11, 2024, all OpenAI services experienced significant degradation or complete unavailability. This event was the result of an internal change to roll out new telemetry across our fleet and was not caused by a security incident or a recent launch. All products started to degrade at 3:16 PM.

ChatGPT: Substantial recovery occurred at 5:45 PM, with full recovery at 7:01 PM PST.
API: Substantial recovery occurred at 5:36 PM, with full recovery across all models at 7:38 PM PST.
Sora: Fully recovered at 7:01 PM PST.

Root Cause

OpenAI operates hundreds of Kubernetes clusters globally. Kubernetes has a control plane responsible for cluster administration and a data plane from where we actually serve workloads like model inference.

As part of a push to improve reliability across the organization, we’ve been working to improve our cluster-wide observability tooling to strengthen visibility into the state of our systems. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.

Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations whose cost scaled with the size of the cluster. With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large clusters. This issue was most pronounced in our largest clusters, so our testing didn’t catch it – and DNS caching made the issue far less visible until the rollouts had begun fleet-wide.

The Kubernetes data plane can operate largely independently of the control plane, but DNS relies on the control plane – services don’t know how to contact one another without the Kubernetes control plane.

In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.

Testing & deployment

The change was tested in a staging cluster, where no issues were observed. The impact was specific to clusters exceeding a certain size, and our DNS cache on each node delayed visible failures long enough for the rollout to continue.

Our main reliability concern prior to deployment was resource consumption of the new telemetry service. Before deployment, we evaluated resource utilization metrics in all clusters (CPU/memory) to ensure that the deployment wouldn’t disrupt running services. While resource requests were tuned on a per cluster basis, no precautions were taken to assess Kubernetes API server load. This rollout process monitored service health but lacked sufficient cluster health monitoring protocols.

The Kubernetes data plane (responsible for handling user requests) is designed to operate independently of the control plane. However, the Kubernetes API server is required for DNS resolution, which is a critical dependency for many of our services.

DNS caching mitigated the impact temporarily by providing stale but functional DNS records. However, as cached records expired over the following 20 minutes, services began failing due to their reliance on real-time DNS resolution. This timing was critical because it delayed the visibility of the issue, allowing the rollout to continue before the full scope of the problem was understood. Once the DNS caches were empty, the load on the DNS servers was multiplied, adding further load to the control plane and further complicating immediate mitigation.

Remediation

Monitoring a deployment and reverting an offending change is generally straightforward, and we have tools to detect and roll back bad deployments. In this case, our detection tools worked and we detected the issue a few minutes before customers started seeing impact. But the fix for this issue required us to remove the offending service. In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

We identified the issue within minutes and immediately spun up multiple workstreams to explore different ways to bring our clusters back online quickly:

Scaling down cluster size: Reduced the aggregate Kubernetes API load.
Blocking network access to Kubernetes admin APIs: Prevented new expensive requests, giving the API servers time to recover.
Scaling up Kubernetes API servers: Increased available resources to handle pending requests, allowing us to apply the fix.

By pursuing all three in parallel, we eventually restored enough control to remove the offending service.

Once we regained access to some Kubernetes control planes, we began to see immediate recovery. Where possible, we shifted traffic to healthy clusters while working on remediation for the others. Some clusters remained in a degraded state as many services tried to download resources simultaneously, saturating resource limits and requiring additional manual intervention.

This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways. Namely –

Our tests didn’t catch the impact the change was having on the Kubernetes control plane.
DNS caching added a delay between making the change and when services started failing.
Remediation was very slow because of the locked out effect.

Timeline

December 10th, 2024: The new telemetry service was deployed to a staging cluster and verified to be working as expected
December 11th, 2024 at 2:23pm: The change to introduce the new service was merged and the deployment pipeline triggered
2:51pm to 3:20pm: The change was applied to all clusters
3:13pm: Alerts fired, notifying engineers
3:16pm: Minor customer impact began
3:16pm: Root cause identified
3:27pm: Engineers began moving traffic away from impacted clusters
3:40pm: Maximum customer impact
4:36pm: First cluster recovered
7:38pm: All clusters recovered

Prevention

To prevent similar incidents, we are implementing the following measures:

1. Robust phased rollouts

We're continuing our work on improved phased rollouts with better monitoring for all infrastructure changes to ensure that any failure has limited impact and is detected early. All infrastructure-related configuration changes moving forward will follow a robust phased rollout process, with improved continuous monitoring that ensures that both the service workloads and the clusters (including the Kubernetes control plane) are healthy.

2. Fault injection testing

The Kubernetes data plane should be able to survive without the control plane for longer, and we’ll run tests that explicitly exercise this scenario. We’ll also run tests to intentionally roll out bad changes to ensure that our systems detect and roll back appropriately.

3. Emergency Kubernetes control plane access

We do not yet have a mechanism to ensure access to the API server when the data plane is putting too much pressure on the control plane. We’ll implement break-glass mechanisms to ensure that engineers can access Kubernetes API servers in any circumstances.

4. Decouple the Kubernetes data plane and control plane

Our dependency on Kubernetes DNS for service discovery creates a link between our Kubernetes data plane and control plane. We are investing in systems to decouple the Kubernetes data plane from the control plane so that the control plane is not load bearing for processing mission critical services and product workloads.

5. Faster recovery

We’ll implement improved caching and dynamic rate limiters for the resources that are necessary for cluster startup and run regular exercises where we rapidly replace an entire cluster with an explicit goal of fast and correct startup.

Conclusion

We apologize for the impact that this incident caused to all of our customers – from ChatGPT users to developers to businesses who rely on OpenAI products. We’ve fallen short of our own expectations. We recognize that it is critical to provide a highly reliable service to all of you, and will be prioritizing the preventative measures outlined above to continue improving reliability. Thank you for your patience during this disruption.

Posted Dec 12, 2024 - 17:19 PST

Resolved

Between 3:16pm PST and 7:38pm PST on December 11, OpenAI’s services were unavailable. We began to see recovery for API traffic starting at approximately 5:40pm PST, and for ChatGPT and Sora around 6:50pm PST. We mitigated the issues at 7:38 pm PST, and all services are now fully operational.

OpenAI will run a full root-cause analysis of this outage and will share details on this page when complete.

Posted Dec 11, 2024 - 22:23 PST

Monitoring

API, ChatGPT, and Sora traffic has largely recovered. We are monitoring the situation to ensure full resolution.

Posted Dec 11, 2024 - 19:53 PST

Update

We are continuing to work towards remediation. API traffic recovery is ongoing and we are bringing ChatGPT traffic back up by region. Sora is partially beginning to recover.

Posted Dec 11, 2024 - 18:54 PST

Update

We are continuing to work towards remediation. API and ChatGPT traffic is partially recovered. Sora remains down.

Posted Dec 11, 2024 - 17:50 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 11, 2024 - 17:03 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 11, 2024 - 16:59 PST

Update

We have identified a pathway to recovery and we are starting to see some traffic successfully return. Continuing to work to return service to normal.

Posted Dec 11, 2024 - 16:55 PST

Update

ChatGPT, Sora, and the API remain down. We have identified the issue and are rolling out a remediation. We are working as fast as we can to return service to normal and apologize for the downtime.

Posted Dec 11, 2024 - 16:24 PST

Identified

We have reports of API calls returning errors, and difficulties logging in to platform.openai.com and ChatGPT. We have identified the issue and are working to roll out a fix.

Posted Dec 11, 2024 - 15:53 PST

Update

We are continuing to investigate this issue.

Posted Dec 11, 2024 - 15:45 PST

Update

We are continuing to investigate this issue.

Posted Dec 11, 2024 - 15:42 PST

Investigating

We are currently investigating the issue and will provide more updates shortly.

Posted Dec 11, 2024 - 15:17 PST

This incident affected: API, ChatGPT, Sora, Playground, and Labs.