Elevated Error Rate for ChatGPT and API

Resolved·Degraded performance

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Nov 25, 2024, 07:03 PM

11:03 PM

Updates

Write-up published

Read it here

Resolved

Summary

On November 25, 2024 starting at 10:15am PT, a large portion of both API and ChatGPT traffic failed with timeouts and 503 error codes.

ChatGPT saw degraded performance for Paid and Enterprise from 10:15 to 12pm, spiking at a 13% error rate for Paid and 23% error rate for Enterprise. Degraded performance for Free continued until 1:20pm. Search in ChatGPT was also disabled between 11:15am and 12:05pm.

API impact was most pronounced across the gpt-4-turbo-preview, gpt-4-o125-preview, and text-embedding-3-large models, where clients saw increased latency and error rates between 10:15am until as late as 1pm.

Root cause & mitigation

A global change to Kubernetes namespace labels inadvertently triggered a recomputation of metadata by our networking layer. This change rolled out to our small clusters safely but overwhelmed the control plane in three of our largest GPU clusters, leading to cascading failures, significant latency, and elevated error rates across multiple OpenAI products.

We mitigated impact by manually moving workloads out of the impacted clusters, which itself was made more complex due to the unresponsive control plane in these large clusters.

Once we identified the root cause, we worked with our cloud provider to scale up the number of control plane nodes while simultaneously scaling down the number of data plane nodes. This allowed the control plane to recover and we have since moved some major workloads back into these clusters with additional protections.

Prevention

In the short term, we've already:

Locked namespace changes to ensure that we don't see a recurrence of this incident
Frozen certain types of infrastructure deploys

We're continuing to work on:

Allowlisting identities to prevent a recurrence of this class of incidents
Improvements to our GPU load balancing logic to ensure that traffic gets automatically re-routed more effectively when possible

We're also investing in several critical improvements across our infrastructure – specifically, improving change management for fleet-wide infrastructure changes.

We know that our customers rely on OpenAI to be available, and that extended outages like these are especially damaging. We're investing heavily in these areas and will continue to improve our service reliability in the coming days and months.

Fri, Dec 6, 2024, 10:28 PM

Resolved

This issue has now been resolved.Starting at 10:20am PT, customers experienced elevated errors on ChatGPT and API.ChatGPT was mostly recovered by 11:55am PT, with some free plan customers continuing to experience issues until 1:20pm PT.API performance was recovered for most customers by 1:30pm PT, with a smaller number of customers continuing to experience issues until 3:45pm PT.

Tue, Nov 26, 2024, 12:15 AM(1 week earlier)

Monitoring

We have implemented a fix for all API models with the exception of 'gpt-4-1106-preview', which we are continuing to work on. We are continuing to monitor performance for across all APIs as well as ChatGPT, and will post an update as soon as able.

Mon, Nov 25, 2024, 11:03 PM(1 hour earlier)

Identified

We have resolved issues surrounding ChatGPT. Some GPT-4 class models (excluding 4o class) accessed by API may continue to experience elevated errors. We are continuing to work towards resolution, and will provide an update as soon as able.

Mon, Nov 25, 2024, 10:27 PM(35 minutes earlier)

Identified

We are continuing to work towards resolving the issue, and will provide an update as soon as possible.

Mon, Nov 25, 2024, 09:21 PM(1 hour earlier)

Identified

We are continuing to work towards resolving the issue, and will provide an update as soon as possible.

Mon, Nov 25, 2024, 08:18 PM(1 hour earlier)

Identified

We have identified that this issue may also cause elevated errors in the API. We are continuing to work towards implementing a fix.

Mon, Nov 25, 2024, 07:33 PM(45 minutes earlier)

Identified

We have identified the root cause of this issue, and are currently working to implement a fix.

Mon, Nov 25, 2024, 07:11 PM(21 minutes earlier)

Investigating

We are currently experiencing elevated error rates for ChatGPT. We are currently investigating.

Mon, Nov 25, 2024, 07:03 PM

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.