OpenAI

Partial outage for GPT-4o free users
Affected components
Updates

Write-up published

Read it here

Resolved

On May 21, 2024 from 11:25 am PT  to 12:26 pm PT a large portion of requests to ChatGPT gpt-4o free tier were failing with 5xx status codes.

At 11:30 am PT, engineers noticed that the configured routing information for gpt-4o had no backing services anymore and took immediate action to reapply the expected configuration to the system. First, rate limits were decreased for free tier traffic to prevent overwhelming the few instances that would be backing gpt-4o. Traffic was slowly dialed back up as the system performing re-configuration slowly added more and more of the backing services to the routing configuration. Due to the nature of the systems, this configuration process ramps up traffic in a staged manner and takes several minutes in between steps to configure more capacity for a service, hence the large length of the incident relative the time to execution of a mitigation.

The root cause was later determined to be a code bug in a new code path to enable draining all workloads in a cluster, an operation that was being attempted for the first time.

An error in the logic led to a misconfiguration that resulted in 100% loss of the services that back gpt-4o free tier, spanning multiple clusters, instead of just impacting a single cluster. This meant there were no backing services configured to answer requests for gpt-4o free tier traffic, in line with the symptoms observed by the triaging engineer.

The core bug causing the incident has been fixed.

Further hardening is being undertaken to introduce inertia to the process of cluster drain as to avoid the same level of catastrophic loss and to prematurely warn operators before large actions are taken.

The systems are also being adapted to better explain specifically what actions will be performed when enacting such large operations.

Additionally, the team is making it easier to more quickly undo changes, something that prevented us from reverting the issue more quickly.

We know that outages to the ChatGPT service affect our customers. While we came up short here, we are committed to preventing such incidents in the future.

Wed, May 29, 2024, 04:24 PM

Resolved

Between 11:25AM PDT and 12:26PM PDT today, we saw elevated error rates for free users utilizing the GPT-4o model on ChatGPT. We rolled out a fix, and have observed performance returning to expected levels. This incident is now resolved.

Tue, May 21, 2024, 07:48 PM(1 week earlier)

Monitoring

We have deployed a fix for the issue, and we are monitoring the results.

Tue, May 21, 2024, 07:30 PM(18 minutes earlier)

Identified

We have identified the issue causing the errors for free users. We are currently in the process of implementing a fix.

Tue, May 21, 2024, 06:50 PM(39 minutes earlier)

Investigating

We are currently investigating errors for free users utilizing GPT-4o.

Tue, May 21, 2024, 06:40 PM
Powered by

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.