Partial outage for GPT-4o free users
Incident Report for OpenAI
Postmortem

On May 21, 2024 from 11:25 am PT  to 12:26 pm PT a large portion of requests to ChatGPT gpt-4o free tier were failing with 5xx status codes.

At 11:30 am PT, engineers noticed that the configured routing information for gpt-4o had no backing services anymore and took immediate action to reapply the expected configuration to the system. First, rate limits were decreased for free tier traffic to prevent overwhelming the few instances that would be backing gpt-4o. Traffic was slowly dialed back up as the system performing re-configuration slowly added more and more of the backing services to the routing configuration. Due to the nature of the systems, this configuration process ramps up traffic in a staged manner and takes several minutes in between steps to configure more capacity for a service, hence the large length of the incident relative the time to execution of a mitigation.

The root cause was later determined to be a code bug in a new code path to enable draining all workloads in a cluster, an operation that was being attempted for the first time.

An error in the logic led to a misconfiguration that resulted in 100% loss of the services that back gpt-4o free tier, spanning multiple clusters, instead of just impacting a single cluster. This meant there were no backing services configured to answer requests for gpt-4o free tier traffic, in line with the symptoms observed by the triaging engineer.

The core bug causing the incident has been fixed.

Further hardening is being undertaken to introduce inertia to the process of cluster drain as to avoid the same level of catastrophic loss and to prematurely warn operators before large actions are taken.

The systems are also being adapted to better explain specifically what actions will be performed when enacting such large operations.

Additionally, the team is making it easier to more quickly undo changes, something that prevented us from reverting the issue more quickly.

We know that outages to the ChatGPT service affect our customers. While we came up short here, we are committed to preventing such incidents in the future.

Posted May 29, 2024 - 09:24 PDT

Resolved
Between 11:25AM PDT and 12:26PM PDT today, we saw elevated error rates for free users utilizing the GPT-4o model on ChatGPT. We rolled out a fix, and have observed performance returning to expected levels. This incident is now resolved.
Posted May 21, 2024 - 12:48 PDT
Monitoring
We have deployed a fix for the issue, and we are monitoring the results.
Posted May 21, 2024 - 12:30 PDT
Identified
We have identified the issue causing the errors for free users. We are currently in the process of implementing a fix.
Posted May 21, 2024 - 11:50 PDT
Investigating
We are currently investigating errors for free users utilizing GPT-4o.
Posted May 21, 2024 - 11:40 PDT
This incident affected: ChatGPT.