Elevated errors for ChatGPT

Write-up

On February 19, 2025, from 9:48 AM to 11:19 AM PT, ChatGPT experienced a service degradation, leading to a significant increase in failed conversation attempts. This resulted in blank responses for many users.

The root cause was a misconfigured internal experiment that unintentionally triggered a surge in traffic, overwhelming our inference infrastructure. This increase in load led to saturation of compute resources, causing failures in generating responses.

After identifying the root cause, we took immediate action by temporarily shedding load from free-tier users to stabilize the system. As capacity was restored, paid users gradually recovered, and the full service was restored by 11:19 AM PT.

As part of the incident response, we are working on the following changes to prevent similar incidents in the future:

Stronger Safeguards: Building better protections around experiment changes and configurations by moving from a uniform approval process to a risk-based model to ensure safer rollouts of experiments.
Faster Root Cause Identification: Automating notifications for relevant changes and experiments to more quickly identify root causes of increased failures.

We understand the impact of extended service disruptions and are committed to making our infrastructure more resilient. Our team is actively working to improve system reliability to prevent similar incidents from occurring in the future. We appreciate your patience and trust as we continue to enhance our service.

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.

Write-up

Elevated errors for ChatGPT

Degraded performance

View the incident

As part of the incident response, we are working on the following changes to prevent similar incidents in the future:

Stronger Safeguards: Building better protections around experiment changes and configurations by moving from a uniform approval process to a risk-based model to ensure safer rollouts of experiments.
Faster Root Cause Identification: Automating notifications for relevant changes and experiments to more quickly identify root causes of increased failures.