Elevated error rates for ChatGPT and Platform users

Write-up

Summary

On February 3, 2026, beginning at 12:12 PM PST, OpenAI experienced a significant service disruption that impacted login functionality and ChatGPT conversation availability. Users encountered elevated error rates when attempting to log in or create accounts, and availability dropped across ChatGPT plan types. The issue was caused by a configuration change that introduced an unexpected data type in a critical execution path, resulting in elevated error rates. Increased retry traffic during the disruption amplified load on downstream systems, which slowed recovery in one region. The issue was mitigated for most regions by 1:03 PM PST, with full recovery completed at 3:05 PM PST.

Impact

Elevated error rates for ChatGPT conversations across all plan types.
Elevated error rates across authentication and related services.
Login success rates dropped significantly at peak impact.
Account creation success rates were reduced.
Some regions experienced prolonged recovery due to configuration propagation delays.

Availability varied by user tier, with Free and Go tiers experiencing the most significant impact during peak disruption.

Root Cause

The incident was triggered by the rollout of a configuration change that introduced a new caching path for feature gate and configuration evaluation in production, which caused validation failures in a critical execution path for services relying on this configuration, resulting in an increase in errors.

Although the change had been deployed to staging and tested internally without issue, the production rollout exposed an unexpected behavior difference that was not detected in pre-production environments.

During the outage, elevated login failures led to increased retry behavior from clients. This amplified traffic to downstream services, causing additional load and temporarily degrading configuration distribution. This feedback loop slowed recovery in one region.

Resolution

The incident was mitigated by disabling the feature gate responsible for the incorrect configuration behavior. Once disabled, services immediately began recovering at around 1:01 PM PST. Full recovery completed after delayed configuration propagation in one region at around 3:05 PM PST.

We also removed lagging infrastructure components in a dependent storage system to reduce latency and restore stability.

Prevention and Next Steps

We are implementing the following improvements to prevent similar incidents in the future:

Improving validation for configuration and feature flag APIs to prevent similar issues.
Enhancing safeguards on production rollouts.
Improving alerting and observability around configuration health and dependency services.
Introducing circuit breaker protections to prevent retry amplification during partial outages.
Strengthening internal rollout playbook for high-blast-radius systems.

We sincerely apologize for the disruption and are committed to continuously bolstering the reliability of our services.

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.

Write-up

Elevated error rates for ChatGPT and Platform users

Degraded performance

View the incident

Summary

Impact

Elevated error rates for ChatGPT conversations across all plan types.
Elevated error rates across authentication and related services.
Login success rates dropped significantly at peak impact.
Account creation success rates were reduced.
Some regions experienced prolonged recovery due to configuration propagation delays.

Availability varied by user tier, with Free and Go tiers experiencing the most significant impact during peak disruption.

Root Cause

Resolution

We also removed lagging infrastructure components in a dependent storage system to reduce latency and restore stability.

Prevention and Next Steps

We are implementing the following improvements to prevent similar incidents in the future:

Improving validation for configuration and feature flag APIs to prevent similar issues.
Enhancing safeguards on production rollouts.
Improving alerting and observability around configuration health and dependency services.
Introducing circuit breaker protections to prevent retry amplification during partial outages.
Strengthening internal rollout playbook for high-blast-radius systems.

We sincerely apologize for the disruption and are committed to continuously bolstering the reliability of our services.