Elevated Error Rate on ChatGPT

Write-up

Summary

On July 15, 2025, ChatGPT experienced reduced availability between 7:43 PM and 8:38 PM PDT due to an invalid configuration value used by a core service. This issue caused new server instances to fail when restarting, resulting in elevated error rates for a small portion of ChatGPT users. We resolved the issue by restoring the correct configuration and forcing an immediate resynchronization of affected services. We are implementing stronger safeguards around critical configuration updates, splitting configuration shared between services, and improving deployment safeguards to prevent similar issues.

Impact

Increased error rates for ChatGPT conversations for a small percentage of users for approximately 55 minutes from 7:43 PM and 8:38 PM PDT.

Root Cause

An internal engineer applied a configuration change with an invalid value. The configuration value was read by multiple services which expanded the blast radius. The invalid value quickly propagated, causing pods to enter crash loops and leading to elevated error rates for a subset of ChatGPT users.

Mitigation

The misconfiguration was identified and reverted to the correct value at 7:12 PM PDT.
A rolling restart of affected pods was initiated to accelerate recovery, with full availability restored by 8:38 PM PDT.

Prevention

We are continuing work to prevent this from happening in the future.

Future critical configuration changes for the service will roll out to a single cluster first with automated health guards that halt further rollout if availability degrades.
Migrate to service-specific configuration value to minimize cross-service dependency risks.

We sincerely apologize for the disruption and are improving our safeguards to prevent this from happening in the future.