Summary:
Between 2:10 PM and 3:05 PM PDT on July 9, 2025, a faulty code change was deployed to a control plane cluster that is primarily responsible for managing low priority workloads. However, this cluster was also managing some production inference workloads, leading to degraded performance for certain API model endpoints and reduced availability for ChatGPT search.
Impact:
API success rates for the gpt-4.1-mini model were degraded for ~20 minutes
ChatGPT search experienced a significant increase in error rates for ~35 minutes
Mitigation and Prevention
We quickly rolled back the faulty code change to restore normal operations. We’ve since implemented safeguards and are continuing to make infrastructure improvements to prevent this from happening in the future.
Completed actions:
Added safeguards to prevent production inference workloads from being managed by the control plane cluster.
Added strengthened testing mechanisms for deploying high-impact control plane changes.
In-progress actions:
Migrating all existing production inference workloads to be managed by the production control plane cluster.
Implement a circuit breaker mechanism designed to prevent our services from large-scale unintended changes.
We sincerely apologize for the disruption and are committed to improving our safeguards to prevent a recurrence.