API Error Rates - OpenAI Status

Write-up

API Error Rates

Summary

On March 4, 2026, between approximately 4:26 PM and 4:56 PM PST, OpenAI experienced elevated API error rates and latency across several models. The incident primarily affected API traffic, and some Codex traffic was also impacted. The issue was caused by an internal scheduling system executing a large batch of queued infrastructure actions simultaneously, which temporarily reduced available model capacity.

Impact

During this period, some API requests experienced increased latency or failures affecting several models, and some Codex traffic was also impacted.

Overall API success rates temporarily dropped
Impact varied by model, with some experiencing higher error rates than others.
The disruption lasted about 30 minutes before service was restored.

Affected models included several text and code models, including gpt-4.1, gpt-4.1-mini, gpt-4o-mini, gpt-5-series models, OpenAI o3, and Codex models. GPT-4o mini was the most affected major model.

Root Cause

The issue occurred due to an interaction between automated infrastructure management systems responsible for allocating compute capacity to models.

A protective circuit breaker had previously paused certain scheduling operations. During this pause, automated capacity management systems continued to queue adjustments based on outdated system state. When the circuit breaker was later released, the queued changes were executed simultaneously.

This sudden batch of capacity reassignments caused multiple inference engines to be temporarily removed from service at the same time, leading to reduced available capacity and elevated error rates.

Resolution

Engineers quickly identified the capacity imbalance and manually reassigned infrastructure resources to affected models. As engines were brought back online and reassigned, service recovered and error rates returned to normal by 4:56 PM PST.

Prevention and Improvements

We are implementing several improvements to reduce the likelihood of similar incidents:

Improving safeguards in automated capacity management systems to better detect and respond to state drift.
Enhancing observability for circuit breakers and scheduling systems so their status and effects are easier to identify.
Adding safeguards designed to ensure queued actions cannot execute in large batches after extended pauses.
Improving monitoring and alerting around capacity reconciliation issues.

We apologize for the disruption and are continuing to strengthen the resilience and visibility of our infrastructure systems.