Degraded API performance
Incident Report for OpenAI
Postmortem

Last week, a cascading set of failures, partially resulting from historically high load and unexpected upstream interruptions, led to degraded performance on our API. Not all customers were affected, but some observed increased latencies when making requests for completions, in some cases leading to timeouts. Some customers, particularly those using fine-tuned models, also observed HTTP 429 errors, with a message that the requested model was still being loaded. And in some instances requests were dropped with HTTP 503 errors.

We have taken immediate steps to resolve these issues. We’ve also made investments, and prioritized others, to ensure we never encounter these issues again, even as our request volume continues to increase. We’ve fixed several newly identified bugs in our system; made ourselves more resilient to upstream failures from our cloud provider; and improved the scaling of historically fixed capacities in our stack to adapt to increased load.

Latency and reliability are the highest priority for our team. We deeply apologize for the service interruptions and degradation of service.

Posted May 25, 2022 - 10:15 PDT

Resolved
This incident has been resolved.
Posted May 18, 2022 - 15:14 PDT
Monitoring
Our fix in production appears to have resolved the issue. We are continuing to monitor.
Posted May 18, 2022 - 14:44 PDT
Identified
We have identified an issue affecting some customers. We are working on remediation.
Posted May 18, 2022 - 14:30 PDT
This incident affected: API.