On November 8th between 5:42AM - 7:16AM PT a large portion of requests to OpenAI failed with 502 or 503 error codes. All models and API endpoints saw significant failures over the course of the event.
The outage occurred due to routing layer nodes hitting memory limits and failing readiness checks. Eventually, a sufficient number of nodes in the service became unavailable, that there wasn’t sufficient capacity to serve the incoming traffic and the service was unable to recover. That morning, there were dramatically more completions than any prior day, which we believe was the sufficient tipping point for the service.
The issue was mitigated through a combination of limiting incoming traffic, redeploying the service en masse and then slowly ramping the traffic back up.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future incidents altogether:
We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.