Major Outage across ChatGPT and API

Write-up

On November 8th between 5:42AM - 7:16AM PT a large portion of requests to OpenAI failed with 502 or 503 error codes. All models and API endpoints saw significant failures over the course of the event.

The outage occurred due to routing layer nodes hitting memory limits and failing readiness checks. Eventually, a sufficient number of nodes in the service became unavailable, that there wasn’t sufficient capacity to serve the incoming traffic and the service was unable to recover. That morning, there were dramatically more completions than any prior day, which we believe was the sufficient tipping point for the service.

The issue was mitigated through a combination of limiting incoming traffic, redeploying the service en masse and then slowly ramping the traffic back up.

‌

As part of the incident response, we have already implemented the following measures:

‌

The underlying issue which led to the chronic memory issues, was that we were continually allocating a new responseBuffer in a loop rather than reusing it. This was causing GC to get behind, particularly when there were lots of requests. The fix was to pre-allocate the buffer and reuse it. Since shipping, we have seen a 3X improvement in both memory and CPU usage.
We have adjusted the configured memory limits to an appropriate level and the service now has significant available headroom.
We have implemented a series of rate limit controls to more gracefully load shed traffic.
As an additional precaution, we increased the service’s capacity.

‌

Additionally, we will be implementing the following changes to prevent future incidents altogether:

Implement a series of alerting changes to catch the underlying memory behavior before they escalate into potential service issues.
It was not possible to enable auto scaling on this service in the past as adding capacity to it could negatively impact an upstream service. The underlying issue has been rectified and we will configure auto scaling for this service.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.