Major Outage across ChatGPT and API

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Nov 8, 2023, 01:54 PM

03:46 PM

Updates

Write-up published

Read it here

Resolved

On November 8th between 5:42AM - 7:16AM PT a large portion of requests to OpenAI failed with 502 or 503 error codes. All models and API endpoints saw significant failures over the course of the event.

The outage occurred due to routing layer nodes hitting memory limits and failing readiness checks. Eventually, a sufficient number of nodes in the service became unavailable, that there wasn’t sufficient capacity to serve the incoming traffic and the service was unable to recover. That morning, there were dramatically more completions than any prior day, which we believe was the sufficient tipping point for the service.

The issue was mitigated through a combination of limiting incoming traffic, redeploying the service en masse and then slowly ramping the traffic back up.

‌

As part of the incident response, we have already implemented the following measures:

‌

The underlying issue which led to the chronic memory issues, was that we were continually allocating a new responseBuffer in a loop rather than reusing it. This was causing GC to get behind, particularly when there were lots of requests. The fix was to pre-allocate the buffer and reuse it. Since shipping, we have seen a 3X improvement in both memory and CPU usage.
We have adjusted the configured memory limits to an appropriate level and the service now has significant available headroom.
We have implemented a series of rate limit controls to more gracefully load shed traffic.
As an additional precaution, we increased the service’s capacity.

‌

Additionally, we will be implementing the following changes to prevent future incidents altogether:

Implement a series of alerting changes to catch the underlying memory behavior before they escalate into potential service issues.
It was not possible to enable auto scaling on this service in the past as adding capacity to it could negatively impact an upstream service. The underlying issue has been rectified and we will configure auto scaling for this service.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Thu, Nov 16, 2023, 01:35 AM

Resolved

Between 5:42AM - 7:16AM PT we saw errors impacting all services. We identified the problem and implemented a fix. We are now seeing normal responses from our services.

Wed, Nov 8, 2023, 03:46 PM(1 week earlier)

Monitoring

A fix has been implemented and we are gradually seeing the services recover. We are currently monitoring the situation.

Wed, Nov 8, 2023, 03:33 PM(12 minutes earlier)

Identified

We've identified an issue resulting in high error rates across the API and ChatGPT, and we are working on remediation.

Wed, Nov 8, 2023, 02:50 PM(43 minutes earlier)

Investigating

We are seeing high error rates across the API and ChatGPT and are actively investigating possible causes.

Wed, Nov 8, 2023, 02:49 PM

Investigating

We are continuing to investigate this issue.

Wed, Nov 8, 2023, 02:02 PM(47 minutes earlier)

Investigating

We are currently investigating this issue.

Wed, Nov 8, 2023, 01:54 PM

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.