Elevated API Errors

Resolved·Partial outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Oct 19, 2023, 11:51 PM

Oct 20, 2023, 03:37 AM

Updates

Write-up published

Read it here

Resolved

On October 19, 2023, from 4:46pm to 9:15pm PT, a large portion of requests to OpenAI failed with 500 or 502 error codes. This peaked at a 60% failure rate for GPT-3.5-Turbo API traffic, a 70% failure rate for GPT-4 API traffic, and similar failure rates for all other API endpoints. Additionally, a fraction of ChatGPT Enterprise traffic was affected as well.

The root cause was identified as our Redis authentication cache becoming largely unresponsive due to shards failing while the cache was already under high load. Because our authentication system was not resilient to such a cache failure, this led to a significant fraction of timeouts when authenticating. Due to the user-keyed nature of the authentication system, some users were heavily affected while others were not at all – some API customers saw all requests fail, while others only saw a small percentage fail.

After identifying the root cause of the outage, we took immediate action by attempting to scale up the existing Redis cluster, as well as separately provisioning a new, larger Redis cluster with higher resource limits. However, fully restoring the service took longer than we anticipated. While the service improved periodically over the course of the outage, the issue was only fully mitigated by 9:15pm.

As part of the incident response, we have already implemented the following measures:

Scaled up the Redis authentication cache to be able to handle traffic even in the presence of multiple shard failures
Modified our cache miss and failure backoff policies to decrease the rate of requests to the cache when it is struggling to serve requests

Additionally, we will be implementing the following changes to prevent future incidents altogether:

Improve monitoring and alerting for our caching and authentication infrastructure
Hardening our authentication system against degradations in caching services, so issues with our authentication caches don’t take down all traffic.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Fri, Oct 27, 2023, 10:53 PM

Resolved

This incident has been resolved.

Fri, Oct 20, 2023, 04:29 AM(1 week earlier)

Monitoring

We have recovered service, and error rates have returned to normal. We are continuing to monitor the issue.

Fri, Oct 20, 2023, 03:37 AM(52 minutes earlier)

Identified

The underlying issue affecting a subset of our API customers (and ChatGPT by extension) is with an upstream part of our API authentication stack. We're simultaneously scaling up the underlying cache resource we found to be near capacity, and working on an alternate system that does not use the cache. We will keep you posted as soon as we have an update, and are sorry for the trouble this is causing you.

Fri, Oct 20, 2023, 02:27 AM(1 hour earlier)

Identified

One fix was implemented, but we are seeing ongoing issues. A subset of API requests and ChatGPT users is impacted. We are continuing to investigate.

Fri, Oct 20, 2023, 01:19 AM(1 hour earlier)

Identified

We are continuing to work on the fix for this issue.

Fri, Oct 20, 2023, 12:49 AM(30 minutes earlier)

Identified

The issue has been identified and a fix is being implemented now.

Fri, Oct 20, 2023, 12:12 AM(36 minutes earlier)

Investigating

ChatGPT for a subset of users is also impacted by this outage, we are continuing to investigate.

Fri, Oct 20, 2023, 12:06 AM

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Thu, Oct 19, 2023, 11:51 PM(15 minutes earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.