Elevated API Errors
Incident Report for OpenAI
Postmortem

On October 19, 2023, from 4:46pm to 9:15pm PT, a large portion of requests to OpenAI failed with 500 or 502 error codes. This peaked at a 60% failure rate for GPT-3.5-Turbo API traffic, a 70% failure rate for GPT-4 API traffic, and similar failure rates for all other API endpoints. Additionally, a fraction of ChatGPT Enterprise traffic was affected as well.

The root cause was identified as our Redis authentication cache becoming largely unresponsive due to shards failing while the cache was already under high load. Because our authentication system was not resilient to such a cache failure, this led to a significant fraction of timeouts when authenticating. Due to the user-keyed nature of the authentication system, some users were heavily affected while others were not at all – some API customers saw all requests fail, while others only saw a small percentage fail.

After identifying the root cause of the outage, we took immediate action by attempting to scale up the existing Redis cluster, as well as separately provisioning a new, larger Redis cluster with higher resource limits. However, fully restoring the service took longer than we anticipated. While the service improved periodically over the course of the outage, the issue was only fully mitigated by 9:15pm.

As part of the incident response, we have already implemented the following measures:

  • Scaled up the Redis authentication cache to be able to handle traffic even in the presence of multiple shard failures
  • Modified our cache miss and failure backoff policies to decrease the rate of requests to the cache when it is struggling to serve requests

Additionally, we will be implementing the following changes to prevent future incidents altogether:

  • Improve monitoring and alerting for our caching and authentication infrastructure
  • Hardening our authentication system against degradations in caching services, so issues with our authentication caches don’t take down all traffic.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Posted Oct 27, 2023 - 15:53 PDT

Resolved
This incident has been resolved.
Posted Oct 19, 2023 - 21:29 PDT
Monitoring
We have recovered service, and error rates have returned to normal. We are continuing to monitor the issue.
Posted Oct 19, 2023 - 20:37 PDT
Update
The underlying issue affecting a subset of our API customers (and ChatGPT by extension) is with an upstream part of our API authentication stack. We're simultaneously scaling up the underlying cache resource we found to be near capacity, and working on an alternate system that does not use the cache. We will keep you posted as soon as we have an update, and are sorry for the trouble this is causing you.
Posted Oct 19, 2023 - 19:27 PDT
Update
One fix was implemented, but we are seeing ongoing issues. A subset of API requests and ChatGPT users is impacted. We are continuing to investigate.
Posted Oct 19, 2023 - 18:19 PDT
Update
We are continuing to work on the fix for this issue.
Posted Oct 19, 2023 - 17:49 PDT
Identified
The issue has been identified and a fix is being implemented now.
Posted Oct 19, 2023 - 17:12 PDT
Update
ChatGPT for a subset of users is also impacted by this outage, we are continuing to investigate.
Posted Oct 19, 2023 - 17:06 PDT
Investigating
We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Oct 19, 2023 - 16:51 PDT
This incident affected: API.