On October 19, 2023, from 4:46pm to 9:15pm PT, a large portion of requests to OpenAI failed with 500 or 502 error codes. This peaked at a 60% failure rate for GPT-3.5-Turbo API traffic, a 70% failure rate for GPT-4 API traffic, and similar failure rates for all other API endpoints. Additionally, a fraction of ChatGPT Enterprise traffic was affected as well.
The root cause was identified as our Redis authentication cache becoming largely unresponsive due to shards failing while the cache was already under high load. Because our authentication system was not resilient to such a cache failure, this led to a significant fraction of timeouts when authenticating. Due to the user-keyed nature of the authentication system, some users were heavily affected while others were not at all – some API customers saw all requests fail, while others only saw a small percentage fail.
After identifying the root cause of the outage, we took immediate action by attempting to scale up the existing Redis cluster, as well as separately provisioning a new, larger Redis cluster with higher resource limits. However, fully restoring the service took longer than we anticipated. While the service improved periodically over the course of the outage, the issue was only fully mitigated by 9:15pm.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future incidents altogether:
We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.