On April 10th between 11:00am - 1:43pm PST, a large portion of requests to OpenAI failed with 500 & 503 error codes. ChatGPT users saw significant failures over the course of the event.
The outage occurred due to a control plane service for one of our ChatGPT clusters degrading as a result of memory exhaustion. This resulted in cascading failures across multiple services. The team shifted traffic away from the problem cluster to a healthy cluster, scaling that cluster up in the process. This caused the second cluster to fail as the increase of pods resulted in more requests to the control plane service that that cluster utilizes.
Rate limiting to reduce the inbound traffic volume was applied, which led to many users receiving a misleading error message.
The issue was mitigated by increasing the available memory on the Kubernetes control plane and rebalancing traffic.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future incidents altogether:
We know that extended outages to ChatGPT affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.