On August 16, 2024 from 11:38 AM to 1:16 PM PT, a significant issue impacted the reliability of the OpenAI primary API, resulting in degraded service for users. This incident led to reduced success rates for ChatGPT conversations and affected login and account creation processes. The incident occurred in two waves, lasting 44 and 15 minutes respectively.
The root cause was a combination of factors. A scheduled maintenance and an upgrade to the ingress of the OpenAI user-facing clusters introduced a networking control plane regression. This manifested itself in a short-lived data plane outage. As a result of the momentary loss of connectivity, a set of services became unhealthy and were automatically restarted. The restarts, however, took much longer than expected as the services starting up overwhelmed a backend persistence store with a heavy first-start query. The backend persistence store required additional time to catch up and recover.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future incidents altogether:
We are continuing to improve our infrastructure to ensure greater resilience and faster recovery in the event of future incidents.
We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.