On November 21st, between 1:48 - 5:21 PT a large portion of requests to OpenAI failed. This peaked at a 75% failure rate on paid and free ChatGPT.
This outage was triggered by maintenance on a PostgresDB service used by OpenAI’s Primary API service that caused both read replicas to fail at the same time. Although only the read replicas for this database were affected, they are load-bearing, causing request paths dependent on them to fail. The Primary API service provides user authentication and service-to-service authentication across the entire OpenAI stack.
After multiple unsuccessful attempts to both revive either of the failed replicas, as well as to bring up replacements in other regions, the issue was ultimately mitigated by working with our service provider to bring two new replicas in the same region into service.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future similar incidents altogether:
Fundamentally OpenAI needs to ensure that maintenance windows do not result in widespread service outages. To that end we will:
We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.