On June 3, 2024, at 11:49 PM PDT, ChatGPT experienced a significant outage affecting all user tiers \(paid, enterprise, free, anonymous\).
By 4:10 AM PDT, service was fully restored.
A second phase of the outage began a few hours later at 7:14 AM PDT on June 4 again impacting the same user cohorts.
Service was restored for a second time at 10:07 AM PDT.
The issue resulted from a database that ChatGPT depends on becoming unavailable due to traffic surges initiated by the connection pooling service and the way that service was configured.
The team initially attempted to mitigate in a variety of ways, including restarting the primary server and assessing failover options to other replicas. Despite the various attempts at recovery, the primary database continued to be unreachable. We eventually blocked all traffic to ChatGPT to remove all load from the DB and were able to promote a secondary target to be the new primary and began redirecting traffic to it. Re-ramping incoming traffic concluded at 10:07 AM at which time, all services were recovered.
As part of the incident response, we have already implemented the following measures:
Tuned the number of connections the pooling service makes to the DB backend.
Increased timeouts on connections made to the DB to avoid deadlocks.
Implemented exponential backoff, gradually increasing the wait time between subsequent retry attempts for DB connection failures.
Modified our load shedding tooling to make it easier to degrade more gracefully.
Additionally, we will be implementing the following changes to prevent future incidents of this type altogether:
Re-architect the DB design to increase its redundancy.
Improve our ability to load shed at the DB layer \(in addition to the clients\).
Expand the load testing and benchmarking we do for the backend layer.