On June 3, 2024, at 11:49 PM PDT, ChatGPT experienced a significant outage affecting all user tiers (paid, enterprise, free, anonymous).
By 4:10 AM PDT, service was fully restored.
A second phase of the outage began a few hours later at 7:14 AM PDT on June 4 again impacting the same user cohorts.
Service was restored for a second time at 10:07 AM PDT.
The issue resulted from a database that ChatGPT depends on becoming unavailable due to traffic surges initiated by the connection pooling service and the way that service was configured.
The team initially attempted to mitigate in a variety of ways, including restarting the primary server and assessing failover options to other replicas. Despite the various attempts at recovery, the primary database continued to be unreachable. We eventually blocked all traffic to ChatGPT to remove all load from the DB and were able to promote a secondary target to be the new primary and began redirecting traffic to it. Re-ramping incoming traffic concluded at 10:07 AM at which time, all services were recovered.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future incidents of this type altogether: