Between March 30 and April 1, OpenAI experienced intermittent service degradation that impacted new user signups, login attempts for some users, and, in certain cases, the availability of ChatGPT. The disruption followed an unprecedented surge in demand after a new product feature launched, which placed unexpected load on our core account services and associated infrastructure.
Impact
Many users were unable to create accounts or log in, particularly during peak traffic periods.
ChatGPT availability was intermittently affected due to its dependency on the affected services.
Temporary geo-specific restrictions were applied to reduce load and maintain overall system stability.
Contributing Factors
Several factors contributed to the incident:
Although the dependent systems had been extensively load tested, the increase in new user signups and logins was significant enough to cause resource saturation in a key backend database.
Some components relied on immediate consistency between reads and writes, which made them more sensitive to delays caused by system overload.
Retry behaviors in client applications unintentionally amplified traffic during peak periods.
The architecture in place at the time did not scale adequately to handle the sharp growth in usage.
Remediation and Next Steps
We took immediate actions to restore service, including:
Scaling up backend infrastructure to handle increased load.
Reducing unnecessary operations and optimizing high-impact queries.
Applying rate limits and temporarily restricting traffic from select regions to protect core systems.
Collaborating with infrastructure partners to address and resolve underlying performance issues.
To prevent a recurrence, we are:
1. Redesigning key systems to better support elastic scaling and high availability under load.
2. Reducing dependencies between core features and signup/login flows.
3. Implementing additional safeguards and automated mechanisms to detect and respond to traffic spikes more effectively.
We know that extended outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.