Starting at 10:40am on December 26th, 2024, multiple OpenAI products saw degraded availability. ChatGPT, Sora video creation, and many APIs (agents, realtime speech, batch, DALL-E) saw > 90% error rates during the incident. The text completions API was unaffected. All systems fully recovered by 3:11 PM except ChatGPT, which fully recovered by 6:20 PM.
The root cause was a power failure in a cloud provider data center which impacted critical services such as databases in that region for an extended period.
OpenAI’s databases are globally replicated but region-wide failover currently requires manual intervention from the hosting cloud provider. We were able to work with the cloud provider to fail over some databases to other regions but our scale elongated the mitigation time. We kicked off several workstreams to explore workarounds; final recovery only came when the cloud provider fully recovered the region.
In the coming weeks, we will embark on a major infrastructure initiative to ensure our systems are resilient to an extended outage in any region of any of our cloud providers by adding a layer of indirection under our control in between our applications and our cloud databases. This will allow significantly faster failover.
We know that extended outages can impact your products, businesses, and lives. We will prioritize the preventative measures outlined above to continue improving our reliability.