API & ChatGPT Performance Degradation
Incident Report for OpenAI
Postmortem

On December 4th, 2024 from 15:48 PT to 15:52 PT, 100% of API requests experienced HTTP 530 errors due to a misconfiguration in the global load balancer. This was promptly corrected and connectivity restored by 15:52 PT.

Later that day, from 16:07 PT to 17:37 PT, a different issue was surfaced by an upgrade of the DNS cache system. Consequently the OpenAI identity system lost connectivity to the DNS cache. As a result, user requests appeared to stall for 30 seconds. After the 30 second DNS lookup timeout the Identity system fell back to a cached IP and completed requests successfully. These elevated latencies led to 45% of API requests experiencing 499 errors, which appeared as client-side cancellations. In the ChatGPT interfaces this latency manifested itself as a 30 second "wait time" after which requests were completed.

The first wave of this incident, caused by a global load balancer misconfiguration, was mitigated by applying the correct configuration.

The second wave was mitigated by temporarily switching to an alternative system.

  • To safeguard against global load balancer (Cloudflare) misconfigurations, we collaborated with Cloudflare to investigate the issue and implemented a solution to prevent a single pool from impacting all API traffic. The default and fallback pools are now distinct, and this change has already been rolled out to ensure seamless traffic handling.
  • The way that our identity systems integrated with and relied on our DNS cache here was found to be unnecessary and was removed.
  • We are continuing to improve our chaos test system, which will aim to surface DNS cache dependency issues in a controlled way.

We know that outages affect our customers' products and businesses. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Posted Dec 13, 2024 - 15:48 PST

Resolved
This incident has been resolved.
Posted Dec 04, 2024 - 18:21 PST
Update
We are continuing to monitor for any further issues. Please contact support via help.openai.com if any issues persist with ChatGPT or the API.
Posted Dec 04, 2024 - 18:00 PST
Update
The issue has been mitigated and we are continuing to monitor and verify that the entire system is returning to full operation.
Posted Dec 04, 2024 - 17:49 PST
Update
The issue has reappeared and may be affecting both API and ChatGPT. We are investigating the issue.
Posted Dec 04, 2024 - 16:29 PST
Update
We are continuing to monitor for any further issues.
Posted Dec 04, 2024 - 16:10 PST
Monitoring
We experienced a brief period of degraded API performance from approximately 3:45 PM to 3:50 PM PT. Performance has stabilized and we are currently monitoring. We will post an update once we have confirmed the issue has been fully resolved.
Posted Dec 04, 2024 - 16:08 PST
This incident affected: API and ChatGPT.