On December 4th, 2024 from 15:48 PT to 15:52 PT, 100% of API requests experienced HTTP 530 errors due to a misconfiguration in the global load balancer. This was promptly corrected and connectivity restored by 15:52 PT.
Later that day, from 16:07 PT to 17:37 PT, a different issue was surfaced by an upgrade of the DNS cache system. Consequently the OpenAI identity system lost connectivity to the DNS cache. As a result, user requests appeared to stall for 30 seconds. After the 30 second DNS lookup timeout the Identity system fell back to a cached IP and completed requests successfully. These elevated latencies led to 45% of API requests experiencing 499 errors, which appeared as client-side cancellations. In the ChatGPT interfaces this latency manifested itself as a 30 second "wait time" after which requests were completed.
The first wave of this incident, caused by a global load balancer misconfiguration, was mitigated by applying the correct configuration.
The second wave was mitigated by temporarily switching to an alternative system.
We know that outages affect our customers' products and businesses. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.