API & ChatGPT Performance Degradation

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Dec 5, 2024, 12:08 AM

02:00 AM

Updates

Write-up published

Read it here

Resolved

On December 4th, 2024 from 15:48 PT to 15:52 PT, 100% of API requests experienced HTTP 530 errors due to a misconfiguration in the global load balancer. This was promptly corrected and connectivity restored by 15:52 PT.

Later that day, from 16:07 PT to 17:37 PT, a different issue was surfaced by an upgrade of the DNS cache system. Consequently the OpenAI identity system lost connectivity to the DNS cache. As a result, user requests appeared to stall for 30 seconds. After the 30 second DNS lookup timeout the Identity system fell back to a cached IP and completed requests successfully. These elevated latencies led to 45% of API requests experiencing 499 errors, which appeared as client-side cancellations. In the ChatGPT interfaces this latency manifested itself as a 30 second "wait time" after which requests were completed.

The first wave of this incident, caused by a global load balancer misconfiguration, was mitigated by applying the correct configuration.

The second wave was mitigated by temporarily switching to an alternative system.

To safeguard against global load balancer \(Cloudflare\) misconfigurations, we collaborated with Cloudflare to investigate the issue and implemented a solution to prevent a single pool from impacting all API traffic. The default and fallback pools are now distinct, and this change has already been rolled out to ensure seamless traffic handling.
The way that our identity systems integrated with and relied on our DNS cache here was found to be unnecessary and was removed.
We are continuing to improve our chaos test system, which will aim to surface DNS cache dependency issues in a controlled way.

We know that outages affect our customers' products and businesses. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Fri, Dec 13, 2024, 11:42 PM

Resolved

This incident has been resolved.

Thu, Dec 5, 2024, 02:21 AM(1 week earlier)

Monitoring

We are continuing to monitor for any further issues. Please contact support via help.openai.com if any issues persist with ChatGPT or the API.

Thu, Dec 5, 2024, 02:00 AM(21 minutes earlier)

Monitoring

The issue has been mitigated and we are continuing to monitor and verify that the entire system is returning to full operation.

Thu, Dec 5, 2024, 01:49 AM(11 minutes earlier)

Monitoring

The issue has reappeared and may be affecting both API and ChatGPT. We are investigating the issue.

Thu, Dec 5, 2024, 12:29 AM(1 hour earlier)

Monitoring

We are continuing to monitor for any further issues.

Thu, Dec 5, 2024, 12:10 AM(19 minutes earlier)

Monitoring

We experienced a brief period of degraded API performance from approximately 3:45 PM to 3:50 PM PT. Performance has stabilized and we are currently monitoring. We will post an update once we have confirmed the issue has been fully resolved.

Thu, Dec 5, 2024, 12:08 AM

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.