OpenAI

Elevated error rates for ChatGPT and Platform API
Affected components
Updates

Write-up published

Read it here

Resolved

On August 16, 2024 from 11:38 AM to 1:16 PM PT, a significant issue impacted the reliability of the OpenAI primary API, resulting in degraded service for users. This incident led to reduced success rates for ChatGPT conversations and affected login and account creation processes. The incident occurred in two waves, lasting 44 and 15 minutes respectively.

The root cause was a combination of factors. A scheduled maintenance and an upgrade to the ingress of the OpenAI user-facing clusters introduced a networking control plane regression. This manifested itself in a short-lived data plane outage. As a result of the momentary loss of connectivity, a set of services became unhealthy and were automatically restarted. The restarts, however, took much longer than expected as the services starting up overwhelmed a backend persistence store with a heavy first-start query. The backend persistence store required additional time to catch up and recover.

As part of the incident response, we have already implemented the following measures:

  • We have mitigated the networking control plane regression, and validated that control plane restarts do not interfere with the clusters' data plane

  • We implemented software changes to improve services' start time and remove first-start query alleviating pressure on the persistence layer and speeding up start and restart of services.

  • Deployed configuration changes to optimize networking control plane's effect on clusters ability to handle traffic

  • Removed the expensive database query from critical startup paths

  • Implemented additional monitoring and alerting for networking control-plane related issues

Additionally, we will be implementing the following changes to prevent future incidents altogether:

  • We are introducing staged rollouts for infrastructure changes with longer soak time to ensure regressions are caught as early as possible and affect as few systems and users as possible.

  • We are auditing our systems for other slow queries that may affect service start time.

We are continuing to improve our infrastructure to ensure greater resilience and faster recovery in the event of future incidents.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Fri, Aug 23, 2024, 08:03 PM
6d earlier...

Resolved

From 11:38am PT to 12:33pm PT and then from 1:02pm PT to 1:16pm PT, some users saw elevated errors across ChatGPT and Platform API services.

Engineers have now completed a series of updates to ensure this issue doesn't recur.

ChatGPT and Platform API services are fully operational. This issue has been fully resolved.

Sat, Aug 17, 2024, 12:35 AM
1h earlier...

Monitoring

Engineers are continuing to make updates to ensure the issue does not reoccur. No impact is expected to ChatGPT and API services.

Fri, Aug 16, 2024, 10:56 PM
1h earlier...

Monitoring

ChatGPT and API services continue to be operational. Engineers are making some additional updates to ensure the issue does not reoccur.

Fri, Aug 16, 2024, 09:21 PM
1h earlier...

Monitoring

Engineers have taken measures to mitigate the elevated errors and are monitoring for additional changes.

Fri, Aug 16, 2024, 07:28 PM
34m earlier...

Investigating

We are continuing to investigate this issue.

Fri, Aug 16, 2024, 06:54 PM

Investigating

We are currently investigating elevated error rates on ChatGPT and Platform API.

Fri, Aug 16, 2024, 06:53 PM
Powered by

Availability metrics are reported at an aggregate level across all tiers, models, and error types. Individual customer availability may vary depending on their subscription tier: PAYG, Scale-Tier, or Reserved Capacity, as well as the specific model and API features in use.