Elevated error rates for ChatGPT and Platform API
Incident Report for OpenAI
Postmortem

On August 16, 2024 from 11:38 AM to 1:16 PM PT, a significant issue impacted the reliability of the OpenAI primary API, resulting in degraded service for users. This incident led to reduced success rates for ChatGPT conversations and affected login and account creation processes. The incident occurred in two waves, lasting 44 and 15 minutes respectively.

The root cause was a combination of factors. A scheduled maintenance and an upgrade to the ingress of the OpenAI user-facing clusters introduced a networking control plane regression. This manifested itself in a short-lived data plane outage. As a result of the momentary loss of connectivity, a set of services became unhealthy and were automatically restarted. The restarts, however, took much longer than expected as the services starting up overwhelmed a backend persistence store with a heavy first-start query. The backend persistence store required additional time to catch up and recover.

As part of the incident response, we have already implemented the following measures:

  1. We have mitigated the networking control plane regression, and validated that control plane restarts do not interfere with the clusters' data plane
  2. We implemented software changes to improve services' start time and remove first-start query alleviating pressure on the persistence layer and speeding up start and restart of services.
  3. Deployed configuration changes to optimize networking control plane's effect on clusters ability to handle traffic
  4. Removed the expensive database query from critical startup paths
  5. Implemented additional monitoring and alerting for networking control-plane related issues

Additionally, we will be implementing the following changes to prevent future incidents altogether:

  1. We are introducing staged rollouts for infrastructure changes with longer soak time to ensure regressions are caught as early as possible and affect as few systems and users as possible.
  2. We are auditing our systems for other slow queries that may affect service start time.

We are continuing to improve our infrastructure to ensure greater resilience and faster recovery in the event of future incidents.

We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Posted Aug 23, 2024 - 14:27 PDT

Resolved
From 11:38am PT to 12:33pm PT and then from 1:02pm PT to 1:16pm PT, some users saw elevated errors across ChatGPT and Platform API services.

Engineers have now completed a series of updates to ensure this issue doesn't recur.

ChatGPT and Platform API services are fully operational. This issue has been fully resolved.
Posted Aug 16, 2024 - 17:35 PDT
Update
Engineers are continuing to make updates to ensure the issue does not reoccur. No impact is expected to ChatGPT and API services.
Posted Aug 16, 2024 - 15:56 PDT
Update
ChatGPT and API services continue to be operational. Engineers are making some additional updates to ensure the issue does not reoccur.
Posted Aug 16, 2024 - 14:21 PDT
Monitoring
Engineers have taken measures to mitigate the elevated errors and are monitoring for additional changes.
Posted Aug 16, 2024 - 12:28 PDT
Update
We are continuing to investigate this issue.
Posted Aug 16, 2024 - 11:54 PDT
Investigating
We are currently investigating elevated error rates on ChatGPT and Platform API.
Posted Aug 16, 2024 - 11:53 PDT
This incident affected: API and ChatGPT.