Elevated errors in ChatGPT
Incident Report for OpenAI
Postmortem

On April 10th between 11:00am - 1:43pm PST, a large portion of requests to OpenAI failed with 500 & 503 error codes. ChatGPT users saw significant failures over the course of the event.

The outage occurred due to a control plane service for one of our ChatGPT clusters degrading as a result of memory exhaustion. This resulted in cascading failures across multiple services. The team shifted traffic away from the problem cluster to a healthy cluster, scaling that cluster up in the process. This caused the second cluster to fail as the increase of pods resulted in more requests to the control plane service that that cluster utilizes.

Rate limiting to reduce the inbound traffic volume was applied, which led to many users receiving a misleading error message.

The issue was mitigated by increasing the available memory on the Kubernetes control plane and rebalancing traffic.

As part of the incident response, we have already implemented the following measures:

  • Adjusted the available memory for control plane services.
  • Modified the services which make the majority of the requests to the control plane to reduce the volume of calls that they make.
  • Implemented improved monitoring for the control plane and relevant other symptoms that we observed.
  • Modify the messages that users receive when using one of our rate limit configuration switches.

Additionally, we will be implementing the following changes to prevent future incidents altogether:

  • More thorough monitoring of the control plane services provided by our infrastructure provider.

We know that extended outages to ChatGPT affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Posted Apr 17, 2024 - 17:01 PDT

Resolved
At this time our deployed fix has mitigated the elevated errors in ChatGPT. We are continuing to investigate the factors which led to this incident.

This outage started at ~11:00am and ended at 1:43pm PST.
Posted Apr 10, 2024 - 13:56 PDT
Monitoring
Errors are continuing to resolve as a result of the fix we recently deployed. We are actively monitoring traffic returning to normal in ChatGPT.
Posted Apr 10, 2024 - 13:49 PDT
Identified
We have identified the source of the elevated errors in ChatGPT and are working through a fix at this time.
Posted Apr 10, 2024 - 13:39 PDT
Update
Some users may also be seeing unexpected error messages. We continue to actively investigate the elevated errors in ChatGPT.
Posted Apr 10, 2024 - 13:10 PDT
Update
We are continuing to investigate this issue of elevated errors in ChatGPT.
Posted Apr 10, 2024 - 12:45 PDT
Update
We are still investigating the elevated errors in ChatGPT.
Posted Apr 10, 2024 - 12:03 PDT
Update
We continue to investigate elevated errors ChatGPT.
Posted Apr 10, 2024 - 11:39 PDT
Investigating
We are currently investigating elevated errors ChatGPT.
Posted Apr 10, 2024 - 10:56 PDT
This incident affected: ChatGPT.