Elevated errors in ChatGPT

Resolved·Partial outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Apr 10, 2024, 05:56 PM

08:56 PM

Updates

Write-up published

Read it here

Resolved

On April 10th between 11:00am - 1:43pm PST, a large portion of requests to OpenAI failed with 500 & 503 error codes. ChatGPT users saw significant failures over the course of the event.

‌

The outage occurred due to a control plane service for one of our ChatGPT clusters degrading as a result of memory exhaustion. This resulted in cascading failures across multiple services. The team shifted traffic away from the problem cluster to a healthy cluster, scaling that cluster up in the process. This caused the second cluster to fail as the increase of pods resulted in more requests to the control plane service that that cluster utilizes.

Rate limiting to reduce the inbound traffic volume was applied, which led to many users receiving a misleading error message.

‌

The issue was mitigated by increasing the available memory on the Kubernetes control plane and rebalancing traffic.

‌

As part of the incident response, we have already implemented the following measures:

Adjusted the available memory for control plane services.
Modified the services which make the majority of the requests to the control plane to reduce the volume of calls that they make.
Implemented improved monitoring for the control plane and relevant other symptoms that we observed.
Modify the messages that users receive when using one of our rate limit configuration switches.

‌

Additionally, we will be implementing the following changes to prevent future incidents altogether:

More thorough monitoring of the control plane services provided by our infrastructure provider.

‌

We know that extended outages to ChatGPT affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.

Wed, Apr 17, 2024, 11:59 PM

Resolved

At this time our deployed fix has mitigated the elevated errors in ChatGPT. We are continuing to investigate the factors which led to this incident.

This outage started at ~11:00am and ended at 1:43pm PST.

Wed, Apr 10, 2024, 08:56 PM(1 week earlier)

Monitoring

Errors are continuing to resolve as a result of the fix we recently deployed. We are actively monitoring traffic returning to normal in ChatGPT.

Wed, Apr 10, 2024, 08:49 PM

Identified

We have identified the source of the elevated errors in ChatGPT and are working through a fix at this time.

Wed, Apr 10, 2024, 08:39 PM(10 minutes earlier)

Investigating

Some users may also be seeing unexpected error messages. We continue to actively investigate the elevated errors in ChatGPT.

Wed, Apr 10, 2024, 08:10 PM(29 minutes earlier)

Investigating

We are continuing to investigate this issue of elevated errors in ChatGPT.

Wed, Apr 10, 2024, 07:45 PM(24 minutes earlier)

Investigating

We are still investigating the elevated errors in ChatGPT.

Wed, Apr 10, 2024, 07:03 PM(41 minutes earlier)

Investigating

We continue to investigate elevated errors ChatGPT.

Wed, Apr 10, 2024, 06:39 PM(24 minutes earlier)

Investigating

We are currently investigating elevated errors ChatGPT.

Wed, Apr 10, 2024, 05:56 PM(43 minutes earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.