Elevated errors on ChatGPT

Resolved

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

On June 17th, 2024, from 11:39 am to 2:02 pm PT, ChatGPT experienced an elevated error rate, with the majority of requests failing at one point.

‌

The incident involved three main issues:

‌

An inference engine issue prompted the initial rollback
A series of cascading issues occurred with our event bus infrastructure resulting in IO-blocking across the ChatGPT service, which prevented requests from completing.
A bug caused ChatGPT users to receive empty completion responses.

‌

During the initial rollback, there was an unexpected degradation in an event publishing flow. Due to recent infrastructure changes, and the deployment, we experienced an abnormally high number of requests to a schema service which led to increased latencies. A 3rd party library executing these requests used an IO-blocking behavior that caused processes to stall, resulting in ChatGPT requests timing out and returning 504 errors. We rolled forward to mitigate this and ChatGPT requests no longer experienced 504s.

‌

We began to notice that conversations responses appeared to be successful but were usually empty. We identified that this is a regression due to recent code changes. The regression was fixed and we rolled forward again to mitigate the regression.

‌

As part of the incident response, we have already implemented the following measures:

Removed the IO-blocking behavior that occurred during event publishing.
Added caching to the schema service.
Implemented additional monitoring for the schema service.

‌

Additionally, we will be implementing the following changes to prevent future incidents altogether:

‌

Reduce environment mismatch between testing and prod.
Add monitor for shortened ChatGPT responses
Improve revert deploy time
Remove the dependency on the schema service.

Tue, Jun 25, 2024, 01:05 AM

Resolved

ChatGPT experienced an elevated error rate from 11:20am PT to 1:55pm PT. This is now resolved.

Mon, Jun 17, 2024, 09:02 PM(1 week earlier)

Investigating

We are currently investigating elevated error rates impacting ChatGPT.

Mon, Jun 17, 2024, 06:39 PM(2 hours earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.