Elevated errors on ChatGPT
Incident Report for OpenAI
Postmortem

On June 17th, 2024, from 11:39 am to 2:02 pm PT, ChatGPT experienced an elevated error rate, with the majority of requests failing at one point.

The incident involved three main issues:

  • An inference engine issue prompted the initial rollback
  • A series of cascading issues occurred with our event bus infrastructure resulting in IO-blocking across the ChatGPT service, which prevented requests from completing.
  • A bug caused ChatGPT users to receive empty completion responses.

During the initial rollback, there was an unexpected degradation in an event publishing flow. Due to recent infrastructure changes, and the deployment, we experienced an abnormally high number of requests to a schema service which led to increased latencies. A 3rd party library executing these requests used an IO-blocking behavior that caused processes to stall, resulting in ChatGPT requests timing out and returning 504 errors. We rolled forward to mitigate this and ChatGPT requests no longer experienced 504s.

We began to notice that conversations responses appeared to be successful but were usually empty. We identified that this is a regression due to recent code changes. The regression was fixed and we rolled forward again to mitigate the regression.

As part of the incident response, we have already implemented the following measures:

  • Removed the IO-blocking behavior that occurred during event publishing.
  • Added caching to the schema service.
  • Implemented additional monitoring for the schema service.

Additionally, we will be implementing the following changes to prevent future incidents altogether:

  • Reduce environment mismatch between testing and prod.

  • Add monitor for shortened ChatGPT responses

  • Improve revert deploy time

  • Remove the dependency on the schema service.

Posted Jun 24, 2024 - 18:05 PDT

Resolved
ChatGPT experienced an elevated error rate from 11:20am PT to 1:55pm PT. This is now resolved.
Posted Jun 17, 2024 - 14:02 PDT
Investigating
We are currently investigating elevated error rates impacting ChatGPT.
Posted Jun 17, 2024 - 11:39 PDT
This incident affected: ChatGPT.