On June 17th, 2024, from 11:39 am to 2:02 pm PT, ChatGPT experienced an elevated error rate, with the majority of requests failing at one point.
The incident involved three main issues:
During the initial rollback, there was an unexpected degradation in an event publishing flow. Due to recent infrastructure changes, and the deployment, we experienced an abnormally high number of requests to a schema service which led to increased latencies. A 3rd party library executing these requests used an IO-blocking behavior that caused processes to stall, resulting in ChatGPT requests timing out and returning 504 errors. We rolled forward to mitigate this and ChatGPT requests no longer experienced 504s.
We began to notice that conversations responses appeared to be successful but were usually empty. We identified that this is a regression due to recent code changes. The regression was fixed and we rolled forward again to mitigate the regression.
As part of the incident response, we have already implemented the following measures:
Additionally, we will be implementing the following changes to prevent future incidents altogether:
Reduce environment mismatch between testing and prod.
Add monitor for shortened ChatGPT responses
Improve revert deploy time
Remove the dependency on the schema service.