Elevated errors affecting API and ChatGPT
Incident Report for OpenAI
Postmortem

On January 29, 2024, from 4:19 pm to 4:34 pm, a significant portion of requests to our system experienced a high failure rate and increased latency. The issue was triggered by a critical error in our routing mechanism.

We were immediately alerted to the failures and rapidly identified the root cause as an interaction between our ChatGPT service and the downstream services, which was triggered by a new release of our ChatGPT backend.

After identifying the root cause of the issue, we took action by rolling back both the backend and the ChatGPT services to their previous stable versions. This action restored normal service operations.

The detailed incident timeline is as below:

  • 2:33 pm, 29th Jan 2024: Culprit PR deployed to the backend service.
  • 4:17 pm, 29th Jan 2024: ChatGPT new deployment. The issue in the downstream service was only triggered when ChatGPT deployed.
  • 4:19 pm, 29th Jan 2024: Impact becomes apparent in monitoring systems and alerts fire notifying engineers.
  • 4:22 pm, 29th Jan 2024: Root cause identified.
  • 4:28 pm, 29th Jan 2024: Rollback mitigation started.
  • 4:34 pm, 29th Jan 2024: Traffic fully recovered.

As part of the incident response, we are taking immediate measures to prevent similar issues in the future, including:

  1. Evaluating dependency enhancements - strengthening our downstream system to prevent issues caused by dependencies between services.
  2. Adjustments to the code roll out strategy for the downstream service to do in phases.

We apologize for any inconvenience caused by this incident and are committed to ensuring the reliability and availability of our services.

Posted Feb 05, 2024 - 08:47 PST

Resolved
This incident has been resolved.
Posted Jan 29, 2024 - 16:49 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 29, 2024 - 16:37 PST
Update
We are continuing to investigate this issue.
Posted Jan 29, 2024 - 16:30 PST
Update
We are continuing to investigate this issue.
Posted Jan 29, 2024 - 16:29 PST
Investigating
We are currently investigating this issue.
Posted Jan 29, 2024 - 16:26 PST
This incident affected: API and ChatGPT.