On January 29, 2024, from 4:19 pm to 4:34 pm, a significant portion of requests to our system experienced a high failure rate and increased latency. The issue was triggered by a critical error in our routing mechanism.
We were immediately alerted to the failures and rapidly identified the root cause as an interaction between our ChatGPT service and the downstream services, which was triggered by a new release of our ChatGPT backend.
After identifying the root cause of the issue, we took action by rolling back both the backend and the ChatGPT services to their previous stable versions. This action restored normal service operations.
The detailed incident timeline is as below:
As part of the incident response, we are taking immediate measures to prevent similar issues in the future, including:
We apologize for any inconvenience caused by this incident and are committed to ensuring the reliability and availability of our services.