Elevated errors affecting API and ChatGPT

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Jan 30, 2024, 12:26 AM

12:49 AM

Updates

Write-up published

Read it here

Resolved

On January 29, 2024, from 4:19 pm to 4:34 pm, a significant portion of requests to our system experienced a high failure rate and increased latency. The issue was triggered by a critical error in our routing mechanism.

We were immediately alerted to the failures and rapidly identified the root cause as an interaction between our ChatGPT service and the downstream services, which was triggered by a new release of our ChatGPT backend.

After identifying the root cause of the issue, we took action by rolling back both the backend and the ChatGPT services to their previous stable versions. This action restored normal service operations.

‌

The detailed incident timeline is as below:

2:33 pm, 29th Jan 2024: Culprit PR deployed to the backend service.
4:17 pm, 29th Jan 2024: ChatGPT new deployment. The issue in the downstream service was only triggered when ChatGPT deployed.
4:19 pm, 29th Jan 2024: Impact becomes apparent in monitoring systems and alerts fire notifying engineers.
4:22 pm, 29th Jan 2024: Root cause identified.
4:28 pm, 29th Jan 2024: Rollback mitigation started.
4:34 pm, 29th Jan 2024: Traffic fully recovered.

‌

As part of the incident response, we are taking immediate measures to prevent similar issues in the future, including:

Evaluating dependency enhancements - strengthening our downstream system to prevent issues caused by dependencies between services.
Adjustments to the code roll out strategy for the downstream service to do in phases.

‌

We apologize for any inconvenience caused by this incident and are committed to ensuring the reliability and availability of our services.

Mon, Feb 5, 2024, 04:47 PM

Resolved

This incident has been resolved.

Tue, Jan 30, 2024, 12:49 AM(6 days earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Tue, Jan 30, 2024, 12:37 AM(12 minutes earlier)

Investigating

We are continuing to investigate this issue.

Tue, Jan 30, 2024, 12:30 AM

Investigating

We are continuing to investigate this issue.

Tue, Jan 30, 2024, 12:29 AM

Investigating

We are currently investigating this issue.

Tue, Jan 30, 2024, 12:26 AM

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.