DALL·E Web Interface Incident

Resolved

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

The hosts which were serving DALL·E's Web Experience and the text-curie-001 API went offline. This was due to hosts not properly joining our Kubernetes cluster. The nodes didn't re-join the cluster due to timing issues of a particular GPU diagnostics command that exceeded a timeout. We do not have control over this timeout or boot script since this is managed by our service provider. This was not anticipated since this behavior is unique to a particular node type in a particular region. The nodes were being cycled as part of a planned Kubernetes version upgrade.

‌

text-curie-001 was quickly moved to an unaffected node and service restored.

‌

Due to the size of DALL·E's infrastructure and limited capacity, moving to healthy nodes was not an option. The resulting decrease in capacity degraded DALL·E service, as the request queue grew long enough that most requests timed out before image generations could be served.

‌

During this incident, we introduced several levers for graceful load shedding in events where DALL·E receives more requests than it can support. To implement one of these levers, we ran a database migration. This migration stalled, had to be rolled back, and then retried due to unexpected row locks. During this time we were unable to serve DALL·E and this issue exacerbated our recovery time.

‌

Moving forward, we are implementing additional levers for load shedding and investigating alternative means of serving greater numbers of requests, given capacity constraints. One such lever is rejecting all inbound requests when the request queue grows beyond a certain length, if the request would certainly time out before returning anyway. Additionally, we are reconfiguring our nodes to give us full control over boot-up scripts and adding new procedures to check for unexpected inconsistencies before full node cycles.

Tue, Mar 21, 2023, 08:29 PM

Resolved

The incident has been mitigated. We are continuing to investigate more ways to add capacity to DALL·E.

Mon, Mar 20, 2023, 11:49 PM(20 hours earlier)

Monitoring

We are continuing to investigate this issue.

Mon, Mar 20, 2023, 11:07 PM(42 minutes earlier)

Identified

The migration rollback was successful.

Mon, Mar 20, 2023, 09:47 PM(1 hour earlier)

Identified

We are rolling back the database migration.

Mon, Mar 20, 2023, 09:38 PM

Identified

There is currently a Labs outage due to a failed database migration.

Mon, Mar 20, 2023, 09:37 PM

Monitoring

We are continuing to investigate this issue.

Mon, Mar 20, 2023, 09:35 PM

Monitoring

We are allowing some free Labs traffic while we investigate an issue with end to end request latency causing a backlog of requests to be processed.

Mon, Mar 20, 2023, 08:16 PM(1 hour earlier)

Monitoring

We are gradually restoring service to free traffic.

Mon, Mar 20, 2023, 07:54 PM(21 minutes earlier)

Identified

Paid labs traffic has been restored and we will soon begin gradually restoring free traffic.

Mon, Mar 20, 2023, 07:01 PM(52 minutes earlier)

Identified

We are adding additional capacity in other regions while investigating the underlying capacity failure.

Mon, Mar 20, 2023, 06:01 PM(1 hour earlier)

Investigating

We're investigating an unintentional reduction in available capacity.

Mon, Mar 20, 2023, 05:19 PM(42 minutes earlier)

Investigating

We are currently investigating.

Mon, Mar 20, 2023, 05:14 PM

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.