DALL·E Web Interface Incident
Incident Report for OpenAI
Postmortem

The hosts which were serving DALL·E's Web Experience and the text-curie-001 API went offline. This was due to hosts not properly joining our Kubernetes cluster. The nodes didn't re-join the cluster due to timing issues of a particular GPU diagnostics command that exceeded a timeout. We do not have control over this timeout or boot script since this is managed by our service provider. This was not anticipated since this behavior is unique to a particular node type in a particular region. The nodes were being cycled as part of a planned Kubernetes version upgrade.

text-curie-001 was quickly moved to an unaffected node and service restored.

Due to the size of DALL·E's infrastructure and limited capacity, moving to healthy nodes was not an option. The resulting decrease in capacity degraded DALL·E service, as the request queue grew long enough that most requests timed out before image generations could be served.

During this incident, we introduced several levers for graceful load shedding in events where DALL·E receives more requests than it can support. To implement one of these levers, we ran a database migration. This migration stalled, had to be rolled back, and then retried due to unexpected row locks. During this time we were unable to serve DALL·E and this issue exacerbated our recovery time.

Moving forward, we are implementing additional levers for load shedding and investigating alternative means of serving greater numbers of requests, given capacity constraints. One such lever is rejecting all inbound requests when the request queue grows beyond a certain length, if the request would certainly time out before returning anyway. Additionally, we are reconfiguring our nodes to give us full control over boot-up scripts and adding new procedures to check for unexpected inconsistencies before full node cycles.

Posted Mar 21, 2023 - 13:29 PDT

Resolved
The incident has been mitigated. We are continuing to investigate more ways to add capacity to DALL·E.
Posted Mar 20, 2023 - 16:49 PDT
Monitoring
We are continuing to investigate this issue.
Posted Mar 20, 2023 - 16:07 PDT
Update
The migration rollback was successful.
Posted Mar 20, 2023 - 14:47 PDT
Update
We are rolling back the database migration.
Posted Mar 20, 2023 - 14:38 PDT
Identified
There is currently a Labs outage due to a failed database migration.
Posted Mar 20, 2023 - 14:37 PDT
Update
We are continuing to investigate this issue.
Posted Mar 20, 2023 - 14:35 PDT
Update
We are allowing some free Labs traffic while we investigate an issue with end to end request latency causing a backlog of requests to be processed.
Posted Mar 20, 2023 - 13:16 PDT
Monitoring
We are gradually restoring service to free traffic.
Posted Mar 20, 2023 - 12:54 PDT
Update
Paid labs traffic has been restored and we will soon begin gradually restoring free traffic.
Posted Mar 20, 2023 - 12:01 PDT
Identified
We are adding additional capacity in other regions while investigating the underlying capacity failure.
Posted Mar 20, 2023 - 11:01 PDT
Update
We're investigating an unintentional reduction in available capacity.
Posted Mar 20, 2023 - 10:19 PDT
Investigating
We are currently investigating.
Posted Mar 20, 2023 - 10:14 PDT
This incident affected: Labs.