Write-up published
Resolved
The hosts which were serving DALL·E's Web Experience and the text-curie-001 API went offline. This was due to hosts not properly joining our Kubernetes cluster. The nodes didn't re-join the cluster due to timing issues of a particular GPU diagnostics command that exceeded a timeout. We do not have control over this timeout or boot script since this is managed by our service provider. This was not anticipated since this behavior is unique to a particular node type in a particular region. The nodes were being cycled as part of a planned Kubernetes version upgrade.
text-curie-001 was quickly moved to an unaffected node and service restored.
Due to the size of DALL·E's infrastructure and limited capacity, moving to healthy nodes was not an option. The resulting decrease in capacity degraded DALL·E service, as the request queue grew long enough that most requests timed out before image generations could be served.
During this incident, we introduced several levers for graceful load shedding in events where DALL·E receives more requests than it can support. To implement one of these levers, we ran a database migration. This migration stalled, had to be rolled back, and then retried due to unexpected row locks. During this time we were unable to serve DALL·E and this issue exacerbated our recovery time.
Moving forward, we are implementing additional levers for load shedding and investigating alternative means of serving greater numbers of requests, given capacity constraints. One such lever is rejecting all inbound requests when the request queue grows beyond a certain length, if the request would certainly time out before returning anyway. Additionally, we are reconfiguring our nodes to give us full control over boot-up scripts and adding new procedures to check for unexpected inconsistencies before full node cycles.
Resolved
The incident has been mitigated. We are continuing to investigate more ways to add capacity to DALL·E.
Monitoring
We are continuing to investigate this issue.
Identified
The migration rollback was successful.
Identified
We are rolling back the database migration.
Identified
There is currently a Labs outage due to a failed database migration.
Monitoring
We are continuing to investigate this issue.
Monitoring
We are allowing some free Labs traffic while we investigate an issue with end to end request latency causing a backlog of requests to be processed.
Monitoring
We are gradually restoring service to free traffic.
Identified
Paid labs traffic has been restored and we will soon begin gradually restoring free traffic.
Identified
We are adding additional capacity in other regions while investigating the underlying capacity failure.
Investigating
We're investigating an unintentional reduction in available capacity.
Investigating
We are currently investigating.