Degraded performance on the DALL·E Web Interface due to backlog of requests

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Mar 18, 2023, 09:22 PM

10:59 PM

Labs

Updates

Write-up published

Read it here

Resolved

On Saturday March 18th and Sunday March 19th, the DALL·E Web Experience suffered a severe degradation. From 6:20 am PDT to 11:20 am PDT, only 10% of DALL·E requests were successful. From 11:20 am to 2:40 pm PDT, only 25% of requests were successful. There was additional degradation Saturday night with 50% success rate at 11 pm and 0% at 7 am Sunday. Service returned around 11 am on Sunday, March 19th.

‌

DALL·E image generation tasks used to be canceled after five minutes. On Friday, March 17th, the engineering team reduced that timeout to one minute. This was done to better increase service with a finite capacity and under the assumption the vast majority of people would not wait five minutes to get their image generation. In anticipation of the effects this would have, we downsized the DALL·E fleet to free up usage for other critical projects. On March 18, steadily increasing traffic caused a backup of requests that was severe enough to reach this one minute threshold. As a result, most requests ended up getting canceled.

‌

This was fixed by reallocating capacity to DALL·E, and changing queuing and retry logic.

‌

When attempting to scale up workers and increase capacity, the application hit database connection limits. The team fixed this by adding connection pooling to enable the application to scale up and work through the backlog of requests. The addition of this database connection pooling caused an additional bug which exacerbated the duration of the incident. This bug prevented our health monitor from observing the state of the queue and enforcing timeouts, causing a backlog of requests until the bug was identified and fixed Sunday morning.

‌

For further remediations, we are improving the observability of our queues to add alerting redundancy. We have also changed response codes to use 429s to better indicate capacity to alleviate queues and are working on better separating and prioritizing traffic to degrade more gracefully.

Wed, Mar 22, 2023, 12:20 AM

Resolved

Successful requests and latency are back to per-incident levels.

Sat, Mar 18, 2023, 10:59 PM(3 days earlier)

Monitoring

We've added additional capacity to Labs to help with demand.

Sat, Mar 18, 2023, 10:54 PM

Identified

Due to high load relative to currently available capacity, users will likely see timeouts on requests. We are currently investigating ways to mitigate this.

Sat, Mar 18, 2023, 10:18 PM(35 minutes earlier)

Investigating

We are continuing to investigate this issue.

Sat, Mar 18, 2023, 10:08 PM(10 minutes earlier)

Investigating

We are still investigating this issue.

Sat, Mar 18, 2023, 10:05 PM

Investigating

We are continuing to investigate this issue.

Sat, Mar 18, 2023, 09:22 PM(42 minutes earlier)

Investigating

We are currently investigating this issue.

Sat, Mar 18, 2023, 06:33 PM(2 hours earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.