On Saturday March 18th and Sunday March 19th, the DALL·E Web Experience suffered a severe degradation. From 6:20 am PDT to 11:20 am PDT, only 10% of DALL·E requests were successful. From 11:20 am to 2:40 pm PDT, only 25% of requests were successful. There was additional degradation Saturday night with 50% success rate at 11 pm and 0% at 7 am Sunday. Service returned around 11 am on Sunday, March 19th.
DALL·E image generation tasks used to be canceled after five minutes. On Friday, March 17th, the engineering team reduced that timeout to one minute. This was done to better increase service with a finite capacity and under the assumption the vast majority of people would not wait five minutes to get their image generation. In anticipation of the effects this would have, we downsized the DALL·E fleet to free up usage for other critical projects. On March 18, steadily increasing traffic caused a backup of requests that was severe enough to reach this one minute threshold. As a result, most requests ended up getting canceled.
This was fixed by reallocating capacity to DALL·E, and changing queuing and retry logic.
When attempting to scale up workers and increase capacity, the application hit database connection limits. The team fixed this by adding connection pooling to enable the application to scale up and work through the backlog of requests. The addition of this database connection pooling caused an additional bug which exacerbated the duration of the incident. This bug prevented our health monitor from observing the state of the queue and enforcing timeouts, causing a backlog of requests until the bug was identified and fixed Sunday morning.
For further remediations, we are improving the observability of our queues to add alerting redundancy. We have also changed response codes to use 429s to better indicate capacity to alleviate queues and are working on better separating and prioritizing traffic to degrade more gracefully.