On June 23, 2022 18:56 PDT (June 24 01:56 UTC), the following models went offline:
Codex, Embeddings, and Legacy GPT models were completely unavailable for about an hour. Fine-tuned Curie was failing for most requests for 5 hours. Fine-tuned Davinci was failing for most requests for 12 hours.
The main cause was that one of our three production clusters went completely offline. Once that cluster went down, we needed to re-allocate models to our remaining two clusters. The downed cluster took 10 hours to fully restore due to several unexpected events detailed below. While most models recovered before the cluster was restored, Fine-tuned Davinci was not able to fully recover until the cluster was restored.
Fine-tuned Curie was timing out for most requests until 00:10 PDT (5 hours). Fine-tuned Davinci was timing out for most requests until 06:40 PDT (12 hours). There were several reasons why fine-tuned models took so long to restore.
While we normally keep healthy infrastructure headroom, these margins were exhausted with the loss of an entire cluster. For us GPUs are the most constrained resource. Due to fundamental supply chain constraints we are unable to provision more GPUs with hours of notice. While we were able to maximally use our remaining fleet, we would not be able to fully restore service without our downed cluster coming back online.
Each fine-tuned model is very large in size. We have many fine-tuned models across all of our customers. Normally, we only load a few at a time from Azure Blob Storage. When we restore all fine-tuned services from scratch, we're massively bottlenecked by Azure Blob Storage bandwidth. This was made worse by cross-cluster data transfer. We normally have many layers of caching to alleviate this; however, the nature of this move invalidated all of those.
The cluster went down due to a combination of human error, unforeseen holes in our safeguards, and subtly missing information about the impacts of destructive operations. It took us until about 05:30 PDT (10½ hours) to restore this cluster. Our clusters are managed with Terraform and Kubernetes via Azure Kubernetes Service (AKS). Normally cluster creation is fairly quick. There were several unexpected issues that led to this taking a long time to restore.
During cluster setup we lost connection to AKS. After working with Azure support, we ended up rebuilding our VPN link to mitigate. We later unexpectedly lost connection again to Azure Private Link. We worked with Azure support again to build workarounds. We do not yet have a root cause of these issues and are actively working with Azure on cause and remediations.
A Kubernetes upgrade would alleviate some of these network problems. After performing the upgrade, we discovered our Ingress was incompatible with the new Kubernetes version. Fixing these ingress problems was not feasible given the time constraints, and we discovered that AKS does not allow Kubernetes downgrades.
We finally tore down and rebuilt the cluster again from scratch. This was successful and we began moving traffic back into this cluster between 05:30 and 06:30 PDT.
We are reprioritizing other product and engineering efforts to immediately dedicate significant resources to the following:
We are actively underway with several remediations to prevent this type of accident from happening again, ensure recovery is faster, and overall improve the performance of our infrastructure.
We expect most of these remediations to be implemented within days and have reprioritized other efforts to immediately work on these improvements.