Write-up published
Resolved
On June 23, 2022 18:56 PDT \(June 24 01:56 UTC\), the following models went offline:
Some Fine-tuned models: curie and davinci
Some Codex beta models: code-davinci-001, code-cushman-001
Some legacy GPT models: curie, babbage, and ada
All Embeddings models
Codex, Embeddings, and Legacy GPT models were completely unavailable for about an hour. Fine-tuned Curie was failing for most requests for 5 hours. Fine-tuned Davinci was failing for most requests for 12 hours.
The main cause was that one of our three production clusters went completely offline. Once that cluster went down, we needed to re-allocate models to our remaining two clusters. The downed cluster took 10 hours to fully restore due to several unexpected events detailed below. While most models recovered before the cluster was restored, Fine-tuned Davinci was not able to fully recover until the cluster was restored.
Fine-tuned Curie was timing out for most requests until 00:10 PDT \(5 hours\). Fine-tuned Davinci was timing out for most requests until 06:40 PDT \(12 hours\). There were several reasons why fine-tuned models took so long to restore.
While we normally keep healthy infrastructure headroom, these margins were exhausted with the loss of an entire cluster. For us GPUs are the most constrained resource. Due to fundamental supply chain constraints we are unable to provision more GPUs with hours of notice. While we were able to maximally use our remaining fleet, we would not be able to fully restore service without our downed cluster coming back online.
Each fine-tuned model is very large in size. We have many fine-tuned models across all of our customers. Normally, we only load a few at a time from Azure Blob Storage. When we restore all fine-tuned services from scratch, we're massively bottlenecked by Azure Blob Storage bandwidth. This was made worse by cross-cluster data transfer. We normally have many layers of caching to alleviate this; however, the nature of this move invalidated all of those.
The cluster went down due to a combination of human error, unforeseen holes in our safeguards, and subtly missing information about the impacts of destructive operations. It took us until about 05:30 PDT \(10½ hours\) to restore this cluster. Our clusters are managed with Terraform and Kubernetes via Azure Kubernetes Service \(AKS\). Normally cluster creation is fairly quick. There were several unexpected issues that led to this taking a long time to restore.
During cluster setup we lost connection to AKS. After working with Azure support, we ended up rebuilding our VPN link to mitigate. We later unexpectedly lost connection again to Azure Private Link. We worked with Azure support again to build workarounds. We do not yet have a root cause of these issues and are actively working with Azure on cause and remediations.
A Kubernetes upgrade would alleviate some of these network problems. After performing the upgrade, we discovered our Ingress was incompatible with the new Kubernetes version. Fixing these ingress problems was not feasible given the time constraints, and we discovered that AKS does not allow Kubernetes downgrades.
We finally tore down and rebuilt the cluster again from scratch. This was successful and we began moving traffic back into this cluster between 05:30 and 06:30 PDT.
We are reprioritizing other product and engineering efforts to immediately dedicate significant resources to the following:
We are actively underway with several remediations to prevent this type of accident from happening again, ensure recovery is faster, and overall improve the performance of our infrastructure.
Adding checks and locks to prevent this and similar prod-critical pieces of infrastructure from being inadvertently modified. While we had some of these protections already in place, they were not comprehensive enough to cover the cluster failure mode that happened here.
Adding better visibility to operations being performed and ensuring parts of our system are more lexicographically distinct.
Continuing to work with Azure on AKS network connectivity issues to ensure this does not disrupt new cluster operations in the future.
Updating our playbooks with these new cluster failure modes and modifying new cluster and infrastructure setup steps.
Performing a systematic upgrade of Kubernetes and provisioning time to ensure we don't fall behind in the future.
Sharding our fine-tuned models across multiple clusters, and are working towards more automatic cross cluster replication strategies.
Distributing and replicating Azure Blob Storage to ensure fine-tuned models can load faster and more reliably.
Finish rolling out a node-to-node caching solution that can alleviate Azure Blob Storage bottlenecks.
Modifying the way we store model weights to be more efficient and faster to load.
We expect most of these remediations to be implemented within days and have reprioritized other efforts to immediately work on these improvements.
Resolved
All models are operational. Thank you for your patience.
Investigating
Babbage is now stable. We're investigating 1-2 remaining issues with our fine tuned curie models and a few lesser used engines.
Investigating
At this time davinci fine-tuned models should be back to normal. We're investigating an issue with our babbage engine.
Investigating
We have brought back our original cluster and are bringing back traffic. As of this post, davinci fine-tuned models should be normalizing in latency and error rates.
Investigating
Davinci fine-tuned models are coming back up but are seeing increased latency. We are continuing to work to resolve this outage.
Investigating
We do not have a resolution on this incident but we are working with our upstream partners for support. Users of davinci fine-tuned models are still advised to use text-davinci-002 for the time being.
Investigating
Fine-tuned Davinci model inference is still degraded. We are exploring alternate theories as to what is causing very high latency on these models. Given the set of root causes that have already been ruled out, this unfortunately is indicating that a much more extensive investigation will be needed to fully remediate fine-tuned Davinci model performance.
We suggest using the text-davinci-002 model as a temporary backup while we work to restore fine-tuned Davinci. The text-davinci-002 model is both fully operational and can approach the capability of fine-tuned Davinci models for many applications.
All other public production models are operating nominally and we have restored the original cluster that had an outage.
Monitoring
Fine-tuned curie model inference has returned to normal.
Fine-tuned davinci model inference is still in a degraded state.
Monitoring
We are seeing error rates drop on curie fine-tuned models as well as davinci fine-tuned models. We're actively monitoring the situation.
Identified
We are continuing to address health issues with fine-tuned curie and fine-tuned davinci models.
In addition to aforementioned model loading issues, we are experiencing limits in our capacity while we restore the cluster that went out.
All other models are operational.
Monitoring
We believe we have found a stable arrangement of our infrastructure. All models are responding to requests; however, fine-tuned davinci and fine-tuned curie have an elevated rates of 429s and 499s.
The fine-tuned davinci and fine-tuned curie model errors are due to customer model weights taking a long time to load. Normally these weights are heavily cached; however, due to these cluster rearrangements, those caches need to be restored. The sudden influx of requests to restore those caches is causing slowdowns upstream from our storage accounts. We expect the error rates to steadily decline, but may take longer than normal due to these bottlenecks.
Identified
We are continuing to move infrastructure around in our operational clusters to ensure all models are performing optimally with the resources we have. We are much closer to a stable configuration, but are still re-allocating resources to better bring down error rates.
Some Fine-tuned curie models are the most heavily affected right now as we continue to move resources around.
Identified
We have now moved all models from the broken cluster to new clusters; however, we are still suffering from some warmup and capacity issues.
Fine-tuned davinci and curie models are warming up. Their performance should improve over time and the rates of 429s and 499s should steadily decrease.
We're also experiencing capacity issues with Codex davinci and cushman engines. We are actively working to fix these. Until then, they will have degraded performance until these issues get resolved.
Identified
One of our clusters has suffered a major communication outage within kubernetes. This has affected the models that are hosted in that cluster.
This includes the following models:
Inference for fine-tuned davinci and curie models
Codex: code-davinci-001, and code-cushman-001
Legacy curie, babbage, and ada
Embeddings models
We are actively working to migrate most of these models to a functioning cluster. Affected models should be coming online as this happens.
Due to capacity constraints, we unfortunately expect to see some temporary performance and latency degradations in other models as we move infrastructure around.
Investigating
We are currently in a state of degraded performance for most engines. We are still working to recover.
Investigating
We know the source of the outage and are working to mitigate.
Investigating
One of our clusters has had an outage affecting some engines. We are investigating.