At 4:53pm Pacific on 2023-01-25 we had a major outage on one of our GPU clusters, resulting in a full outage of the text-davinci-003 model. 80% of capacity was back online in approximately one hour. Capacity was fully stabilized in approximately four hours.
The problem was due to a configuration change of "CNI" - the Container Network Interface plugins used to provide network connectivity to containers running on our clusters. Given the high demand of our services, we go to great lengths to utilize any and all GPU capacity available. This involves supporting a variety of different hardware and networking configurations, requiring different CNI configurations. We fully tested this CNI configuration change in a staging environment prior to deployment, but unfortunately our staging environment lacked one particular variation of hardware that only exists in production. The CNI change was incompatible with those servers and caused their workloads to lose network connectivity.
Engineers immediately identified the problem as due to network connectivity, but it took nearly an hour for CNI to be identified as the cause. The CNI change had been deployed to other clusters over the past 24 hours, so had been deemed safe. The problem only arose once that change had been deployed to a cluster with different hardware. Once the problem was identified, a fix was in place and restored 80% of traffic immediately.
Though the network and servers were back up and running, problems related to model deployment continued to affect our reliability over the subsequent 3 hours. Some issues were due to known bottlenecks in our network and storage infrastructure, limiting how many models we can load back onto GPUs at a time. Other issues are due to misbehaving bad hardware that need to be identified and removed from operation. We are actively working on addressing those limitations this quarter.
We are addressing several action items as a result of this incident. Staging environments will contain a representation of all possible types of hardware configurations so that it is a better representation of production. The health of our CNI system itself will be directly monitored and alerted, so that failures will be more readily visible. We are fixing the bottlenecks in network and storage infrastructure that constrain our ability to deploy many replicas of a model quickly.