At 9:58am PST on Thursday November 17 2022 we experienced an outage on the serving the text-davinci-002 model to our API and Playground customers. The outage lasted 57 minutes. No other models were affected during this outage.
Engineers had been conducting a routine upgrade of one of our clusters when we started seeing a low level of internal network connectivity problems. We identified the networking issues to be due to increased scale - a larger number of connections was leading towards "SNAT Port Exhaustion." Engineering moved quickly to add additional network capacity. The configuration change to upgrade the network overrode the in-flight upgrade of the cluster, causing the cluster to unexpectedly rollback. This disrupted workloads powering the text-davinci-002 model.
The problem was identified in less than 2 minutes. It took approximately 30 minutes to restore capacity on the cluster, and an additional 25 minutes for workloads to start operating normally.
This outage is regrettable. We will be addressing gaps in our change control process to ensure that conflicts between two in-flight changes does not result in unexpected outages. We are also adding cluster redundancy so that a single cluster outage has overall far less impact on our services. Both of these changes were already planned and are actively being implemented.