Outage for text-davinci-002
Incident Report for OpenAI
Postmortem

At 9:58am PST on Thursday November 17 2022 we experienced an outage on the serving the text-davinci-002 model to our API and Playground customers. The outage lasted 57 minutes. No other models were affected during this outage.

Engineers had been conducting a routine upgrade of one of our clusters when we started seeing a low level of internal network connectivity problems. We identified the networking issues to be due to increased scale - a larger number of connections was leading towards "SNAT Port Exhaustion." Engineering moved quickly to add additional network capacity. The configuration change to upgrade the network overrode the in-flight upgrade of the cluster, causing the cluster to unexpectedly rollback. This disrupted workloads powering the text-davinci-002 model.

The problem was identified in less than 2 minutes. It took approximately 30 minutes to restore capacity on the cluster, and an additional 25 minutes for workloads to start operating normally.

This outage is regrettable. We will be addressing gaps in our change control process to ensure that conflicts between two in-flight changes does not result in unexpected outages. We are also adding cluster redundancy so that a single cluster outage has overall far less impact on our services. Both of these changes were already planned and are actively being implemented.

Posted Nov 17, 2022 - 16:45 PST

Resolved
This incident has been resolved.
Posted Nov 17, 2022 - 11:41 PST
Update
We are continuing to monitor for any further issues.
Posted Nov 17, 2022 - 11:33 PST
Monitoring
Traffic should be back now. We are continuing to monitor the situation.
Posted Nov 17, 2022 - 10:59 PST
Update
We are continuing to work on a fix for this issue.
Posted Nov 17, 2022 - 10:58 PST
Identified
We've identified an issue causing loss of traffic for text-davinci-002, and are currently working on a resolution
Posted Nov 17, 2022 - 09:57 PST
This incident affected: API.