OpenAI

Outage for text-davinci-002
Affected components
Updates

Write-up published

Read it here

Resolved

At 9:58am PST on Thursday November 17 2022 we experienced an outage on the serving the text-davinci-002 model to our API and Playground customers. The outage lasted 57 minutes. No other models were affected during this outage.

Engineers had been conducting a routine upgrade of one of our clusters when we started seeing a low level of internal network connectivity problems. We identified the networking issues to be due to increased scale - a larger number of connections was leading towards "SNAT Port Exhaustion." Engineering moved quickly to add additional network capacity. The configuration change to upgrade the network overrode the in-flight upgrade of the cluster, causing the cluster to unexpectedly rollback. This disrupted workloads powering the text-davinci-002 model.

The problem was identified in less than 2 minutes. It took approximately 30 minutes to restore capacity on the cluster, and an additional 25 minutes for workloads to start operating normally.

This outage is regrettable. We will be addressing gaps in our change control process to ensure that conflicts between two in-flight changes does not result in unexpected outages. We are also adding cluster redundancy so that a single cluster outage has overall far less impact on our services. Both of these changes were already planned and are actively being implemented.

Fri, Nov 18, 2022, 12:45 AM

Resolved

This incident has been resolved.

Thu, Nov 17, 2022, 07:41 PM(5 hours earlier)

Monitoring

We are continuing to monitor for any further issues.

Thu, Nov 17, 2022, 07:33 PM

Monitoring

Traffic should be back now. We are continuing to monitor the situation.

Thu, Nov 17, 2022, 06:59 PM(33 minutes earlier)

Identified

We are continuing to work on a fix for this issue.

Thu, Nov 17, 2022, 06:58 PM

Identified

We've identified an issue causing loss of traffic for text-davinci-002, and are currently working on a resolution

Thu, Nov 17, 2022, 05:57 PM(1 hour earlier)
Powered by

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.