Multiple engines are down

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Jan 26, 2023, 01:13 AM

05:42 AM

Updates

Write-up published

Read it here

Resolved

At 4:53pm Pacific on 2023-01-25 we had a major outage on one of our GPU clusters, resulting in a full outage of the text-davinci-003 model. 80% of capacity was back online in approximately one hour. Capacity was fully stabilized in approximately four hours.

The problem was due to a configuration change of "CNI" - the Container Network Interface plugins used to provide network connectivity to containers running on our clusters. Given the high demand of our services, we go to great lengths to utilize any and all GPU capacity available. This involves supporting a variety of different hardware and networking configurations, requiring different CNI configurations. We fully tested this CNI configuration change in a staging environment prior to deployment, but unfortunately our staging environment lacked one particular variation of hardware that only exists in production. The CNI change was incompatible with those servers and caused their workloads to lose network connectivity.

Engineers immediately identified the problem as due to network connectivity, but it took nearly an hour for CNI to be identified as the cause. The CNI change had been deployed to other clusters over the past 24 hours, so had been deemed safe. The problem only arose once that change had been deployed to a cluster with different hardware. Once the problem was identified, a fix was in place and restored 80% of traffic immediately.

Though the network and servers were back up and running, problems related to model deployment continued to affect our reliability over the subsequent 3 hours. Some issues were due to known bottlenecks in our network and storage infrastructure, limiting how many models we can load back onto GPUs at a time. Other issues are due to misbehaving bad hardware that need to be identified and removed from operation. We are actively working on addressing those limitations this quarter.

We are addressing several action items as a result of this incident. Staging environments will contain a representation of all possible types of hardware configurations so that it is a better representation of production. The health of our CNI system itself will be directly monitored and alerted, so that failures will be more readily visible. We are fixing the bottlenecks in network and storage infrastructure that constrain our ability to deploy many replicas of a model quickly.

Fri, Jan 27, 2023, 08:08 PM

Resolved

This incident has been resolved.

Thu, Jan 26, 2023, 05:50 AM(1 day earlier)

Monitoring

We are continuing to monitor for any further issues.

Thu, Jan 26, 2023, 05:42 AM

Monitoring

The rollout of the fix continues. Many engines, including text-davinci-003, appear to be operating normally again.

Thu, Jan 26, 2023, 03:51 AM(1 hour earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Thu, Jan 26, 2023, 03:00 AM(51 minutes earlier)

Identified

The issue has been identified and a fix is being implemented.

Thu, Jan 26, 2023, 02:06 AM(53 minutes earlier)

Investigating

We are continuing to investigate this issue.

Thu, Jan 26, 2023, 01:22 AM(44 minutes earlier)

Investigating

We are currently investigating this issue.

Thu, Jan 26, 2023, 01:13 AM