Error responses for curie fine-tuned models

Resolved·Partial outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Jul 23, 2022, 06:42 AM

07:08 AM

Updates

Write-up published

Read it here

Resolved

On Jul 22, 2022 starting 11:10pm, some fine-tuned curie models suffered an outage for around 50 minutes. The outage affected infrastructure for more recently fine-tuned curie models \(older models are running on separate infrastructure and were unaffected\). Overall this represented 80% of requests to curie fine-tuned models.

The cause was a failure of the pub/sub system we use to queue requests. Even though the specific pods responsible for pub/sub were brought back up, there were cascading failures that prevented traffic from recovering.

The mitigation involved standing up a new infrastructure to process these requests. This was brought up and put into production traffic at around midnight, at which point failures stopped. Because new systems have some startup costs and cold caches, there was still higher latency observed for around 10 minutes.

Remediations:

Since this incident, we have begun sharding traffic for fine-tuned models across different regions to add redundancy in case of an outage.
We are in the process of re-architecting our model runners to be more resilient to pub-sub failures.

Tue, Jul 26, 2022, 04:54 PM

Resolved

This incident has been resolved.

Sat, Jul 23, 2022, 07:15 AM(3 days earlier)

Investigating

Latency should be back to normal. Service is healthy.

Sat, Jul 23, 2022, 07:08 AM

Investigating

We have identified the source of the issue and have applied mitigations to restore service. However, latency is still affected during warmup.

Sat, Jul 23, 2022, 06:58 AM

Investigating

We are continuing to investigate this issue.

Sat, Jul 23, 2022, 06:42 AM(16 minutes earlier)

Investigating

We have been alerted to an issue with curie fine-tuned models. We are investigating.

Sat, Jul 23, 2022, 06:42 AM

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.