Error responses for curie fine-tuned models

Write-up

On Jul 22, 2022 starting 11:10pm, some fine-tuned curie models suffered an outage for around 50 minutes. The outage affected infrastructure for more recently fine-tuned curie models \(older models are running on separate infrastructure and were unaffected\). Overall this represented 80% of requests to curie fine-tuned models.

The cause was a failure of the pub/sub system we use to queue requests. Even though the specific pods responsible for pub/sub were brought back up, there were cascading failures that prevented traffic from recovering.

The mitigation involved standing up a new infrastructure to process these requests. This was brought up and put into production traffic at around midnight, at which point failures stopped. Because new systems have some startup costs and cold caches, there was still higher latency observed for around 10 minutes.

Remediations:

Since this incident, we have begun sharding traffic for fine-tuned models across different regions to add redundancy in case of an outage.
We are in the process of re-architecting our model runners to be more resilient to pub-sub failures.

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.