On Jul 22, 2022 starting 11:10pm, some fine-tuned curie models suffered an outage for around 50 minutes. The outage affected infrastructure for more recently fine-tuned curie models (older models are running on separate infrastructure and were unaffected). Overall this represented 80% of requests to curie fine-tuned models.
The cause was a failure of the pub/sub system we use to queue requests. Even though the specific pods responsible for pub/sub were brought back up, there were cascading failures that prevented traffic from recovering.
The mitigation involved standing up a new infrastructure to process these requests. This was brought up and put into production traffic at around midnight, at which point failures stopped. Because new systems have some startup costs and cold caches, there was still higher latency observed for around 10 minutes.
Remediations: