Error responses for curie fine-tuned models
Incident Report for OpenAI
Postmortem

On Jul 22, 2022 starting 11:10pm, some fine-tuned curie models suffered an outage for around 50 minutes. The outage affected infrastructure for more recently fine-tuned curie models (older models are running on separate infrastructure and were unaffected). Overall this represented 80% of requests to curie fine-tuned models.

The cause was a failure of the pub/sub system we use to queue requests. Even though the specific pods responsible for pub/sub were brought back up, there were cascading failures that prevented traffic from recovering. 

The mitigation involved standing up a new infrastructure to process these requests. This was brought up and put into production traffic at around midnight, at which point failures stopped. Because new systems have some startup costs and cold caches, there was still higher latency observed for around 10 minutes. 

Remediations:

  • Since this incident, we have begun sharding traffic for fine-tuned models across different regions to add redundancy in case of an outage.
  • We are in the process of re-architecting our model runners to be more resilient to pub-sub failures.
Posted Jul 26, 2022 - 09:54 PDT

Resolved
This incident has been resolved.
Posted Jul 23, 2022 - 00:15 PDT
Update
Latency should be back to normal. Service is healthy.
Posted Jul 23, 2022 - 00:08 PDT
Update
We have identified the source of the issue and have applied mitigations to restore service. However, latency is still affected during warmup.
Posted Jul 22, 2022 - 23:58 PDT
Update
We are continuing to investigate this issue.
Posted Jul 22, 2022 - 23:42 PDT
Investigating
We have been alerted to an issue with curie fine-tuned models. We are investigating.
Posted Jul 22, 2022 - 23:42 PDT
This incident affected: API.