Increased latencies, failures in text-davinci-002
Incident Report for OpenAI
Postmortem

An internal configuration error caused text-davinci-002 and code-davinci-002 to receive unanticipated load starting at 21:03 UTC on 2022-09-06. The sudden increase in workload led to increased failure rates and longer response times, negatively impacting customer experience. We reverted the misconfiguration and rebalanced the load.

To prevent this from happening again, we are working on improving the test coverage for regressions in this area and have improved the rollout logic and alerting to catch issues earlier.

Posted Sep 07, 2022 - 17:54 PDT

Resolved
The system continues to operate as expected. We are marking this incident as resolved.
Posted Sep 06, 2022 - 18:40 PDT
Monitoring
Latencies appear to have returned to normal. We will continue to monitor.
Posted Sep 06, 2022 - 16:23 PDT
Identified
Our mitigation rollout appears to be working, latencies are in the process of returning to normal & failure rates have dropped.
Posted Sep 06, 2022 - 16:10 PDT
Investigating
We are experiencing increased latencies for models across the board starting 2pm, leading to increased load & errors in text-davinci-002 by around 3pm. We have a potential mitigation we are going to try.
Posted Sep 06, 2022 - 14:00 PDT
This incident affected: API.