ChatGPT Outage
Incident Report for OpenAI
Postmortem

At 10:55am on March 15th, a deploy of the ChatGPT service cut over to a new Redis cluster for additional capacity. This new Redis cluster had a misconfigured maximum connections setting. This was missed on staging, canary, and production dual writing, because we had not yet hit connection thresholds. A rollback to redirect traffic to the original Redis cluster restored service.

An internal audit concluded that this particular setting was the only relevant difference in config between the two Redis clusters. Moving forward, we’re increasing our telemetry on Redis, and auditing all relevant configuration templates to ensure no other updates are needed.

Posted Mar 16, 2023 - 18:50 PDT

Resolved
This incident has been resolved.
Posted Mar 15, 2023 - 11:18 PDT
Monitoring
The fix has been deployed. We are no longer seeing elevated error rates.
Posted Mar 15, 2023 - 11:14 PDT
Investigating
We are currently investigating this issue.
Posted Mar 15, 2023 - 10:55 PDT
This incident affected: ChatGPT.