At 10:55am on March 15th, a deploy of the ChatGPT service cut over to a new Redis cluster for additional capacity. This new Redis cluster had a misconfigured maximum connections setting. This was missed on staging, canary, and production dual writing, because we had not yet hit connection thresholds. A rollback to redirect traffic to the original Redis cluster restored service.
An internal audit concluded that this particular setting was the only relevant difference in config between the two Redis clusters. Moving forward, we’re increasing our telemetry on Redis, and auditing all relevant configuration templates to ensure no other updates are needed.