Full outage on all API endpoints
Incident Report for OpenAI
At 3:28pm PDT on Mar 16, 2023, a feature flag caused unexpected failures on a critical service. Despite a steady rollout from 1% -> 50% -> 100%, the failure was not identified until critical systems went offline. As a result, APIs across all of OpenAI’s products experienced 7 minutes of downtime before the cause was identified, reverted, and service was restored. We will be increasing the observability of such changes to ensure we catch these problems before they fully roll out, and make root causes clearer to reduce recovery time.
Posted Mar 16, 2023 - 15:30 PDT