Starting at 11:15pm PST on Feb 20, 2023 we suffered a major outage across all endpoints and models of our service. /v1/completions traffic was restored by 2:05 am PST on Feb 21. Ongoing database instabilities left some non Completions services degraded until 1:30pm PST on Feb 21. The root cause was an unexpected database failure with compounding effects delaying full recovery.
Our primary Postgres database was scheduled for a routine automated maintenance by our cloud provider. During this window, no impact was expected because the primary would fail over to a hot standby to continue to serve traffic. This ran as expected. However, additional "read replica" databases were also scheduled for maintenance in the same time window but had no corresponding fail-over functionality. The read replicas unexpectedly failed to come back after their maintenance window. The result was that we were able to keep parts of the site up and running, but not enough database capacity to service all traffic.
Engineers immediately started two streams of work. One was to create new read replicas. The other was to rebalance the available database connections towards the most critical API endpoints. Completions and authentication traffic were prioritized. This brought back traffic to the main /v1/completions endpoints by 2:05 am PT on Feb 21. The length was compounded by existing read replicas unexpectedly getting stuck in a recovery loop and being unable to reconnect to the primary database. This was additionally compounded by delays incurred spinning up new replicas due to unexpected dependencies from non-on-call staff in the middle of the night on a holiday weekend and during a holiday Monday on-call shift change.
Throughout the night we continued to see database instabilities; however, isolation procedures were able to keep /v1/completions online. Unfortunately due the volume of ChatGPT and Dall-E these products remained degraded due to database instabilities.
We use PgBouncer to pool database connections; unfortunately in the post-recovery database configuration we identified previously unknown slow queries hogging the pools, preventing other queries to run. Database instabilities were further frustrated by new read replicas inadvertently bypassing PGBouncer and exceeding database connection limits. This caused an additional brief outage from 10:43 am to 11:04 am PT.
By 1:30pm read replicas were fully online and ChatGPT and DALL·E returned to a fully operational status.
We are immediately executing on several action items as a result of this outage:
Longer term, we are reviewing our overall database strategy and planning towards solutions that are more resilient to individual server failures.