Outage on all models
Incident Report for OpenAI
Postmortem

Starting at 11:15pm PST on Feb 20, 2023 we suffered a major outage across all endpoints and models of our service. /v1/completions traffic was restored by 2:05 am PST on Feb 21. Ongoing database instabilities left some non Completions services degraded until 1:30pm PST on Feb 21. The root cause was an unexpected database failure with compounding effects delaying full recovery.

Our primary Postgres database was scheduled for a routine automated maintenance by our cloud provider. During this window, no impact was expected because the primary would fail over to a hot standby to continue to serve traffic. This ran as expected. However, additional "read replica" databases were also scheduled for maintenance in the same time window but had no corresponding fail-over functionality. The read replicas unexpectedly failed to come back after their maintenance window. The result was that we were able to keep parts of the site up and running, but not enough database capacity to service all traffic.

Engineers immediately started two streams of work. One was to create new read replicas. The other was to rebalance the available database connections towards the most critical API endpoints. Completions and authentication traffic were prioritized. This brought back traffic to the main /v1/completions endpoints by 2:05 am PT on Feb 21. The length was compounded by existing read replicas unexpectedly getting stuck in a recovery loop and being unable to reconnect to the primary database. This was additionally compounded by delays incurred spinning up new replicas due to unexpected dependencies from non-on-call staff in the middle of the night on a holiday weekend and during a holiday Monday on-call shift change.

Throughout the night we continued to see database instabilities; however, isolation procedures were able to keep /v1/completions online. Unfortunately due the volume of ChatGPT and Dall-E these products remained degraded due to database instabilities.

We use PgBouncer to pool database connections; unfortunately in the post-recovery database configuration we identified previously unknown slow queries hogging the pools, preventing other queries to run. Database instabilities were further frustrated by new read replicas inadvertently bypassing PGBouncer and exceeding database connection limits. This caused an additional brief outage from 10:43 am to 11:04 am PT.

By 1:30pm read replicas were fully online and ChatGPT and DALL·E returned to a fully operational status.

We are immediately executing on several action items as a result of this outage:

  • We are working with our cloud provider to be better prepared for how future maintenance will affect read replicas to minimize impact.
  • Whenever possible, we will adjust maintenance windows to occur during normal working hours so more staff is readily available
  • We are adjusting caches to persist with longer TTLs if the database is unavailable, allowing some critical endpoints to continue functioning for longer.
  • PgBouncer configuration is being tuned to account for slow queries
  • We are moving our on-call rotation to Tuesdays to avoid scheduling confusion of 3-day-weekend holidays.
  • We are reviewing our on-call access policies and escalation channels to ensure that on-call has knowledge and access to all necessary dependencies to remediate an outage

Longer term, we are reviewing our overall database strategy and planning towards solutions that are more resilient to individual server failures.

Posted Feb 23, 2023 - 16:41 PST

Resolved
This incident has been resolved.
Posted Feb 21, 2023 - 04:09 PST
Update
Service to ChatGPT has now been restored. There are still low but elevated error rates across some of our model completion endpoints that we are addressing now.
Posted Feb 21, 2023 - 04:03 PST
Update
Models and playground are available again, and we will continue to monitor. ChatGPT is continuing to have issues that we are investigating.
Posted Feb 21, 2023 - 03:06 PST
Monitoring
Service is now beginning to recover, we are continuing to monitor.
Posted Feb 21, 2023 - 02:04 PST
Update
We are continuing to work to resolve the underlying issue, and are investigating alternatives to recover service faster.
Posted Feb 21, 2023 - 01:55 PST
Update
We are continuing to work on recovering services.
Posted Feb 21, 2023 - 00:52 PST
Identified
We have identified the root cause and are working to recover service.
Posted Feb 20, 2023 - 23:59 PST
Investigating
We are currently investigating an outage affecting all models, including ChatGPT and Playground, beginning around 11:05 pm Pacific.
Posted Feb 20, 2023 - 23:40 PST
This incident affected: API, ChatGPT, and Playground.