Outage on all models

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Feb 21, 2023, 07:40 AM

12:03 PM

Playground

Updates

Write-up published

Read it here

Resolved

Starting at 11:15pm PST on Feb 20, 2023 we suffered a major outage across all endpoints and models of our service. /v1/completions traffic was restored by 2:05 am PST on Feb 21. Ongoing database instabilities left some non Completions services degraded until 1:30pm PST on Feb 21. The root cause was an unexpected database failure with compounding effects delaying full recovery.

Our primary Postgres database was scheduled for a routine automated maintenance by our cloud provider. During this window, no impact was expected because the primary would fail over to a hot standby to continue to serve traffic. This ran as expected. However, additional "read replica" databases were also scheduled for maintenance in the same time window but had no corresponding fail-over functionality. The read replicas unexpectedly failed to come back after their maintenance window. The result was that we were able to keep parts of the site up and running, but not enough database capacity to service all traffic.

Engineers immediately started two streams of work. One was to create new read replicas. The other was to rebalance the available database connections towards the most critical API endpoints. Completions and authentication traffic were prioritized. This brought back traffic to the main /v1/completions endpoints by 2:05 am PT on Feb 21. The length was compounded by existing read replicas unexpectedly getting stuck in a recovery loop and being unable to reconnect to the primary database. This was additionally compounded by delays incurred spinning up new replicas due to unexpected dependencies from non-on-call staff in the middle of the night on a holiday weekend and during a holiday Monday on-call shift change.

Throughout the night we continued to see database instabilities; however, isolation procedures were able to keep /v1/completions online. Unfortunately due the volume of ChatGPT and Dall-E these products remained degraded due to database instabilities.

We use PgBouncer to pool database connections; unfortunately in the post-recovery database configuration we identified previously unknown slow queries hogging the pools, preventing other queries to run. Database instabilities were further frustrated by new read replicas inadvertently bypassing PGBouncer and exceeding database connection limits. This caused an additional brief outage from 10:43 am to 11:04 am PT.

By 1:30pm read replicas were fully online and ChatGPT and DALL·E returned to a fully operational status.

We are immediately executing on several action items as a result of this outage:

We are working with our cloud provider to be better prepared for how future maintenance will affect read replicas to minimize impact.
Whenever possible, we will adjust maintenance windows to occur during normal working hours so more staff is readily available
We are adjusting caches to persist with longer TTLs if the database is unavailable, allowing some critical endpoints to continue functioning for longer.
PgBouncer configuration is being tuned to account for slow queries
We are moving our on-call rotation to Tuesdays to avoid scheduling confusion of 3-day-weekend holidays.
We are reviewing our on-call access policies and escalation channels to ensure that on-call has knowledge and access to all necessary dependencies to remediate an outage

Longer term, we are reviewing our overall database strategy and planning towards solutions that are more resilient to individual server failures.

Fri, Feb 24, 2023, 12:40 AM

Resolved

This incident has been resolved.

Tue, Feb 21, 2023, 12:09 PM(2 days earlier)

Monitoring

Service to ChatGPT has now been restored. There are still low but elevated error rates across some of our model completion endpoints that we are addressing now.

Tue, Feb 21, 2023, 12:03 PM

Monitoring

Models and playground are available again, and we will continue to monitor. ChatGPT is continuing to have issues that we are investigating.

Tue, Feb 21, 2023, 11:06 AM(57 minutes earlier)

Monitoring

Service is now beginning to recover, we are continuing to monitor.

Tue, Feb 21, 2023, 10:04 AM(1 hour earlier)

Identified

We are continuing to work to resolve the underlying issue, and are investigating alternatives to recover service faster.

Tue, Feb 21, 2023, 09:55 AM

Identified

We are continuing to work on recovering services.

Tue, Feb 21, 2023, 08:52 AM(1 hour earlier)

Identified

We have identified the root cause and are working to recover service.

Tue, Feb 21, 2023, 07:59 AM(52 minutes earlier)

Investigating

We are currently investigating an outage affecting all models, including ChatGPT and Playground, beginning around 11:05 pm Pacific.

Tue, Feb 21, 2023, 07:40 AM(19 minutes earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.