Outage on All Endpoints

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Aug 19, 2022, 04:20 PM

05:56 PM

Playground

Updates

Write-up published

Read it here

Resolved

At 9:05 AM PDT on Friday, August 19, we experienced a full outage on our publicly exposed load balancer, knocking out traffic to all models and the Playground. All customers making API requests at this time were affected by this outage. The network outage lasted approximately one hour. Upon network traffic returning, some customers continued to see elevated errors when making requests to some models and to our Moderation API, and when making changes to their billing settings. Within two hours these cascading issues were fully resolved.

Engineers quickly identified that the problem was related to our public load balancer. All visible configuration and monitoring indicated that the load balancer was operating correctly. We escalated to our cloud provider to help with the investigation, who later determined that an unrelated change to our network configuration broke our public load balancer in a way that was not visible to us.

Approximately half of traffic was restored an hour after the incident began. But unfortunately after such a long period of outage, internal automation then further hindered our ability to serve the full load. Over the subsequent two hours, engineers manually worked to get all systems back online.

This multi-hour outage went on for far too long. To enable faster recovery times in the future, we're implementing changes to further increase observability and increase robustness of our change control processes.

‌

In the course of investigation, engineers have identified the underlying issue in how network configurations get propagated that caused the load balancer to unexpectedly break, and have been able to reproduce the issue in a test environment. A fix is expected soon, and in the meantime we are able to reliably mitigate the bug.

Thu, Aug 25, 2022, 09:06 PM

Resolved

This issue has been resolved. We apologize again for the inconvenience. We have begun a postmortem into this outage and will share it out as soon as we're able.

Fri, Aug 19, 2022, 07:15 PM(6 days earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Fri, Aug 19, 2022, 05:56 PM(1 hour earlier)

Investigating

Network traffic is flowing again but many of our models are failing. We are investigating.

Fri, Aug 19, 2022, 05:20 PM(35 minutes earlier)

Monitoring

We are seeing external traffic reach our servers again. We are monitoring.

Fri, Aug 19, 2022, 05:08 PM(12 minutes earlier)

Investigating

We are experiencing a critical issue in our networking stack preventing external connections to our system. We are continuing to investigate. We appreciate your patience and apologize for the inconvenience.

Fri, Aug 19, 2022, 04:52 PM(15 minutes earlier)

Investigating

We are investigating a major outage affecting all models and the OpenAI Playground

Fri, Aug 19, 2022, 04:20 PM(31 minutes earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.