Outage on All Endpoints
Incident Report for OpenAI
Postmortem

At 9:05 AM PDT on Friday, August 19, we experienced a full outage on our publicly exposed load balancer, knocking out traffic to all models and the Playground. All customers making API requests at this time were affected by this outage. The network outage lasted approximately one hour. Upon network traffic returning, some customers continued to see elevated errors when making requests to some models and to our Moderation API, and when making changes to their billing settings. Within two hours these cascading issues were fully resolved.

Engineers quickly identified that the problem was related to our public load balancer. All visible configuration and monitoring indicated that the load balancer was operating correctly. We escalated to our cloud provider to help with the investigation, who later determined that an unrelated change to our network configuration broke our public load balancer in a way that was not visible to us.

Approximately half of traffic was restored an hour after the incident began. But unfortunately after such a long period of outage, internal automation then further hindered our ability to serve the full load. Over the subsequent two hours, engineers manually worked to get all systems back online.

This multi-hour outage went on for far too long. To enable faster recovery times in the future, we're implementing changes to further increase observability and increase robustness of our change control processes.

In the course of investigation, engineers have identified the underlying issue in how network configurations get propagated that caused the load balancer to unexpectedly break, and have been able to reproduce the issue in a test environment. A fix is expected soon, and in the meantime we are able to reliably mitigate the bug.

Posted Aug 25, 2022 - 14:07 PDT

Resolved
This issue has been resolved. We apologize again for the inconvenience. We have begun a postmortem into this outage and will share it out as soon as we're able.
Posted Aug 19, 2022 - 12:15 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 19, 2022 - 10:56 PDT
Investigating
Network traffic is flowing again but many of our models are failing. We are investigating.
Posted Aug 19, 2022 - 10:20 PDT
Monitoring
We are seeing external traffic reach our servers again. We are monitoring.
Posted Aug 19, 2022 - 10:08 PDT
Update
We are experiencing a critical issue in our networking stack preventing external connections to our system. We are continuing to investigate. We appreciate your patience and apologize for the inconvenience.
Posted Aug 19, 2022 - 09:52 PDT
Investigating
We are investigating a major outage affecting all models and the OpenAI Playground
Posted Aug 19, 2022 - 09:20 PDT
This incident affected: API and Playground Site.