At 9:05 AM PDT on Friday, August 19, we experienced a full outage on our publicly exposed load balancer, knocking out traffic to all models and the Playground. All customers making API requests at this time were affected by this outage. The network outage lasted approximately one hour. Upon network traffic returning, some customers continued to see elevated errors when making requests to some models and to our Moderation API, and when making changes to their billing settings. Within two hours these cascading issues were fully resolved.
Engineers quickly identified that the problem was related to our public load balancer. All visible configuration and monitoring indicated that the load balancer was operating correctly. We escalated to our cloud provider to help with the investigation, who later determined that an unrelated change to our network configuration broke our public load balancer in a way that was not visible to us.
Approximately half of traffic was restored an hour after the incident began. But unfortunately after such a long period of outage, internal automation then further hindered our ability to serve the full load. Over the subsequent two hours, engineers manually worked to get all systems back online.
This multi-hour outage went on for far too long. To enable faster recovery times in the future, we're implementing changes to further increase observability and increase robustness of our change control processes.
In the course of investigation, engineers have identified the underlying issue in how network configurations get propagated that caused the load balancer to unexpectedly break, and have been able to reproduce the issue in a test environment. A fix is expected soon, and in the meantime we are able to reliably mitigate the bug.