The API was partially degraded between 6:52 pm and 8:19 pm PDT on 2023-03-18. A few percent of requests to GPT 3.5 Turbo and text-davinci-003 experienced 500s. Other models also experienced a smaller level of 500s and 499 induced timeouts, particularly the Moderation API. This was due to some of our hosts unexpectedly encountering network connectivity issues. These manifested as the inability to resolve DNS and failures in internal node diagnostic services. We believe these network connectivity issues were triggered by sudden changes in external traffic patterns that overwhelmed the network stacks of some hosts.
Once we identified the root cause, we were able to adjust to the new traffic patterns and take unavailable nodes out of circulation. Response time was exacerbated by this also causing our primary monitoring service to go out, as well as there being unrelated hardware problems on some of the affected nodes.
We have added better observability around DNS failures and network traffic. We are also working with our cloud provider to better understand this pattern of unexpected host failure. We are also working to put services in place to better handle sudden changes in external traffic.