Degraded API performance

Resolved

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

Last week, a cascading set of failures, partially resulting from historically high load and unexpected upstream interruptions, led to degraded performance on our API. Not all customers were affected, but some observed increased latencies when making requests for completions, in some cases leading to timeouts. Some customers, particularly those using fine-tuned models, also observed HTTP 429 errors, with a message that the requested model was still being loaded. And in some instances requests were dropped with HTTP 503 errors.

‌

We have taken immediate steps to resolve these issues. We’ve also made investments, and prioritized others, to ensure we never encounter these issues again, even as our request volume continues to increase. We’ve fixed several newly identified bugs in our system; made ourselves more resilient to upstream failures from our cloud provider; and improved the scaling of historically fixed capacities in our stack to adapt to increased load.

‌

Latency and reliability are the highest priority for our team. We deeply apologize for the service interruptions and degradation of service.

Wed, May 25, 2022, 05:15 PM

Resolved

This incident has been resolved.

Wed, May 18, 2022, 10:14 PM(6 days earlier)

Monitoring

Our fix in production appears to have resolved the issue. We are continuing to monitor.

Wed, May 18, 2022, 09:44 PM(29 minutes earlier)

Identified

We have identified an issue affecting some customers. We are working on remediation.

Wed, May 18, 2022, 09:30 PM(14 minutes earlier)

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.