Between 10:15 AM and 1:00 PM on March 25, traffic was inadvertently routed to a compute node experiencing an unhealthy state due to underlying issues with its operating system. This incorrect system state triggered our rate-limiting protections, causing an unexpected increase in HTTP 429 responses. During this period, a global error rate of up to 20% was observed.
To promptly mitigate the issue, we temporarily disabled rate limits for all customers. After identifying the root cause, we restored the affected compute node to a healthy state, implemented a fix ensuring that the rate-limiting service continues functioning correctly even when compute nodes become unhealthy, and subsequently re-enabled rate limiting.
We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.