Image Generation API Errors

Write-up

Root Cause
A code update introduced a new variable format for part of our image-generation pipeline. One service began using the new format while its peers still expected the old format. The mismatch triggered validation failures and a portion of image-generation calls returned HTTP 500 errors.

Incident Timeline (PDT, May 14 2025)

10:10 AM First component rolled out to production.
10:30 AM Engineers got the alert and began investigation.
10:45 AM Both components started to roll back to the previous stable release.
10:50 AM Error rates returned to normal; incident declared resolved.
11:00 AM Post-incident review began.

Monitoring Gaps

Alerting was insufficient to detect schema or format mismatches between dependent services.
Dashboards grouped multiple routes together, delaying pinpointing of the exact failing endpoint.

Since this incident, we are closing the following monitoring gaps

More coherent monitoring & alerts across multiple related services.
More detailed instructions on how to quickly locate the root cause

Availability metrics are reported at an aggregate level across all tiers, models and error types. Individual customer availability may vary depending on their subscription tier as well as the specific model and API features in use.