Root Cause
A code update introduced a new variable format for part of our image-generation pipeline. One service began using the new format while its peers still expected the old format. The mismatch triggered validation failures and a portion of image-generation calls returned HTTP 500 errors.
Incident Timeline (PDT, May 14 2025)
10:10 AM First component rolled out to production.
10:30 AM Engineers got the alert and began investigation.
10:45 AM Both components started to roll back to the previous stable release.
10:50 AM Error rates returned to normal; incident declared resolved.
11:00 AM Post-incident review began.
Monitoring Gaps
Alerting was insufficient to detect schema or format mismatches between dependent services.
Dashboards grouped multiple routes together, delaying pinpointing of the exact failing endpoint.
Since this incident, we are closing the following monitoring gaps
More coherent monitoring & alerts across multiple related services.
More detailed instructions on how to quickly locate the root cause