Summary
On June 9 at 11:36 PM PDT, a routine update to the host Operating System on our cloud-hosted GPU servers caused a significant number of GPU nodes to lose network connectivity. This led to a drop in available capacity for our services. As a result, ChatGPT users experienced elevated error rates reaching ~35% errors at peak, while API users experienced error rates peaking at ~25%. The highest impact occurred between June 10 2:00 AM PDT and June 10 8:00 AM PDT. The engineering teams were alerted immediately at the start of the incident and worked diligently to restore service.
The service was restored by re-imaging affected nodes, halting the background update mechanisms, and taking additional mitigation steps to bring systems back online. While we were near full system recovery by 8:00 AM PDT, all affected systems were fully restored by 3:00 PM PDT on June 10. We have addressed the underlying causes and implemented additional safeguards and tooling to prevent a recurrence. We sincerely apologize for the disruption and thank you for your patience while we resolved the issue.
Timeline
(Time in PDT)
Jun 9
11:36 PM: Incident begins. Alerts were triggered immediately across multiple services as underlying infrastructure began losing network connectivity, prompting an instant escalation to the engineering team.
Jun 10
01:45 AM: Errors remain ~5% while capacity is re-balanced.
02:00 AM: Engineering team starts re-imaging affected VMs.
06:49 AM: Recovery automation deployed to restart remaining impacted nodes.
08:00 AM – 03:00 PM: Capacity gradually restored; remaining cleanup performed.
09:40 AM: All major API models fully operational.
12:30 PM: API fully recovered.
03:00 PM: Incident marked as mitigated. Services fully restored.
Impact
ChatGPT: Error rates peaked at ~35% between June 10 2:00 AM PDT and June 10 8:00 AM PDT. Full recovery completed at 3:00 PM PDT.
API: Availability dropped to 75% during the incident. Most major models served by the API were fully operational by June 10 9:40 AM PDT, with complete recovery across all models achieved by 12:30 PM PDT.
Root Cause
The root cause was traced to an unintended consequence of a routine update to the host Operating System. Specifically, a daily scheduled system update inadvertently restarted the network management service (systemd-networkd) on affected nodes, causing a conflict with a networking agent that we run on production nodes. This resulted in all routes being removed from impacted nodes, effectively making these nodes lose network connectivity.
Mitigation and Recovery Efforts
To address the immediate impact, our engineering teams initiated a large-scale re-imaging of the affected GPU nodes. The absence of break-glass tooling to rapidly restore network connectivity on affected nodes extended the overall recovery timeline. Additional measures were needed to bring affected nodes back online, extending the overall timeline.
Preventive Actions
To prevent future occurrences, we have taken several preventive measures, and others are actively underway:
Completed actions:
Disabled automatic daily updates on GPU VMs to ensure updates are conducted in a controlled, scheduled manner.
Updated system configurations to prevent conflicts between systemd-networkd and the networking service.
Ongoing initiatives:
We have initiated an audit of VM configurations across our fleet to identify and mitigate similar risks.
We are prioritizing improvements in recovery speed, particularly for critical infrastructure components such as GPU VMs and clusters.
We are planning regular disaster recovery drills to improve response effectiveness and minimize future disruptions.
We sincerely apologize for the disruption and the impact that this incident caused to all of our customers. We are committed to strengthening our infrastructure to prevent similar incidents.