Cloudflare outage should not have happened
Recorded: Nov. 27, 2025, 1:02 a.m.
| Original | Summarized |
Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future
Eduardo's blog Home Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future Unfortunately, there were assumptions made in the past, that the list of However, as part of the additional permissions that were granted to the A central database query didn’t have the right constraints to express Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input These are all solid, reasonable steps. But here’s the problem: they No nullable fiels. Conclusion Bocharov, Alex. 2018. “Http Analytics for 6m Requests per Second Using Clickhouse.” https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/. Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving Feel free to send me an email: ebellani -at- gmail -dot- com PGP Key Fingerprint: 48C50C6F1139C5160AA0DC2BC54D00BC4DF7CA7C |
Cloudflare’s recent global outage, while seemingly a complex incident, reveals a fundamental issue stemming from a disconnect between application logic and database design. As detailed by Eduardo Bellani, the root cause wasn’t a single, catastrophic failure, but rather a confluence of factors centered around an unanticipated interaction within Cloudflare’s systems. The outage underscores a broader challenge faced by large, complex organizations – the difficulty in guaranteeing logical consistency and correctness when dealing with evolving data structures and processing requirements. The core of the problem resides within the Bot Management feature file generation logic. This component, designed to expedite data processing, leveraged a database query that lacked critical constraints – specifically, a database name filter. This oversight allowed the query to retrieve metadata from the r0 database, even after the gradual rollout of explicit user grants. The resulting duplication of column data, effectively doubling the number of rows returned, overwhelmed the file generation process and triggered a crash loop across Cloudflare’s core systems. Bellani powerfully illustrates the classic “database/application mismatch,” a recurring pattern often observed in large-scale systems. Bellani’s analysis highlights the inadequacy of Cloudflare’s response, characterized by standard mitigation strategies such as hardening ingestion processes, enabling global kill switches, and managing error reports. Despite these reasonable steps, the outage persisted, indicating a failure to address the underlying logical inconsistency. He argues that Cloudflare incorrectly equated physical replication with the absence of a single point of failure, overlooking the potential for a logical single point of failure to exist even in a highly replicated environment. This misinterpretation is further complicated by Cloudflare’s decision to shift from PostgreSQL to ClickHouse, driven by a desire for improved data processing speed but lacking detailed consideration for ensuring logical data integrity. Bellani emphasizes that effectively preventing such occurrences requires proactive analytical design rather than reactive responses. He advocates for foundational principles such as the complete normalization of the database—eliminating nullable fields—and the implementation of formally verified application code, drawing on the work of Chapman et al. (2024) in co-developing programs and proof of correctness. While acknowledging that FAANG-style companies may face resistance to adopting formal methods, Bellani contends that for their most critical systems, this approach represents the only viable path towards eliminating failures by design, moving away from a strategy of simply reducing the likelihood of errors. He concludes with a familiar sentiment – the benefits of this approach would be widely appreciated by internet users, despite the inherent caveat of “caveat emptor.” The case presented by Bellani underscores a crucial lesson: the architecture and design choices within complex systems have a profound impact on their resilience and stability. |