Cloudflare outage should not have happened

Recorded: Nov. 27, 2025, 1:02 a.m.

Original

Summarized

Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future

Eduardo's blog

Home
Blog

Cloudflare outage should not have happened, and they seem to be missing the point on how to avoid it in the future

November 26, 2025
by Eduardo Bellani

Yet again, another global IT outage happen (deja vu strikes again in our
industry). This time at
cloudflare(Prince 2025). Again, taking down
large swats of the internet with
it(Booth 2025).
And yes, like my previous analysis of the GCP and CrowdStrike’s outages,
this post critiques Cloudflare’s root cause analysis (RCA), which —
despite providing a great overview of what happened — misses the real
lesson.
Here’s the key section of their RCA:

Unfortunately, there were assumptions made in the past, that the list of
columns returned by a query like this would only include the “default”
database:
SELECT
name,
type
FROM system.columns
WHERE
table = ‘http_requests_features’
order by name;
Note how the query does not filter for the database name. With us
gradually rolling out the explicit grants to users of a given ClickHouse
cluster, after the change at 11:05 the query above started returning
“duplicates” of columns because those were for underlying tables stored
in the r0 database.
This, unfortunately, was the type of query that was performed by the Bot
Management feature file generation logic to construct each input
“feature” for the file mentioned at the beginning of this section.
The query above would return a table of columns like the one displayed
(simplified example):

However, as part of the additional permissions that were granted to the
user, the response now contained all the metadata of the r0 schema
effectively more than doubling the rows in the response ultimately
affecting the number of rows (i.e. features) in the final file output.

A central database query didn’t have the right constraints to express
business rules. Not only it missed the database name, but it clearly
needs a distinct and a limit, since these seem to be crucial business
rules.
So, a new underlying security work manifested the (unintended) potential
already there in the query. Since this was by definition unintended, the
application code didn’t expect that value to be what it was, and reacted
poorly. This caused a crash loop across seemingly all of cloudflare’s
core systems. This bug wasn’t caught during rollout because the faulty
code path required data that was assumed to be impossible to be
generated.
Sounds familiar? It should. Any senior engineer has seen this pattern
before. This is classic database/application mismatch. With this in
mind, let’s review how Cloudflare is planning to prevent this from happening
again:

Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
Enabling more global kill switches for features
Eliminating the ability for core dumps or other error reports to overwhelm system resources
Reviewing failure modes for error conditions across all core proxy modules

These are all solid, reasonable steps. But here’s the problem: they
already do most of this—and the outage happened anyway.
Why? Because of they seem to mistake physical replication with not
having a single point of failure. This mistakes the physical layer with
the logical layer. One can have a logical single point of failure
without having any physical one, which was the case in this
situation.
I base my paragraph on their choice of abandoning PostgreSQL and
adopting
ClickHouse(Bocharov 2018). The
whole post is a great overview on trying to process data fast, without a
single line on how to garantee its logical correctness/consistency in
the face of changes.
They are treating a logical problem as if it was a physical problem
I’ll repeat the same advice I offered in my previous article on GCP’s outage:
The real cause
These kinds of outages stem from the uncontrolled interaction between
application logic and database schema. You can’t reliably catch that
with more tests or rollouts or flags. You prevent it by
construction—through analytical design.

No nullable fiels.
(as a cororally of 1) full normalization of the database (The principles of database design, or, the Truth is out there)
formally verified application code(Chapman et al. 2024)

Conclusion
FAANG-style companies are unlikely to adopt formal methods or relational
rigor wholesale. But for their most critical systems, they should. It’s
the only way to make failures like this impossible by design, rather
than just less likely.
The internet would thank them. (Cloud users too—caveat emptor.)
References

Bocharov, Alex. 2018. “Http Analytics for 6m Requests per Second Using Clickhouse.” https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/.
Booth, Robert. 2025. “What Is Cloudflare — and Why Did Its Outage Take down so Many Websites?” https://www.theguardian.com/technology/2025/nov/18/what-is-cloudflare-and-why-did-its-outage-take-down-so-many-websites.
Chapman, Roderick, Claire Dross, Stuart Matthews, and Yannick Moy. 2024. “Co-Developing Programs and Their Proof of Correctness.” Commun. Acm 67 (3): 84–94. https://doi.org/10.1145/3624728.
Prince, Matthew. 2025. “Cloudflare Outage on November 18, 2025.” https://blog.cloudflare.com/18-november-2025-outage/.

Figure 1: The Cluny library was one of the richest and most important in France and Europe. In 1790 during the French Revolution, the abbey was sacked and mostly destroyed, with only a small part surviving

Feel free to send me an email: ebellani -at- gmail -dot- com

PGP Key Fingerprint: 48C50C6F1139C5160AA0DC2BC54D00BC4DF7CA7C

Cloudflare’s recent global outage, while seemingly a complex incident, reveals a fundamental issue stemming from a disconnect between application logic and database design. As detailed by Eduardo Bellani, the root cause wasn’t a single, catastrophic failure, but rather a confluence of factors centered around an unanticipated interaction within Cloudflare’s systems. The outage underscores a broader challenge faced by large, complex organizations – the difficulty in guaranteeing logical consistency and correctness when dealing with evolving data structures and processing requirements.

The core of the problem resides within the Bot Management feature file generation logic. This component, designed to expedite data processing, leveraged a database query that lacked critical constraints – specifically, a database name filter. This oversight allowed the query to retrieve metadata from the r0 database, even after the gradual rollout of explicit user grants. The resulting duplication of column data, effectively doubling the number of rows returned, overwhelmed the file generation process and triggered a crash loop across Cloudflare’s core systems. Bellani powerfully illustrates the classic “database/application mismatch,” a recurring pattern often observed in large-scale systems.

Bellani’s analysis highlights the inadequacy of Cloudflare’s response, characterized by standard mitigation strategies such as hardening ingestion processes, enabling global kill switches, and managing error reports. Despite these reasonable steps, the outage persisted, indicating a failure to address the underlying logical inconsistency. He argues that Cloudflare incorrectly equated physical replication with the absence of a single point of failure, overlooking the potential for a logical single point of failure to exist even in a highly replicated environment. This misinterpretation is further complicated by Cloudflare’s decision to shift from PostgreSQL to ClickHouse, driven by a desire for improved data processing speed but lacking detailed consideration for ensuring logical data integrity.

Bellani emphasizes that effectively preventing such occurrences requires proactive analytical design rather than reactive responses. He advocates for foundational principles such as the complete normalization of the database—eliminating nullable fields—and the implementation of formally verified application code, drawing on the work of Chapman et al. (2024) in co-developing programs and proof of correctness. While acknowledging that FAANG-style companies may face resistance to adopting formal methods, Bellani contends that for their most critical systems, this approach represents the only viable path towards eliminating failures by design, moving away from a strategy of simply reducing the likelihood of errors. He concludes with a familiar sentiment – the benefits of this approach would be widely appreciated by internet users, despite the inherent caveat of “caveat emptor.” The case presented by Bellani underscores a crucial lesson: the architecture and design choices within complex systems have a profound impact on their resilience and stability.