Finding all regex matches has always been O(n²)
Recorded: March 24, 2026, 2:23 a.m.
| Original | Summarized |
finding all regex matches has always been O(n²). even in the engines built to prevent it | ian erik varatalu blog March 16, 2026 finding all regex matches has always been O(n²). even in the engines built to prevent it every regex engine that promises linear time breaks that promise the moment you ask for all matches. the problem has been there since the 70s, hiding in the iteration loop. blog rust regex automata search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours. every regex engine, in every language, has had this problem since the 1970s, and nobody fixed it. the worst case time complexity for iterators is O(m * n²). [...] if both patterns and haystacks are untrusted and you're iterating over all matches, you're susceptible to worst case quadratic time complexity. There is no way to avoid this. One possible way to mitigate this is to [...] immediately stop as soon as a match has been found. Enabling this mode will thus restore the worst case O(m * n) time complexity bound, but at the cost of different semantics. the mechanism is simple. take the pattern .*a|b and a haystack of n b's. at each position, the engine tries .*a first: scan the entire remaining haystack looking for an a, find none, fail. then the b branch matches a single character. advance one position, repeat. that's n + (n-1) + (n-2) + ... = O(n²) work to report n single-character matches. a textbook triangular sum. hit play to see it: Russ Cox described this exact problem back in 2009, noting that even the original awk by Aho himself used the naive quadratic "loop around a DFA" for leftmost-longest matching. BurntSushi's rebar benchmark suite confirms it empirically across RE2, Go, and rust. the throughput halves when the input doubles. as he put it: "even for automata oriented engines, it provokes a case that is unavoidably O(m * n²)". minimatch is npm's own glob-matching library, written by npm's creator. it converts globs to JavaScript regexes and has been hit by five separate ReDoS CVEs, all caused by the same root issue: backtracking. it gets 350 million downloads a week. the library's readme now warns in bold that "if you create a system where you take user input, and use that input as the source of a Regular Expression pattern [...] you will be pwned", and states that future ReDoS reports will be considered "working as intended." the quadratic all-matches problem is more subtle. it affects even the engines specifically built to avoid backtracking. it won't kill your browser, but it will still quietly turn a one-second search into a three-hour one. scanning "ushers": u stays at root, s enters S, h enters SH, e enters SHE, match "she". then the failure link jumps to HE, match "he". two overlapping matches found in one pass. on patterns that produce many matches - log parsing, data extraction, search-and-replace across large files - the difference between O(n) and O(n²) is the difference between "instant" and "why is this taking so long". input sizenormalhardenedspeedup w/ hardened1,0000.7ms28us25x5,00018ms146us123x10,00073ms303us241x50,0001.8s1.6ms1,125x normal mode goes quadratic; hardened stays linear. so why not make hardened the default? i went back and forth on this. patternnormalhardenedratio[A-Z][a-z]+2.2ms6.5ms3.0x slower[A-Za-z]{8,13}1.7ms7.6ms4.4x slower\w{3,8}2.6ms22ms8.7x slower\d+1.3ms5.2ms3.9x slower[A-Z]{2,}0.7ms4.7ms6.7x slower patterns with lookarounds are currently rejected in hardened mode. there's no theoretical barrier, but the implementation needs some work. benchmarkresharpresharp (hardened)rust/regexaho-corasickdictionary 2663 words (~15 matches)627 MiB/s163 MiB/s (3.85x)535 MiB/s (1.17x)155 MiB/s (4.05x) how is this possible when RE# is doing more work - two passes instead of one? it comes down to cache behavior. Aho-Corasick builds the full automaton upfront - for 2663 words that's a large DFA with many states and unpredictable jumps between them, leading to cache misses and branch mispredictions. rust regex uses a single lazily-compiled DFA, which helps, but the state space for a large alternation is still substantial. RE#'s derivative-based DFAs are lazily built and more compact - the two automata (forward and reverse) each have far fewer states than the equivalent full trie or NFA-based DFA, so transitions hit warm cache lines more often. benchmarkresharprust/regexliteral, english34.8 GB/s33.7 GB/sliteral, case-insensitive english17.1 GB/s9.7 GB/sliteral, russian40.2 GB/s33.5 GB/sliteral, case-insensitive russian10.3 GB/s7.7 GB/sliteral, chinese22.4 GB/s44.0 GB/sliteral alternation, english12.2 GB/s12.5 GB/sliteral alternation, case-insensitive english4.9 GB/s2.5 GB/sliteral alternation, russian3.3 GB/s5.9 GB/s RE# does very well here now - most numbers are within noise threshold of regex. the few differences here and there come down to byte frequency tables and algorithmic choices in the skip loop. for context, a DFA by itself gets you somewhere near 1 GB/s. CPU vector intrinsics can opportunistically push that to 40+ on patterns where most of the input can be skipped. any pattern + leftmost-longest semantics = no. this isn't an engine limitation - it's inherent to the semantics. if you ask for the longest match on an infinite stream, the answer might be "keep going forever." you might think leftmost-greedy avoids this since it works left-to-right, but it doesn't - .*a|b on a stream of b's has the same problem, the .*a branch keeps scanning forward looking for the last a that may never come. the API doesn't expose a streaming interface yet - find_all takes &[u8] - but chunked streaming is on the list. no capture groups - RE# returns match boundaries only, not sub-group captures. this isn't impossible - captures are a post-match operation that can be layered on top. the reason is we haven't found the right way to do it yet. with intersection and complement, every subexpression would naively become a capture group - (a.*&.*b) has two implicit groups, and complement creates more. in traditional regex, (?:...) exists to opt out of capturing, but the more i think about it the more ?: feels like a historical mistake - it makes the default behavior (capturing) the one that opts you into a much slower algorithm, even when you don't need it. i'd rather get the design right than ship something awkward. no lazy quantifiers - .*? isn't supported. RE# uses leftmost-longest (POSIX) semantics, which is the mathematically unambiguous interpretation. lazy quantifiers are a backtracking concept that doesn't translate to this model. capture groups may come eventually, but lazy quantifiers are a deliberate architectural choice. if you need captures today, use regex. if you need the properties RE# offers (boolean operators, lookarounds, true-linear all-matches, POSIX semantics), these limitations are unlikely to matter. # list all files both containing serde and async |
Finding all regular expression matches has historically been characterized by an O(n²) time complexity, even within regex engines designed to mitigate this issue. This fundamental limitation stems from the inherent nature of iterative matching processes, originating in the 1970s and persisting within numerous language implementations. The core problem arises when seeking all matches within a document, fundamentally requiring a scan of the entire text (n) for each potential match, leading to a quadratic relationship. Ian Erik Varatalu highlights that this issue remains present, despite the existence of engines like RE2, Go’s `regexp`, and Rust’s `regex crate`, which provide linear time performance for *single* matches. These engines achieve linear time for a single match by focusing on a 'match or no match' approach, neglecting the crucial aspects of *where* matches occur, their *length*, and their *quantity*. The design philosophy behind these engines prioritizes simplicity in proving theorems, overlooking practical considerations that heavily influence regex usage. The problem is exacerbated by backtracking, a default mechanism in many regex engines that can result in exponential time complexity—the noted "heat death of the universe" scenario. Thompson’s 1968 NFA construction provides a solution to this backtracking issue, enabling linear time matching for fixed strings, yet this advantageous technique has remained largely unutilized in practice due to the continued prevalence of backtracking. Minimatch, npm’s glob-matching library created by npm’s founder, exemplifies the dangers of this approach, exhibiting repeated ReDoS (Regular Expression Denial of Service) vulnerabilities directly attributable to backtracking behavior. This issue has prompted developers to explicitly warn against using user-supplied input as the source of a regex pattern, recognizing the potential for exploitable vulnerabilities. A contrasting approach emerges with Aho-Corasick, developed in 1975, which offers linear time performance for finding all occurrences of multiple fixed strings within a single pass. Utilizing a trie data structure and failure links, Aho-Corasick avoids the quadratic complexity associated with iterative matching, operating efficiently by scanning the input only once. However, this solution is inherently limited to literal strings, not regular expressions, as the mechanism relies on predetermined pattern lengths. The development of Hyperscan and Vectorscan introduces a new solution—earliest match semantics—which delivers linear time all-matches performance. Unlike the traditional backtracking approach, Hyperscan avoids backtracking by reporting a match the moment the DFA enters a match state, optimizing for specific use cases like network intrusion detection. These engines, however, introduce a semantic shift, as the reported matches might differ from user expectations when patterns utilize greedy quantifiers (e.g., `a+`). Russ Cox’s 2009 observation underscored this fundamental trade-off. Another approach is demonstrated by REmatch (VLDB 2023), which takes a different tactic – enumerating every valid (start, end) span for a given pattern, including all overlapping and nested matches. While this method offers a comprehensive solution, its computational cost can lead to O(n²) complexity. RE# offers a novel approach, designed by Ian Erik Varatalu, which achieves linear time all-matches performance through the use of two passes. The first pass builds a reverse DFA, marking potential match starting positions. The second pass resolves match endpoints, eliminating redundant scans by confirming match beginnings. This approach avoids the quadratic complexity by performing two passes and minimizing backtracking. The ‘hardened mode’ is a key component of RE#, providing a safety net against malicious or poorly crafted patterns. It simplifies the forward scan, streamlining the process and maintaining a linear time bound, even when confronted with challenging inputs. This feature is crucial for scenarios where user-supplied patterns are employed. It's noteworthy that RE#’s performance is further enhanced through skip acceleration, leveraging CPU vector intrinsics for efficient byte search and range-based character class scanning, ultimately pushing performance towards 40+ GB/s. Furthermore, RE# incorporates streaming capabilities, enabling efficient processing of continuous data streams. However, it operates using leftmost-longest semantics by design, omitting support for lazy quantifiers, a deliberate choice reflecting the engine's architecture. A related tool, re, builds upon RE#’s functionality, delivering multi-term boolean search with scoping. The ultimate takeaway is that the O(n²) complexity associated with finding all regular expression matches within traditional engines is deeply rooted in the nature of iterative algorithms and the simplified theoretical foundations of regex design. While solutions like Aho-Corasick and newer engines like Hyperscan and RE# demonstrate the possibility of linear-time all-matches performance, this achievement often necessitates strategic design choices and trade-offs – such as earliest match semantics – to achieve optimal efficiency. Ian Erik Varatalu's work underscores the critical importance of considering the practical implications of regex design and the potential pitfalls lurking within the seemingly straightforward task of matching text. |