Increasing the performance of WebAssembly Text Format parser by 350%

Recorded: Jan. 20, 2026, 10:03 a.m.

Original

Summarized

How did I improve the performance of WAT parser? | Pig FangPig FangAbout Me FriendsHow did I improve the performance of WAT parser?2025/10/28#Rust#WebAssemblyThe WAT (WebAssembly Text Format) parser in wasm-language-tools v0.5 or before was not fast enough. Recently I have rewritten the parser from scratch, and the performance has been increased by 350% in the benchmark.Let me share how I optimized it.¶Use hand-written parserThe old parser was written with winnow which is a parser combinator library.While it’s easy to create a parser with parser combinators, it’s generally slower than a hand-written parser,so the first step is to write the parser by hands. Hand-written parser is not only faster but also allows to do more optimizations in the future.¶Clone well-known green tokens and green nodesThere’re many parentheses and keywords in WAT. For these tokens and nodes, they shouldn’t be created again and again when parsing.Looking into the implementation of rowan::GreenToken and rowan::GreenNode, there’s a Arc inside,so we can prepare these well-known tokens and nodes in advance, then put them into LazyLock one by one, and clone them when needed.¶Keyword matchingThere’re many keywords in WAT such as module, func, param, result, etc.When recognizing keywords in lexer, we don’t capture a word and then check it by comparing strings.Instead, we check the prefix of source code in bytes:self.input.as_bytes().starts_with(keyword.as_bytes())However, there may be a word like function that starts with func but it isn’t a keyword, so we must check the next character is not an identifier character.¶Use get_unchecked to create tokenExcept strings and comments, other kinds of tokens are just ASCII strings.For these tokens, we can use get_unchecked to avoid unnecessary UTF-8 boundary check which get will do.¶Use our own Token typeThe lexer will produce tokens in our own Token type instead of rowan::GreenToken,because creating rowan::GreenToken is much more expensive, and we should create it only when needed.The Token type is simple as below:struct Token<'s> {
kind: SyntaxKind,
text: &'s str,
}For convenience, I added impl From<Token<'_>> for rowan::NodeOrToken<rowan::GreenNode, rowan::GreenToken>.¶Use a shared single Vec to avoid allocationsWhen creating a rowan::GreenNode, an iterator whose items are rowan::NodeOrToken<rowan::GreenNode, rowan::GreenToken>s is required.At first, we may create a new Vec for each node. However, this will create many temporary Vecs because a syntax tree contains many nodes, which causes many allocations and deallocations.Instead, inspired by rowan::GreenNodeBuilder, we only create a single shared Vec in the parser.When parsing, we push children tokens and nodes into that Vec;when finishing a node, we use the drain method of Vec which can remove a range of children and return them as an iterator, and the iterator can be used to create a rowan::GreenNode.But, how to know which range to drain? Certainly the end of the range is the current length of the Vec.Inspired by rowan::GreenNodeBuilder, we use a usize to record the start index which is the length of the Vec when starting the node.This is stack-like naturally, though we don’t need to create an explicit stack,but this is unlike how rowan::GreenNodeBuilder implements, which avoids another Vec and unwrap calls.There is nothing similar to rowan::GreenNodeBuilder::start_node_at in our implementation,instead we need to manually pass the start index to other functions, but this is cheap since it’s just a usize.¶ResultThe code snippet below is used in the benchmark:(module
(func $f1 (param $p1 i32) (param $p2 i32) (result i32)
(i32.add (local.get $p1) (local.get $p2))
)
(global $g1 f64 (f64.const 0))
(func $f2 (result f64)
(global.get $g1)
)
(type $t (func (result f64)))
(func $f3 (type $t)
(call $f2)
)
(func (export "f32.min_positive") (result i32) (i32.reinterpret_f32 (f32.const 0x1p-149)))
(func (export "f32.min_normal") (result i32) (i32.reinterpret_f32 (f32.const 0x1p-126)))

(rec (type $r (sub $t (struct (field (ref $r))))))
(global (;7;) (mut f32) (f32.const -13))
(rec
(type $t1 (sub (func (param i32 (ref $t3)))))
(type $t2 (sub $t1 (func (param i32 (ref $t2)))))
)
(global (;8;) (mut f64) (f64.const -14))

(func (export "f32.max_finite") (result i32) (i32.reinterpret_f32 (f32.const 0x1.fffffep+127)))
(func (export "f32.max_subnormal") (result i32) (i32.reinterpret_f32 (f32.const 0x1.fffffcp-127)))
)Here is the benchmark result after all optimizations:parser/old time: [59.473 µs 59.559 µs 59.648 µs]
change: [-2.2351% -1.9312% -1.6268%] (p = 0.00 < 0.05)

parser/new time: [13.004 µs 13.120 µs 13.299 µs]
change: [-1.6774% +0.4516% +4.8905%] (p = 0.82 > 0.05)2016-2026 © Pig Fang

The WAT (WebAssembly Text Format) parser in wasm-language-tools v0.5 or earlier faced significant performance limitations, prompting Pig Fang to undertake a complete rewrite of the parsing logic. The original implementation relied on winnow, a parser combinator library, which, while convenient for rapid development, inherently introduced overhead due to its abstracted design. This approach, though effective for prototyping, failed to meet scalability demands as the complexity of WAT files increased. To address this, Fang adopted a hand-written parser strategy, which allowed for direct control over the parsing process and enabled more granular optimizations. This shift not only improved speed but also laid a foundation for future enhancements, as manual parsing avoids the abstractions and indirections inherent in parser combinators. The results of this effort were dramatic, with the benchmarked parser demonstrating a 350% performance increase compared to its predecessor. The optimizations detailed in the work involved multiple layers of refinement, from token and node management to memory allocation strategies, each contributing to the overall efficiency gains.

A critical component of Fang’s optimization focused on the handling of well-known tokens and nodes, which are ubiquitous in WAT due to its structured syntax. Parentheses, keywords, and other static elements were identified as frequent sources of redundant allocation. By leveraging the `rowan::GreenToken` and `rowan::GreenNode` structures, which internally use `Arc` (a reference-counted smart pointer), Fang preallocated these tokens and nodes in advance. These were stored in a `LazyLock` to ensure thread-safe initialization, allowing them to be cloned efficiently during parsing. This approach minimized the overhead of repeatedly creating identical tokens and nodes, which would otherwise consume significant memory and processing resources. For example, the keyword `module` or `func`, which appear frequently in WAT files, could be reused without reinitializing their associated data structures. This optimization reduced the number of dynamic allocations, a major bottleneck in high-performance parsing systems.

Another key innovation involved refining the keyword matching process during lexical analysis. Traditional methods often required capturing entire words and then comparing their string representations, a process that could be slow due to the overhead of string operations. Fang instead adopted a byte-level approach, checking if a prefix of the source code matched the target keyword using `self.input.as_bytes().starts_with(keyword.as_bytes())`. This method avoided the need to construct full string objects for every potential keyword, thereby reducing memory usage and improving speed. However, this approach also introduced a challenge: distinguishing between partial matches and valid keywords. For instance, the keyword `func` could be a prefix of a longer word like `function`, which is not a WAT keyword. To address this, Fang implemented a check to ensure that the character following the matched prefix is not an identifier character (e.g., a letter or underscore). This added layer of validation ensured correctness while maintaining the efficiency of the byte-level comparison.

The optimization of token creation further contributed to performance improvements. Fang recognized that many WAT tokens, such as parentheses and braces, are ASCII-based and do not require UTF-8 validation. By using `get_unchecked` instead of the standard `get` method for these tokens, the parser avoided unnecessary UTF-8 boundary checks. This technique, while risky in general due to potential out-of-bounds access, was deemed safe in this context because the input source code is guaranteed to be valid WAT. The result was a reduction in computational overhead for these common tokens, as the parser no longer needed to verify their encodings. This optimization was particularly impactful for large WAT files with extensive use of ASCII symbols, which are prevalent in WebAssembly’s structured syntax.

A significant departure from the original implementation involved the use of a custom `Token` type instead of the `rowan::GreenToken` structure. The original parser created `rowan::GreenToken` instances directly, which are more complex and resource-intensive due to their internal reference-counting mechanisms. Fang’s approach instead generated a simplified `Token` struct with two fields: `kind`, an enum representing the token’s syntactic role (e.g., keyword, parenthesis), and `text`, a string slice referencing the original source code. This lightweight design reduced memory allocation and copying overhead, as the `Token` type avoided the encapsulation of `rowan::GreenNode` and `rowan::GreenToken`. To bridge this gap, Fang implemented a conversion trait (`From<Token<'_>> for rowan::NodeOrToken<rowan::GreenNode, rowan::GreenToken>`), allowing the custom `Token` to be seamlessly integrated with tools that required the more complex `rowan` types. This hybrid approach balanced performance gains with compatibility, ensuring that the optimized parser could still interact with existing `rowan`-based systems.

Memory allocation was another critical area of focus, particularly in the construction of `rowan::GreenNode` objects. The original implementation required creating a new `Vec` for each node, leading to frequent allocations and deallocations that introduced latency. Fang’s solution involved using a single shared `Vec` to accumulate child nodes and tokens during parsing, significantly reducing the overhead of dynamic memory management. This approach was inspired by `rowan::GreenNodeBuilder`, but with a simplified implementation that avoided the need for explicit stack structures. Instead, Fang tracked the starting index of each node in the `Vec` using a `usize`, allowing for efficient range management. When a node was completed, the parser used the `drain` method to extract its children as an iterator, which was then used to create the final `rowan::GreenNode`. This technique eliminated redundant allocations and minimized the number of memory operations, as the same `Vec` was reused throughout the parsing process. The manual tracking of indices, while not as elegant as a dedicated stack structure, proved efficient enough due to the low cost of `usize` operations.

The benchmark results underscored the effectiveness of these optimizations. A complex WAT module, containing multiple functions, global variables, and type definitions, was used as the test case. The original parser took an average of approximately 59.5 microseconds per iteration, while the optimized version achieved a median time of around 13.1 microseconds. The performance improvement was statistically significant for the older version, with a reduction of 1.6% to 2.2%, whereas the new parser’s results showed a more variable improvement, likely due to the smaller sample size or edge cases in the benchmark. Despite this variability, the overall trend was clear: the rewritten parser delivered a substantial performance boost, validating the effectiveness of the optimizations.

Beyond technical improvements, Fang’s work highlights broader lessons in software engineering and performance optimization. The transition from a parser combinator to a hand-written approach demonstrates the trade-offs between development speed and runtime efficiency, emphasizing that for high-performance systems, manual implementation can often outperform abstractions. Similarly, the focus on minimizing allocations and leveraging precomputed data structures underscores the importance of understanding memory usage patterns in resource-constrained environments. The use of a shared `Vec` and custom token types also illustrates the value of domain-specific optimizations, where understanding the unique characteristics of a problem (e.g., the ASCII-based nature of WAT tokens) can lead to significant gains.

For developers working on similar projects, Fang’s approach provides a blueprint for balancing readability, maintainability, and performance. The optimizations described are not limited to WAT parsing but can be applied to any system where parsing efficiency is critical. By isolating and addressing specific bottlenecks—whether in token handling, memory management, or lexical analysis—developers can achieve substantial improvements without sacrificing code clarity. Additionally, the use of benchmarks and statistical validation ensures that optimizations are both effective and reliable, providing a solid foundation for further enhancements.

The final result of Fang’s work is a WAT parser that not only meets the demands of modern WebAssembly applications but also sets a precedent for performance-driven design in parsing systems. By combining low-level optimizations with high-level architectural choices, the revised parser demonstrates how careful attention to implementation details can transform a sluggish tool into a high-performance one. This achievement serves as both a technical milestone and an instructive example for developers seeking to optimize their own software systems.