Poking holes into bytecode with peephole optimisations | xnacly - blog~~/archive.html~/about.htmlPoking holes into bytecode with peephole optimisationsJan 14, 20261766 Words9 Minute readPeephole OptimizationsPurple garden implementationself_moveconst_binaryObservabilityIntegration and flag guardTestingTags:PldevRustThis article highlights the first optimizations I made while redesigning and semi-porting the runtime from C to Rust. These aren’t benchmarked or verified, since the virtual machine is currently under construction and will probably be finished this week.At a high level, purple-garden current works like this, with 2+3*4-1 as an exemplary input:TEXT 1. 2+- Tokenizer 3| 4]: Token(2) Token(+) Token(3) 5]: Token(*) 6]: Token(4) Token(-) Token(1) 7| 8 \ 9 +- Parsing (Tokens -> Abstract Syntax Tree) 10 | 11 ]: (Asteriks 12 ]: (Plus 13 ]: Integer("2") 14 ]: Integer("3") 15 ]: ) 16 ]: (Minus 17 ]: Integer("4") 18 ]: Integer("1") 19 ]: ) 20 ]: ) 21 | 22 | 23<planned section start> 24 \ 25 +- Planned IR and Optimisation Boundary 26 | 27 / \ 28 | +- JIT Compiler (IR -> x86/ARM) 29 | ^ 30 | \ 31 | \ 32 | \ 33 | \ 34 | \ 35<planned section end> |Calls 36 | |JIT'ed 37 \ |functions 38 +- Compiler (AST/IR -> bytecode) | 39 | / 40 ]: __entry: / 41 ]: load_imm r0, #2 | 42 ]: load_imm r1, #3 | 43 ]: add r2, r0, r1 | 44 ]: load_imm r1, #4 | 45 ]: load_imm r0, #1 | 46 ]: sub r3, r1, r0 | 47 ]: mul r0, r2, r3 | 48 | | 49 \ | 50 +- Peephole Optimiser | 51 | | 52 ]: __entry: | 53 ]: load_imm r2, #5 | 54 ]: load_imm r3, #3 | 55 ]: mul r0, r2, r3 | 56 | / 57 \ / 58 +- Baseline interpreter --+ 59 | 60 ] [vm][0000] LoadImm { dst: 2, value: 5 } 61 ] [vm][0001] LoadImm { dst: 3, value: 3 } 62 ] [vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 } 63 | 64 'Peephole OptimizationsPeephole optimisations are, as the name implies, optimisations performed on a small section of a larger input. For a virtual machine, like purple-garden this means using a window of size 3 (anything larger is no longer local1 subject to IR optimisation, not peephole and will have happened before peephole optimisation is reached in the compilation pipeline) and merging operators, rewriting redundant or useless operations.So to summarize:peephole optimisations are done on the bytecode in a local setting (non-global)in this case they are fallback optimisations for things the IR optimisation rewriting missedthese are local, single-pass, and meant only to catch what the IR opt didn’tPurple garden implementationFor purple garden specifics and questions regarding the runtime, do consult:purple-garden GitHubThe Manchester Garbage Collector and purple-garden’s runtimeRedesign and Semi-port to Rust #15Peephole optimisations in purple-garden are intentionally kept single pass to keep startup time cost as low as possible and to move heavy optimisation into the IR.This introduces the problem for recursive optimisations due to the result of a previous optimisation, this is mitigated by peephole optimisations being the fallback for the previous optimisation pipeline.RUST 1const WINDOW_SIZE: usize = 3; 2 3/// Peephole optimisations 4/// 5/// See: 6/// - [Peephole optimization - Wikipedia] 7/// (https://en.wikipedia.org/wiki/Peephole_optimization) 8/// - [W. M. McKeeman "Peephole Optimization"] 9/// (https://dl.acm.org/doi/epdf/10.1145/364995.365000) 10pub fn bc(bc: &mut Vec<Op>) { 11 for i in 0..=bc.len().saturating_sub(WINDOW_SIZE) { 12 let window = &mut bc[i..i + WINDOW_SIZE]; 13 bc::const_binary(window); 14 bc::self_move(window); 15 } 16 17 bc.retain(|op| !matches!(op, Op::Nop)) 18}Since optimisations can both rewrite, replace and remove instructions, the Op::Nop encodes removal by being the replacement for instructions that were optimised away. These are then removed from the bytecode list after all optimisations are applied.self_move“Self move” is a mov instruction having equivalent dst and src:ASM1__entry: 2 load_imm r0 #5 3 load_imm r1 #5 4 mov r1 r1 ; <-- basically NOP 5 add r1 r0If this pattern is encountered, the VM would waste processing power on running a NOP instruction, to combat this, they are removed:ASM1__entry: 2 load_imm r0 #5 3 load_imm r1 #5 4 add r1 r0This is achieved by iterating the window and replacing self movs:RUST 1/// self_move removes patterns conforming to 2/// 3/// Mov { dst: x, src: x }, 4/// 5/// where both dst == src 6pub fn self_move(window: &mut [Op]) { 7 for op in window.iter_mut() { 8 if let Op::Mov { dst, src } = op { 9 if dst == src { 10 *op = Op::Nop; 11 opt_trace!("self_move", "removed self_moving Mov"); 12 } 13 } 14 } 15}const_binaryThis optimisation refers to a binary instruction and both lhs and rhs are created via LoadImm directly in the window beforehand:ASM1__entry: 2 load_imm r0, #2 3 load_imm r1, #3 4 add r2, r0, r1 ; <-- this is a compile time known result := 2+3=5 5 load_imm r1, #4 6 load_imm r0, #1 7 sub r3, r1, r0 ; <-- this is a compile time known result := 4-1=3 8 mul r0, r2, r3In this example add, sub and mul are constant. The optimisation itself applies to all arithmetics. Thus after optimising:ASM1__entry: 2 load_imm r2, #5 3 load_imm r3, #3 4 mul r0, r2, r3This looks like the optimiser pass missed the load_imm, load_imm, mul pattern, but this implementation is non recursive, recursive constant folding is subject to IR optimisation, which comes before peephole optimisation, thus this case will not happen once the IR and the IR optimiser is done. Overflow is currently handled silently using wrapping arithmetic; this could and probably will be changed to trigger a compile-time error in the future.RUST 1/// const_binary fuses 2/// 3/// LoadImm{ dst: a, value: x }, 4/// LoadImm{ dst: b, value: y }, 5/// bin { dst, lhs: a, rhs: b } 6/// 7/// into 8/// 9/// LoadImm { dst, value: x bin y } 10/// 11/// where bin := Add | Sub | Mul | Div 12pub fn const_binary(window: &mut [Op]) { 13 let [ 14 Op::LoadImm { dst: a, value: x }, 15 Op::LoadImm { dst: b, value: y }, 16 op, 17 ] = window 18 else { 19 return; 20 }; 21 22 let (dst, result) = match *op { 23 Op::Add { dst, lhs, rhs } 24 if lhs == *a && rhs == *b => (dst, x.wrapping_add(*y)), 25 Op::Sub { dst, lhs, rhs } 26 if lhs == *a && rhs == *b => (dst, x.wrapping_sub(*y)), 27 Op::Mul { dst, lhs, rhs } 28 if lhs == *a && rhs == *b => (dst, x.wrapping_mul(*y)), 29 Op::Div { dst, lhs, rhs } 30 if lhs == *a && 31 rhs == *b && *y != 0 => (dst, x.wrapping_div(*y)), 32 _ => return, 33 }; 34 35 window[0] = Op::LoadImm { dst, value: result }; 36 window[1] = Op::Nop; 37 window[2] = Op::Nop; 38 39 opt_trace!("const_binary", "fused a constant binary op"); 40}ObservabilityRUST1macro_rules! opt_trace { 2 ($optimisation:literal, $text:literal) => { 3 #[cfg(feature = "trace")] 4 println!("[opt::{}]: {}", $optimisation, $text); 5 }; 6}The opt_trace! macro is used through out the optimisation pipeline for enabling insights into decision making and performed optimisations.RUST1opt_trace!("const_binary", "fused two imm loads and an add");Results in [opt::const_binary]: fused a constant binary op, when the runtime is compiled with --features=trace.Integration and flag guardSince startup time is one of the most important parts of a fast runtime (for me!), peephole optimisation is guarded behind -O1 at this time:RUST 1// all error handling is hidden 2fn main() { 3 // [...] tokenizing, parsing, further setup 4 5 let mut cc = cc::Cc::new(); 6 cc.compile(&ast); 7 8 if args.opt >= 1 { 9 opt::bc(&mut cc.buf); 10 } 11 12 let mut vm = cc.finalize(); 13 14 // [...] other flags and running the VM 15}Running something like 2+3*4-1 through the full pipeline:SHELL 1cargo run \ 2 # compile with tracing prints included 3 --features trace \ 4 -- \ 5 # args for purple-garden 6 7 # print the abstract syntax tree 8 --ast 9 # disassemble the produced bytecode 10 --disassemble \ 11 # set optimisation level 12 -O1 \ 13 # specify input, if not set just 14 # pass a file to execute in as an argument 15 -r="2+3*4-1"TEXT 1(Asteriks 2 (Plus 3 Integer("2") 4 Integer("3") 5 ) 6 (Minus 7 Integer("4") 8 Integer("1") 9 ) 10) 11Cc::cc(Asteriks) 12Cc::cc(Plus) 13Cc::cc(Integer("2")) 14RegisterAllocator::alloc(r0) 15Cc::cc(Integer("3")) 16RegisterAllocator::alloc(r1) 17RegisterAllocator::alloc(r2) 18RegisterAllocator::free(r0) 19RegisterAllocator::free(r1) 20Cc::cc(Minus) 21Cc::cc(Integer("4")) 22RegisterAllocator::alloc(r1) 23Cc::cc(Integer("1")) 24RegisterAllocator::alloc(r0) 25RegisterAllocator::alloc(r3) 26RegisterAllocator::free(r1) 27RegisterAllocator::free(r0) 28RegisterAllocator::alloc(r0) 29RegisterAllocator::free(r2) 30RegisterAllocator::free(r3) 31[opt::const_binary]: fused a constant binary op 32[opt::const_binary]: fused a constant binary op 33__entry: 34 load_imm r2, #5 35 load_imm r3, #3 36 mul r0, r2, r3 37[vm][0000] LoadImm { dst: 2, value: 5 } 38[vm][0001] LoadImm { dst: 3, value: 3 } 39[vm][0002] Mul { dst: 0, lhs: 2, rhs: 3 }TestingTo ensure correctness for the expected optimisation case, tests are necessary. These tests validate pattern rewriting, not full program semantic equivalence.RUST 1#[cfg(test)] 2mod bc { 3 use crate::op::Op; 4 5 #[test] 6 fn self_move() { 7 let mut bc = vec![ 8 Op::Mov { src: 64, dst: 64 }, 9 Op::Mov { src: 64, dst: 64 }, 10 Op::Mov { src: 64, dst: 64 }, 11 ]; 12 crate::opt::bc::self_move(&mut bc); 13 assert_eq!(bc, vec![Op::Nop, Op::Nop, Op::Nop]) 14 } 15 16 #[test] 17 fn const_binary() { 18 let mut bc = vec![ 19 Op::LoadImm { dst: 0, value: 1 }, 20 Op::LoadImm { dst: 1, value: 2 }, 21 Op::Add { 22 dst: 0, 23 lhs: 0, 24 rhs: 1, 25 }, 26 Op::LoadImm { dst: 0, value: 1 }, 27 Op::LoadImm { dst: 1, value: 2 }, 28 Op::Sub { 29 dst: 0, 30 lhs: 0, 31 rhs: 1, 32 }, 33 Op::LoadImm { dst: 0, value: 3 }, 34 Op::LoadImm { dst: 1, value: 5 }, 35 Op::Mul { 36 dst: 0, 37 lhs: 0, 38 rhs: 1, 39 }, 40 Op::LoadImm { dst: 0, value: 64 }, 41 Op::LoadImm { dst: 1, value: 8 }, 42 Op::Div { 43 dst: 0, 44 lhs: 0, 45 rhs: 1, 46 }, 47 ]; 48 49 for i in 0..=bc.len().saturating_sub(3) { 50 crate::opt::bc::const_binary(&mut bc[i..i + 3]); 51 } 52 53 bc.retain(|op| *op != Op::Nop); 54 assert_eq!( 55 bc, 56 vec![ 57 Op::LoadImm { dst: 0, value: 3 }, 58 Op::LoadImm { dst: 0, value: -1 }, 59 Op::LoadImm { dst: 0, value: 15 }, 60 Op::LoadImm { dst: 0, value: 8 }, 61 ] 62 ) 63 } 64}fight me on this one, I make the rules, if WINDOW_SIZE > 3, I say that’s no longer local :^) ↩︎2026 - xnacly |
Peephole optimizations are a critical component of the Purple Garden virtual machine's runtime redesign, which is being transitioned from C to Rust. These optimizations target bytecode at a local level, focusing on small windows of instructions to improve efficiency without disrupting the broader compilation pipeline. The author emphasizes that peephole optimizations serve as a fallback for more complex IR (Intermediate Representation) transformations, which occur earlier in the process. This approach ensures that heavy optimizations are deferred to the IR phase while keeping startup time low by limiting peephole operations to single-pass, non-recursive processing. The implementation leverages a fixed window size of three instructions, as larger windows would fall outside the local scope of peephole optimization and instead be addressed by global IR optimizations. This design choice balances performance gains with the constraints of a lightweight virtual machine, particularly important for applications requiring rapid initialization. The optimizations are guarded by a flag (e.g., -O1), allowing users to enable or disable them based on their needs, and are integrated into the bytecode generation process before execution by the baseline interpreter.
The self_move optimization addresses redundant move operations where a register is assigned to itself, such as `mov r1, r1`. These instructions are effectively no-ops (NOPS) and waste processing resources if left unaddressed. By iterating through the bytecode window, the optimization identifies such patterns and replaces them with Op::Nop, which are later removed during post-processing. This not only reduces the number of executed instructions but also simplifies the bytecode for subsequent stages. The code implementation for self_move involves checking each instruction in the window, and if a Mov operation has identical source and destination registers, it is replaced with Op::Nop. This process is straightforward but effective in eliminating unnecessary operations that could otherwise accumulate during complex computations. The author notes that this optimization is part of a broader strategy to minimize runtime overhead while maintaining clarity and simplicity in the bytecode processing pipeline.
The const_binary optimization targets arithmetic operations where both operands are constants loaded via LoadImm instructions within the three-instruction window. For example, if the bytecode contains `load_imm r0, #2`, `load_imm r1, #3`, and `add r2, r0, r1`, the optimizer recognizes that the result of 2 + 3 = 5 can be precomputed and directly loaded into a register. This eliminates the need for runtime computation, reducing execution time and improving efficiency. The implementation of const_binary involves checking whether the first two instructions in the window are LoadImm operations with distinct destination registers, and the third instruction is an arithmetic operation (Add, Sub, Mul, Div) that uses these registers as operands. If the conditions are met, the arithmetic operation is replaced with a LoadImm instruction that directly stores the computed value. This optimization not only streamlines the bytecode but also reduces the number of instructions executed by the interpreter, as seen in the example where `add r2, r0, r1` and subsequent operations are replaced with a single LoadImm. However, the author acknowledges that recursive constant folding is handled by IR optimizations and not by peephole techniques, which are limited to non-recursive transformations.
Observability is a key aspect of the optimization process, facilitated by the opt_trace! macro, which logs detailed information about each optimization’s execution. This macro enables developers to trace the decision-making process of the optimizer, providing insights into which patterns were detected and how they were modified. For instance, when const_binary successfully fuses a load-imm pair with an arithmetic operation, the macro outputs a message like “[opt::const_binary]: fused a constant binary op,” allowing for real-time debugging and validation. The use of conditional compilation (e.g., `#[cfg(feature = "trace")]`) ensures that these logs are only included when the runtime is compiled with tracing enabled, minimizing performance overhead in production environments. This level of transparency is crucial for verifying the correctness of optimizations and understanding their impact on the overall system. The author emphasizes that while these logs are invaluable for development, they do not interfere with the runtime’s efficiency in optimized builds.
Integration of peephole optimizations into the broader compilation pipeline is carefully managed to avoid conflicts with other stages. The optimizations are applied after the bytecode is generated but before it is executed by the interpreter, ensuring that any changes made during this phase do not disrupt the parsing or IR transformation steps. The code snippet demonstrating the optimization workflow shows how the compiler first generates bytecode, applies peephole optimizations if enabled (e.g., via -O1), and then finalizes the bytecode for execution. This staged approach ensures that optimizations are applied at the appropriate point in the pipeline, maximizing their effectiveness while maintaining compatibility with other components. The author also highlights that the optimizations are not guaranteed to be exhaustive, as some patterns may be addressed by IR-level transformations or left for future improvements. For example, the const_binary optimization is designed to handle non-recursive cases, while more complex scenarios involving nested expressions or recursive operations are deferred to the IR phase.
Testing is a fundamental aspect of ensuring the reliability of peephole optimizations. The provided test suite includes unit tests for both self_move and const_binary, validating that they correctly identify and modify bytecode patterns. For self_move, the test ensures that redundant move operations are replaced with Nops and subsequently removed from the bytecode. In the case of const_binary, the test checks that arithmetic operations with constant operands are correctly replaced with LoadImm instructions storing precomputed values. These tests focus on pattern recognition rather than full semantic equivalence, as the optimizations are designed to simplify bytecode without altering its intended behavior. The author stresses that while these tests confirm the correctness of individual optimizations, they do not account for edge cases or interactions with other stages of the compiler. This limitation underscores the importance of additional testing strategies, such as integration tests and manual verification, to ensure robustness in real-world scenarios.
The broader implications of peephole optimizations in Purple Garden extend beyond performance improvements. By reducing the number of executed instructions, these optimizations contribute to lower memory usage and faster execution times, which are critical for resource-constrained environments. The author also notes that the choice of a single-pass approach, while efficient for startup time, may limit the scope of possible optimizations compared to multi-pass techniques. This trade-off reflects a design decision to prioritize speed during initialization, which is particularly important for applications that require rapid response times. However, the author acknowledges that future updates could explore more sophisticated optimization strategies, such as recursive peephole passes or hybrid approaches combining IR and bytecode-level transformations. The current implementation serves as a foundation for further enhancements, with the flexibility to adapt to evolving requirements.
In conclusion, peephole optimizations in Purple Garden represent a targeted and efficient approach to improving bytecode execution. By focusing on local patterns and leveraging a fixed window size, the optimizations achieve significant performance gains without compromising the broader compilation pipeline. The self_move and const_binary techniques address specific inefficiencies, while observability tools like opt_trace! provide critical insights into their operation. Integrated testing ensures that these optimizations function as intended, and the modular design allows for future expansion. As the virtual machine continues to evolve, these optimizations will play a vital role in balancing speed, efficiency, and maintainability. The author’s emphasis on simplicity and clarity underscores the importance of thoughtful design in achieving these goals, setting a precedent for similar projects seeking to optimize low-level execution environments. |