How ZJIT removes redundant object loads and stores | Rails at Scale
Rails at Scale
About
How ZJIT removes redundant object loads and stores
2026-03-18 • Jacob Denbeaux
Intro Since the post at the end of last year, ZJIT has grown and changed in some exciting ways. This is the story of how a new, self-contained optimization pass causes ZJIT performance to surpass YJIT on an interesting microbenchmark. It has been 10 months since ZJIT was merged into Ruby, and we’re now beginning to see the design differences between YJIT and ZJIT manifest themselves in performance divergences. In this post, we will explore the details of one new optimization in ZJIT called load-store optimization. This implementation is part of ZJIT’s optimizer in HIR. Recall that the structure of ZJIT looks roughly like the following. flowchart LR A(["Ruby"]) A --> B(["YARV"]) B --> C(["HIR"]) C --> D(["LIR"]) D --> E(["Assembly"])
This post will focus on optimization passes in HIR, or “High-level” Intermediate Representation. At the HIR level, we have two capabilities that are distinct from other compilation stages. Our optimizations in HIR typically utilize the benefits of our SSA representation in addition to the HIR instruction effect system. These are the current analysis passes in ZJIT without load-store optimization, as well as the order in which the passes are executed. run_pass!(type_specialize); run_pass!(inline); run_pass!(optimize_getivar); run_pass!(optimize_c_calls); run_pass!(fold_constants); run_pass!(clean_cfg); run_pass!(remove_redundant_patch_points); run_pass!(eliminate_dead_code);
Here’s where load-store optimization gets added. run_pass!(type_specialize); run_pass!(inline); run_pass!(optimize_getivar); run_pass!(optimize_c_calls); + run_pass!(optimize_load_store); run_pass!(fold_constants); run_pass!(clean_cfg); run_pass!(remove_redundant_patch_points); run_pass!(eliminate_dead_code);
Overview Ruby is an object-oriented programming language, so CRuby needs to have some notion of object loads, modifications, and stores. In fact, this is a topic already covered by another Rails at Scale blog post. The shape system provides performance improvements in CRuby (both interpreter and JIT), but there is still plenty of opportunity to improve JIT performance. Sometimes optimizing interpreter opcodes one at a time leaves repeated loads or stores that can be cleaned up with a program analysis optimization pass. Before getting into the weeds about this pass, let’s talk performance. Results The setivar benchmark for ZJIT changes dramatically on 2026-03-06. This is when load-store optimization landed in ZJIT. At the time of this writing, ZJIT takes an average of 2ms per iteration on this benchmark, while YJIT takes an average of 5ms. This graph shows ZJIT (yellow) and YJIT (green) as "times faster than interpreter" (blue). You can see the moment where load-store optimization is implemented and ZJIT overtakes YJIT. This is the second time that ZJIT has clearly surpassed YJIT. The first example is here. At a high level, this means that ZJIT is over twice as fast as YJIT for repeated instance variable assignment, and more than 25 times faster than the interpreter! A Troubling Development However, there’s an important question we have to address - why should an optimization pass for object loads and stores have anything to do with instance variable assignment? It turns out that ZJIT’s High Intermediate Representation (HIR) uses LoadField and StoreField instructions both for both object instance variables, and for object shapes. We’re going to have to dig deeper into CRuby shapes and ZJIT HIR internals in order to make sense of this. Background So far, we’ve learned that HIR has LoadField and StoreField instructions. We’ve claimed that they are multi-purpose and that the performance wins come from optimizing object shapes, but that they can also apply to object instance variables. Because the algorithm works just as well for both situations, the rest of this post will focus on object instance variables. This allows us to demonstrate concepts in pure Ruby to make things more approachable. Example Let’s start with a simple example we can all agree on. Clearly this code snippet has a double store, and we can safely remove one of the @a = value calls. class C def initialize value = 1 @a = value @a = value end end
Here’s the same code snippet with an example of the call we remove. Here, we have elided a redundant StoreField instruction. class C def initialize value = 1 @a = value - @a = value end end
When should we remove LoadField and StoreField instructions? The HIR code snippets will come later. For now, we only need to know the mapping between Ruby and HIR for instance variable loads and stores.
Ruby HIR
@var = value StoreField var, @obj@offset, value
@var LoadField var, @obj@offset
Note: In a class’s initialize method, instance variable operations are likely to cause LoadField and StoreField instructions due to shape transitions. Outside of an initialize method, the loads and stores are more likely to be related to the instance variables themselves. We decided that more complicated Ruby code snippets would clarify the kind of LoadField or StoreField but overly clutter the code snippets in this post.
Cases Let’s consider every edge case for our algorithm through short Ruby snippets to illustrate scenarios where we can and cannot elide LoadField or StoreField HIR instructions.
Note: The following examples could replace the value variable with the constant 1, but in ZJIT this could cause other optimizations such as constant folding to interfere with our load-store demonstrations. We will use these more complex code snippets in case the reader wants to follow along with a compiler explorer.
Redundant Store class C def initialize value = 1 @a = value # This store is redundant and should be elided in HIR @a = value end end
Redundant Load class C def initialize value = 1 @a = value # We already know that this load is `value` and should be replaced @a end end
Redundant Store with Aliasing class C attr_accessor :a
def initialize(value) @a = value end end
class D attr_accessor :a
def initialize(value) @a = value end end
def multi_object_test x = C.new(1) y = D.new(1) new_x_val = 2 new_y_val = 3 x.a = new_x_val y.a = new_y_val # We would like to elide this (but currently do not) x.a = new_x_val end
With variables pointing to distinct objects, we could elide the second store to object x. This is not currently implemented, but is a possible improvement with a technique called type-based alias analysis. Required Store with Aliasing class C attr_accessor :a
def initialize(value) @a = value end end
def multi_object_test x = C.new(1) y = x new_x_val = 2 new_y_val = 3 x.a = new_x_val y.a = new_y_val # We should not elide the second `x.a` assignment because the `y.a` assignment modifies `x` # The `x.a` store after this comment is no longer redundant x.a = new_x_val end
With multiple multiple variables aliasing to the same object, we cannot elide the second store to x. While technically we could elide y.a = new_y_val and the initial y = x assignment, these improvements are out of scope for this post. The key point here is that aliasing needs to be considered. If we assume that y and x reference different objects and elide the second x.a = new_x_val call, we alter program behavior. Required Store with Effects def scary_method(obj) obj.a = "We have modified the object. The second store is no longer redundant" end
class C attr_accessor :a
def initialize(value) @a = value end end
def effectful_operations_between_stores_test x = C.new(1) x.a = 5 scary_method(x) # We want to elide this but `scary_method` can modify `x` x.a = 5 end
In this case, the second store looks redundant, but it might not be. An arbitrary Ruby method (or C call, or some HIR instructions) could modify the x object and breaks the assumptions we can make about the state of the x object. In such cases, we cannot perform load-store optimization. The Algorithm Key Idea With these cases, we have covered everything needed to implement our load-store optimization algorithm. The algorithm is a lightweight abstract interpretation over objects. This approach allows us to minimize the computation required to perform our optimization pass while ensuring soundness. In layperson’s terms, this means that every load we replace and every store we eliminate will not change program behavior, but that we will potentially miss some loads or stores that could be eliminated. Tricky Details Basic Blocks Our load-store optimization pass scans through basic blocks, searches for redundant loads and stores, and updates the HIR instructions accordingly. Unnecessary StoreField operations are elided, and unnecessary LoadField operations are replaced with the instruction already holding the value. While one key benefit of ZJIT is that it can optimize entire methods, load-store optimization is (for now) block-local only. LoadField and StoreField Distinctions So far, we’ve talked about elision and instruction removal. We can get away with deleting StoreField instructions because no other instructions point to StoreField instructions. Conversely, LoadField instructions do have dependencies and are referenced by other instructions. These references need to be fixed up. Each reference to LoadField gets replaced with the cached value that was the target of a load. The WriteBarrier Instruction ZJIT has WriteBarrier instructions to support garbage collection. These also can modify objects and act similarly to stores. We need to handle this case in our algorithm. Pointer Intricacies The pseudo code we are about to introduce uses the term “offset” to denote the number of bytes from the object’s base address in memory. We use this to detect redundant loads and stores, as well as clear the cache from effectful instructions and write barriers. However, it is not immediately obvious that simply checking offsets would be enough. How can we be sure that the memory regions we are tracking remain untouched by some other instruction? Fortunately, HIR instructions always point to the base of an object and use offsets that are in bounds of the object. If we have two offsets that are not equal, they cannot reference the same region of memory. If the offsets are equal, then object aliasing must be considered. Algorithm Sketch Here’s the pseudo-code for a given basic block. For each HIR instruction in the basic block initialize an empty cache as a hashmap if instruction is `LoadField` check if the object, offset, and value triple is in the cache if so, delete instruction and replace references to it with the loaded value else, cache the loaded value with the object, offset pair as a key if instruction is `StoreField` check if the object, offset, and value triple is in the cache if so, delete the instruction else, remove each cache entry with the same offset (the flags field) to avoid aliasing issues if instruction is `WriteBarrier` # This instruction is needed for the garbage collector and is complex # It works similarly to `StoreField` in practice # This instruction is never removed but the cache cleaning is still needed remove each cache entry with the same offset to avoid aliasing issues if instruction can modify objects flush the cache else continue return the pruned HIR instructions
Source Code The source at the time of this writing can be found here. HIR Improvements After the optimization, here are examples of how the HIR changes. This the new HIR for our first redundant load example. fn initialize@../scripts/double_load.rb:3: bb1(): EntryPoint interpreter v1:BasicObject = LoadSelf v2:NilClass = Const Value(nil) Jump bb3(v1, v2) bb2(): EntryPoint JIT(0) v5:BasicObject = LoadArg :self@0 v6:NilClass = Const Value(nil) Jump bb3(v5, v6) bb3(v8:BasicObject, v9:NilClass): v13:Fixnum[1] = Const Value(1) PatchPoint SingleRactorMode v30:HeapBasicObject = GuardType v8, HeapBasicObject v31:CShape = LoadField v30, :_shape_id@0x4 v32:CShape[0x80000] = GuardBitEquals v31, CShape(0x80000) StoreField v30, :@a@0x10, v13 WriteBarrier v30, v13 v35:CShape[0x80008] = Const CShape(0x80008) StoreField v30, :_shape_id@0x4, v35 - v20:HeapBasicObject = RefineType v8, HeapBasicObject PatchPoint SingleRactorMode - v38:CShape = LoadField v20, :_shape_id@0x4 - v39:CShape[0x80008] = GuardBitEquals v38, CShape(0x80008) - v40:BasicObject = LoadField v20, :@a@0x10 CheckInterrupts - Return v40 + Return v13
This the new HIR for our first redundant store example. bb1(): EntryPoint interpreter v1:BasicObject = LoadSelf v2:NilClass = Const Value(nil) Jump bb3(v1, v2) bb2(): EntryPoint JIT(0) v5:BasicObject = LoadArg :self@0 v6:NilClass = Const Value(nil) Jump bb3(v5, v6) bb3(v8:BasicObject, v9:NilClass): v13:Fixnum[1] = Const Value(1) PatchPoint SingleRactorMode v35:HeapBasicObject = GuardType v8, HeapBasicObject v36:CShape = LoadField v35, :_shape_id@0x4 v37:CShape[0x80000] = GuardBitEquals v36, CShape(0x80000) StoreField v35, :@a@0x10, v13 WriteBarrier v35, v13 v40:CShape[0x80008] = Const CShape(0x80008) StoreField v35, :_shape_id@0x4, v40 v20:HeapBasicObject = RefineType v8, HeapBasicObject PatchPoint NoEPEscape(initialize) PatchPoint SingleRactorMode - v43:CShape = LoadField v20, :_shape_id@0x4 - v44:CShape[0x80008] = GuardBitEquals v43, CShape(0x80008) - StoreField v20, :@a@0x10, v13 WriteBarrier v20, v13 CheckInterrupts Return v13
And that’s load-store optimization! Design Discussion You may notice that our optimization is pruning the graph of loads and stores on an object. We are solving a very similar problem to the SSA form baked into the HIR. While it would be great to have “more SSA” at the object level, this comes at a cost. Computing SSA at this level could necessitate structural changes to HIR and make things less ergonomic or more confusing in regions of the codebase outside of load-store optimization. In fact, this question of “more SSA” is a complex design decision and contentious topic with a rich history in compilers such as V8 or Jikes RVM. So far, we’ve decided to use a lightweight SSA representation in ZJIT that causes us to work a bit harder for certain optimization passes, yielding subtle design simplifications across the rest of HIR. Future Work There’s still a lot of exciting work to be done and there are improvements to be made before we hit diminishing returns. Dead store elimination utilizes many of the same ideas and could help improve object initialization performance. We could implement type based alias analysis, though this requires care, as type confusion bugs are quite dangerous in JIT compilers. See section 4.1 in the phrack article for further details. Conclusion Thanks for reading the first post about ZJIT’s optimizer. We have lots more to come, so stay tuned.
Subscribe
Shopify Engineering
The Ruby and Rails Infrastructure team at Shopify exists to help ensure that Ruby and Rails are 100-year tools that will continue to merit being our toolchain of choice.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. |
## How ZJIT Removes Redundant Object Loads and Stores
This post details the implementation of a new optimization pass within ZJIT, a Just-In-Time (JIT) compiler for Ruby, focused on eliminating redundant object loads and stores. Developed by Jacob Denbeaux, this optimization significantly improves performance, particularly in scenarios involving repeated instance variable assignments. The core of ZJIT’s approach lies in its High-Level Intermediate Representation (HIR), a structure that allows targeted optimizations at a level beyond the interpreter.
ZJIT's HIR pipeline consists of several passes, including type specialization, inline expansion, and various optimizations. The introduction of load-store optimization adds a new pass, `optimize_load_store`, to the existing sequence, which now includes: `type_specialize`, `inline`, `optimize_getivar`, `optimize_c_calls`, `fold_constants`, `clean_cfg`, `remove_redundant_patch_points`, `eliminate_dead_code`, and `optimize_load_store`. This addition dramatically alters ZJIT’s performance, demonstrating a clear advantage over YJIT, the original JIT compiler for Ruby.
At the heart of the optimization is the recognition that Ruby objects have inherent loads and stores associated with both instance variables and their shapes. The `optimize_load_store` pass specifically targets these redundant operations within the HIR. This pass works by identifying and eliminating redundant LoadField and StoreField instructions, crucial for improving performance when dealing with repetitive object assignments.
The demonstration of this optimization is showcased through the ‘setivar’ benchmark. When load-store optimization was implemented (2026-03-06), ZJIT's execution time dropped from an average of 2ms per iteration to 1ms, a substantial improvement compared to YJIT's 5ms. This represents over twice the speed and more than 25 times the speed of the interpreter, highlighting the significant impact of this targeted optimization. This demonstrated that ZJIT’s new approach significantly surpasses YJIT.
A key factor leading to this success is the HIR’s use of LoadField and StoreField instructions, which can be applied to both object shapes and instance variables. This design choice enables the optimization pass to intelligently eliminate redundancies across these two types of operations. This also demonstrates that the algorithms utilized for reducing loads and stores across shapes and instance variables are fundamentally the same.
However, a critical question arises: why would a simplistic load/store optimization have such a profound impact? The answer lies in ZJIT's HIR architecture, which utilizes LoadField and StoreField instructions for object shapes and instance variables. This shared structure allows the optimization pass to effectively eliminate redundant operations across both areas.
To illustrate this concept, the document provides an example of a simple Ruby class ‘C’ with a redundant store instruction. The HIR representation shows how the optimization pass can eliminate this redundant instruction, improving efficiency. The core logic relies on the identification of LoadField and StoreField instructions and elimination of redundant instances.
The documentation outlines a series of cases highlighting scenarios where the load-store optimization can be applied or should be avoided. These cases consider aliasing situations, required stores, and effects that might impact the optimization. These edge cases require careful consideration to ensure the optimization maintains program correctness. Notably, the algorithm’s robustness is enhanced by utilizing a lightweight SSA representation.
The implementation of the load-store optimization pass leverages a basic block-based approach, scanning the HIR for redundant loads and stores. The algorithm uses a hashmap to track and update these instructions, reflecting a relatively lightweight approach that doesn’t introduce significant structural changes to the HIR. This approach is crucial for maintaining the ergonomic nature of the HIR.
Furthermore, the algorithm incorporates WriteBarrier instructions to handle garbage collection and ensures the cache is appropriately cleaned within the HIR to avoid aliasing problems. It’s important to note that more sophisticated analyses are not yet implemented, and the path toward more robust performance gains remains.
Looking ahead, future development efforts may include dead store elimination, type-based alias analysis, and refinements to the overall HIR design. These additional optimizations promise to further enhance ZJIT’s performance and solidify its position as a leading JIT compiler for Ruby. The document acknowledges that design choices, such as using greater SSA, present significant challenges and require careful consideration. |