LmCast :: Stay tuned in

ZJIT removes redundant object loads and stores

Recorded: March 21, 2026, 10 p.m.

Original Summarized

How ZJIT removes redundant object loads and stores | Rails at Scale

Rails at Scale

About

How ZJIT removes redundant object loads and stores

2026-03-18

Jacob Denbeaux

Intro
Since the post at the end of last year, ZJIT has grown and
changed in some exciting ways. This is the story of how a new, self-contained
optimization pass causes ZJIT performance to surpass YJIT on an interesting
microbenchmark. It has been 10 months since ZJIT was merged
into Ruby, and we’re now beginning to see the design differences between YJIT
and ZJIT manifest themselves in performance divergences. In this post, we will
explore the details of one new optimization in ZJIT called load-store
optimization. This implementation is part of ZJIT’s optimizer in HIR. Recall
that the structure of ZJIT looks roughly like the following.
flowchart LR
A(["Ruby"])
A --> B(["YARV"])
B --> C(["HIR"])
C --> D(["LIR"])
D --> E(["Assembly"])

This post will focus on optimization passes in HIR, or “High-level” Intermediate
Representation. At the HIR level, we have two capabilities that are distinct
from other compilation stages. Our optimizations in HIR typically utilize the
benefits of our SSA representation in addition to the HIR
instruction effect system.
These are the current analysis passes in ZJIT without load-store optimization,
as well as the order in which the passes are executed.
run_pass!(type_specialize);
run_pass!(inline);
run_pass!(optimize_getivar);
run_pass!(optimize_c_calls);
run_pass!(fold_constants);
run_pass!(clean_cfg);
run_pass!(remove_redundant_patch_points);
run_pass!(eliminate_dead_code);

Here’s where load-store optimization gets added.
run_pass!(type_specialize);
run_pass!(inline);
run_pass!(optimize_getivar);
run_pass!(optimize_c_calls);
+ run_pass!(optimize_load_store);
run_pass!(fold_constants);
run_pass!(clean_cfg);
run_pass!(remove_redundant_patch_points);
run_pass!(eliminate_dead_code);

Overview
Ruby is an object-oriented programming language, so CRuby needs to have some
notion of object loads, modifications, and stores. In fact, this is a topic
already covered by another Rails at Scale blog post. The shape
system provides performance improvements in CRuby (both interpreter and JIT),
but there is still plenty of opportunity to improve JIT performance. Sometimes
optimizing interpreter opcodes one at a time leaves repeated loads or stores
that can be cleaned up with a program analysis optimization pass. Before getting
into the weeds about this pass, let’s talk performance.
Results
The setivar benchmark for ZJIT changes dramatically on
2026-03-06. This is when load-store optimization landed in ZJIT. At the time of
this writing, ZJIT takes an average of 2ms per iteration on this benchmark,
while YJIT takes an average of 5ms.
This graph shows ZJIT (yellow) and YJIT (green) as "times faster than interpreter" (blue). You can see the moment where load-store optimization is implemented and ZJIT overtakes YJIT.
This is the second time that ZJIT has clearly surpassed YJIT. The first example
is here.
At a high level, this means that ZJIT is over twice as fast as YJIT for repeated
instance variable assignment, and more than 25 times faster than the
interpreter!
A Troubling Development
However, there’s an important question we have to address - why should an
optimization pass for object loads and stores have anything to do with instance
variable assignment? It turns out that ZJIT’s High Intermediate Representation
(HIR) uses LoadField and StoreField instructions both for both object
instance variables, and for object shapes. We’re going to have to dig deeper
into CRuby shapes and ZJIT HIR internals in order to make sense of this.
Background
So far, we’ve learned that HIR has LoadField and StoreField instructions.
We’ve claimed that they are multi-purpose and that the performance wins come
from optimizing object shapes, but that they can also apply to object instance
variables. Because the algorithm works just as well for both situations, the
rest of this post will focus on object instance variables. This allows us to
demonstrate concepts in pure Ruby to make things more approachable.
Example
Let’s start with a simple example we can all agree on. Clearly this code
snippet has a double store, and we can safely remove one of the @a = value
calls.
class C
def initialize
value = 1
@a = value
@a = value
end
end

Here’s the same code snippet with an example of the call we remove. Here, we
have elided a redundant StoreField instruction.
class C
def initialize
value = 1
@a = value
- @a = value
end
end

When should we remove LoadField and StoreField instructions? The HIR code
snippets will come later. For now, we only need to know the mapping between Ruby
and HIR for instance variable loads and stores.

Ruby
HIR

@var = value
StoreField var, @obj@offset, value

@var
LoadField var, @obj@offset

Note: In a class’s initialize method, instance variable operations are
likely to cause LoadField and StoreField instructions due to shape
transitions. Outside of an initialize method, the loads and stores are more
likely to be related to the instance variables themselves. We decided that
more complicated Ruby code snippets would clarify the kind of LoadField or
StoreField but overly clutter the code snippets in this post.

Cases
Let’s consider every edge case for our algorithm through short Ruby snippets
to illustrate scenarios where we can and cannot elide LoadField or
StoreField HIR instructions.

Note: The following examples could replace the value variable with the
constant 1, but in ZJIT this could cause other optimizations such as
constant folding to interfere with our load-store demonstrations. We will use
these more complex code snippets in case the reader wants to follow along with
a compiler explorer.

Redundant Store
class C
def initialize
value = 1
@a = value
# This store is redundant and should be elided in HIR
@a = value
end
end

Redundant Load
class C
def initialize
value = 1
@a = value
# We already know that this load is `value` and should be replaced
@a
end
end

Redundant Store with Aliasing
class C
attr_accessor :a

def initialize(value)
@a = value
end
end

class D
attr_accessor :a

def initialize(value)
@a = value
end
end

def multi_object_test
x = C.new(1)
y = D.new(1)
new_x_val = 2
new_y_val = 3
x.a = new_x_val
y.a = new_y_val
# We would like to elide this (but currently do not)
x.a = new_x_val
end

With variables pointing to distinct objects, we could elide the second store to
object x. This is not currently implemented, but is a possible improvement
with a technique called type-based alias analysis.
Required Store with Aliasing
class C
attr_accessor :a

def initialize(value)
@a = value
end
end

def multi_object_test
x = C.new(1)
y = x
new_x_val = 2
new_y_val = 3
x.a = new_x_val
y.a = new_y_val
# We should not elide the second `x.a` assignment because the `y.a` assignment modifies `x`
# The `x.a` store after this comment is no longer redundant
x.a = new_x_val
end

With multiple multiple variables aliasing to the same object, we cannot elide
the second store to x. While technically we could elide y.a = new_y_val and
the initial y = x assignment, these improvements are out of scope for this
post. The key point here is that aliasing needs to be considered. If we assume
that y and x reference different objects and elide the second
x.a = new_x_val call, we alter program behavior.
Required Store with Effects
def scary_method(obj)
obj.a = "We have modified the object. The second store is no longer redundant"
end

class C
attr_accessor :a

def initialize(value)
@a = value
end
end

def effectful_operations_between_stores_test
x = C.new(1)
x.a = 5
scary_method(x)
# We want to elide this but `scary_method` can modify `x`
x.a = 5
end

In this case, the second store looks redundant, but it might not be. An
arbitrary Ruby method (or C call, or some HIR instructions) could modify the x
object and breaks the assumptions we can make about the state of the x object.
In such cases, we cannot perform load-store optimization.
The Algorithm
Key Idea
With these cases, we have covered everything needed to implement our load-store
optimization algorithm. The algorithm is a lightweight
abstract interpretation over objects. This approach allows us to
minimize the computation required to perform our optimization pass while
ensuring soundness. In layperson’s terms, this means that every load we replace
and every store we eliminate will not change program behavior, but that we will
potentially miss some loads or stores that could be eliminated.
Tricky Details
Basic Blocks
Our load-store optimization pass scans through basic blocks, searches for
redundant loads and stores, and updates the HIR instructions accordingly.
Unnecessary StoreField operations are elided, and unnecessary LoadField
operations are replaced with the instruction already holding the value. While
one key benefit of ZJIT is that it can optimize entire methods, load-store
optimization is (for now) block-local only.
LoadField and StoreField Distinctions
So far, we’ve talked about elision and instruction removal. We can get away with
deleting StoreField instructions because no other instructions point to
StoreField instructions. Conversely, LoadField instructions do have
dependencies and are referenced by other instructions. These references need to
be fixed up. Each reference to LoadField gets replaced with the cached value
that was the target of a load.
The WriteBarrier Instruction
ZJIT has WriteBarrier instructions to support garbage collection. These also
can modify objects and act similarly to stores. We need to handle this case in
our algorithm.
Pointer Intricacies
The pseudo code we are about to introduce uses the term “offset” to denote the
number of bytes from the object’s base address in memory. We use this to
detect redundant loads and stores, as well as clear the cache from effectful
instructions and write barriers. However, it is not immediately obvious that
simply checking offsets would be enough. How can we be sure that the memory
regions we are tracking remain untouched by some other instruction? Fortunately,
HIR instructions always point to the base of an object and use offsets that
are in bounds of the object. If we have two offsets that are not equal, they
cannot reference the same region of memory. If the offsets are equal, then
object aliasing must be considered.
Algorithm Sketch
Here’s the pseudo-code for a given basic block.
For each HIR instruction in the basic block
initialize an empty cache as a hashmap

if instruction is `LoadField`
check if the object, offset, and value triple is in the cache
if so, delete instruction and replace references to it with the loaded value
else, cache the loaded value with the object, offset pair as a key

if instruction is `StoreField`
check if the object, offset, and value triple is in the cache
if so, delete the instruction
else, remove each cache entry with the same offset (the flags field) to avoid aliasing issues

if instruction is `WriteBarrier`
# This instruction is needed for the garbage collector and is complex
# It works similarly to `StoreField` in practice
# This instruction is never removed but the cache cleaning is still needed
remove each cache entry with the same offset to avoid aliasing issues

if instruction can modify objects
flush the cache

else
continue

return the pruned HIR instructions

Source Code
The source at the time of this writing can be found here.
HIR Improvements
After the optimization, here are examples of how the HIR changes.
This the new HIR for our first redundant load example.
fn initialize@../scripts/double_load.rb:3:
bb1():
EntryPoint interpreter
v1:BasicObject = LoadSelf
v2:NilClass = Const Value(nil)
Jump bb3(v1, v2)
bb2():
EntryPoint JIT(0)
v5:BasicObject = LoadArg :self@0
v6:NilClass = Const Value(nil)
Jump bb3(v5, v6)
bb3(v8:BasicObject, v9:NilClass):
v13:Fixnum[1] = Const Value(1)
PatchPoint SingleRactorMode
v30:HeapBasicObject = GuardType v8, HeapBasicObject
v31:CShape = LoadField v30, :_shape_id@0x4
v32:CShape[0x80000] = GuardBitEquals v31, CShape(0x80000)
StoreField v30, :@a@0x10, v13
WriteBarrier v30, v13
v35:CShape[0x80008] = Const CShape(0x80008)
StoreField v30, :_shape_id@0x4, v35
- v20:HeapBasicObject = RefineType v8, HeapBasicObject
PatchPoint SingleRactorMode
- v38:CShape = LoadField v20, :_shape_id@0x4
- v39:CShape[0x80008] = GuardBitEquals v38, CShape(0x80008)
- v40:BasicObject = LoadField v20, :@a@0x10
CheckInterrupts
- Return v40
+ Return v13

This the new HIR for our first redundant store example.
bb1():
EntryPoint interpreter
v1:BasicObject = LoadSelf
v2:NilClass = Const Value(nil)
Jump bb3(v1, v2)
bb2():
EntryPoint JIT(0)
v5:BasicObject = LoadArg :self@0
v6:NilClass = Const Value(nil)
Jump bb3(v5, v6)
bb3(v8:BasicObject, v9:NilClass):
v13:Fixnum[1] = Const Value(1)
PatchPoint SingleRactorMode
v35:HeapBasicObject = GuardType v8, HeapBasicObject
v36:CShape = LoadField v35, :_shape_id@0x4
v37:CShape[0x80000] = GuardBitEquals v36, CShape(0x80000)
StoreField v35, :@a@0x10, v13
WriteBarrier v35, v13
v40:CShape[0x80008] = Const CShape(0x80008)
StoreField v35, :_shape_id@0x4, v40
v20:HeapBasicObject = RefineType v8, HeapBasicObject
PatchPoint NoEPEscape(initialize)
PatchPoint SingleRactorMode
- v43:CShape = LoadField v20, :_shape_id@0x4
- v44:CShape[0x80008] = GuardBitEquals v43, CShape(0x80008)
- StoreField v20, :@a@0x10, v13
WriteBarrier v20, v13
CheckInterrupts
Return v13

And that’s load-store optimization!
Design Discussion
You may notice that our optimization is pruning the graph of loads and stores
on an object. We are solving a very similar problem to the SSA form baked into
the HIR. While it would be great to have “more SSA” at the object level, this
comes at a cost. Computing SSA at this level could necessitate structural
changes to HIR and make things less ergonomic or more confusing in regions of
the codebase outside of load-store optimization. In fact, this question of “more
SSA” is a complex design decision and contentious topic with a
rich history in compilers such as V8 or Jikes
RVM. So far, we’ve decided to use a lightweight SSA representation in ZJIT that
causes us to work a bit harder for certain optimization passes, yielding subtle
design simplifications across the rest of HIR.
Future Work
There’s still a lot of exciting work to be done and there are improvements to
be made before we hit diminishing returns. Dead store elimination utilizes many
of the same ideas and could help improve object initialization performance. We
could implement type based alias analysis, though this
requires care, as type confusion bugs are quite
dangerous in JIT compilers. See section 4.1 in the phrack article for further
details.
Conclusion
Thanks for reading the first post about ZJIT’s optimizer. We have lots more to
come, so stay tuned.

Subscribe

Shopify Engineering

The Ruby and Rails Infrastructure team at Shopify exists to help ensure that Ruby and Rails are 100-year tools that will continue to merit being our toolchain of choice.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

## How ZJIT Removes Redundant Object Loads and Stores

This post details the implementation of a new optimization pass within ZJIT, a Just-In-Time (JIT) compiler for Ruby, focused on eliminating redundant object loads and stores. Developed by Jacob Denbeaux, this optimization significantly improves performance, particularly in scenarios involving repeated instance variable assignments. The core of ZJIT’s approach lies in its High-Level Intermediate Representation (HIR), a structure that allows targeted optimizations at a level beyond the interpreter.

ZJIT's HIR pipeline consists of several passes, including type specialization, inline expansion, and various optimizations. The introduction of load-store optimization adds a new pass, `optimize_load_store`, to the existing sequence, which now includes: `type_specialize`, `inline`, `optimize_getivar`, `optimize_c_calls`, `fold_constants`, `clean_cfg`, `remove_redundant_patch_points`, `eliminate_dead_code`, and `optimize_load_store`. This addition dramatically alters ZJIT’s performance, demonstrating a clear advantage over YJIT, the original JIT compiler for Ruby.

At the heart of the optimization is the recognition that Ruby objects have inherent loads and stores associated with both instance variables and their shapes. The `optimize_load_store` pass specifically targets these redundant operations within the HIR. This pass works by identifying and eliminating redundant LoadField and StoreField instructions, crucial for improving performance when dealing with repetitive object assignments.

The demonstration of this optimization is showcased through the ‘setivar’ benchmark. When load-store optimization was implemented (2026-03-06), ZJIT's execution time dropped from an average of 2ms per iteration to 1ms, a substantial improvement compared to YJIT's 5ms. This represents over twice the speed and more than 25 times the speed of the interpreter, highlighting the significant impact of this targeted optimization. This demonstrated that ZJIT’s new approach significantly surpasses YJIT.

A key factor leading to this success is the HIR’s use of LoadField and StoreField instructions, which can be applied to both object shapes and instance variables. This design choice enables the optimization pass to intelligently eliminate redundancies across these two types of operations. This also demonstrates that the algorithms utilized for reducing loads and stores across shapes and instance variables are fundamentally the same.

However, a critical question arises: why would a simplistic load/store optimization have such a profound impact? The answer lies in ZJIT's HIR architecture, which utilizes LoadField and StoreField instructions for object shapes and instance variables. This shared structure allows the optimization pass to effectively eliminate redundant operations across both areas.

To illustrate this concept, the document provides an example of a simple Ruby class ‘C’ with a redundant store instruction. The HIR representation shows how the optimization pass can eliminate this redundant instruction, improving efficiency. The core logic relies on the identification of LoadField and StoreField instructions and elimination of redundant instances.

The documentation outlines a series of cases highlighting scenarios where the load-store optimization can be applied or should be avoided. These cases consider aliasing situations, required stores, and effects that might impact the optimization. These edge cases require careful consideration to ensure the optimization maintains program correctness. Notably, the algorithm’s robustness is enhanced by utilizing a lightweight SSA representation.

The implementation of the load-store optimization pass leverages a basic block-based approach, scanning the HIR for redundant loads and stores. The algorithm uses a hashmap to track and update these instructions, reflecting a relatively lightweight approach that doesn’t introduce significant structural changes to the HIR. This approach is crucial for maintaining the ergonomic nature of the HIR.

Furthermore, the algorithm incorporates WriteBarrier instructions to handle garbage collection and ensures the cache is appropriately cleaned within the HIR to avoid aliasing problems. It’s important to note that more sophisticated analyses are not yet implemented, and the path toward more robust performance gains remains.

Looking ahead, future development efforts may include dead store elimination, type-based alias analysis, and refinements to the overall HIR design. These additional optimizations promise to further enhance ZJIT’s performance and solidify its position as a leading JIT compiler for Ruby. The document acknowledges that design choices, such as using greater SSA, present significant challenges and require careful consideration.