Two studies in compiler optimisations

Recorded: March 26, 2026, 4:02 a.m.

Original

Summarized

Two studies in compiler optimisations · hmpc

hmpc

Posts

Bluesky

Codeberg

Two studies in compiler optimisations
20 Mar 2026

Table of contents

Introduction
Case 1: Modular increment
The scenario
Peephole optimisations
Loop invariants
All roads

Case 2: Endianness conversion
The scenario
Instruction selection
Helping the compiler
Ordering, redux
Shooting ourselves in the foot

Conclusion
Appendix: other resources

Introduction
While many performance-oriented programmers are intimately acquainted with the almost preternatural
ability of modern compilers to optimise their code, and many of us have spent countless hours on
Compiler Explorer examining the differences between the Assembly generated
by different versions of gcc and clang, most have likely not looked under the hood to see how the
magic happens. It is a testament to their quality that most of us simply treat compilers as black
boxes: more or less readable code goes in, fast binaries come out. Sometimes, however, seemingly
innocuous changes—perhaps even meant to help the compiler—can cause surprising performance
issues which we are hard-pressed to explain without a deeper understanding of the underlying
machinery.
In this post we’ll dive into the implementation of some of the LLVM1 optimisation passes using
two simple examples which, nonetheless, will help us pull back the veil on the complexity involved
in producing highly-optimised code. We will see how small source changes can trigger different
paths in the compiler’s internal processing with unexpected consequences, demonstrating how
achieving high performance can be as much an art as it is a science for both compiler developers and
users. I have also included a few exercises for those interested in getting their hands dirty, but
they are not required to follow along with the main text.
I use LLVM 22.1.0 as the reference implementation throughout this post. The
examples are written in (very basic) C++23 and target x86-64, and the Assembly code uses Intel
syntax. Prior knowledge of LLVM IR is not required but it can be helpful (I recommend A Gentle Introduction to LLVM IR).
Case 1: Modular increment
The scenario
Consider the following C++ function to get the next index into an array or vector of elements
accessed in a round-robin fashion, with cur being the current index and count the number of
elements:
unsigned next_naive(unsigned cur, unsigned count)
{
return (cur + 1) % count;
}

As written, this code requires an expensive 32-bit division instruction (6 cycles with 12-cycle
latency on an Intel IceLake core, worse in previous generations). There are, of course, numerous
tricks to replace divisions by constants with cheaper arithmetic operations—powers of two being the
best-known case—but since here count is a dynamic runtime value the compiler cannot help us:
next_naive(unsigned int, unsigned int):
lea eax, [rdi + 1]
xor edx, edx
div esi
mov eax, edx
ret

However, we know something the compiler doesn’t: because cur is an index, it must always be
strictly smaller than count. We can take advantage of this to replace the division with a much
cheaper conditional move by writing:
unsigned next_cmov(unsigned cur, unsigned count)
{
auto n = cur + 1;
return n == count ? 0 : n;
}

next_cmov(unsigned int, unsigned int):
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, esi
cmovne eax, ecx
ret

Could the compiler have done this transformation on its own? Let’s try out the new C++23 assume
attribute2:
unsigned next_assume(unsigned cur, unsigned count)
{
[[assume(cur < count)]];
return (cur + 1) % count;
}

next_assume(unsigned int, unsigned int):
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, esi
cmovne eax, ecx
ret

Exactly the same code as for next_cmov()! So how did clang figure it out?
Peephole optimisations
When trying to understand where a certain optimisation comes from (or where it might be missing),
the first step is to identify the pass where it occurs. Currently, the most convenient way to
investigate the effects of the different LLVM optimisation passes is to use Compiler Explorer’s Opt
Pipeline view:

(You can get the same information locally by passing the -mllvm -print-changed options to clang,
but without the nice highlighted diffs.)
When looking through the evolution of the LLVM IR for
next_assume(), we quickly find that the transformation from a urem instruction (representing the
modulo operation) to an icmp+select pair (which is eventually translated into an x86
cmp+cmov) happens in
InstCombine:
%cmp = icmp ult i32 %cur, %count
call void @llvm.assume(i1 %cmp)
%add = add i32 %cur, 1
- %rem = urem i32 %add, %count
+ %0 = icmp eq i32 %add, %count
+ %rem = select i1 %0, i32 0, i32 %add
ret i32 %rem

InstCombine is responsible for many of the peephole
optimisations that do not modify the control
flow graph, such as converting multiplications by powers of two into left shifts. It proceeds by
visiting
all the instructions in a function and matching them to one of many hundreds of patterns using
a generic pattern matching
mechanism.
The pass actually runs multiple times during compilation3, since other passes typically
expose new optimisation opportunities by performing different simplifications, unrolling loops, etc.
In this case the main transformation logic is quite simple and self-contained. If we look through
the lib/Transforms/InstCombine directory in the LLVM repository, we’ll see that the code is neatly
organised into different files for the different categories of instructions matched by the pass,
each file containing visitor methods for the opcodes of interest. One of these files is
InstCombineMulDivRem.cpp, where we can find the relatively short visitURem() method and, at the
end of it:4
// For "(X + 1) % Op1" and if (X u< Op1) => (X + 1) == Op1 ? 0 : X + 1 .
if (match(Op0, m_Add(m_Value(X), m_One()))) {
Value *Val =
simplifyICmpInst(ICmpInst::ICMP_ULT, X, Op1, SQ.getWithInstruction(&I));
if (Val && match(Val, m_One())) {
Value *FrozenOp0 = Op0;
if (!isGuaranteedNotToBeUndef(Op0))
FrozenOp0 = Builder.CreateFreeze(Op0, Op0->getName() + ".frozen");
Value *Cmp = Builder.CreateICmpEQ(FrozenOp0, Op1);
return createSelectInstWithUnknownProfile(
Cmp, ConstantInt::getNullValue(Ty), FrozenOp0);
}
}

As the comment preceding the block describes, this snippet matches exactly the pattern we used in
next_assume() and generates the same optimisation we wrote by hand in next_cmov() for cases
where it can prove that the pre-increment value is less than the modulus.5 To make this
guarantee, the compiler cleverly reuses the same analysis used to fold the result of icmp
instructions, namely the simplifyICmpInst()
function.
Scrolling down its implementation you can see the numerous patterns it matches; of particular
interest to us here is the simplifyICmpWithDominatingAssume helper, which checks all the
assumptions applicable to either side of the comparison (in the form of calls to the llvm.assume()
intrinsic) to see whether they imply our condition of interest.6 This, finally, explains
how the information we encoded in our assume attribute enabled the compiler to remove that pesky
division.
Looking through the implementation of simplifyICmpInst() also shows us other options to
communicate this precondition. For example, this version of our function is optimised in a similar
way when assertions are enabled:
unsigned next_assert(unsigned cur, unsigned count)
{
assert(cur < count);
return (cur + 1) % count;
}

Here, the simplification analysis uses the conditional hidden in the assert() macro to figure out
that we only execute the urem instruction if cur < count (see the isImpliedByDomCondition()
method). This can have the unexpected consequence of making builds with assertions enabled faster
in some cases; we can restore performance by replacing disabled assertions with assume attributes,
which have no runtime impact.

Exercise 1: If we try to be clever and rewrite our function using bitwise operations to avoid
the potential branch, InstCombine sees right through our trickery and produces the exact same
output:
unsigned next_bitwise(unsigned cur, unsigned count)
{
auto n = cur + 1;
return n & -(n != count);
}

%add = add i32 %cur, 1
- %cmp = icmp ne i32 %add, %count
- %conv = zext i1 %cmp to i32
- %sub = sub nsw i32 0, %conv
- %and = and i32 %add, %sub
+ %cmp.not = icmp eq i32 %add, %count
+ %and = select i1 %cmp.not, i32 0, i32 %add
ret i32 %and

Trace the execution of the pass to figure out what patterns it matches to do this. Start with the
visitor for the sub instruction. (rr can be very useful for this.)

Loop invariants
So far all the examples we’ve seen required us to explicitly indicate, one way or another, what the
relationship was between cur and count so the code could be optimised accordingly. In other
cases this relationship might be implicit, such as when iterating through the modulo values in
a loop:
void iterate_1(unsigned upto, unsigned count)
{
for (unsigned i = 0, r = 0; i < upto; ++i, r = (r + 1) % count)
// force the compiler to calculate r instead of discarding the whole loop
asm("" :: "r" (r));
}

If we look at the generated Assembly, it doesn’t seem like the compiler is smart enough to deduce
the optimisation on its own:
iterate_1(unsigned int, unsigned int):
test edi, edi
je .LBB4_3
xor edx, edx
.LBB4_2:
inc edx
mov eax, edx
xor edx, edx
div esi
dec edi
jne .LBB4_2
.LBB4_3:
ret

As is often the case, though, the compiler has simply noticed something we overlooked: our
optimisation relies on having r < count, but if count is zero this condition cannot be met.7
It’s instructive to see exactly how the optimiser reaches this conclusion. If we look at the IR at
the input to the first InstCombine pass, we will find that the (X + 1) % Op1 pattern from
visitURem() now matches X to the following phi instruction8:
%r.0 = phi i32 [ 0, %entry ], [ %rem, %for.body ]

When finding a phi instruction, simplifyICmpInst()
checks
that the comparison can be simplified for all possible input values to the phi (in this case, zero
and the last value calculated for r) and, if so, that it produces the same result for all of them.
With our code above there is no way to determine whether 0 < count and the simplification fails.
If we cover our bases by adding a if (count == 0) return; statement before the loop, we can close
the gap and get the expected result:
.LBB4_2:
inc ecx
cmp ecx, esi
cmove ecx, eax
dec edi
jne .LBB4_2

But that doesn’t explain how the compiler determined that %rem < %count in the second input to the
phi, when we no longer have an explicit assumption or condition to that effect. That case is
handled by yet another
pattern
deep in the bowels of the instruction folding routines, which specifically handles comparisons
between remainders of a given modulus and the modulus itself:
// icmp pred (urem X, Y), Y
if (match(LBO, m_URem(m_Value(), m_Specific(RHS)))) {
switch (Pred) {
// ...
case ICmpInst::ICMP_NE:
case ICmpInst::ICMP_ULT:
case ICmpInst::ICMP_ULE:
return getTrue(ITy);
}
}

Yes, it’s patterns all the way down.

Exercise 2: If we instead change the loop condition to i < std::min(upto, count) the compiler
is no longer able to replace the urem instruction. Starting from the isImpliedByDomCondition()
method mentioned earlier, try to figure out where you could add support to see through min/max
idioms when evaluating implied conditions, and whether you could reuse any existing analysis code.
What if we wanted to remove the now-redundant remainder entirely?

All roads
It may seem unnecessary to define an extra auxiliary variable to hold the remainder when we could
simply calculate it from the loop counter itself, even if the generated code is efficient. Let’s
get rid of it:
void iterate_2(unsigned upto, unsigned count)
{
for (unsigned i = 0; i < upto; ++i)
asm("" :: "r" (i % count));
}

iterate_2(unsigned int, unsigned int):
test edi, edi
je .LBB5_3
xor eax, eax
xor ecx, ecx
xor edx, edx
.LBB5_2:
inc ecx
cmp ecx, esi
cmove ecx, eax
inc edx
cmp edi, edx
jne .LBB5_2
.LBB5_3:
ret

It looks like the compiler recognises the same opportunity here: the division is still replaced with
a nice conditional move, using a separate register to hold the remainder. But this doesn’t look
like the work of the InstCombine code we saw earlier; and why are we incrementing the loop
counter, which requires an extra comparison, rather than decrementing it and using the zero flag, as
before?
This time, if we inspect the results of the different optimisation passes, we’ll see that the
transformation comes not from InstCombine but from another pass, CodeGenPrepare, which runs shortly
before instruction selection but after the loop strength reduction (LSR) responsible for converting
the counter increment into a decrement. InstCombine can no longer match the pattern it expects,
since it does not hold that i < count; and, because i is used inside the body of the loop to
calculate the remainder, LSR can no longer replace it with a down counter. CodeGenPrepare then acts
as a sort of last resort, identifying a specific
pattern
which it can transform into almost-optimal code.
This illustrates two points which we’ll explore more in depth in the next section:

Compiler passes (and their code) have a fair amount of redundancy as an engineering compromise to
maximise coverage.
The order of passes, including when and where they are repeated, can have a significant influence
on the final result.

Exercise 3: LLVM ships its optimiser, aptly named opt, as a modular binary separate from the
compiler front-end. Save our implementation of iterate_2() above in an iterate.cpp file and try
running the following command using your local clang install:
clang++ -O2 -S -emit-llvm -fno-discard-value-names -o- iterate.cpp \
| opt -S -passes="require<profile-summary>,loop-reduce,codegenprepare" -

You should obtain output similar to what you can observe after the CodeGenPrepare pass in the
Compiler Explorer Opt Pipeline view. What happens if you add another loop-reduce pass at the end?
Exercise 4: If we make the same change to the loop condition in iterate_2() as in Exercise 2,
the remainder instruction is eliminated from the generated code. What pass does this and why didn’t
it work before?

Case 2: Endianness conversion
The scenario
Most binary formats, whether for storage, IPC, or networking, specify a particular byte order for
numeric fields; for example, SQLite database files use big-endian fields, while Protobuf i32 and
i64 fields are little-endian. The customary way to write host endianness-agnostic decoding
functions is to explicitly join the individual input bytes. For 32-bit values that looks something
like (assuming a 32-bit int so we don’t need casts):
uint32_t load_le32(const uint8_t* data)
{
return data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
}

uint32_t load_be32(const uint8_t* data)
{
return data[3] | (data[2] << 8) | (data[1] << 16) | (data[0] << 24);
}

On x86, which supports unaligned loads from memory, this generates optimal Assembly at -O1 and
above:
load_le32(unsigned char const*):
mov eax, dword ptr [rdi]
ret

load_be32(unsigned char const*):
mov eax, dword ptr [rdi]
bswap eax
ret

Some developers, suffering from chronic DRYness and faced with the prospect of writing these
functions and their 64-bit equivalents for the umpteenth time, or perhaps working with integer sizes
other than 4 and 8 bytes, would prefer to write a single generic version instead:
static uint64_t load(const uint8_t* data, size_t sz, bool be)
{
uint64_t val = 0;

for (size_t i = 0; i < sz; ++i)
val |= static_cast<uint64_t>(data[be ? sz - i - 1 : i]) << (8 * i);

return val;
}

uint32_t load_le32(const uint8_t* data) { return load(data, 4, false); }
uint32_t load_be32(const uint8_t* data) { return load(data, 4, true); }
uint64_t load_le64(const uint8_t* data) { return load(data, 8, false); }
uint64_t load_be64(const uint8_t* data) { return load(data, 8, true); }

Code reviewers might then reasonably worry about whether there are any performance regressions
compared to the more traditional functions. Let’s see what happens when we compile:
load_le32(unsigned char const*):
mov eax, dword ptr [rdi]
ret

load_be32(unsigned char const*):
mov eax, dword ptr [rdi]
bswap eax
ret

load_le64(unsigned char const*):
mov rax, qword ptr [rdi]
ret

load_be64(unsigned char const*):
mov rax, qword ptr [rdi]
bswap rax
ret

It looks like our lazy (or perhaps far-sighted) developer is in the clear—at least if they don’t
need to use gcc which, at least as of version 15.2, cannot optimise this code. Let’s examine the
progression of the optimisation pipeline again to gain some insight into how the compiler pulls this
off.
Instruction selection
Given what we’ve seen in the first half of the post, it may be somewhat surprising to find through
our trusty Opt Pipeline Viewer that it takes all the way until
the instruction selection phase (also known as ISel) for the consecutive loads to be folded into
single 32- or 64-bit loads. We’ll see in the next section how we can enable earlier simplification
in some cases, but let’s take some time to understand what ISel is and how it implements folding.
Instruction selection is the first phase of the compiler back-end, which runs after all the generic
optimisation passes (in LLVM this distinction is also reflected in the separation between the opt
and llc binaries). Its purpose is to convert the LLVM IR produced by the optimiser into Machine
IR (MIR), a target-specific representation of the instruction stream suitable for downstream passes
such as scheduling or register allocation. Most LLVM targets, including x86, use SelectionDAG as
their instruction selection framework. We won’t go into the weeds of SelectionDAG here9;
for our purposes, it suffices to know that it includes its own peephole optimiser called
DAGCombiner, analogous to InstCombine, which can capture simplification opportunities exposed by
code generation or specific to that target.
Much like InstCombine visits each instruction in a basic block, DAGCombiner visits each node in the
DAG built from those same instructions. It’s here, among the many transformations performed by the
or visitor, that we find the MatchLoadCombine()
method:
/// Match a pattern where a wide type scalar value is loaded by several narrow
/// loads and combined by shifts and ors. Fold it into a single load or a load
/// and a BSWAP if the targets supports it.

This is, of course, the exact pattern we see in the LLVM IR at the input to SelectionDAG, after
load() is inlined, constant arguments propagated, and the loops fully unrolled. What’s
interesting about this method, though, is that it doesn’t perform a simple pattern match as we’ve
seen before in InstCombine, but instead tracks the origin of every single byte at the output of the
or instruction recursively through the DAG (using the auxiliary calculateByteProvider()
function),
not just from loads but also through bitwise operations, byte swaps, and zero/sign extensions. This
makes it powerful enough to also handle expressions like:
static uint64_t load_alt(const uint8_t* data, size_t sz, bool be)
{
uint64_t val = 0;

for (size_t i = 0; i < sz; ++i)
val = (val << 8) | data[be ? i : sz - i - 1];

return val;
}

In this case, however, we find that only the 32-bit functions are
fully optimised. Although we have exactly the same number of LLVM IR instructions as before, this
new version requires recursing deeper into the DAG to uncover the source of the bytes loaded
earliest (since we do two operations on the output variable per iteration, instead of one), hitting
the hard-coded limit of ten levels of recursion. This is a good example of how having a mental map
of the compiler’s internal processing, as well as sticking to more common patterns, can help us
avoid such pathological cases.10
Helping the compiler
Some seasoned C++ developers may have instinctively objected to this example. After all, if we only
have a few possible argument values known at compile-time, why wouldn’t we just template our
function rather than relying on the compiler to propagate the constants?
template <size_t sz, bool be>
static uint64_t load(const uint8_t* data)
{
uint64_t val = 0;

for (size_t i = 0; i < sz; ++i)
val |= static_cast<uint64_t>(data[be ? sz - i - 1 : i]) << (8 * i);

return val;
}

uint32_t load_le32(const uint8_t* data) { return load<4, false>(data); }
uint32_t load_be32(const uint8_t* data) { return load<4, true>(data); }
uint64_t load_le64(const uint8_t* data) { return load<8, false>(data); }
uint64_t load_be64(const uint8_t* data) { return load<8, true>(data); }

The output is unchanged, but this time the IR for the
little-endian functions is optimised into a single load long before we reach the instruction
selection stage, in a pass called
AggressiveInstCombine.
Like InstCombine, this pass performs a variety of peephole optimisations, but it’s focused on those
which tend to be less commonly needed or take longer to run, which is why it runs only once at -O2
(and not at all at -O1). One of these optimisations is implemented in the
foldConsecutiveLoads()
function,
which matches patterns of the form “bitwise-or of (optionally) left-shifted, zero-extended,
consecutive loads”, such as the ones we find in our functions.
Since it relies on simple pattern matching, AggressiveInstCombine is not as powerful as DAGCombiner;
in particular, it cannot optimise the fold used in load_alt() above even for 32 bits. It also has
no impact on our big-endian loads because, as a middle-end optimisation pass (unlike DAGCombiner,
which is part of the compiler back-end), it tries to produce target-independent IR as much as
possible. Folding a big-endian load on a little-endian target, and vice-versa, would require adding
a target-specific byte-swap intrinsic which would complicate downstream analysis and optimisation
passes.

Exercise 5: Template the load_alt() function as in this section and examine the resulting
output. Does it match our earlier results? What does this tell us about the interaction between
templating and the optimisation pipeline?
Exercise 6: Instead of loading bytes in a different order depending on endianness, we can also
fix the order and optionally use the __builtin_bswap64() intrinsic on the return from load().
How does this affect the ability of AggressiveInstCombine to optimise the code?

Ordering, redux
Why didn’t AggressiveInstCombine come into play in our initial non-templated version? If we inspect
the optimisation pipeline, we’ll find that earlier the pass ran after load() was inlined but,
crucially, before the loop was unrolled, so there was no pattern to match yet. Function templates,
on the other hand, are instantiated separately for each unique combination of parameters, so by the
time they are inlined the loops have already been fully unrolled (since the iteration count is
known) and can be matched by foldConsecutiveLoads(). We can obtain the same result with the
initial code by manually running:
clang++ -S -emit-llvm -O2 -o- endianness.cpp \
| opt -S -passes='aggressive-instcombine' - \
| llc --x86-asm-syntax=intel

This also exemplifies an important principle of function inlining in LLVM (I’m not sure about other
compilers): whenever possible, callees are simplified before callers in order to improve the
accuracy of the inlining cost heuristics. The more information is available to the optimiser before
inlining, the better it can make those decisions. In general, in fact, the earlier in the pipeline
we can enable specific simplifications, the more optimised the final result will be (see the
exercises for more examples of this).

Exercise 7: If we comment out all of the specialised functions but load_le32() we get similar
results as when we template load(), i.e. AggressiveInstCombine successfully folds the
consecutive loads. What optimisation pass enables this? How can we restructure
performance-critical code to take advantage of it?
Exercise 8: If we now keep only load_le64() but replace load() with load_alt() above, the
DAGCombiner is able to fully optimise the resulting LLVM IR. Compare with our earlier results to
understand why this is the case.

Shooting ourselves in the foot
If you’ve read this far you may be under the impression that, while they may not be magic,
optimisers are at the very least superhuman. Let’s quickly dispel that notion by looking at two
examples where the obvious thing to do turns out not to be so obvious for the compiler.
We’ll start by adding a small (and pretty useless) function to our initial example:
uint32_t load_le32(const uint8_t* data) { return load(data, 4, false); }

uint32_t load_add_le32(const uint8_t* data)
{
return load_le32(data) + data[0];
}

Take a moment before continuing to consider how you would expect this function to be compiled. Two
loads and an add? Perhaps a load, zero extension of the low byte, and an add?
Now compare with the actual result:
load_add_le32(unsigned char const*):
movzx ecx, byte ptr [rdi]
movzx edx, word ptr [rdi + 1]
shl edx, 8
movzx eax, byte ptr [rdi + 3]
shl eax, 24
or eax, edx
or eax, ecx
add eax, ecx
ret

Huh. What happened here?
Remember that the loads aren’t folded together until we reach the ISel phase, so by the time
load_le32() is inlined it’s not yet fully optimised. It turns out DAGCombiner, when tracking the
origin of each byte in the load, specifically rules
out
bytes which are used elsewhere in the function to prevent introducing duplicate loads. This can be
sensible in general (even cache hits have a cost, after all), but it’s clearly suboptimal in this
case.
Can we make use of AggressiveInstCombine to “hide” the duplicate load from DAGCombiner? The answer
turns out to be yes, but we must be careful. This, for example, won’t work:
uint32_t load_add_le32(const uint8_t* data)
{
return load<4, false>(data) + data[0];
}

Although AggressiveInstCombine does run after load<4, false>() is inlined here, it also
refuses
to create duplicate loads11. As the saying goes, however, there’s no problem that cannot
be solved with another layer of indirection: if we instead call the function template from
load_le32(), we can get AggressiveInstCombine to optimise the unrolled loop before it’s inlined
in load_add_le32(), producing the desired result (the
additional cleverness to avoid a second load comes courtesy of the GVN pass).

Exercise 9: What happens if we use val & 0xFF instead of data[0] in the examples above, and
why? Can you make a small change to make this work regardless of which function we call?12

For our second and final act, let’s reconsider what we’re optimising for in the first place. We
have been focusing on performance in the sense of execution time by using the -O<n> flags. But
what about optimising for size with -Os? In this case, they happen to coincide: surely you cannot
get any faster or smaller than a single 32- or 64-bit load. But does the compiler agree?
Unfortunately, it does not. With the curious exception of
load_be32() (or, more precisely, load<4, true>()), it has decided that unrolling the loops would
go against its mandate to minimise code size, making it impossible for downstream passes to fold the
loads. The solution here is to nudge it politely using #pragma unroll before the loop, allowing
AggressiveInstCombine or DAGCombiner to do their job properly. (You can also avoid pragmas by
writing a recursive function template if that’s more to your taste.)
Conclusion
What can we take away from this somewhat meandering exploration? Perhaps the overarching theme is:
get familiar with your compiler and what it can and cannot do for you. More actionable advice:

Use language facilities to make your intent clear. If you know something at compile-time, all
else being equal, you can help the optimiser by stating it explicitly through e.g. C++ templates
or Rust macros. In some cases it may even be preferable to choose at runtime from among a small
set of specialised functions, instead of using a single generic function which cannot be
optimised.

Maximise the potential for early optimisation through things like templating (again), auxiliary
variables, explicit constraints, or specialised functions. Having at least a rough idea of the
different optimisation passes and their order can help with this. The earlier the compiler can
optimise something, the greater the opportunity for beneficial knock-on effects.

Stick to well-known patterns. The more common a construction is, the more likely that someone
else already did the hard work of making the compiler recognise and optimise it. Simple code
beats clever nine times out of ten.

Appendix: other resources
Haven’t had enough yet? Here are some helpful resources if you’re interested in diving deeper into
LLVM in general and the optimiser in particular:

Official LLVM documentation, in particular:

LLVM Language Reference Manual
LLVM Programmer’s Manual
The LLVM Target-Independent Code Generator
LLVM Loop Terminology (and Canonical Forms)

Nikita Popov’s blog, in particular:

LLVM: The middle-end optimization pipeline
LLVM: Canonicalization and target-independence
LLVM: Scalar evolution

A whirlwind tour of the LLVM optimizer (YouTube)

I’m using LLVM rather than gcc mainly because gcc does not perform the optimisations
I analyse here, but also because LLVM provides better introspection tooling around its internal
representation. Personally, I also find the LLVM codebase better documented and easier to
navigate and understand. ↩

We could’ve also used __builtin_assume() or __builtin_unreachable() (which also works
with gcc) for the same effect; all of these are translated into llvm.assume() intrinsics. ↩

Eight times per my count, at both -O1 and -O2. You can pass the -mllvm
-print-pipeline-passes arguments to clang to see exactly which passes it will run. ↩

When reading through the LLVM code it’s helpful to keep in mind the
following:

One important aspect of LLVM is that there is no distinction between an SSA variable and the
operation that produces it. Because of this, any reference to the value produced by an
instruction (or the value available as an incoming argument, for example) is represented as
a direct pointer to the instance of the class that represents this value.

↩

There is some additional logic to make the optimisation work correctly for undefined
values; you can read more about this in the LLVM
documentation. ↩

It is enlightening, although quite repetitive, to go down the rabbit hole of the
implementation
for the isImpliedCondition() method to see just how many different cases it has to handle. ↩

Since taking the remainder of a division by zero is undefined
behaviour, the compiler would be well within its
rights to assume that count != 0 and proceed with the transformation. At the moment, at
least, LLVM is not quite so ruthless. ↩

If you’re not familiar with SSA form, a phi node is a way to merge the different execution
paths introduced by conditionals or loops; the output of a phi instruction is the value labelled
by the basic block from which the current execution has proceeded. In this case that value is
zero at the entrance to the loop and %rem after the first iteration. ↩

I recommend watching A Beginners’ Guide to
SelectionDAG (YouTube) if you want to learn more. ↩

Hardware engineers used to writing HDL for FPGAs might find this familiar. Toolchains like
AMD Vivado have strict requirements on what kinds of constructs they are able to recognise and
synthesise as expected (a process called inference long before the term was taken over by
machine learning), e.g. into state
machines or hard IP blocks
such as RAMs. ↩

Before PR #176101 it refused to
fold loads even if only the final result was used multiple times, but that change was not
included in the LLVM 22.x release branch. ↩

Hint: the iterate_1() function from the first half of the post uses a similar trick. ↩

This post delves into the intricacies of compiler optimization, specifically using two simple C++ examples to illustrate the mechanisms behind LLVM’s optimization passes. It highlights how seemingly minor changes in code—such as explicit assumptions—can trigger significant transformations within the compiler, ultimately leading to performance improvements. The author, Henrique Cabral, aims to demystify the black-box nature of compilers, demonstrating how developers can leverage the compiler’s capabilities effectively. This exploration covers modular increment optimization, endianness conversion, and the interplay between different optimization passes within LLVM.

The central idea is that compilers, despite their apparent intelligence, rely on recognizing patterns and exploiting specific conditions. By providing the compiler with this information—through techniques like assumptions or specialized functions—developers can guide the optimization process towards desired outcomes. The author demonstrates this by showcasing how a simple C++ function, designed to calculate the next index in an array, can be optimized dramatically through judicious use of assumptions and leveraging the compiler’s ability to identify and eliminate redundant computations.

The first example demonstrates modular increment optimization. The naive implementation of `next_naive` involved a costly 32-bit division. However, recognizing that the index `cur` is always less than `count`, the author introduces the `next_cmov` function, utilizing a conditional move instruction, which is far more efficient. The implementation uses an assume attribute to enforce this restriction, triggering the compiler to replace the division with the conditional move. The explanation shows how InstCombine, one of the LLVM optimization passes, automatically identifies and substitutes the division with the conditional move based on the assumption, showing the process by tracking different steps in the LLVM IR. This detailed walkthrough illustrates how compiler's patterns matching can lead to the exact same result being implemented at the hardware level.

The second example focuses on endianness conversion. The code demonstrates the standard practice of loading multi-byte values from memory using byte-swapping for big-endian formats and using the default native format for little-endian formats. The author illustrates how the compiler can optimize this process through Instruction Selection by folding the multiple loads into a single optimized load instruction. The exploration of the CodeGenPrepare pass reveals that the compiler handles endianness changes by creating dedicated functions for both big-endian and little-endian loads, preventing conflicts during the optimization.

The author includes supplemental exercises encouraging readers to experiment with the code and explore the debugging tools available, although those are not required to follow along with main text. The discussion highlights the importance of understanding the compiler’s behavior and how different code constructs influence the optimization process. It emphasizes how code should be structured to allow efficient processing by the compiler, including making the code intent explicitly clear. The author further distinguishes two separate optimization strategies: that of using a template to make our code more robust and efficient, and the alternative of using a function with side-effects to achieve the same result which might result in increased code size.

The key takeaway is that compiler optimization is not simply a matter of “letting the compiler do its thing”. Instead, it’s a collaborative effort between the developer and the compiler, where the developer’s understanding of the compiler’s capabilities—and providing the compiler with the necessary information—is critical to achieving optimal performance. The exploration of the different stages of an LLVM optimization pass, such as the InstCombine and Instruction Selection passes, provides a deeper insight into the mechanics of the compiler’s process.

Finally, the author makes a brief but pointed observation regarding the potential for suboptimal results when relying solely on generic code without explicitly stating the assumptions or limiting the scope of the operations. This reinforces the need for developers to consider the potential impacts of their code on the compiler’s optimization process and to provide sufficient constraints to guide the compiler effectively.