Two studies in compiler optimisations · hmpc
hmpc
Posts
Bluesky
Codeberg
© 2026 Henrique Cabral
Two studies in compiler optimisations 20 Mar 2026
Table of contents
Introduction Case 1: Modular increment The scenario Peephole optimisations Loop invariants All roads
Case 2: Endianness conversion The scenario Instruction selection Helping the compiler Ordering, redux Shooting ourselves in the foot
Conclusion Appendix: other resources
Introduction While many performance-oriented programmers are intimately acquainted with the almost preternatural ability of modern compilers to optimise their code, and many of us have spent countless hours on Compiler Explorer examining the differences between the Assembly generated by different versions of gcc and clang, most have likely not looked under the hood to see how the magic happens. It is a testament to their quality that most of us simply treat compilers as black boxes: more or less readable code goes in, fast binaries come out. Sometimes, however, seemingly innocuous changes—perhaps even meant to help the compiler—can cause surprising performance issues which we are hard-pressed to explain without a deeper understanding of the underlying machinery. In this post we’ll dive into the implementation of some of the LLVM1 optimisation passes using two simple examples which, nonetheless, will help us pull back the veil on the complexity involved in producing highly-optimised code. We will see how small source changes can trigger different paths in the compiler’s internal processing with unexpected consequences, demonstrating how achieving high performance can be as much an art as it is a science for both compiler developers and users. I have also included a few exercises for those interested in getting their hands dirty, but they are not required to follow along with the main text. I use LLVM 22.1.0 as the reference implementation throughout this post. The examples are written in (very basic) C++23 and target x86-64, and the Assembly code uses Intel syntax. Prior knowledge of LLVM IR is not required but it can be helpful (I recommend A Gentle Introduction to LLVM IR). Case 1: Modular increment The scenario Consider the following C++ function to get the next index into an array or vector of elements accessed in a round-robin fashion, with cur being the current index and count the number of elements: unsigned next_naive(unsigned cur, unsigned count) { return (cur + 1) % count; }
As written, this code requires an expensive 32-bit division instruction (6 cycles with 12-cycle latency on an Intel IceLake core, worse in previous generations). There are, of course, numerous tricks to replace divisions by constants with cheaper arithmetic operations—powers of two being the best-known case—but since here count is a dynamic runtime value the compiler cannot help us: next_naive(unsigned int, unsigned int): lea eax, [rdi + 1] xor edx, edx div esi mov eax, edx ret
However, we know something the compiler doesn’t: because cur is an index, it must always be strictly smaller than count. We can take advantage of this to replace the division with a much cheaper conditional move by writing: unsigned next_cmov(unsigned cur, unsigned count) { auto n = cur + 1; return n == count ? 0 : n; }
next_cmov(unsigned int, unsigned int): lea ecx, [rdi + 1] xor eax, eax cmp ecx, esi cmovne eax, ecx ret
Could the compiler have done this transformation on its own? Let’s try out the new C++23 assume attribute2: unsigned next_assume(unsigned cur, unsigned count) { [[assume(cur < count)]]; return (cur + 1) % count; }
next_assume(unsigned int, unsigned int): lea ecx, [rdi + 1] xor eax, eax cmp ecx, esi cmovne eax, ecx ret
Exactly the same code as for next_cmov()! So how did clang figure it out? Peephole optimisations When trying to understand where a certain optimisation comes from (or where it might be missing), the first step is to identify the pass where it occurs. Currently, the most convenient way to investigate the effects of the different LLVM optimisation passes is to use Compiler Explorer’s Opt Pipeline view:
(You can get the same information locally by passing the -mllvm -print-changed options to clang, but without the nice highlighted diffs.) When looking through the evolution of the LLVM IR for next_assume(), we quickly find that the transformation from a urem instruction (representing the modulo operation) to an icmp+select pair (which is eventually translated into an x86 cmp+cmov) happens in InstCombine: %cmp = icmp ult i32 %cur, %count call void @llvm.assume(i1 %cmp) %add = add i32 %cur, 1 - %rem = urem i32 %add, %count + %0 = icmp eq i32 %add, %count + %rem = select i1 %0, i32 0, i32 %add ret i32 %rem
InstCombine is responsible for many of the peephole optimisations that do not modify the control flow graph, such as converting multiplications by powers of two into left shifts. It proceeds by visiting all the instructions in a function and matching them to one of many hundreds of patterns using a generic pattern matching mechanism. The pass actually runs multiple times during compilation3, since other passes typically expose new optimisation opportunities by performing different simplifications, unrolling loops, etc. In this case the main transformation logic is quite simple and self-contained. If we look through the lib/Transforms/InstCombine directory in the LLVM repository, we’ll see that the code is neatly organised into different files for the different categories of instructions matched by the pass, each file containing visitor methods for the opcodes of interest. One of these files is InstCombineMulDivRem.cpp, where we can find the relatively short visitURem() method and, at the end of it:4 // For "(X + 1) % Op1" and if (X u< Op1) => (X + 1) == Op1 ? 0 : X + 1 . if (match(Op0, m_Add(m_Value(X), m_One()))) { Value *Val = simplifyICmpInst(ICmpInst::ICMP_ULT, X, Op1, SQ.getWithInstruction(&I)); if (Val && match(Val, m_One())) { Value *FrozenOp0 = Op0; if (!isGuaranteedNotToBeUndef(Op0)) FrozenOp0 = Builder.CreateFreeze(Op0, Op0->getName() + ".frozen"); Value *Cmp = Builder.CreateICmpEQ(FrozenOp0, Op1); return createSelectInstWithUnknownProfile( Cmp, ConstantInt::getNullValue(Ty), FrozenOp0); } }
As the comment preceding the block describes, this snippet matches exactly the pattern we used in next_assume() and generates the same optimisation we wrote by hand in next_cmov() for cases where it can prove that the pre-increment value is less than the modulus.5 To make this guarantee, the compiler cleverly reuses the same analysis used to fold the result of icmp instructions, namely the simplifyICmpInst() function. Scrolling down its implementation you can see the numerous patterns it matches; of particular interest to us here is the simplifyICmpWithDominatingAssume helper, which checks all the assumptions applicable to either side of the comparison (in the form of calls to the llvm.assume() intrinsic) to see whether they imply our condition of interest.6 This, finally, explains how the information we encoded in our assume attribute enabled the compiler to remove that pesky division. Looking through the implementation of simplifyICmpInst() also shows us other options to communicate this precondition. For example, this version of our function is optimised in a similar way when assertions are enabled: unsigned next_assert(unsigned cur, unsigned count) { assert(cur < count); return (cur + 1) % count; }
Here, the simplification analysis uses the conditional hidden in the assert() macro to figure out that we only execute the urem instruction if cur < count (see the isImpliedByDomCondition() method). This can have the unexpected consequence of making builds with assertions enabled faster in some cases; we can restore performance by replacing disabled assertions with assume attributes, which have no runtime impact.
Exercise 1: If we try to be clever and rewrite our function using bitwise operations to avoid the potential branch, InstCombine sees right through our trickery and produces the exact same output: unsigned next_bitwise(unsigned cur, unsigned count) { auto n = cur + 1; return n & -(n != count); }
%add = add i32 %cur, 1 - %cmp = icmp ne i32 %add, %count - %conv = zext i1 %cmp to i32 - %sub = sub nsw i32 0, %conv - %and = and i32 %add, %sub + %cmp.not = icmp eq i32 %add, %count + %and = select i1 %cmp.not, i32 0, i32 %add ret i32 %and
Trace the execution of the pass to figure out what patterns it matches to do this. Start with the visitor for the sub instruction. (rr can be very useful for this.)
Loop invariants So far all the examples we’ve seen required us to explicitly indicate, one way or another, what the relationship was between cur and count so the code could be optimised accordingly. In other cases this relationship might be implicit, such as when iterating through the modulo values in a loop: void iterate_1(unsigned upto, unsigned count) { for (unsigned i = 0, r = 0; i < upto; ++i, r = (r + 1) % count) // force the compiler to calculate r instead of discarding the whole loop asm("" :: "r" (r)); }
If we look at the generated Assembly, it doesn’t seem like the compiler is smart enough to deduce the optimisation on its own: iterate_1(unsigned int, unsigned int): test edi, edi je .LBB4_3 xor edx, edx .LBB4_2: inc edx mov eax, edx xor edx, edx div esi dec edi jne .LBB4_2 .LBB4_3: ret
As is often the case, though, the compiler has simply noticed something we overlooked: our optimisation relies on having r < count, but if count is zero this condition cannot be met.7 It’s instructive to see exactly how the optimiser reaches this conclusion. If we look at the IR at the input to the first InstCombine pass, we will find that the (X + 1) % Op1 pattern from visitURem() now matches X to the following phi instruction8: %r.0 = phi i32 [ 0, %entry ], [ %rem, %for.body ]
When finding a phi instruction, simplifyICmpInst() checks that the comparison can be simplified for all possible input values to the phi (in this case, zero and the last value calculated for r) and, if so, that it produces the same result for all of them. With our code above there is no way to determine whether 0 < count and the simplification fails. If we cover our bases by adding a if (count == 0) return; statement before the loop, we can close the gap and get the expected result: .LBB4_2: inc ecx cmp ecx, esi cmove ecx, eax dec edi jne .LBB4_2
But that doesn’t explain how the compiler determined that %rem < %count in the second input to the phi, when we no longer have an explicit assumption or condition to that effect. That case is handled by yet another pattern deep in the bowels of the instruction folding routines, which specifically handles comparisons between remainders of a given modulus and the modulus itself: // icmp pred (urem X, Y), Y if (match(LBO, m_URem(m_Value(), m_Specific(RHS)))) { switch (Pred) { // ... case ICmpInst::ICMP_NE: case ICmpInst::ICMP_ULT: case ICmpInst::ICMP_ULE: return getTrue(ITy); } }
Yes, it’s patterns all the way down.
Exercise 2: If we instead change the loop condition to i < std::min(upto, count) the compiler is no longer able to replace the urem instruction. Starting from the isImpliedByDomCondition() method mentioned earlier, try to figure out where you could add support to see through min/max idioms when evaluating implied conditions, and whether you could reuse any existing analysis code. What if we wanted to remove the now-redundant remainder entirely?
All roads It may seem unnecessary to define an extra auxiliary variable to hold the remainder when we could simply calculate it from the loop counter itself, even if the generated code is efficient. Let’s get rid of it: void iterate_2(unsigned upto, unsigned count) { for (unsigned i = 0; i < upto; ++i) asm("" :: "r" (i % count)); }
iterate_2(unsigned int, unsigned int): test edi, edi je .LBB5_3 xor eax, eax xor ecx, ecx xor edx, edx .LBB5_2: inc ecx cmp ecx, esi cmove ecx, eax inc edx cmp edi, edx jne .LBB5_2 .LBB5_3: ret
It looks like the compiler recognises the same opportunity here: the division is still replaced with a nice conditional move, using a separate register to hold the remainder. But this doesn’t look like the work of the InstCombine code we saw earlier; and why are we incrementing the loop counter, which requires an extra comparison, rather than decrementing it and using the zero flag, as before? This time, if we inspect the results of the different optimisation passes, we’ll see that the transformation comes not from InstCombine but from another pass, CodeGenPrepare, which runs shortly before instruction selection but after the loop strength reduction (LSR) responsible for converting the counter increment into a decrement. InstCombine can no longer match the pattern it expects, since it does not hold that i < count; and, because i is used inside the body of the loop to calculate the remainder, LSR can no longer replace it with a down counter. CodeGenPrepare then acts as a sort of last resort, identifying a specific pattern which it can transform into almost-optimal code. This illustrates two points which we’ll explore more in depth in the next section:
Compiler passes (and their code) have a fair amount of redundancy as an engineering compromise to maximise coverage. The order of passes, including when and where they are repeated, can have a significant influence on the final result.
Exercise 3: LLVM ships its optimiser, aptly named opt, as a modular binary separate from the compiler front-end. Save our implementation of iterate_2() above in an iterate.cpp file and try running the following command using your local clang install: clang++ -O2 -S -emit-llvm -fno-discard-value-names -o- iterate.cpp \ | opt -S -passes="require<profile-summary>,loop-reduce,codegenprepare" -
You should obtain output similar to what you can observe after the CodeGenPrepare pass in the Compiler Explorer Opt Pipeline view. What happens if you add another loop-reduce pass at the end? Exercise 4: If we make the same change to the loop condition in iterate_2() as in Exercise 2, the remainder instruction is eliminated from the generated code. What pass does this and why didn’t it work before?
Case 2: Endianness conversion The scenario Most binary formats, whether for storage, IPC, or networking, specify a particular byte order for numeric fields; for example, SQLite database files use big-endian fields, while Protobuf i32 and i64 fields are little-endian. The customary way to write host endianness-agnostic decoding functions is to explicitly join the individual input bytes. For 32-bit values that looks something like (assuming a 32-bit int so we don’t need casts): uint32_t load_le32(const uint8_t* data) { return data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24); }
uint32_t load_be32(const uint8_t* data) { return data[3] | (data[2] << 8) | (data[1] << 16) | (data[0] << 24); }
On x86, which supports unaligned loads from memory, this generates optimal Assembly at -O1 and above: load_le32(unsigned char const*): mov eax, dword ptr [rdi] ret
load_be32(unsigned char const*): mov eax, dword ptr [rdi] bswap eax ret
Some developers, suffering from chronic DRYness and faced with the prospect of writing these functions and their 64-bit equivalents for the umpteenth time, or perhaps working with integer sizes other than 4 and 8 bytes, would prefer to write a single generic version instead: static uint64_t load(const uint8_t* data, size_t sz, bool be) { uint64_t val = 0;
for (size_t i = 0; i < sz; ++i) val |= static_cast<uint64_t>(data[be ? sz - i - 1 : i]) << (8 * i);
return val; }
uint32_t load_le32(const uint8_t* data) { return load(data, 4, false); } uint32_t load_be32(const uint8_t* data) { return load(data, 4, true); } uint64_t load_le64(const uint8_t* data) { return load(data, 8, false); } uint64_t load_be64(const uint8_t* data) { return load(data, 8, true); }
Code reviewers might then reasonably worry about whether there are any performance regressions compared to the more traditional functions. Let’s see what happens when we compile: load_le32(unsigned char const*): mov eax, dword ptr [rdi] ret
load_be32(unsigned char const*): mov eax, dword ptr [rdi] bswap eax ret
load_le64(unsigned char const*): mov rax, qword ptr [rdi] ret
load_be64(unsigned char const*): mov rax, qword ptr [rdi] bswap rax ret
It looks like our lazy (or perhaps far-sighted) developer is in the clear—at least if they don’t need to use gcc which, at least as of version 15.2, cannot optimise this code. Let’s examine the progression of the optimisation pipeline again to gain some insight into how the compiler pulls this off. Instruction selection Given what we’ve seen in the first half of the post, it may be somewhat surprising to find through our trusty Opt Pipeline Viewer that it takes all the way until the instruction selection phase (also known as ISel) for the consecutive loads to be folded into single 32- or 64-bit loads. We’ll see in the next section how we can enable earlier simplification in some cases, but let’s take some time to understand what ISel is and how it implements folding. Instruction selection is the first phase of the compiler back-end, which runs after all the generic optimisation passes (in LLVM this distinction is also reflected in the separation between the opt and llc binaries). Its purpose is to convert the LLVM IR produced by the optimiser into Machine IR (MIR), a target-specific representation of the instruction stream suitable for downstream passes such as scheduling or register allocation. Most LLVM targets, including x86, use SelectionDAG as their instruction selection framework. We won’t go into the weeds of SelectionDAG here9; for our purposes, it suffices to know that it includes its own peephole optimiser called DAGCombiner, analogous to InstCombine, which can capture simplification opportunities exposed by code generation or specific to that target. Much like InstCombine visits each instruction in a basic block, DAGCombiner visits each node in the DAG built from those same instructions. It’s here, among the many transformations performed by the or visitor, that we find the MatchLoadCombine() method: /// Match a pattern where a wide type scalar value is loaded by several narrow /// loads and combined by shifts and ors. Fold it into a single load or a load /// and a BSWAP if the targets supports it.
This is, of course, the exact pattern we see in the LLVM IR at the input to SelectionDAG, after load() is inlined, constant arguments propagated, and the loops fully unrolled. What’s interesting about this method, though, is that it doesn’t perform a simple pattern match as we’ve seen before in InstCombine, but instead tracks the origin of every single byte at the output of the or instruction recursively through the DAG (using the auxiliary calculateByteProvider() function), not just from loads but also through bitwise operations, byte swaps, and zero/sign extensions. This makes it powerful enough to also handle expressions like: static uint64_t load_alt(const uint8_t* data, size_t sz, bool be) { uint64_t val = 0;
for (size_t i = 0; i < sz; ++i) val = (val << 8) | data[be ? i : sz - i - 1];
return val; }
In this case, however, we find that only the 32-bit functions are fully optimised. Although we have exactly the same number of LLVM IR instructions as before, this new version requires recursing deeper into the DAG to uncover the source of the bytes loaded earliest (since we do two operations on the output variable per iteration, instead of one), hitting the hard-coded limit of ten levels of recursion. This is a good example of how having a mental map of the compiler’s internal processing, as well as sticking to more common patterns, can help us avoid such pathological cases.10 Helping the compiler Some seasoned C++ developers may have instinctively objected to this example. After all, if we only have a few possible argument values known at compile-time, why wouldn’t we just template our function rather than relying on the compiler to propagate the constants? template <size_t sz, bool be> static uint64_t load(const uint8_t* data) { uint64_t val = 0;
for (size_t i = 0; i < sz; ++i) val |= static_cast<uint64_t>(data[be ? sz - i - 1 : i]) << (8 * i);
return val; }
uint32_t load_le32(const uint8_t* data) { return load<4, false>(data); } uint32_t load_be32(const uint8_t* data) { return load<4, true>(data); } uint64_t load_le64(const uint8_t* data) { return load<8, false>(data); } uint64_t load_be64(const uint8_t* data) { return load<8, true>(data); }
The output is unchanged, but this time the IR for the little-endian functions is optimised into a single load long before we reach the instruction selection stage, in a pass called AggressiveInstCombine. Like InstCombine, this pass performs a variety of peephole optimisations, but it’s focused on those which tend to be less commonly needed or take longer to run, which is why it runs only once at -O2 (and not at all at -O1). One of these optimisations is implemented in the foldConsecutiveLoads() function, which matches patterns of the form “bitwise-or of (optionally) left-shifted, zero-extended, consecutive loads”, such as the ones we find in our functions. Since it relies on simple pattern matching, AggressiveInstCombine is not as powerful as DAGCombiner; in particular, it cannot optimise the fold used in load_alt() above even for 32 bits. It also has no impact on our big-endian loads because, as a middle-end optimisation pass (unlike DAGCombiner, which is part of the compiler back-end), it tries to produce target-independent IR as much as possible. Folding a big-endian load on a little-endian target, and vice-versa, would require adding a target-specific byte-swap intrinsic which would complicate downstream analysis and optimisation passes.
Exercise 5: Template the load_alt() function as in this section and examine the resulting output. Does it match our earlier results? What does this tell us about the interaction between templating and the optimisation pipeline? Exercise 6: Instead of loading bytes in a different order depending on endianness, we can also fix the order and optionally use the __builtin_bswap64() intrinsic on the return from load(). How does this affect the ability of AggressiveInstCombine to optimise the code?
Ordering, redux Why didn’t AggressiveInstCombine come into play in our initial non-templated version? If we inspect the optimisation pipeline, we’ll find that earlier the pass ran after load() was inlined but, crucially, before the loop was unrolled, so there was no pattern to match yet. Function templates, on the other hand, are instantiated separately for each unique combination of parameters, so by the time they are inlined the loops have already been fully unrolled (since the iteration count is known) and can be matched by foldConsecutiveLoads(). We can obtain the same result with the initial code by manually running: clang++ -S -emit-llvm -O2 -o- endianness.cpp \ | opt -S -passes='aggressive-instcombine' - \ | llc --x86-asm-syntax=intel
This also exemplifies an important principle of function inlining in LLVM (I’m not sure about other compilers): whenever possible, callees are simplified before callers in order to improve the accuracy of the inlining cost heuristics. The more information is available to the optimiser before inlining, the better it can make those decisions. In general, in fact, the earlier in the pipeline we can enable specific simplifications, the more optimised the final result will be (see the exercises for more examples of this).
Exercise 7: If we comment out all of the specialised functions but load_le32() we get similar results as when we template load(), i.e. AggressiveInstCombine successfully folds the consecutive loads. What optimisation pass enables this? How can we restructure performance-critical code to take advantage of it? Exercise 8: If we now keep only load_le64() but replace load() with load_alt() above, the DAGCombiner is able to fully optimise the resulting LLVM IR. Compare with our earlier results to understand why this is the case.
Shooting ourselves in the foot If you’ve read this far you may be under the impression that, while they may not be magic, optimisers are at the very least superhuman. Let’s quickly dispel that notion by looking at two examples where the obvious thing to do turns out not to be so obvious for the compiler. We’ll start by adding a small (and pretty useless) function to our initial example: uint32_t load_le32(const uint8_t* data) { return load(data, 4, false); }
uint32_t load_add_le32(const uint8_t* data) { return load_le32(data) + data[0]; }
Take a moment before continuing to consider how you would expect this function to be compiled. Two loads and an add? Perhaps a load, zero extension of the low byte, and an add? Now compare with the actual result: load_add_le32(unsigned char const*): movzx ecx, byte ptr [rdi] movzx edx, word ptr [rdi + 1] shl edx, 8 movzx eax, byte ptr [rdi + 3] shl eax, 24 or eax, edx or eax, ecx add eax, ecx ret
Huh. What happened here? Remember that the loads aren’t folded together until we reach the ISel phase, so by the time load_le32() is inlined it’s not yet fully optimised. It turns out DAGCombiner, when tracking the origin of each byte in the load, specifically rules out bytes which are used elsewhere in the function to prevent introducing duplicate loads. This can be sensible in general (even cache hits have a cost, after all), but it’s clearly suboptimal in this case. Can we make use of AggressiveInstCombine to “hide” the duplicate load from DAGCombiner? The answer turns out to be yes, but we must be careful. This, for example, won’t work: uint32_t load_add_le32(const uint8_t* data) { return load<4, false>(data) + data[0]; }
Although AggressiveInstCombine does run after load<4, false>() is inlined here, it also refuses to create duplicate loads11. As the saying goes, however, there’s no problem that cannot be solved with another layer of indirection: if we instead call the function template from load_le32(), we can get AggressiveInstCombine to optimise the unrolled loop before it’s inlined in load_add_le32(), producing the desired result (the additional cleverness to avoid a second load comes courtesy of the GVN pass).
Exercise 9: What happens if we use val & 0xFF instead of data[0] in the examples above, and why? Can you make a small change to make this work regardless of which function we call?12
For our second and final act, let’s reconsider what we’re optimising for in the first place. We have been focusing on performance in the sense of execution time by using the -O<n> flags. But what about optimising for size with -Os? In this case, they happen to coincide: surely you cannot get any faster or smaller than a single 32- or 64-bit load. But does the compiler agree? Unfortunately, it does not. With the curious exception of load_be32() (or, more precisely, load<4, true>()), it has decided that unrolling the loops would go against its mandate to minimise code size, making it impossible for downstream passes to fold the loads. The solution here is to nudge it politely using #pragma unroll before the loop, allowing AggressiveInstCombine or DAGCombiner to do their job properly. (You can also avoid pragmas by writing a recursive function template if that’s more to your taste.) Conclusion What can we take away from this somewhat meandering exploration? Perhaps the overarching theme is: get familiar with your compiler and what it can and cannot do for you. More actionable advice:
Use language facilities to make your intent clear. If you know something at compile-time, all else being equal, you can help the optimiser by stating it explicitly through e.g. C++ templates or Rust macros. In some cases it may even be preferable to choose at runtime from among a small set of specialised functions, instead of using a single generic function which cannot be optimised.
Maximise the potential for early optimisation through things like templating (again), auxiliary variables, explicit constraints, or specialised functions. Having at least a rough idea of the different optimisation passes and their order can help with this. The earlier the compiler can optimise something, the greater the opportunity for beneficial knock-on effects.
Stick to well-known patterns. The more common a construction is, the more likely that someone else already did the hard work of making the compiler recognise and optimise it. Simple code beats clever nine times out of ten.
Appendix: other resources Haven’t had enough yet? Here are some helpful resources if you’re interested in diving deeper into LLVM in general and the optimiser in particular:
Official LLVM documentation, in particular: LLVM Language Reference Manual LLVM Programmer’s Manual The LLVM Target-Independent Code Generator LLVM Loop Terminology (and Canonical Forms)
Nikita Popov’s blog, in particular: LLVM: The middle-end optimization pipeline LLVM: Canonicalization and target-independence LLVM: Scalar evolution
A whirlwind tour of the LLVM optimizer (YouTube)
I’m using LLVM rather than gcc mainly because gcc does not perform the optimisations I analyse here, but also because LLVM provides better introspection tooling around its internal representation. Personally, I also find the LLVM codebase better documented and easier to navigate and understand. ↩
We could’ve also used __builtin_assume() or __builtin_unreachable() (which also works with gcc) for the same effect; all of these are translated into llvm.assume() intrinsics. ↩
Eight times per my count, at both -O1 and -O2. You can pass the -mllvm -print-pipeline-passes arguments to clang to see exactly which passes it will run. ↩
When reading through the LLVM code it’s helpful to keep in mind the following:
One important aspect of LLVM is that there is no distinction between an SSA variable and the operation that produces it. Because of this, any reference to the value produced by an instruction (or the value available as an incoming argument, for example) is represented as a direct pointer to the instance of the class that represents this value.
↩
There is some additional logic to make the optimisation work correctly for undefined values; you can read more about this in the LLVM documentation. ↩
It is enlightening, although quite repetitive, to go down the rabbit hole of the implementation for the isImpliedCondition() method to see just how many different cases it has to handle. ↩
Since taking the remainder of a division by zero is undefined behaviour, the compiler would be well within its rights to assume that count != 0 and proceed with the transformation. At the moment, at least, LLVM is not quite so ruthless. ↩
If you’re not familiar with SSA form, a phi node is a way to merge the different execution paths introduced by conditionals or loops; the output of a phi instruction is the value labelled by the basic block from which the current execution has proceeded. In this case that value is zero at the entrance to the loop and %rem after the first iteration. ↩
I recommend watching A Beginners’ Guide to SelectionDAG (YouTube) if you want to learn more. ↩
Hardware engineers used to writing HDL for FPGAs might find this familiar. Toolchains like AMD Vivado have strict requirements on what kinds of constructs they are able to recognise and synthesise as expected (a process called inference long before the term was taken over by machine learning), e.g. into state machines or hard IP blocks such as RAMs. ↩
Before PR #176101 it refused to fold loads even if only the final result was used multiple times, but that change was not included in the LLVM 22.x release branch. ↩
Hint: the iterate_1() function from the first half of the post uses a similar trick. ↩ |
This post delves into the intricacies of compiler optimization, specifically using two simple C++ examples to illustrate the mechanisms behind LLVM’s optimization passes. It highlights how seemingly minor changes in code—such as explicit assumptions—can trigger significant transformations within the compiler, ultimately leading to performance improvements. The author, Henrique Cabral, aims to demystify the black-box nature of compilers, demonstrating how developers can leverage the compiler’s capabilities effectively. This exploration covers modular increment optimization, endianness conversion, and the interplay between different optimization passes within LLVM.
The central idea is that compilers, despite their apparent intelligence, rely on recognizing patterns and exploiting specific conditions. By providing the compiler with this information—through techniques like assumptions or specialized functions—developers can guide the optimization process towards desired outcomes. The author demonstrates this by showcasing how a simple C++ function, designed to calculate the next index in an array, can be optimized dramatically through judicious use of assumptions and leveraging the compiler’s ability to identify and eliminate redundant computations.
The first example demonstrates modular increment optimization. The naive implementation of `next_naive` involved a costly 32-bit division. However, recognizing that the index `cur` is always less than `count`, the author introduces the `next_cmov` function, utilizing a conditional move instruction, which is far more efficient. The implementation uses an assume attribute to enforce this restriction, triggering the compiler to replace the division with the conditional move. The explanation shows how InstCombine, one of the LLVM optimization passes, automatically identifies and substitutes the division with the conditional move based on the assumption, showing the process by tracking different steps in the LLVM IR. This detailed walkthrough illustrates how compiler's patterns matching can lead to the exact same result being implemented at the hardware level.
The second example focuses on endianness conversion. The code demonstrates the standard practice of loading multi-byte values from memory using byte-swapping for big-endian formats and using the default native format for little-endian formats. The author illustrates how the compiler can optimize this process through Instruction Selection by folding the multiple loads into a single optimized load instruction. The exploration of the CodeGenPrepare pass reveals that the compiler handles endianness changes by creating dedicated functions for both big-endian and little-endian loads, preventing conflicts during the optimization.
The author includes supplemental exercises encouraging readers to experiment with the code and explore the debugging tools available, although those are not required to follow along with main text. The discussion highlights the importance of understanding the compiler’s behavior and how different code constructs influence the optimization process. It emphasizes how code should be structured to allow efficient processing by the compiler, including making the code intent explicitly clear. The author further distinguishes two separate optimization strategies: that of using a template to make our code more robust and efficient, and the alternative of using a function with side-effects to achieve the same result which might result in increased code size.
The key takeaway is that compiler optimization is not simply a matter of “letting the compiler do its thing”. Instead, it’s a collaborative effort between the developer and the compiler, where the developer’s understanding of the compiler’s capabilities—and providing the compiler with the necessary information—is critical to achieving optimal performance. The exploration of the different stages of an LLVM optimization pass, such as the InstCombine and Instruction Selection passes, provides a deeper insight into the mechanics of the compiler’s process.
Finally, the author makes a brief but pointed observation regarding the potential for suboptimal results when relying solely on generic code without explicitly stating the assumptions or limiting the scope of the operations. This reinforces the need for developers to consider the potential impacts of their code on the compiler’s optimization process and to provide sufficient constraints to guide the compiler effectively. |