Quantization from the Ground Up

Recorded: March 26, 2026, 4:02 a.m.

Original

Summarized

Quantization from the ground up | ngrok blogSkip to main contentngrok home/ngrok blog homeblogopen mobile navigationProductProblems We SolveResourcesDocsBlogPricingDownloadLog inSign upSearch…Control⌃KNewsletterRSSMar 25, 2026Latest PostQuantization from the ground up Sam Rose•6,658 words•AISam RoseSam Rose is a Senior Developer Educator at ngrok, focusing on creating content that helps developers get the most out of ngrok.Share this postShare Quantization from the ground up on hackernewsShare Quantization from the ground up on linkedinShare Quantization from the ground up on twitterShare Quantization from the ground up on redditShare Quantization from the ground up on whatsappQwen-3-Coder-Next is an 80 billion
parameter model 159.4GB in size. That's roughly how much RAM you
would need to run it, and that's before thinking about long context windows.
This is not considered a big model. Rumors have it that frontier models have
over 1 trillion parameters, which would require at least 2TB of RAM. The
last time I saw that much RAM in one machine was never.
But what if I told you we can make LLMs 4x smaller and 2x faster, enough to run very capable models on your laptop, all while losing only 5-10% accuracy.
That's the magic of quantization.
In this post, you are going to learn:

How a model's parameters make it so big

How floating point precision works and how models sacrifice it

How to compress floats using quantization

How to measure model quality loss after quantization

If you already know what parameters are and how floats are stored, feel free to skip straight to
quantization.
What makes large language models so large?
Parameters, also called "weights," are the majority of what an LLM is when it's in memory or on disk. In
my prompt caching post I wrote that LLMs are an "enormous graph of billions of carefully arranged
operations." What do those graphs look like? Let's start with the simplest example: 1 input, 1
parameter, 1 output.
0.52.01.00.52.01.0
It doesn't look like much, but this is the fundamental building block of modern AI.
It takes the input of 2.0, multiplies it by the parameter 0.5, and gets the output 1.0.
LLMs, though, are much bigger. They have billions of these parameters in practice. One of the ways they get so big is that they have "layers." Here's how that looks.
0.50.51.00.52.01.01.01.50.50.51.00.52.01.01.01.5
These two nodes in the middle are a layer. They both show 1.0 because both connections have a parameter of 0.5, and so they are the result of 2.0 * 0.5. Every connection between two nodes gets a parameter, so we have 4 in total above. When 2 connections end at the same node, the values are added together. So to get the output of 1.5 we add together 1.0 * 1.0 and 0.5 * 1.0.
Play with the slider next to the input below to get a feel for how it affects the output.
Use the input slider to change the input value. The output updates automatically.Output value 2.50.10.51.50.11.00.52.00.21.03.02.50.10.51.50.11.00.52.00.21.03.02.5
This is still only 6 parameters, we're a long way from the billions seen in modern LLMs. The example below has 2 inputs, 3 layers, and 2 outputs. In total it has 64 parameters, hover or tap a node to see its parameters.
2.01.02.02.32.31.34.04.45.16.54.23.414.19.59.213.313.425.32.01.02.02.32.31.34.04.45.16.54.23.414.19.59.213.313.425.3
Modern LLMs have hundreds of thousands of inputs and outputs. They have many dozens of layers, each with thousands of nodes, all densely connected together. This all multiplies out to result in billions, sometimes trillions of parameters.
How do computers store numbers?
Computers work in 1s and 0s, called "bits." Here's what a whole number, or
"integer," looks like when stored as bits. Drag the slider to change the value.
You can also tap on individual bits to flip them.
unsigned int81286432168421010011010010010010010010=64unsigned int 8 value slider. Minimum value 0. Maximum value 255. Use the slider to change the value.unsigned int 8 value 64. Bits 0 1 0 0 0 0 0 0.
Each bit represents a power-of-2 number which when summed together gives the
final answer.
Integers are nice to work with because they are discrete. Between 1 and 3
there is exactly one number: 2. This is great for computers, they can represent
discrete values no problem.
It gets trickier when you start thinking about decimal places. How many decimal
place numbers are there between 1 and 3? There are an infinite number of them.
This is not good for computers, because computers can't represent an infinite
number of things.
What computers do is they compromise. They promise to be accurate up to so many
significant figures, and anything after that is best-effort.
For example, 32-bit floating point numbers span the range ±3.40×1038
with 7 significant figures of accuracy. They do this by dividing the 32 bits up
into 3 parts: 1 sign bit, 8 exponent bits, and 23 significand bits. More exponent bits results in a larger range, while more significand bits results in more significant figures of accuracy.
Play with the example below. Sliding left and right explores the full range of values. The plus and minus buttons at the top jump to the next higher or lower representable number. The 7 significant figures that are promised to be accurate are underlined. Press reset to return to the default value of 1.5.
float32Decrement float32 bit patternIncrement float32 bit patternReset float32 visualizationsexponentsignificand0100100110110110110110110110110100100100100100100100100100100100100100100100100100100100100100101.5float32 value slider. Minimum finite value -3.403 times 10 to the 38. Maximum finite value 3.403 times 10 to the 38. Use arrow keys to move through representable float32 values. This control can also reach infinities and NaN.float32 value 1.5. Underlined significant digits 1.5. Sign bit 0. Exponent bits 0, then 7 ones. Significand bits 1, then 22 zeroes.
Whenever you press plus, it takes you to the next highest
representable value. Pay attention to the digit after the underlined digits, it will sometimes skip over a number. This is the precision compromise in action.
Pressing plus when you're on a very small
value moves you forward a small amount. Pressing it when you're on a very large
value moves you forward a larger amount. The size of the jump changes depending on where
you are in the range, the values are not evenly distributed.
To illustrate this point further, below is a histogram. Each bar shows you a
slice of the 32-bit floating point range. There are over 2 billion unique values
that can be represented between -0.5 and 0.5.
Distribution of 32-bit float valuesHistogram of representable float32 values between negative ten and ten. Most finite values cluster close to zero. Use the controls below to switch the y-axis between linear and log scale and to view absolute counts or percentages.LinearLogAbs%
A lot of the representable 32-bit floats are small values. This is fantastic for LLMs, because parameters also tend to be small. Small parameters have been found to result in models that generalise better to problems they haven't seen before, so models are rewarded during training for making parameters small.
Below is another histogram, and was created by downloading 6 popular open source
models and counting their parameter values. Almost all parameters
are very close to 0.
Distribution of model parameter valuesLine chart comparing parameter distributions for six models. Most parameter values cluster near zero. Use the range and value display controls to inspect different portions of the distribution.Range0.11101001,000Abs%
So most model parameters sit in the range of floats that can be
most precisely represented. There are a very small number of outliers, though.
I'll come back to that later.
Can we use smaller floats?
Do language models actually need 32-bit floats? They don't need a wide range,
as you can see from the parameter distribution histogram above, and
do they really need 7 significant figures of accuracy?
The answer is no, LLMs work just fine with smaller, less accurate floats. Below is an
example of a 16-bit float. It works just like the 32-bit float, except it only
has 5 exponent bits and 10 significand bits.
It can represent 3 significant figures of precision, and has a range of ±65504.
It also takes up half as much RAM and disk as a 32-bit float.
float16Decrement float16 bit patternIncrement float16 bit patternReset float16 visualizationsexponentsignificand0100100110110110110110100100100100100100100100101.5float16 value slider. Minimum finite value -65504. Maximum finite value 65504. Use arrow keys to move through representable float16 values. This control can also reach infinities and NaN.float16 value 1.5. Underlined significant digits 1.5. Sign bit 0. Exponent bits 0, 1, 1, 1, 1. Significand bits 1, then 9 zeroes.
We can mix and match the number of exponent and significand bits to get different precision/range tradeoffs.
For example, the Google Brain team created the bfloat16 format, which has 8 exponent bits but only 7 significand bits. This gives it a very wide range, but only 2 significant figures of precision.
bfloat16Decrement bfloat16 bit patternIncrement bfloat16 bit patternReset bfloat16 visualizationsexponentsignificand0100100110110110110110110110110100100100100100101.5bfloat16 value slider. Minimum finite value -3.39 times 10 to the 38. Maximum finite value 3.39 times 10 to the 38. Use arrow keys to move through representable bfloat16 values. This control can also reach infinities and NaN.bfloat16 value 1.5. Underlined significant digits 1.5. Sign bit 0. Exponent bits 0, then 7 ones. Significand bits 1, then 6 zeroes.
Google found that 2 significant figures is sufficient for creating LLMs, and
having this extremely wide range means they don't have to worry about any
calculations overflowing, which can happen in larger LLMs using smaller floats.
Some more extreme examples that are seen less often are float8 and float4. Below are just example configurations of these floats, in the wild people will mix and match the number of exponent and significand bits to suit their needs.
float8Decrement float8 bit patternIncrement float8 bit patternReset float8 visualizationsexponentsignificand0100100110110110110100101.5float8 value slider. Minimum finite value -240. Maximum finite value 240. Use arrow keys to move through representable float8 values. This control can also reach infinities and NaN.float8 value 1.5. Underlined significant digits 1. Sign bit 0. Exponent bits 0, 1, 1, 1. Significand bits 1, 0, 0.float4Decrement float4 bit patternIncrement float4 bit patternReset float4 visualizationsexponentsignificand0100100110111.5float4 value slider. Minimum finite value -3. Maximum finite value 3. Use arrow keys to move through representable float4 values. This control can also reach infinities and NaN.float4 value 1.5. Underlined significant digits 1. Sign bit 0. Exponent bits 0, 1. Significand bits 1.
Another way to visualise the accuracy of these formats is to see how well they
approximate a sine wave. Use the zoom buttons underneath the graph below to see
how they differ.
Different floats approximating sineThis chart compares an ideal sine wave with quantized approximations for float32, float16, bfloat16, float8, and float4. Lower-precision formats appear more step-like and drift further from the smooth curve. Use the format buttons to show or hide individual lines, and use the zoom control to inspect the same section at one, thirty, or two hundred times magnification.━float32━float16━bfloat16━float8━float4Zoom1x30x200x
Let's talk about how we can use this knowledge to make models smaller.
What is quantization?
Quantization is the process of taking values from a large range, and packing
them into a smaller range. It is a form of lossy compression.
When we convert between, e.g., a float16 and a float8, we tend to round to
the closest representable value. We're taking values in the float16 range
and mapping them to the nearest float8 value. This is called
"round-to-nearest" and is one of many types of quantization.
Slide the vertical bar along the number lines below. As you move the bar, the
closest representable value for each float size is shown.
Float comparison slider. Use left and right arrow keys for small adjustments. Hold Shift for larger adjustments. Use Page Up and Page Down for larger jumps, and Home or End to move to the minimum or maximum.
This is a very simple way to take a value in one range and represent it inside a
smaller range. However, it's not a good idea to do this for LLMs.
Take a look at the small model below. See how the parameters and output change as you round from bfloat16 to float8 and float4.
Quantization by roundingUse the input slider to change the input value. The output updates automatically.Output value 0.20-0.890.160.08-0.130.16-0.542.00-1.780.320.160.20-0.890.160.08-0.130.16-0.542.00-1.780.320.160.20bfloat16float8float4
Rounding to float8 isn't too bad, but rounding to float4 completely breaks
the model. This is because some of the parameters are now 0.
Because there is no path from input to output that
doesn't multiply by 0, the output is always 0.
This happens because we're being inefficient about how we squish values into the 4 bits we have available. A float4 goes from -3 to 3, but our parameters go from -0.89 to 0.16.
-33-0.890.16-33-0.890.16
On top of that, float4 can also represent Infinity and NaN. These aren't
useful to us when quantizing.
Let's look at how we can more efficiently use the 16 values a 4-bit number gives
us.
Symmetric quantization
Instead of going from -3 to 3 like float4 does, what if we used a tighter
range that better fits our data?
One of the ways we can do this is by scaling our data into a new range. For
example, if I had data in the range of -14 to 14, and I wanted to fit it into -7
to 7, I could divide by 2.
-1414-7-6-5-4-3-2-101234567-14-12-8-4048121414-14-7-6-5-4-3-2-101234567-14-12-8-40481214
Then to get back to my original value from the scaled value, all I need to do is
multiply it by 2. Any odd value would become unrepresentable in this scheme, as
they would have to be rounded to the nearest integer. For example, 5 / 2 = 2.5
would round up to 3 and then dequantize to 3 * 2 = 6. This is what makes
quantization lossy.
-1414-7-6-5-4-3-2-101234567-1451414-14-7-6-5-4-3-2-101234567-14514
We can apply this process to any set of values, we just need to find the right
scaling factor. We do by finding the largest absolute value in our dataset, and
dividing it by the largest value in our quantized range. For our parameters that
would be 0.89 / 7. Here's what the code would look like in JavaScript.
Copy codefunction quantize({ values, bits }) {
// Assuming:
// values = [-0.89, 0.16, 0.08, -0.13, 0.16, -0.54]
// bits = 4
const vmax = Math.max(...values.map(Math.abs)); // 0.89
const qmax = 2 ** (bits - 1) - 1; // 7
const scale = vmax / qmax; // 0.12714285714285714
return {
values: values.map((v) => Math.round(v / scale)),
scale,
};
}

dequantize(quantized);
// [-0.91, 0.14, 0.07, -0.14, 0.14, -0.56]
How does this compare to the average error with symmetric quantization?
OriginalQuantizedDeltaDelta %-0.89-0.91-0.02+2.2%0.160.14-0.02-12.5%0.080.07-0.01-12.5%-0.13-0.14-0.01+7.7%0.160.14-0.02-12.5%-0.54-0.56-0.02+3.7%Average error +8.5%
Much better! 10% less error for the same number of bits. Let's see how it does
on our model.
Asymmetrically quantized modelUse the input slider to change the input value. The output updates automatically.Output value 0.20-0.890.160.08-0.130.16-0.542.00-1.780.320.160.20-0.890.160.08-0.130.16-0.542.00-1.780.320.160.20bfloat16quantized 4-bit
The final output is still off by around 10%, but a really nice
improvement over symmetric quantization.
This is what's happening to the parameters of models when they're
quantized down to sizes that are possible to run on your laptop. Instead of
floats, small integers are what get stored and loaded into memory. When the time
comes to use the quantized values, to generate an answer to a question for
example, the values are dequantized on the fly. You might think this sounds
slower, but we'll see later on that this actually ends up being faster as well
as smaller.
How is quantization applied in practice?
Are people taking their LLMs with hundreds of billions of parameters, finding the the largest and smallest of all of them,
and quantizing the model in one go?
No.
I hinted at the reason for this earlier on. Let's take another look at that graph of parameters from 6 different open weight models, except this time let's look at the long tail of outlier values.
Outlier parametersLine chart comparing the outlier tails of model parameter distributions. Outliers are rare and only appear in a small number of bins. Use the range and y-axis scale controls to inspect the long tail.Range1-210-20100-1,000LinearLog
All of the models have a small number of outlier parameters. Ones
that are much larger or smaller than most others. Outliers are really bad for
quantization. Look what happens when we try to quantize our parameters from earlier with a single outlier of 10 added to them:
-0.8910-8-7-6-5-4-3-2-101234567-0.890.160.08-0.13-0.541010-0.89-8-7-6-5-4-3-2-101234567-0.890.160.08-0.13-0.5410
OriginalQuantizedDeltaDelta %-0.890.7261.616-181.6%0.160-0.16-100.0%0.080-0.08-100.0%-0.1300.13-100.0%0.160-0.16-100.0%-0.540.7261.266-234.4%1010.1640.164+1.6%Average error +116.8%
Everything gets squished into a small number of buckets and average error goes through the roof. If we quantized the entire model in one go, we'd destroy it. What's done in practice is quantization in blocks, usually around 32-256 parameters at a time. This way, the impact of outliers is contained.
To dequantize, we need to save the scale value for symmetric, and the scale + zero for asymmetric. These get stored alongside each block, and are
considered overhead. Choosing a larger block size reduces this overhead, but
larger blocks have a wider range of values on average, increasing error.
It's a trade-off.
"Why do these outliers exist?"Expand sectionIt's weird, isn't it? I got fascinated by these values a while back, and if you want to learn more about them
I recommend reading:

This paper
by Apple

This post
by Tim Dettmers

tl;dr: no one conclusively knows, but a small fraction of these outliers are very important to model quality.
Removing even a single "super weight," as Apple calls them, can cause the model to output complete gibberish.Given their importance, real-world quantization schemes sometimes do extra work to preserve these outliers. They might
do this by not quantizing them at all, or by saving their location and value into a separate table, then removing them
so that their block isn't destroyed. During dequantization this table is consulted and the outliers are restored.
How much does quantization affect model accuracy?
In this section I'm going to show you a number of ways quality loss in LLMs can
be measured. All of these measures have pros and cons. If you're evaluating
quantized models for a critcal use-case you have, nothing beats creating your
own benchmark for the specific task you're asking the model to perform.
Disclaimers aside, all of the following tests were performed against the
Qwen3.5 9B model, and I've put details about all of the commands I
ran in the appendix at the end of this post.
Perplexity
What LLMs are doing under the hood is creating probability distributions of what
the likely next "token" is for a given prompt. For example, if I prompt Qwen3.5
9B with The answer to 2 + 2 is, it gives me these probabilities for what it
thinks the next token should be:
TokenProbability492.29%53.23%31.15%10.90%20.85%And many more less likely options...
It's given a high probability for the token 4, which makes sense given that's
the correct answer. Concerning that it'll say 5 3% of the time, but I digress.
The idea behind "perplexity" as a measurement is to collapse these probability
distributions down into a single number that's easy to reason about. Calculating
perplexity involves a little bit of math but I promise it's not too bad. For the
single prediction above, The answer to 2 + 2 is, we take the probability of
the correct token, 4, and we do this:
Copy codepCorrect = 0.9229; // 92.28%, probability for `4`
perplexity = Math.exp(-Math.log(pCorrect)); // 1.08
Lower scores are better, and the way you're supposed to read this is "the model
considers there to be ~1.08 plausible tokens that complete this prompt." The
lower the perplexity, the higher the probability the model gave to the correct
token.
When I give Qwen3.5 9B the prompt And then I, the possibilities are more spread
out.
TokenPercenthave3.02%was3.00%realized2.98%found2.73%'m2.73%got2.63%will2.54%'ll2.43%thought2.38%saw2.33%went2.16%And many more somewhat equally likely options...
If we assume the correct next token is was, the perplexity calculation for
this probability distribution becomes:
Copy codepCorrect = 0.03; // 3%, probability for `was`
perplexity = Math.exp(-Math.log(pCorrect)); // 33.33
Much higher, but then again: what would you have predicted comes after And then I? It's a far more ambiguous prompt, and it makes sense for models to be
less confident predicting it.
I used llama.cpp's llama-perplexity tool to measure the
perplexity of Qwen 3.5 9B at different quantization levels. The way it works is
you give it a reference text, and it takes a sliding window of tokens over that
reference text as the prompt, and uses the next token in the reference text to
know what the correct token should be.
To illustrate this sliding window idea, below you can see the prompt tokens and the correct token. Use the forward→︎ and backward←︎ buttons to move the sliding window.
Sliding prompt windowMove prompt window backward←︎Move prompt window forward→︎Interactive sliding prompt window. Full text: Help, I'm stuck in a test dataset used for measuring the perplexity of quantized models.. Use the backward and forward buttons to move the prompt window one token at a time.Help, I'm stuck in a test dataset used for measuring the perplexity of quantized models.Window: Help, I'm stuck in a test dataset. Correct token: used.Correct token: used
At each step, the -Math.log(pCorrect) is accumulated and then at the end we
Math.exp the average. If we moved this sliding window across the whole
reference text and collected the correct token probabilities at each step into
probs, the end calculation would look like this:
Copy codelet total = 0;
for (const prob of probs) {
total += -Math.log(prob);
}
const perplexity = Math.exp(total / probs.length);
The llama.cpp project likes to use wikitext-2's test dataset as
the reference text, so I did the same. It's just the contents of the Wikipedia
page on Robert Boulter, I have no idea why. I wonder if he
knows he's used to benchmark quantized LLM quality...
Anyway, let's take a look at the results.
FormatPerplexitybfloat168.1868-bit symmetric8.193 (+0.1%)4-bit asymmetric8.563 (+4.6%)4-bit symmetric8.71 (+6.4%)2-bit asymmetric66.1 (+707.5%)
We see almost no change with 8-bit symmetric, small degradation in the 4-bit
variants, and then almost complete collapse in the 2-bit variant. Quantization
has caused the model to become less confident, considering a wider selection of
tokens on average.
Crucially, perplexity only considered the correct token in its calculations.
The probability of all of the other tokens is not used. As such, perplexity
doesn't capture the full picture of how quantization has affected a model.
KL divergence
Short for "Kullback-Leibler divergence," this is a measurement that tells us how
well 2 probability distributions overlap.
Play with the slider below. Try to get the KL divergence number to 0.
Two bell-shaped distributions with the same width are overlaid. The original distribution stays centered, and the quantized distribution is currently shifted 1.500 standard deviations to the right of the reference distribution.KL divergence: 1.125KL divergence 1.125. Quantized distribution moved right by 1.500 standard deviations.
The only time that KL divergence is 0 is when the 2 distributions exactly
overlap. The further they are apart, the higher the KL divergence.
It's not just horizontal skew that increases KL divergence, any sort of
non-overlapping will do it.
Two bell-shaped distributions are centered at the same position. The quantized distribution is currently shifted downward, making it shorter and wider than the original distribution.KL divergence: 0.069KL divergence 0.069. Quantized distribution moved down to 0.750 times the original height.
This measurement can be applied to the token probability distributions output by
an LLM. The distibution below shows the probability of each digit from 0 to 9 to
follow the prompt The answer to 2 + 2 is. Toggle between different levels of
quantization to see how it changes, along with the KL divergence score at the
top.
Line chart comparing next-token probabilities for digits 0 through 9. The original distribution is compared against a quantized distribution. KL divergence: 0.000. Per-token comparison: token 0, original 0.25 percent, 8-bit 0.25 percent; token 1, original 0.9 percent, 8-bit 0.91 percent; token 2, original 0.85 percent, 8-bit 0.87 percent; token 3, original 1.15 percent, 8-bit 1.19 percent; token 4, original 92.3 percent, 8-bit 92 percent; token 5, original 3.23 percent, 8-bit 3.43 percent; token 6, original 0.57 percent, 8-bit 0.57 percent; token 7, original 0.21 percent, 8-bit 0.22 percent; token 8, original 0.39 percent, 8-bit 0.4 percent; token 9, original 0.09 percent, 8-bit 0.09 percent.KL divergence: 0.000originalquantized8-bit4-bit (asym)4-bit (sym)2-bit8-bit4-bit (asym)4-bit (sym)2-bit
There's sadly no intuitive way to think about the KL divergence score other than
"higher is worse." There's not even a natural maximum, e.g. we can't say "KL
divergence is always between 0 and 1." It differs based on properties of the
model. For that reason, it's only valid to compare the score between
quantizations of the same model.
The llama-perplexity tool can also be used to measure KL divergence by passing
a --kl-divergence flag, details in the appendix. I used
wikitext-2 as the reference text again.
FormatMean KL divergence8-bit symmetric0.00084-bit asymmetric0.05934-bit symmetric0.06752-bit asymmetric2.1447
While KL divergence has the downsides of being difficult to reason about
intuitively, and only comparable between quantizations of the same model, one of
the benefits it has over perplexity is that it considers the entire probability
distribution of each prediction.
Perplexity only cares about the probability of the correct token. The
probability of every other token could change, but if the correct one stayed the
same the perplexity wouldn't change. With KL divergence, if the entire
distribution changed, the score would be higher. KL divergence captures a fuller
picture of how quantization has changed the model's behavior.
Benchmarking
I wrote a whole post on LLM benchmarking, and one of the ways
people measure the impact of quantization is to compare a model's score on some
benchmarks before and after quantization.
For this post I decided to run the GPQA Diamond benchmark. I wrote
about this benchmark if you're interested in learning the full
details, but for this post it's enough to know that it's a set of 198 very hard
multiple choice questions in biology, chemistry, and physics. Each question has
4 answers to choose from, so random guessing should score 25% on average.
Details on how I ran this benchmark are in the appendix.
FormatCorrect answerIncorrect answerNo answerbfloat1666.7%33.3%0%8-bit symmetric73.2%26.8%0%4-bit asymmetric62.6%36.4%1%4-bit symmetric66.2%29.3%4.5%2-bit asymmetric1%2%97%
If you're confused by the results, don't worry. So was I. The 8-bit quantization
scores higher than the unquantized model in its original bfloat16 form. It's
hard to say exactly why this happens, it's possible that it is simply luck. With
multiple choice there's always that 25% chance of being right by accident.
What I'm taking away from these scores is that the 8- and 4-bit quantizations
perform well, but the 2-bit quantization has fallen off a cliff. No answer was
found for 97% of the questions, which could mean the model got stuck in a loop
or didn't understand what it was being asked to do.
Just talk to it
The last test is the easiest but least rigorous: just talk to it! I asked the
same question to all of the different quantization levels to see how they would
respond. The question was: "What is the capital city of England, UK?"
FormatAnswerOriginal bfloat16The capital city of England and the United Kingdom is London.8-bit symmetricThe capital city of England and the United Kingdom is London.4-bit asymmetricThe capital city of England is London. It is also the capital city of the entire United Kingdom.4-bit symmetricThe capital city of England is London.2-bit asymmetric<no answer>
All correct except for the 2-bit quantization, which refused to answer almost
every time I asked it. The "reasoning trace" got a little bit unhinged:
Copy codeThe capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of England is London.
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
The capital city of London
But then it would go on to produce an empty response. I think it's fair to say
that 2-bit quantization is too much information loss for Qwen3.5 9B, and it's
not a useful model at that level of compression.
Clearly this is not a thorough or scientific test, but it can be useful to help
you put all of the other scores into perspective.
How much does quantization affect model speed?
The last thing I want to talk about is model speed. Smaller sizes of
quantization tend to also result in faster models. This is mostly due to there
being less data to move around inside your GPU.
The llama.cpp project comes to the rescue again here with its
llama-bench tool, which I ran both on a MacBook Pro M1 Max and a rented H100
SXM GPU from Runpod. Performance figures are given in "tokens per second,"
so how fast the model generates responses.
FormatM1 MaxH100bfloat1619.45106.858-bit symmetric32.36141.614-bit asymmetric43.32175.704-bit symmetric46.05177.062-bit asymmetric40.25166.90Unit: tokens per second
There's a big difference in performance between the original bfloat16 and the
8- and 4-bit quantizations. It's not obvious to me why the 2-bit is slower than
4-bit, though. I would have expected this to be faster. If you know why this is,
please reach out and let me know!
Conclusion
The main thing I want you to take away from this post is that quantized models
are pretty good, actually.
When I first pitched the idea of writing this post, I didn't know anything about
how quantization worked. I had this assumption that model quality would degrade
linearly as you compress it. So if you start at bfloat16, an 8-bit
quantization of that would be half as good. Then a 4-bit quantization would be
half as good as the 8-bit version, and so on.
That doesn't appear to be true.
It looks like 16-bit to 8-bit carries almost no quality penalty. 16-bit to 4-bit
is more noticeable, but it's certainly not a quarter as good as the original.
Closer to 90%, depending on how you want to measure it.
So don't be afraid to run local models that are quantized. There's a quality
cliff at some point, but I've given you the tools to know how to identify where
it is. Provided you haven't fallen off that cliff, quantized models work well
and you shouldn't shy away from trying them just because they're compressed!
As you do experiment with quantized local models, consider giving ngrok's AI
gateway a shot. Use it to route LLM requests to these local
models, whether
they're running on your laptop or a rented GPU in the cloud.
Further reading
This post has focused entirely on what's called "post-training quantization" or
PTQ. Some models these days go through what's called "quantization aware
training" or QAT, which introduces quantization during pre-training and helps
the model set its parameters that will quantize well.
I also only covered a couple of quantization methods, there's a lot of more
complex alternatives with different trade-offs. If you're interested, I
recommend looking up AWQ and GPTQ.
Quantization is also only one method of reducing the size of LLMs. There's also
parameter pruning and distillation. Efficient Large Language Models: A
Survey is a litle old now (May 2024) but it gives a good overview of the
methods being pursued to make LLMs more efficient.
Appendix
Installing llama.cpp
Copy codebrew install llama.cpp
Downloading Qwen 3.5 9B
Copy codellama-server -hf unsloth/Qwen3.5-9B-GGUF:BF16 --port 8000
This downloads the bfloat16 version of Qwen 3.5 9B, then runs an OpenAI
compatible server on port 8000 backed by the model.
Running the GPQA benchmark
You can clone the official GPQA repository and unzip the question set like this:
Copy codegit clone git@github.com:idavidrein/gpqa.git
cd gpqa
uv venv --python 3.9
uv pip install -r requirements.txt
unzip -P deserted-untie-orchid dataset.zip
But I found that it doesn't let you run the benchmark against an arbitrary local
model. I made some modifications (sorry they're a bit noisy, my
editor changed a bunch of quotes and stuff) to allow me to run:
Copy codeOPENAI_BASE_URL=http://localhost:8000/v1 uv run python baselines/run_baseline.py main \
--model_name qwen3.5-9b-bf16 \
--data_filename dataset/gpqa_diamond.csv \
--prompt_type zero_shot \
--max-tokens 30000 \
--verbose
I ended up running the benchmark against a llama-server that allowed for 4096 reasoning tokens:
Copy codellama-server -hf unsloth/Qwen3.5-9B-GGUF:BF16 --reasoning-budget 4096
Quantizing the model
Copy codecd ~/Library/Caches/llama.cpp
llama-quantize unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-BF16.gguf unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q8_0.gguf Q8_0
There's a lot going on there, I'll break it down:

~/Library/Caches/llama.cpp: This is the directory where llama.cpp stores models it's downloaded.
unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-BF16.gguf: This is the filename that was given to the model we downloaded when we ran llama-server -hf unsloth/Qwen3.5-9B-GGUF:BF16 earlier.
unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q8_0.gguf: This is the filename I want to give to my newly quantized model.
Q8_0: This is the quantization format, llama.cpp has a special format for these strings, the "Q8" here means quantize to 8-bit, and the "_0" means symmetric quantization.

That command only took a minute on my laptop, and the resulting file is just over half the size of the original.
Here are the format strings to use for each of the quantization levels I used throughout the post:
Formatstring8-bit symmetricQ8_04-bit asymmetricQ4_14-bit symmetricQ4_02-bit asymmetricQ2_K
llama.cpp doesn't seem to offer a 8-bit asymmetric or 2-bit symmetric option
as far as I can tell.
Also, I know that the 2-bit quantization I'm using is different to the rest. I
watched the fantastic reverse-engineering
GGUF video from Julia Turc, but it
was between K-quant or I-quant, there's no legacy quant available for 2-bit. So
I compromised. I also suspect this has something to do with why my llama-bench
results for 2-bit were weird.
Measuring perplexity and KL divergence
I used the get-wikitext-2.sh script from the llama.cpp repo to download the wikitext-2 test dataset then ran:
Copy codellama-perplexity -hf unsloth/Qwen3.5-9B-GGUF:BF16 -f wikitext-2-raw/wiki.test.raw -c 512 --kl-divergence-base ~/reference.kld
This saves KL divergence data you'll need when measuring the quantized versions of the model. Use it like this:
Copy codellama-perplexity -m path/to/quantized/model.gguf -f wikitext-2-raw/wiki.test.raw -c 512 --kl-divergence-base ~/reference.kld --kl-divergence
Talking to models locally
Copy codellama-cli -hf unsloth/Qwen3.5-9B-GGUF:BF16
Measuring performance
Copy codellama-bench -hf unsloth/Qwen3.5-9B-GGUF:BF16

Quantization from the ground up by Sam Rose delves into the process of reducing the size and computational demands of Large Language Models (LLMs) by representing their parameters with lower precision numerical formats. Rose’s post meticulously details this technique, starting with a foundational understanding of how LLMs are constructed – primarily through billions of interconnected parameters – and how these parameters are stored and manipulated as floating-point numbers. He highlights the significant memory footprint associated with standard 32-bit floating-point representations, particularly in models like the 80 billion parameter Qwen-3-Coder-Next, demonstrating that even models considered “small” require substantial resources.

The core of Rose’s explanation centers on the trade-offs inherent in representing numerical data. He clearly articulates the concept of floating-point precision, where computers sacrifice absolute accuracy for a wider range of representable values. He illustrates this using the example of the 2+2 calculation, showcasing how the limited precision of 32-bit floats can introduce minor discrepancies in the output. Rose effectively uses interactive visualizations, including sliders and graphs, to demonstrate the impact of different quantization levels – from 32-bit floats to 16-bit floats, 8-bit floats, and even 4-bit floats. These visualizations vividly illustrate how reduced precision leads to a narrower range of representable values and, consequently, a loss of accuracy.

A critical aspect of Rose’s analysis is the distinction between symmetric and asymmetric quantization. Symmetric quantization, where the central value (in this case, 0) is precisely represented, offers a more manageable and predictable transformation. Asymmetric quantization, on the other hand, allows for a wider range of values, minimizing the impact of outliers—those parameters that deviate significantly from the average. Rose’s examination of the parameter distributions across various open-source LLMs reveals a pronounced skew towards lower values, supporting the rationale for asymmetric quantization, which optimally utilizes the available bits. He demonstrates this through code examples and a visualization of parameter distributions, revealing that the majority of parameters reside within a relatively narrow range, making them amenable to quantization without substantial loss of accuracy.

Furthermore, Rose provides a practical, accessible explanation of how quantization is implemented in practice, emphasizing the importance of selecting the appropriate scaling factor and zero point. He develops a JavaScript function to smoothly quantize values between two ranges, offering a clear illustration of the underlying mathematical operations. The post also covers the measurement of quantization quality loss through perplexity and Kullback-Leibler (KL) divergence, demonstrating that while some loss of accuracy is inevitable, carefully optimized quantization techniques can minimize the impact. The author highlights the need to consider the entire probability distribution when evaluating quantization, correctly contrasting perplexity’s focus on the probability of the correct token with KL divergence’s ability to capture broader distributional differences. Finally, Rose illustrates the benchmark results for quantized models against an unquantized one, and shows a notable difference in the results.

The post offers a practical guide to setting up the necessary tools and scripts for quantization, including instructions on downloading the Qwen-3-Coder-Next model and utilizing the llama.cpp library for running quantized LLMs. It also explores the performance implications of quantization, demonstrating that smaller quantized models can achieve comparable speed to their larger, unquantized counterparts, particularly when running on hardware accelerators. Overall, Sam Rose’s post delivers a comprehensive and insightful exploration of quantization, providing a solid foundation for developers seeking to optimize the size and efficiency of LLMs.