LmCast :: Stay tuned in

Floor and Ceil versus Denormals on CPU and GPU

Recorded: May 30, 2026, 10:01 a.m.

Original Summarized

Floor and Ceil Versus Denormals on CPU and GPU

Programming, graphics, games, media, C++, Windows, Internet and more...

Sorry, you need Javascript on to email me.

Main PageBlogProductionsAbout
Blog »

Floor and Ceil Versus Denormals on CPU and GPU

Sat
23
May 2026

Recently, I dove deep into floating-point numbers and their behavior. Somehow, this topic haunts me in my programming practice since I created Floating-Point Formats Cheatsheet back in 2013 and also released a comprehensive article The Secrets of Floating-Point Numbers in 2024.
This time, I would like to focus on one specific question:
What is the result of floor(-1.175493930432748e-38) ?
Note: Hexadecimal value of our input number is 0x807FFFFD.
Floor, ceil, trunc, round
To recap, floor, ceil, trunc, round are functions available in the standard library of C, C++, as well as shading languages: HLSL and GLSL. Each of them transforms a floating-point number into an integral floating-point value, but using different rounding rules.
Note this is not about a conversion from float to int. The result of these functions is still a float, just having only integral part. When the input is already integral, the value is returned as-is. Otherwise, it gets "snapped" to the nearest integer in a specific direction:

floor - rounding "down" i.e., towards -infinity.
ceil - rounding "up" i.e., towards +infinity.
trunc - rounding towards zero, which we can also explain as truncating the fractional part.
round - rounding to the nearest integer up or down, depending on which one is closer.

Examples:
Note: IEEE 754 floating-point numbers distinguish between positive and negative zero, so some results below are -0.0 rather than 0.0 to visualize this distinction.
floor( 5.7) = 5.0 ceil( 5.7) = 6.0 trunc( 5.7) = 5.0 round( 5.7) = 6.0
floor( 0.2) = 0.0 ceil( 0.2) = 1.0 trunc( 0.2) = 0.0 round( 0.2) = 0.0
floor(-0.2) = -1.0 ceil(-0.2) = -0.0 trunc(-0.2) = -0.0 round(-0.2) = -0.0
floor(-5.7) = -6.0 ceil(-5.7) = -5.0 trunc(-5.7) = -5.0 round(-5.7) = -6.0

When talking about round, there is also a question what happens when we are exactly halfway, like round(2.5). Various programming languages define it differently:

Standard C/C++ function defines it to round away from zero, so round(2.5) = 3.0, round(-3.5) = -4.0.
HLSL function defines it to round towards nearest even number, so round(2.5) = 2.0, round(-3.5) = -4.0.
GLSL leaves the behavior of the function round implementation-dependent, while also offering function roundEven that rounds towards nearest even.

Knowing all this, we can answer our main question:
According to mathematical rules, floor(-1.175493930432748e-38) = -1.0, because the number is between -1 and 0.
Denormals
However, those of you who know more about the structure of floating-point numbers may notice that our input value is a subnormal. Subnormal numbers, also called denormalized numbers or denormals, are values so small (so close to 0) that they use a special representation where the implicit leading 1 bit is no longer assumed. They have exponent = 0. The minimum positive normalized value representable by 32-bit floats is 1.18 * 10^-38, while the minimum value representable as denormalized is 1.4 * 10^-45, so our number falls in that range.
We wouldn't need to care about denormals if not for the fact that:

some platforms preserve them (processing the values they represent),
while others "flush" them to zero (treating them as if the number was exactly 0).

This is the problem I stumbled upon recently. In most cases, it doesn't matter. For example, when rendering graphics, the difference between such a small number and 0 would produce an indistinguishable difference in results. After applying functions such as floor and ceil, however, the difference is significant:
If the platform preserves denormals:

floor(-1.175493930432748e-38) = -1.0 ceil(-1.175493930432748e-38) = -0.0
floor( 1.175493930432748e-38) = 0.0 ceil( 1.175493930432748e-38) = 1.0

If the platform flushes denormals to 0:

floor(-1.175493930432748e-38) = -0.0 ceil(-1.175493930432748e-38) = -0.0
floor( 1.175493930432748e-38) = 0.0 ceil( 1.175493930432748e-38) = 0.0

The behavior of a specific platform may depend on many factors, such as flags used during compilation of our source code, as well as some floating-point modes controlled in runtime. It may be an unexpected source of nondeterminism between CPU and GPU, as well as between GPU vendors.
I've performed a few tests. Here are my results:

CPU in x86 64-bit architecture (AMD Ryzen 7 7800X3D, but I don't expect differences between AMD and Intel here) on Windows, executing C++ code compiled using Visual Studio 2022 appeared to preserve denormals when doing floor and ceil. I've tested the following options, with no change in the results:

Both Release and Debug configurations (with and without compiler optimizations)
With Floating Point Model parameter set to /fp:precise, /fp:strict, /fp:fast
With and without Enable Intrinsic Functions parameter /Oi
With Enable Enhanced Instruction Set parameter unset or set to /arch:SSE2, /arch:AVX2

GPU from Nvidia (GeForce RTX 4090 - Ada architecture) executing a Direct3D 12 program and HLSL code compiled using modern DXC shader compiler flushes denormals. I've tested the following options, with no change in the results:

With and without DXC parameter -Gis (Force IEEE strictness)
With and without DXC parameter -denorm preserve

GPU from Intel (Arc B580 - Xe2-HPG architecture) executing the same shader flushes denormals by default. However, using -denorm preserve makes it preserving denormals.
GPU from AMD (Radeon RX 6800 XT - RDNA2 architecture) behaves like Intel - also flushes denormals by default. Using -denorm preserve makes it preserving denormals.

This is not the first time we can see Nvidia taking shortcuts to achieve maximum performance of their GPUs 😉 We could see another example in the article "Mipmap selection in too much detail" by Pema Malling.
Note: Architectures may support two related modes: flushing denormal results to zero and treating denormal inputs as zero. The behavior observed here suggests the latter.
UPDATE: As Pete Cawley commented on X, DirectX Specification, chapter 3.1.3.2 "Complete Listing of Deviations or Additional Requirements vs. IEEE-754" actually requires GPUs to flush denorms on input and output of any floating point operation.
Deterministic solution
If, for any reason, you need an implementation of these functions that behaves consistently, deterministically, and preserving denormals across CPUs and all GPUs, you can use the following HLSL code implementing custom floor and ceil using simple bit tricks:
float DeterministicFloor(float x)
{
if ((asuint(x) & 0xFF800000u) == 0x80000000 && // sign = 1, exponent = 0
(asuint(x) & 0x007FFFFFu) != 0) // mantissa != 0
{
return -1.0f;
}
return floor(x);
}

float DeterministicCeil(float x)
{
if ((asuint(x) & 0xFF800000u) == 0 && // sign = 0, exponent = 0
(asuint(x) & 0x007FFFFFu) != 0) // mantissa != 0
{
return 1.0f;
}
return ceil(x);
}

Comments |
#math #gpu

Share

Comments

Please enable JavaScript to view the comments powered by Disqus.

[Download] [Dropbox] [pub] [Mirror] [Privacy policy]
Copyright © 2004-2026
Sorry, you need Javascript on to email me.

The discussion centers on the behavior of floating-point operations, specifically floor and ceil functions, and the handling of denormal numbers on both Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The initial context establishes the mathematical definitions of floor, ceil, trunc, and round, noting that these functions transform a floating-point number into an integral floating-point value by rounding towards negative infinity (floor), positive infinity (ceil), rounding towards zero (trunc), or the nearest integer (round). The text differentiates the behavior of the round function across programming languages, noting that standard C/C++ rounds away from zero, while HLSL uses rounding to the nearest even number, and GLSL leaves the implementation dependent.

A critical point introduced is the issue of denormal numbers, or subnormal numbers, which are values extremely close to zero that utilize a special representation where the implicit leading one bit is disregarded. These numbers have an exponent of zero and fall within a range smaller than the minimum positive normalized value, which creates ambiguity regarding how hardware platforms process them. The core problem arises because some systems preserve these denormal values while others flush them to zero, which leads to nondeterministic results when applying floor or ceil functions, especially with very small inputs.

The text demonstrates that the outcome of floor and ceil operations on a denormal input depends entirely on which platform's floating-point handling policy is followed. If a platform preserves denormals, the results of floor and ceil for small numbers are distinct from when the platform flushes denormals to zero. For instance, the floor of a small negative denormal may yield a result of -1.0 if denormals are preserved, versus -0.0 if they are flushed to zero.

Testing across different hardware revealed this disparity in behavior. On x86 64-bit CPUs, the behavior appeared to preserve denormals when executing C++ code, regardless of many compilation flags. Conversely, the results on various GPUs showed platform-specific handling: Nvidia GPUs utilizing DirectX and HLSL appeared to flush denormals by default, whereas Intel and AMD GPUs also generally flushed them unless specific flags were used. This suggests a source of nondeterminism across CPU and GPU architectures and between different GPU vendors.

The text notes that the DirectX Specification requires GPUs to flush denormals on both input and output of floating-point operations. To achieve a consistent, deterministic implementation of floor and ceil functions that preserves denormals across all CPUs and GPUs, the author proposes a deterministic solution. This involves implementing custom floor and ceil functions using simple bit manipulation techniques operating on the underlying bit representation of the floating-point numbers. This method bypasses the inconsistent hardware-dependent behavior by ensuring the functions behave consistently, regardless of whether denormals are preserved or flushed.