Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Rounding float to nearest integer, part 3

Two earlier posts showed two different approaches in order to round a float to the nearest integer. The first was to truncate to integer after having added the right quantity (either 0.5 if the programmer is willing to take care of a few dangerous inputs beforehand, or the predecessor of 0.5 so as to have fewer dangerous inputs to watch for).

The second approach was to mess with the representation of the float input, trying to recognize where the bits for the fractional part were, deciding whether they represented less or more than one half, and either zeroing them (in the first case), or sending the float up to the nearest integer (in the second case) which was simple for complicated reasons.

Variations on the first method

Several persons have suggested smart variations on the first theme, included here for the sake of completeness. The first suggestion is as follows (remembering that the input f is assumed to be positive, and ignoring overflow issues for simplicity):

float myround(float f)
  float candidate = (float) (unsigned int) f;
  if (f - candidate <= 0.5) return candidate;
  return candidate + 1.0f;

Other suggestions were to use modff(), that separates a floating-point number into its integral and fractional components, or fmodf(f, 1.0f), that computes the remainder of f in the division by 1.

These three solutions work better than adding 0.5 for a reason that is simple if one only looks at it superficially: floating-point numbers are denser around zero. Adding 0.5 takes us away from zero, whereas operations f - candidate, modff(f, iptr) and fmodf(f, 1.0) take us closer to zero, in a range where the answer can be exactly represented, so it is. (Note: this is a super-superficial explanation.)

A third method

General idea: the power of two giveth, and the power of two taketh away

The third and generally most efficient method for rounding f to the nearest integer is to take advantage of this marvelous rounding machine that is IEEE 754 arithmetic. But for this to work, the exact right machine is needed, that is, a C compiler that implements strict IEEE 754 arithmetic and rounds each operation to the precision of the type. If you are using GCC, consider using options -msse2 -mfpmath=sse.

We already noticed that single-precision floats between 2^23 and 2^24 are all the integers in this range. If we add some quantity to f so that the result ends up in this range, wouldn't it follow that the result obtained will be rounded to the integer? And it would be rounded in round-to-nearest. Exactly what we are looking for:

  f                 f + 8388608.0f

0.0f                8388608.0f
0.1f                8388608.0f
0.5f                8388608.0f
0.9f                8388609.0f
1.0f                8388609.0f
1.1f                8388609.0f
1.5f                8388610.0f
1.9f                8388610.0f
2.0f                8388610.0f
2.1f                8388610.0f

The rounding part goes well, but now we are stuck with large numbers far from the input and from the expected output. Let us try to get back close to zero by subtracting 8388608.0f again:

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f

0.0f                8388608.0f                              0.0f
0.1f                8388608.0f                              0.0f
0.5f                8388608.0f                              0.0f
0.9f                8388609.0f                              1.0f
1.0f                8388609.0f                              1.0f
1.1f                8388609.0f                              1.0f
1.5f                8388610.0f                              2.0f
1.9f                8388610.0f                              2.0f
2.0f                8388610.0f                              2.0f
2.1f                8388610.0f                              2.0f

It works! The subtraction is exact, for the same kind of reason that was informally sketched for f - candidate. Adding 8388608.0f causes the result to be rounded to the unit, and then subtracting it is exact, producing a float that is exactly the original rounded to the nearest integer.

For these inputs anyway. For very large inputs, the situation is different.

Very large inputs: absorption

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f

1e28f                  1e28f                               1e28f
1e29f                  1e29f                               1e29f
1e30f                  1e30f                               1e30f
1e31f                  1e31f                               1e31f

When f is large enough, adding 8388608.0f to it does nothing, and then subtracting 8388608.0f from it does nothing again. This is good news, because we are dealing with very large single-precision floats that are already integers, and can be returned directly as the result of our function myround().

In fact, since we entirely avoided converting to a range-challenged integer type, and since adding 8388608.0f to FLT_MAX does not make it overflow (we have been assuming the FPU was in round-to-nearest mode all this time, remember?), we could even caress the dream of a straightforward myround() with a single execution path. Small floats rounded to the nearest integer and taken back near zero where they belong, large floats returned unchanged by the addition and the subtraction of a comparatively small quantity (with respect to them).

Dreams crushed

Unfortunately, although adding and subtracting 2^23 almost always does what we expect (it does for inputs up to 2^23 and above 2^47), there is a range of values for which it does not work. An example:

  f                 f + 8388608.0f             f + 8388608.0f - 8388608.0f

8388609.0f          16777216.0f                        8388608.0f

In order for function myround() to work correctly for all inputs, it still needs a conditional. The simplest is to put aside inputs larger than 2^23 that are all integers, and to use the addition-subtraction trick for the others:

float myround(float f)
  if (f >= 0x1.0p23)
    return f;
  return f + 0x1.0p23f - 0x1.0p23f;

The function above, in round-to-nearest mode, satisfies the contract we initially set out to fulfill. Interestingly, if the rounding mode is other than round-to-nearest, then it still rounds to a nearby integer, but according to the FPU rounding mode. This is a consequence of the fact that the only inexact operation is the addition. The subtraction, being exact, is not affected by the rounding mode.

For instance, if the FPU is set to round downwards and the argument f is 0.9f, then f + 8388608.0f produces 8388608.0f, and f + 8388608.0f - 8388608.0f produces zero.


This post concludes the “rounding float to nearest integer” series. The method highlighted in this third post is actually the method generally used for function rintf(), because the floating-point addition has the effect of setting the “inexact” FPU flag when it is inexact, which is exactly when the function returns an output other than its input, which is when rintf() is specified as setting the “inexact” flag.

Function nearbyintf() is specified as not touching the FPU flags and would typically be implemented with the method from the second post.


1. On Saturday, May 4 2013, 18:01 by John Regehr

My preferred method for converting a float to an int is called cvttsd2siq.

2. On Saturday, May 4 2013, 18:10 by pascal

Hello, John.

cvttsd2siq is the instruction to convert a double. The single-precision version would be cvttss2siq. And they truncate. Funnily, they truncate because cast truncates in C. The 8087 instruction set converted according to FPU rounding mode, and it was a pain to implement C's cast from float to int with it. Intel avoided the mistake when designing SSE2, but then according to this Hacker News comment they introduced conversion according to rounding mode back in SSE4:

Since we are on the subject, two amusing facts:

cvttss2siq is the instruction to which a smart compiler translates the function below:

unsigned int f(float x)


  return x;


Because it can! The conversion to (32-bit) int would rely on the parsimonious cvttss2si instead.

On the other hand, to convert to unsigned long long, there is no single instruction that does the trick: 

unsigned long long f(float x)
  return x;
It compiles to:
~ $ gcc -O -S tr.c
~ $ cat tr.s
    .long    1593835520 ## float 9.223372e+18
    pushq    %rbp
    movq    %rsp, %rbp
    movss    LCPI1_0(%rip), %xmm1
    movaps    %xmm0, %xmm2
    subss    %xmm1, %xmm2
    cvttss2siq    %xmm2, %rax
    movabsq    $-9223372036854775808, %rcx
    xorq    %rax, %rcx
    ucomiss    %xmm1, %xmm0
    cvttss2siq    %xmm0, %rax
    cmovaeq    %rcx, %rax
    popq    %rbp
I actually re-discovered and implemented this algorithm by hand at a time I needed it (in OCaml, towards multi-precision integers that only needed to be as large as unsigned long longs) before recognizing it later in my compiler's output.
One last funny fact is how C compilers sometimes try to recognize patterns to map them to efficient instructions. The blog The Shape of Code contains an example with the compilation of a multiplication by 6 or somesuch, that a programmer too smart for eir own good wrote as 4 * x + 2 * x, and that the compiler reverse-translates to a single multiplication.
Well, the compiler will not be able to do that with (int) (f + 0.5f), since it does not always produce the same result as an instruction to round to the nearest.
3. On Saturday, May 4 2013, 18:47 by John Regehr

I can't read anything about FP accuracy without going back to this bit of black humor: