This post is the first in a series on the difficult task of rounding a floating-point number to an integer. Laugh not!
The easiest-looking questions can hide unforeseen difficulties, and the most widely accepted solutions can be wrong.

## Problem

Consider the task of rounding a `float`

to the nearest integer. The answer is expected as a `float`

, same as the input. In other words, we are looking at the work done by standard C99 function `nearbyintf()`

when the rounding mode is the default round-to-nearest.

For the sake of simplicity, in this series of posts, we assume that the argument is positive and we allow the function to round any which way if the float argument is exactly in-between two integers. In other words, we are looking at the ACSL specification below.

/*@ requires 0 ≤ f ≤ FLT_MAX ;
ensures -0.5 ≤ \result - f ≤ 0.5 ;
ensures \exists integer n; \result == n;
*/
float myround(float f);

In the second `ensures`

clause, `integer`

is an ACSL type (think of it as a super-long `long long`

). The formula `\exists integer n; \result == n`

simply means that `\result`

, the `float`

returned by function `myround()`

, is a mathematical integer.

## Via truncation

A first idea is to convert the argument `f`

to `unsigned int`

, and then again to `float`

, since that's the expected type for the result:

float myround(float f)
{
return (float) (unsigned int) f;
}

### Obvious overflow issue

One does not need Frama-C's value analysis to spot the very first issue, an overflow for large `float`

arguments, but it's there, so we might as well use it:

$ frama-c -pp-annot -val r.c -lib-entry -main myround
...
r.c:9:[kernel] warning: overflow in conversion of f ([-0. .. 3.40282346639e+38])
from floating-point to integer. assert -1 < f < 4294967296;

This overflow can be fixed by testing for large arguments. Large floats are all integers, so the function can simply return `f`

in this case.

float myround(float f)
{
if (f >= UINT_MAX) return f;
return (float) (unsigned int) f;
}

### Obvious rounding issue

The cast from `float`

to `unsigned int`

does not round to the nearest integer. It “truncates”, that is, it rounds towards zero. And if you already know this, you probably know too the universally used trick to obtain the nearest integer instead of the immediately smaller one, adding 0.5:

float myround(float f)
{
if (f >= UINT_MAX) return f;
return (float) (unsigned int) (f + 0.5f);
}

**This universally used trick is wrong.**

### An issue when the ULP of the argument is exactly one

The Unit in the Last Place, or ULP for short, of a floating-point number is its distance to the floats immediately above and immediately below it. For large enough floats, this distance is one. In that range, floats behave as integers.

There is an ambiguity in the above definition for powers of two: the distances to the float immediately above and the float immediately below are not the same. If you know of a usual convention for which one is called the ULP of a power of two, please leave a note in the comments.

int main()
{
f = 8388609.0f;
printf("%f -> %f\n", f, myround(f));
}

With a strict IEEE 754 compiler, the simple test above produces the result below:

8388609.000000 -> 8388610.000000

The value passed as argument is obviously representable as a float, since that's the type of `f`

. However, the mathematical sum `f + 0.5`

does not have to be representable as a float. In the worst case for us, when the argument is in a range of floats separated by exactly one, the floating-point sum `f + 0.5`

falls exactly in-between the two representable floats `f`

and `f + 1`

. Half the time, it is rounded to the latter, although `f`

was already an integer and was the correct answer for function `myround()`

. This causes bugs as the one displayed above.

The range of floating-point numbers spaced every 1.0 goes from 2^23 to 2^24. Half these 2^23 values, that is, nearly 4 millions of them, exhibit the problem.

Since the 0.5 trick is nearly universally accepted as the solution to implement rounding to nearest from truncation, this bug is likely to be found in lots of places. Nicolas Cellier identified it in Squeak. He offered a solution, too: switch the FPU to round-downward for the time of the addition `f + 0.5`

. But let us not fix the problem just yet, there is another interesting input for the function as it currently stands.

### An issue when the argument is exactly the predecessor of 0.5f

int main()
{
f = 0.49999997f;
printf("%.9f -> %.9f\n", f, myround(f));
}

This test produces the output `0.499999970 -> 1.000000000`

, although the input `0.49999997`

is clearly closer to `0`

than to `1`

.

Again, the issue is that the floating-point addition is not exact. The argument `0.49999997f`

is the last `float`

of the interval `[0.25 … 0.5)`

. The mathematical result of `f + 0.5`

falls exactly midway between the last float of the interval `[0.5 … 1.0)`

and `1.0`

. The rule that ties must be rounded to the even choice means that `1.0`

is chosen.

### A function that works

The overflow issue and the first non-obvious issue (when ulp(f)=1) can be fixed by the same test: as soon as the ULP of the argument is one, the argument is an integer and can be returned as-is.

The second non-obvious issue, with input `0.49999997f`

, can be fixed similarly.

float myround(float f)
{
if (f >= 0x1.0p23) return f;
if (f <= 0.5) return 0;
return (float) (unsigned int) (f + 0.5f);
}

### A better function that works

Changing the FPU rounding mode, the suggestion in the Squeak bug report, is slightly unpalatable for such a simple function, but it suggests to add the predecessor of `0.5f`

instead of `0.5f`

to avoid the sum rounding up when it shouldn't:

float myround(float f)
{
if (f >= 0x1.0p23) return f;
return (float) (unsigned int) (f + 0.49999997f);
}

It turns out that this works, too. It solves the problem with the input `0.49999997f`

without making the function fail its specification for other inputs.

## To be continued

The next post will approach the same question from a different angle. It should not be without its difficulties either.