## Rounding float to nearest integer, part 3

By pascal on Saturday, May 4 2013, 13:50 - Permalink

Two earlier posts showed two different approaches in order to round a float to the nearest integer. The first was to truncate to integer after having added the right quantity (either 0.5 if the programmer is willing to take care of a few dangerous inputs beforehand, or the predecessor of 0.5 so as to have fewer dangerous inputs to watch for).

The second approach was to mess with the representation of the `float`

input, trying to recognize where the bits for the fractional part were, deciding whether they represented less or more than one half, and either zeroing them (in the first case), or sending the float up to the nearest integer (in the second case) which was simple for complicated reasons.

## Variations on the first method

Several persons have suggested smart variations on the first theme, included here for the sake of completeness.
The first suggestion is as follows (remembering that the input `f`

is assumed to be positive, and ignoring overflow issues for simplicity):

float myround(float f) { float candidate = (float) (unsigned int) f; if (f - candidate <= 0.5) return candidate; return candidate + 1.0f; }

Other suggestions were to use `modff()`

, that separates a floating-point number into its integral and fractional components, or `fmodf(f, 1.0f)`

, that computes the remainder of `f`

in the division by 1.

These three solutions work better than adding 0.5 for a reason that is simple if one only looks at it superficially: floating-point numbers are denser around zero. Adding 0.5 takes us away from zero, whereas operations `f - candidate`

, `modff(f, iptr)`

and `fmodf(f, 1.0)`

take us closer to zero, in a range where the answer can be exactly represented, so it is. (Note: this is a super-superficial explanation.)

## A third method

### General idea: the power of two giveth, and the power of two taketh away

The third and generally most efficient method for rounding `f`

to the nearest integer is to take advantage of this marvelous rounding machine that is IEEE 754 arithmetic. But for this to work, the exact right machine is needed, that is, a C compiler that implements strict IEEE 754 arithmetic and rounds each operation to the precision of the type. If you are using GCC, consider using options `-msse2 -mfpmath=sse`

.

We already noticed that single-precision floats between 2^23 and 2^24 are all the integers in this range. If we add some quantity to `f`

so that the result ends up in this range, wouldn't it follow that the result obtained will be rounded to the integer? And it would be rounded in round-to-nearest. Exactly what we are looking for:

f f + 8388608.0f _____________________________ 0.0f 8388608.0f 0.1f 8388608.0f 0.5f 8388608.0f 0.9f 8388609.0f 1.0f 8388609.0f 1.1f 8388609.0f 1.5f 8388610.0f 1.9f 8388610.0f 2.0f 8388610.0f 2.1f 8388610.0f

The rounding part goes well, but now we are stuck with large numbers far from the input and from the expected output. Let us try to get back close to zero by subtracting `8388608.0f`

again:

f f + 8388608.0f f + 8388608.0f - 8388608.0f ____________________________________________________________________ 0.0f 8388608.0f 0.0f 0.1f 8388608.0f 0.0f 0.5f 8388608.0f 0.0f 0.9f 8388609.0f 1.0f 1.0f 8388609.0f 1.0f 1.1f 8388609.0f 1.0f 1.5f 8388610.0f 2.0f 1.9f 8388610.0f 2.0f 2.0f 8388610.0f 2.0f 2.1f 8388610.0f 2.0f

It works! The subtraction is exact, for the same kind of reason that was informally sketched for `f - candidate`

. Adding `8388608.0f`

causes the result to be rounded to the unit, and then subtracting it is exact, producing a `float`

that is exactly the original rounded to the nearest integer.

For these inputs anyway. For very large inputs, the situation is different.

### Very large inputs: absorption

f f + 8388608.0f f + 8388608.0f - 8388608.0f ____________________________________________________________________ 1e28f 1e28f 1e28f 1e29f 1e29f 1e29f 1e30f 1e30f 1e30f 1e31f 1e31f 1e31f

When `f`

is large enough, adding `8388608.0f`

to it does nothing, and then subtracting `8388608.0f`

from it does nothing again. This is good news, because we are dealing with very large single-precision floats that are already integers, and can be returned directly as the result of our function `myround()`

.

In fact, since we entirely avoided converting to a range-challenged integer type, and since adding `8388608.0f`

to `FLT_MAX`

does not make it overflow (we have been assuming the FPU was in round-to-nearest mode all this time, remember?), we could even caress the dream of a straightforward `myround()`

with a single execution path. Small floats rounded to the nearest integer and taken back near zero where they belong, large floats returned unchanged by the addition and the subtraction of a comparatively small quantity (with respect to them).

### Dreams crushed

Unfortunately, although adding and subtracting 2^23 almost always does what we expect (it does for inputs up to 2^23 and above 2^47), there is a range of values for which it does not work. An example:

f f + 8388608.0f f + 8388608.0f - 8388608.0f ____________________________________________________________________ 8388609.0f 16777216.0f 8388608.0f

In order for function `myround()`

to work correctly for all inputs, it still needs a conditional. The simplest is to put aside inputs larger than 2^23 that are all integers, and to use the addition-subtraction trick for the others:

float myround(float f) { if (f >= 0x1.0p23) return f; return f + 0x1.0p23f - 0x1.0p23f; }

The function above, in round-to-nearest mode, satisfies the contract we initially set out to fulfill. Interestingly, if the rounding mode is other than round-to-nearest, then it still rounds to a nearby integer, but according to the FPU rounding mode. This is a consequence of the fact that the only inexact operation is the addition. The subtraction, being exact, is not affected by the rounding mode.

For instance, if the FPU is set to round downwards and the argument `f`

is `0.9f`

, then `f + 8388608.0f`

produces `8388608.0f`

, and `f + 8388608.0f - 8388608.0f`

produces zero.

## Conclusion

This post concludes the “rounding float to nearest integer” series. The method highlighted in this third post is actually the method generally used for function `rintf()`

, because the floating-point addition has the effect of setting the “inexact” FPU flag when it is inexact, which is exactly when the function returns an output other than its input, which is when `rintf()`

is specified as setting the “inexact” flag.

Function `nearbyintf()`

is specified as not touching the FPU flags and would typically be implemented with the method from the second post.

## Comments

My preferred method for converting a float to an int is called cvttsd2siq.

Hello, John.

cvttsd2siq is the instruction to convert a double. The single-precision version would be cvttss2siq. And they truncate. Funnily, they truncate because cast truncates in C. The 8087 instruction set converted according to FPU rounding mode, and it was a pain to implement C's cast from float to int with it. Intel avoided the mistake when designing SSE2, but then according to this Hacker News comment they introduced conversion according to rounding mode back in SSE4: https://news.ycombinator.com/item?id=5648045

Since we are on the subject, two amusing facts:

cvttss2siq is the instruction to which a smart compiler translates the function below:

unsigned int f(float x)

{

return x;

}

Because it can! The conversion to (32-bit) int would rely on the parsimonious cvttss2si instead.

On the other hand, to convert to unsigned long long, there is no single instruction that does the trick:

I can't read anything about FP accuracy without going back to this bit of black humor:

http://gcc.gnu.org/bugzilla/show_bu...