## Harder than it looks: rounding float to nearest integer, part 1

By pascal on Thursday, May 2 2013, 18:14 - Permalink

This post is the first in a series on the difficult task of rounding a floating-point number to an integer. Laugh not! The easiest-looking questions can hide unforeseen difficulties, and the most widely accepted solutions can be wrong.

## Problem

Consider the task of rounding a `float`

to the nearest integer. The answer is expected as a `float`

, same as the input. In other words, we are looking at the work done by standard C99 function `nearbyintf()`

when the rounding mode is the default round-to-nearest.

For the sake of simplicity, in this series of posts, we assume that the argument is positive and we allow the function to round any which way if the float argument is exactly in-between two integers. In other words, we are looking at the ACSL specification below.

/*@ requires 0 <= f <= FLT_MAX ; ensures -0.5 <= \result - f <= 0.5 ; ensures \exists integer n; \result == n; */ float myround(float f);

In the second `ensures`

clause, `integer`

is an ACSL type (think of it as a super-long `long long`

). The formula `\exists integer n; \result == n`

simply means that `\result`

, the `float`

returned by function `myround()`

, is a mathematical integer.

## Via truncation

A first idea is to convert the argument `f`

to `unsigned int`

, and then again to `float`

, since that's the expected type for the result:

float myround(float f) { return (float) (unsigned int) f; }

### Obvious overflow issue

One does not need Frama-C's value analysis to spot the very first issue, an overflow for large `float`

arguments, but it's there, so we might as well use it:

$ frama-c -pp-annot -val r.c -lib-entry -main myround ... r.c:9:[kernel] warning: overflow in conversion of f ([-0. .. 3.40282346639e+38]) from floating-point to integer. assert -1 < f < 4294967296;

This overflow can be fixed by testing for large arguments. Large floats are all integers, so the function can simply return `f`

in this case.

float myround(float f) { if (f >= UINT_MAX) return f; return (float) (unsigned int) f; }

### Obvious rounding issue

The cast from `float`

to `unsigned int`

does not round to the nearest integer. It “truncates”, that is, it rounds towards zero. And if you already know this, you probably know too the universally used trick to obtain the nearest integer instead of the immediately smaller one, adding 0.5:

float myround(float f) { if (f >= UINT_MAX) return f; return (float) (unsigned int) (f + 0.5f); }

**This universally used trick is wrong.**

### An issue when the ULP of the argument is exactly one

The Unit in the Last Place, or ULP for short, of a floating-point number is its distance to the floats immediately above and immediately below it. For large enough floats, this distance is one. In that range, floats behave as integers.

There is an ambiguity in the above definition for powers of two: the distances to the float immediately above and the float immediately below are not the same. If you know of a usual convention for which one is called the ULP of a power of two, please leave a note in the comments.

int main() { f = 8388609.0f; printf("%f -> %f\n", f, myround(f)); }

With a strict IEEE 754 compiler, the simple test above produces the result below:

8388609.000000 -> 8388610.000000

The value passed as argument is obviously representable as a float, since that's the type of `f`

. However, the mathematical sum `f + 0.5`

does not have to be representable as a float. In the worst case for us, when the argument is in a range of floats separated by exactly one, the floating-point sum `f + 0.5`

falls exactly in-between the two representable floats `f`

and `f + 1`

. Half the time, it is rounded to the latter, although `f`

was already an integer and was the correct answer for function `myround()`

. This causes bugs as the one displayed above.

The range of floating-point numbers spaced every 1.0 goes from 2^23 to 2^24. Half these 2^23 values, that is, nearly 4 millions of them, exhibit the problem.

Since the 0.5 trick is nearly universally accepted as the solution to implement rounding to nearest from truncation, this bug is likely to be found in lots of places. Nicolas Cellier identified it in Squeak. He offered a solution, too: switch the FPU to round-downward for the time of the addition `f + 0.5`

. But let us not fix the problem just yet, there is another interesting input for the function as it currently stands.

### An issue when the argument is exactly the predecessor of 0.5f

int main() { f = 0.49999997f; printf("%.9f -> %.9f\n", f, myround(f)); }

This test produces the output `0.499999970 -> 1.000000000`

, although the input `0.49999997`

is clearly closer to `0`

than to `1`

.

Again, the issue is that the floating-point addition is not exact. The argument `0.49999997f`

is the last `float`

of the interval `[0.25 … 0.5)`

. The mathematical result of `f + 0.5`

falls exactly midway between the last float of the interval `[0.5 … 1.0)`

and `1.0`

. The rule that ties must be rounded to the even choice means that `1.0`

is chosen.

### A function that works

The overflow issue and the first non-obvious issue (when ulp(f)=1) can be fixed by the same test: as soon as the ULP of the argument is one, the argument is an integer and can be returned as-is.

The second non-obvious issue, with input `0.49999997f`

, can be fixed similarly.

float myround(float f) { if (f >= 0x1.0p23) return f; if (f <= 0.5) return 0; return (float) (unsigned int) (f + 0.5f); }

### A better function that works

Changing the FPU rounding mode, the suggestion in the Squeak bug report, is slightly unpalatable for such a simple function, but it suggests to add the predecessor of `0.5f`

instead of `0.5f`

to avoid the sum rounding up when it shouldn't:

float myround(float f) { if (f >= 0x1.0p23) return f; return (float) (unsigned int) (f + 0.49999997f); }

It turns out that this works, too. It solves the problem with the input `0.49999997f`

without making the function fail its specification for other inputs.

## To be continued

The next post will approach the same question from a different angle. It should not be without its difficulties either.

## Comments

This is definitely harder than it seems that it should be. I have fought this battle with several languages including C. This is a great share.

Even the last myround function does not work for all inputs. myround(0.5f) gives 0.0f here (GCC 4.2, Mac OS X). Also negative numbers break. The tests should be on the absolute of f, also needs to subtract 0.5/0.49999997f when negative. See https://gist.github.com/sheijk/4e8b...

Is this really a problem? Given that a floating point number is not an exact representation, and can't even represent all integer values in its range, does it even mean anything to talk about rounding to the nearest integer in the general case?

E.g.

float f = 123456789;

int i = (int) f; // 123456792

I'd say most applications that use this sort of rounding implicitly restrict values to a range where it makes sense (with double-precision it will work for values with fewer than 15 digits).

Would this work?

float RoundedFloat (float rawFloat){

if(rawFloat - rawFloat.floor >= 0.5){

return rawFloat.floor+1;

}else{

return rawfloat.floor;

}

}

@shjk

The contract says that in case of ties, the function is allowed to return any of the floats within 0.5 distance. Always rounding numbers with 0.5 fractional part up is not such a good solution anyway, it introduces a statistical bias. The widely used solution is to round ties to the nearest even number. Function myround() is not trying to implement this rule because we specified that for ties, any of the two nearest numbers were fine, but if it were, myround(0.5) -> 0 would be the correct answer.

Also, negative numbers are not handled because there was plenty to say while only focusing on positive numbers. It would make each example one line longer to handle negative numbers, and that would only show that -0.49999997f and a bunch of numbers around -8388609.000000 need to be handled with care.

The information about ties and the information that negative numbers can be ignored is all contained in the contract:

/*@ requires 0 <= f <= FLT_MAX ;

ensures -0.5 <= \result - f <= 0.5 ;

ensures \exists integer n; \result == n;

*/

float myround(float f);

If you need a real function that handles all cases, do not base one on this article. Use nearbyintf(). The function in this article is for showing the subtleties of floating-point computations, not necessarily for including in your programs.

@shjk

I used exhaustive testing with nearbyintf() as reference function to test the various proposals in this post:

union { float f; unsigned int u; } u;

for (u.f = 0.0; u.f < +1.0/0.0; u.u++)

{

float m = myround(u.f);

float ref = nearbyintf(u.f);

if (m != ref && u.f - ref != m - u.f)

printf("XXX %a %a %a\n", u.f, m, ref);

}

I think this works:

float RoundedFloat (float rawFloat){

if(rawFloat - rawFloat.floor >= 0.5){

return rawFloat.floor+1;

}else{

return rawFloat.floor;

}

}

Edit: Here's an actual tested implementation:

#include <math.h>

#include <iostream>

using namespace std;

float RoundedFloat(float rawFloat){

if(rawFloat - floor(rawFloat) >= 0.5){

return floor(rawFloat)+1;

}else{

return floor(rawFloat);

}

}

int main(){

float funnyNum=0.4999995f;

cout<<funnyNum<<endl;

cout<<RoundedFloat(funnyNum)<<endl;

return 0;

}

Which produces:

cos@DEADBEEF:~/floatRound$ g++ RoundFloat.cpp -lm

cos@DEADBEEF:~/floatRound$ ./a.out

0.499999

0

cos@DEADBEEF:~/floatRound$

The interesting thing is I couldn't use 0.49999997 as it gets rounded to 0.5 the second it gets placed in a float. 0.4999995 was the largest number I tested that didn't automatically get rounded up to 0.5 before it even got in the variable.

I think the obvious lesson we should take away from this is that if you need more precision than that you should use double precision floats.

interestingly python doesn't seem to suffer from the same issue until much later:

>>> def RoundFloat(num):

if num-math.floor(num) >= 0.5:

return math.floor(num)+1

else:

return math.floor(num)

>>> import math

>>> RoundFloat (1.49999999999999999999999)

2.0

>>> RoundFloat (1.49999999999999)

1.0

>>> RoundFloat (1.499999999999999999)

2.0

>>> RoundFloat (1.4999999999999999)

2.0

>>> RoundFloat (1.499999999999999)

1.0

>>>

@Joe

It is a good point you make. In many usages, floating-point do not represent exact values, and in that case it may not make sense to round a number in a range where the ULP is, say, 16, to the nearest integer.

But this is only one possible use of floating-point numbers.

This is even worse for periodic trigonometric functions:

in the same way, it does not make sense for most people that function sin() is accurate for an argument such as 1.0e21. But the standard makes no special provision for these functions, so math libraries have to implement costly argument reduction that few programmers will ever use. This does not hurt programmers who only call sin() with arguments between -π and π very much: it only means that the library contains a thousand decimals for π that they never use.

The specification we set out to satisfy was to return an integer (represented as a float) that was 0.5 away from the float argument. We could loosen the specification, but that would only make it more complex. The simple way out is to implement a function that works for all positive arguments.

@Collin

> The interesting thing is I couldn't use 0.49999997 as it gets rounded to 0.5 the second it gets placed in a float.

Are you sure that it is not just the pretty-printing that is showing 0.49999997f as 0.5? The last float before 0.5 is exactly 0.4999999701976776123046875, or in hexadecimal 0x1.fffffep-1. I hear that C++ does not accept hexadecimal for floating-point numbers, an unfortunate oversight, but perhaps your compiler will accept it as an extension.

Regarding double precision, that's true, but the same issues found here happen with different numbers (following the same idea) in double-precision. However, if the floating-point sum is computed at a higher precision than the argument and result types, for instance 80-bit if the argument and results are doubles, then the problem with adding 0.5 disappears.

float myround(float f) {

return (float) ( (unsigned) ( double(f) + 0.5f ) );

}

@r-lyeh

Exactly. For rounding a double in the same conditions, if there is a wider long double type available, one can also do:

double myround(double f) {

return (double) ( (unsigned long long) (f + 0.5L) );

}

But for a function to round a long double to the nearest integer, we have to get back to hackish solution(s), another one of which is reserved for the next post.

I'd avoid this sort of conversion if at all possible.

> Is this really a problem?

Sometimes, for some people, yes. Floating-point is complicated, and adding in the complexity of round-to-nearest-integer functions that don't actually work just layers on additional complexity -- that is unnecessary.

Note that one of the failure cases from the naive method is 0.49999997, which is likely to be in the active range for most users.

Note that the recommended trick is float specific -- it will fail on doubles. That also suggests that it may fail if intermediate calculations are done at double precision, which is entirely IEEE compliant.

I believe you missed a simplification. The bad rounding on numbers that are already integers only happens above 2^23, and since all floats above 2^23 are already integers the large-number check can just be moved down. Then the constant 0.5 can safely be used.

Simple. And, it becomes easier to make the code work with doubles -- just compare against 2^53 and cast to uint64 -- 0.5 can still be used, whereas 0.49999997 would not have worked.

Relevant series of posts:

http://randomascii.wordpress.com/ca...

@Bruce

The recommended trick fails for the conversion of doubles if you think of it as “add 0.49999997f”, but not if you think of it as “add the appropriate predecessor of 0.5”. The double type has the same issue with adding 0.5, the problematic input is the same (the predecessor of 0.5, in this case 0.49999999999999994) and the same fix of adding the predecessor of 0.5 instead so that for the problematic input, the result of the addition is the predecessor of 1.0.

You are right that the code (unsigned int) (f + 0.49999997f) assumes that the C compiler computes exactly at the precision of the types. The C standard (standardized by ISO) allows this, rather than the IEEE 754 standard for which this would be such a heresy that they did not even think of forbidding it.

(unsigned int) (float) (f + 0.49999997f) would be more portable, and since computing a single basic operation at precision higher than double before rounding to single is harmless (1), it would work with the very large majority of compilers (C doesn't even mandate IEEE 754 floating-point, so it is impossible to claim it would work with all of them).

(1) “When is double rounding innocuous?”, Samuel A. Figueroa

@Bruce

Hmm, no, I did not miss this optimization, this is what section “A function that works” is all about. Is the use of the hexadecimal floating-point constant 0x1.0p23 confusing? It's a convenient input format introduced in C99 but sorely missing from C++.

I never simply use a cast to convert floats to ints, since years, maybe decades ago, I found that can produce rare errors. I don't know if that was an issue with CPUs back in the day or Windows floating point package, but I have tracked bugs down to using a cast at least twice in my life, seeing the actual fail under a debugger. I am talking here about a bug that may show up in an app that was released and has been in the field for several months. Now I don't recall if that was a situation where a bad value exceeded the range of an integer, and I never considered that. As posted above, quite often we have some range checks in our code and don't expect huge values in the final sum.

Many years ago a wise programmer would never use a cast to convert from float to integer. Instead he would do something like this...

if(fMyFloat >= 0)

{

IntValue = (int)floor(fMyFloat + 0.5);

}else

{

IntValue = (int)ceil(fMyFloat - 0.5);

}

In my audio applications, when converting double precision floating point samples to their final integer form, after checking for clipping I use truncation to convert double precision samples to shorts...

short sample = (short)_copysign(floor(fabs(dRawSample)), dRawSample);

This seems to result in the least amount of harmonic distortion.

We may pick up certain habits early in our career as a programmer and never consider them again. Your discussion has made me examine my habits, or even superstitions that may have arisen from experiences gained decades ago.

The varying accuracy is also affected by the number format. This includes whether the flotaing format is hex, octal or binary. It may in some circumstances be affected by whether or not the number system in storage is big endian or little endian as that may change the algorithms used to convert the decimal entered into the internal storage format.

Therefore, there is not likely to be a single means to convert from float to integer. For the combinations of 3 for the format, and 2 for the endian, and maybe 3 for the conversion from decimal entry to internal representation, you end up with 24 possibilities, some of which may overlap. Add the variation of internal storage size, as 32,48, 64, 80, 96, and 128, and possibly 256 bit sizes, and you get 168 different combinations.

Is it getting to be fun yet? Then add the various "arbitrary length" math models and it gets even more wonderful.

Anyone care to try to work them all out? And by the way, an integer always is considered to be accurate +/- 1 anyway, so why is this even a problem? You have to consider the loss of precision between integers anyway.

Use Fixed Point for all the calculations. This will solve all the problems!

You might enjoy seeing the anatomy of a floating-point number

and some programming tricks in intpow.c

(at www.civilized.com/files/intpow.c)