Frama-C news and ideas - Tag - floating-point2019-02-26T08:53:17+01:00Florent Kirchnerurn:md5:b42a078de1ba22947c760c9e93aba4d2DotclearThe problem with differential testing is that at least one of the compilers must get it righturn:md5:44aa942a3f57fe3b180812009faad36d2013-09-25T20:22:00+01:002013-09-26T05:54:08+01:00pascalfloating-point <p>A long time ago, John Regehr wrote a blog post about <a href="http://blog.regehr.org/archives/558">a 3-3 split vote</a> that occurred while he was finding bugs in C compilers through differential testing. John could have included Frama-C's value analysis in his set of C implementations, and then the vote would have been 4-3 for the correct interpretation (Frama-C's value analysis predicts the correct value on the particular C program that was the subject of the post). But self-congratulatory remarks are not the subject of today's post. Non-split votes in differential testing where all compilers get it wrong are.</p>
<h2>A simple program to find double-rounding examples</h2>
<p>The program below looks for examples of harmful double-rounding in floating-point multiplication. Harmful double-rounding occurs when the result of the multiplication of two <code>double</code> operands differs between the double-precision multiplication (the result is rounded directly to what fits the <code>double</code> format) and the extended-double multiplication (the mathematical result of multiplying two <code>double</code> numbers may not be representable exactly even with extended-double precision, so it is rounded to extended-double, and then rounded again to <code>double</code>, which changes the result).</p>
<pre>
$ cat dr.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <float.h>
#include <limits.h>
int main(){
printf("%d %a %La\n", FLT_EVAL_METHOD, DBL_MAX, LDBL_MAX);
while(1){
double d1 = ((unsigned long)rand()<<32) +
((unsigned long)rand()<<16) + rand() ;
double d2 = ((unsigned long)rand()<<32) +
((unsigned long)rand()<<16) + rand() ;
long double ld1 = d1;
long double ld2 = d2;
if (d1 * d2 != (double)(ld1 * ld2))
printf("%a*%a=%a but (double)((long double) %a * %a))=%a\n",
d1, d2, d1*d2,
d1, d2, (double)(ld1 * ld2));
}
}
</pre>
<p>The program is platform-dependent, but if it starts printing something like below, then a long list of double-rounding examples should immediately follow:</p>
<pre>
0 0x1.fffffffffffffp+1023 0xf.fffffffffffffffp+16380
</pre>
<h2>Results</h2>
<p>In my case, what happened was:</p>
<pre>
$ gcc -v
Using built-in specs.
Target: i686-apple-darwin11
...
gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
$ gcc -std=c99 -O2 -Wall dr.c && ./a.out
0 0x1.fffffffffffffp+1023 0xf.fffffffffffffffp+16380
^C
</pre>
<p>I immediately blamed myself for miscalculating the probability of easily finding such examples, getting a conversion wrong, or following <code>while (1)</code> with a semicolon. But it turned out I had not done any of those things. I turned to Clang for a second opinion:</p>
<pre>
$ clang -v
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix
$ clang -std=c99 -O2 -Wall dr.c && ./a.out
0 0x1.fffffffffffffp+1023 0xf.fffffffffffffffp+16380
^C
</pre>
<h2>Conclusion</h2>
<p>It became clear what had happened when looking at the assembly code:</p>
<pre>
$ clang -std=c99 -O2 -Wall -S dr.c && cat dr.s
...
mulsd %xmm4, %xmm5
ucomisd %xmm5, %xmm5
jnp LBB0_1
...
</pre>
<p>Clang had compiled the test for deciding whether to call <code>printf()</code> into <code>if (xmm5 != xmm5)</code> for some register <code>xmm5</code>.</p>
<pre>
$ gcc -std=c99 -O2 -Wall -S dr.c && cat dr.s
...
mulsd %xmm1, %xmm2
ucomisd %xmm2, %xmm2
jnp LBB1_1
...
</pre>
<p>And GCC had done the same. Although, to be fair, the two compilers appear to be using LLVM as back-end, so this could be the result of a single bug. But this would remove all the salt of the anecdote, so let us hope it isn't.</p>
<p><br /></p>
<p>It is high time that someone used fuzz-testing to debug floating-point arithmetic in compilers. Hopefully one compiler will get it right sometimes and we can work from there.</p>Exact case management in floating-point library functionsurn:md5:49ef47dac2eef65f1dd76ae92c48fe412013-08-05T17:07:00+01:002013-08-05T17:07:00+01:00pascalfloating-point <p>The <a href="http://lipforge.ens-lyon.fr/frs/download.php/153/crlibm-1.0beta3.pdf">documentation</a> of the correctly-rounded <a href="http://lipforge.ens-lyon.fr/www/crlibm/">CRlibm</a> floating-point library states, for the difficult <code>pow()</code> function (p. 159):</p>
<blockquote><p>Directed rounding requires additional work, in particular in subnormal handling and in exact case management. There are more exact cases in directed rounding modes, therefore the performance should also be inferior.</p></blockquote>
<p>The phrase “exact case” refers to inputs that need to be treated specially because no number of “Ziv iterations”, at increasing precisions, can ever resolve which way the rounding should go.</p>
<p><br /></p>
<p>Quiz: Isn't an exact case an exact case independently of the rounding mode? How can exact cases vary with the rounding mode?</p>
<p><br /></p>
<p>If you can answer the above quiz without having to <a href="https://en.wikipedia.org/wiki/Rubber_duck_debugging">rubber duck</a> through the entire question on an internet forum, you have me beat.</p>More on the precise analysis of C programs for FLT_EVAL_METHOD==2urn:md5:814c1dbea81893b174ef2b7d6a617bfe2013-07-24T00:33:00+01:002014-01-01T17:15:24+00:00pascalfacetious-colleaguesfloating-pointFLT_EVAL_METHOD <h2>Introduction</h2>
<p>It started innocently enough. My colleagues were talking of supporting target compilers with excess floating-point precision.
<a href="http://blog.frama-c.com/index.php?post/2013/07/06/On-the-precise-analysis-of-C-programs-for-FLT_EVAL_METHOD-2">We saw</a> that if analyzing programs destined to be compiled with strict IEEE 754 compilers was a lovely Spring day at the beach, analyzing for compilers that allow excess precision was Normandy in June, 1944. But we had not seen anything, yet.</p>
<h2>The notion of compile-time computation depends on the optimization level</h2>
<p>One first obvious problem was that of constant expressions that were evaluated at compile-time following rules that differed from run-time ones.
And who is to say what is evaluated at compile-time and at run-time? Why, it even depends, for one same compiler, on the optimization level:</p>
<pre>
#include <stdio.h>
int r1, r2;
double ten = 10.0;
int main(int c, char **v)
{
ten = 10.0;
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
printf("r1=%d r2=%d\n", r1, r2);
}
</pre>
<p>Note how, compared to last time, we make the vicious one-line change of assigning variable <code>ten</code> again inside function <code>main()</code>.</p>
<pre>
$ gcc -v
Target: x86_64-linux-gnu
…
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1)
$ gcc -mno-sse2 -mfpmath=387 -std=c99 -O2 s.c && ./a.out
r1=1 r2=1
$ gcc -mno-sse2 -mfpmath=387 -std=c99 s.c && ./a.out
r1=0 r2=1
</pre>
<p>So the problem is not just that the static analyzer must be able to recognize the computations that are done at compile-time. A precise static analyzer that went this path would in addition have to model each of the tens of optimization flags of the target compiler and their effects on the definition of constant expression.</p>
<p><br /></p>
<p>Fortunately for us, after <a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323">many, varied complaints</a> from GCC users, Joseph S. Myers decided that 387 floating-point math in GCC was at least going to be <a href="http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00105.html">predictable</a>. That would not solve all the issues that had been marked as duplicates of the infamous bug 323 over its lifetime, but it would answer the valid ones.</p>
<h2>A ray of hope</h2>
<p>Joseph S. Myers provided a reasonable interpretation of the effects of
FLT_EVAL_METHOD in the C99 standard. The comparatively old compiler
we used in the previous post and in the first section of this one does not contain the patch from that
discussion, but recent compilers do. The most recent GCC I have available is
SVN snapshot 172652 from 2011. It includes the patch. With this version of
GCC, we compile and execute the test program below.</p>
<pre>
#include <stdio.h>
int r1, r2, r3, r4, r5, r6, r7;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
r3 = 0.1 == (double) (1.0 / ten);
r4 = 0.1 == (double) (1.0 / 10.0);
ten = 10.0;
r5 = 0.1 == (1.0 / ten);
r6 = 0.1 == (double) (1.0 / ten);
r7 = ((double) 0.1) == (1.0 / 10.0);
printf("r1=%d r2=%d r3=%d r4=%d r5=%d r6=%d r7=%d\n", r1, r2, r3, r4, r5, r6, r7);
}
</pre>
<p>We obtain the following results, different from the results of the earlier
version of GCC, but independent of the optimization level and understandable
(all computations are done with FLT_EVAL_METHOD==2 semantics):</p>
<pre>
$ ./gcc-172652/bin/gcc -mno-sse2 -mfpmath=387 -std=c99 t.c && ./a.out
r1=1 r2=1 r3=0 r4=0 r5=1 r6=0 r7=0
</pre>
<p>As per the C99 standard, the choice was made to give the literal <code>0.1</code> the value of <code>0.1L</code>. I am happy to report that this simple explanation for the values of <code>r2</code> and <code>r7</code> can be inferred directly from the assembly code. Indeed, the corresponding constant is declared in assembly as:</p>
<pre>
.LC1:
.long 3435973837
.long 3435973836
.long 16379
.long 0
</pre>
<p><br /></p>
<p><strong>Quiz:</strong> why is it obvious in the above assembly code for a <code>long double</code> constant that the compiler used the <code>long double</code> approximation for <code>0.1</code> instead of the <code>double</code> one?</p>
<p><br /></p>
<p>As described, the semantics of C programs compiled with FLT_EVAL_METHOD==2 are just as
deterministic as if they were compiled with FLT_EVAL_METHOD==0. They
give different results from the latter, but always the same
ones, regardless of optimization level, interference from unrelated
statements, and even regardless of the particular compiler generating
code with FLT_EVAL_METHOD==2. In the discussion that followed between
Joseph Myers and Ian Lance Taylor, this is called “predictable semantics”
and it is a boon to anyone who whishes to tell what a program ought
to do when executed (including but not limited to precise static analyzers).</p>
<h2>Implementation detail: source-to-source transformation or architecture option?</h2>
<p>Now that at least one C compiler can be said to have predictable behavior
with respect to excess precision, the question arises of supporting
FLT_EVAL_METHOD==2 in Frama-C. This could be one more of the architecture-dependent parameters such as the size of type <code>int</code> and the endianness.</p>
<p>The rules are subtle, however, and rather than letting each Frama-C plug-in implement them and get them slightly wrong, it would be less error-prone to implement
these rules once and for all as a source-to-source translation from a program
with FLT_EVAL_METHOD==2 semantics to a program that when compiled or
analyzed with FLT_EVAL_METHOD==0 semantics, computes the same thing as the
first one.</p>
<h3>The destination of the transformation can be a Frama-C AST</h3>
<p>A translated program giving, when compiled with strict IEEE 754 semantics,
the FLT_EVAL_METHOD==2 semantics of an existing program
can be represented as an AST in Frama-C. Here is how the translation would work on an example:</p>
<pre>
double interpol(double u1, double u2, double u3)
{
return u2 * (1.0 - u1) + u1 * u3;
}
</pre>
<p>Function <code>interpol()</code> above can be compiled with either <code>FLT_EVAL_METHOD==0</code> or
with <code>FLT_EVAL_METHOD==2</code>. In the second case, it actually appears to have slightly better properties than in the first case, but the differences are minor.</p>
<p>A source-to-source translation could transform the function into that below:</p>
<pre>
double interpol_80(double u1, double u2, double u3)
{
return u2 * (1.0L - (long double)u1) + u1 * (long double)u3;
}
</pre>
<p>This transformed function, <code>interpol_80()</code>, when compiled or analyzed with <code>FLT_EVAL_METHOD==0</code>, behaves exactly identical to function <code>interpol()</code> compiled or analyzed
with <code>FLT_EVAL_METHOD==2</code>. I made an effort here to insert only the minimum number of explicit conversions but a Frama-C transformation plug-in would not need to be so punctilious.</p>
<h3>The source of the transformation cannot be a Frama-C AST</h3>
<p>There is however a problem with the implementation of the transformation as a traditional Frama-C transformation plug-in.
It turns out that the translation cannot use the normalized Frama-C AST as source. Indeed, if we use a Frama-C command to print the AST of the previous example in textual form:</p>
<pre>
~ $ frama-c -print -kernel-debug 1 t.c
…
/* Generated by Frama-C */
int main(int c, char **v)
{
/* Locals: __retres */
int __retres;
/* sid:18 */
r1 = 0.1 == 1.0 / ten;
/* sid:19 */
r2 = 0.1 == 1.0 / 10.0;
/* sid:20 */
r3 = 0.1 == 1.0 / ten;
/* sid:21 */
r4 = 0.1 == 1.0 / 10.0;
…
}
</pre>
<p>Explicit casts to a type that an expression already has, such as the casts to <code>double</code> in the assignments to variables <code>r3</code> and <code>r4</code>, are erased by the Frama-C front-end as part of its normalization. For us, this will not do: these casts, although they convert a <code>double</code> expression to <code>double</code>, change the meaning of the program, as shown by the differences between the values of <code>r1</code> and <code>r3</code> and respectively or <code>r2</code> and <code>r4</code> when one executes our example.</p>
<p>This setback would not be insurmountable but it means complications. It also implies that FLT_EVAL_METHOD==2 semantics cannot be implemented by individual plug-ins, which looked like a possible alternative.</p>
<p><br /></p>
<p>To conclude this section on a positive note, if the goal is to analyze a C program destined to be compiled to the
thirty-year old 8087 instructions with a recent GCC compiler, we can
build the version of Frama-C that will produce results precise to the last bit.
The amount of work is not inconsiderable, but it is possible.</p>
<h2>But wait!</h2>
<p>But what about a recent version of Clang? Let us see, using the
same C program as previously:</p>
<pre>
#include <stdio.h>
int r1, r2, r3, r4, r5, r6, r7;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
r3 = 0.1 == (double) (1.0 / ten);
r4 = 0.1 == (double) (1.0 / 10.0);
ten = 10.0;
r5 = 0.1 == (1.0 / ten);
r6 = 0.1 == (double) (1.0 / ten);
r7 = ((double) 0.1) == (1.0 / 10.0);
printf("r1=%d r2=%d r3=%d r4=%d r5=%d r6=%d r7=%d\n", r1, r2, r3, r4, r5, r6, r7);
}
</pre>
<pre>
$ clang -v
Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn)
$ clang -mno-sse2 -std=c99 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=0 r7=1
</pre>
<p>Oh no! Everything is to be done again… Some expressions are evaluated
as compile-time with different results than the run-time ones, as shown
by the difference between <code>r1</code> and <code>r2</code>. The explicit cast to <code>double</code>
does not seem to have an effect for <code>r3</code> and <code>r4</code> as compared to <code>r1</code> and
<code>r2</code>. This is different from Joseph Myers's interpretation, but if it is because floating-point expressions are always converted to their nominal types before being compared, it may be a good astonishment-lowering move. The value of <code>r5</code> differs from that of <code>r1</code>,
pointing to a non-obvious demarcation line between compile-time
evaluation and run-time evaluation. And the values of <code>r5</code> and <code>r6</code> differ,
meaning that
our interpretation “explicit casts to <code>double</code> have no effects”
based on the comparison of the values of <code>r1</code> and <code>r3</code> on the one hand
and r2 and r4 on the other hand, is wrong or that some other
compilation pass can interfere.</p>
<p>What a mess! There is no way a precise static analyzer can be made for this recent version of Clang (with these unfashionable options). Plus the results depend on optimizations:</p>
<pre>
$ clang -mno-sse2 -std=c99 -O2 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=1 r7=1
</pre>
<h2>FLT_EVAL_METHOD is not ready for precise static analysis</h2>
<p>In conclusion, it would be possible, and only quite a lot of hard work, to make a precise static analyzer for programs destined to be compiled to x87 instructions by a modern GCC. But for most other compilers, even including recent ones, it is simply impossible: the compiler gives floating-point operations a meaning that only it knows.</p>
<p>This is the sort of problem we tackled in the <a href="http://hisseo.saclay.inria.fr">Hisseo project</a> mentioned last time. One of the solutions we researched was “Just do not make a <strong>precise</strong> static analyzer”, and another was “Just analyze the generated assembly code where the meaning of floating-point operations has been fixed”. A couple of years later, the third solution, “Just use a proper compiler”, is looking better and better. It could even be a <a href="http://compcert.inria.fr">certified one</a>, although it does not have to. Both Clang and GCC, when targeting the SSE2 instruction set, give perfect FLT_EVAL_METHOD==0 results. We should all enjoy this period of temporary sanity until x86-64 processors all sport a <a href="http://en.wikipedia.org/wiki/FMA_instruction_set">fused-multiply-add instruction</a>.</p>
<blockquote><p>Two things I should point out as this conclusion's conclusion. First, with the introduction of SSE2, the IA-32 platform (and its x86-64 cousin) has gone from the worst platform still in existence for predictable floating-point results to the best. It has correctly rounded operations for the standard single-precision and double-precision formats, and it retains hardware support for an often convenient extended precision format. Second, the fused-multiply-add instruction is a great addition to the instruction set, and I for one cannot wait until I get my paws on a processor that supports it, but it is going to be misused by compilers to compile source-level multiplications and additions. Compilers have not become wiser. The SSE2 instruction set has only made it more costly for them to do the wrong thing than to do the right one. They will break predictability again as soon as the opportunity comes, and the opportunity is already in Intel and AMD's product pipelines.</p></blockquote>The word “binade” now has its Wikipedia page…urn:md5:df20655928028acf720041e89e549a072013-07-19T19:44:00+01:002013-07-19T19:44:00+01:00pascalfloating-pointlink <p>… but that's only because I <a href="http://en.wikipedia.org/wiki/Binade">created it</a>.</p>
<p>If you are more familiar than me with Wikipedia etiquette, feel free to adjust, edit, or delete this page. Also, although a Wikipedia account is necessary to create a page, I think it is not required for editing, so you can add to the story too (but if you do not have an account you are perhaps no more familiar than me with Wikipedia etiquette).</p>On the precise analysis of C programs for FLT_EVAL_METHOD==2urn:md5:9662a58de0dc1572e7a2976588ac0a6e2013-07-06T22:07:00+01:002013-07-06T23:20:29+01:00pascalfacetious-colleaguesfloating-pointFLT_EVAL_METHOD <p>There has been talk recently amongst my colleagues of Frama-C-wide support for compilation platforms that define <code>FLT_EVAL_METHOD</code> as 2. Remember that this compiler-set value, introduced in C99, means that all floating-point computations in the C program are made with <code>long double</code> precision, even if the type of the expressions they correspond to is <code>float</code> or <code>double</code>. This post is a reminder, to the attention of these colleagues and myself, of pitfalls to be anticipated in this endeavor.</p>
<p><br /></p>
<p>We are talking of C programs like the one below.</p>
<pre>
#include <stdio.h>
int r1;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
printf("r1=%d\n", r1);
}
</pre>
<p>With a C99 compilation platform that defines <code>FLT_EVAL_METHOD</code> as 0, this program prints "r1=1", but with a compilation platform that sets <code>FLT_EVAL_METHOD</code> to 2, it prints “r1=0”.</p>
<p>Although we are discussing non-strictly-IEEE 754 compilers, we are assuming IEEE 754-like floating-point: we're not in 1980 any more.
Also, we are assuming that <code>long double</code> has more precision than <code>double</code>, because the opposite situation would make any discussion specifically about <code>FLT_EVAL_METHOD == 2</code> quite moot. In fact, we are precisely thinking of compilation platforms where <code>float</code> is IEEE 754 single-precision (now called binary32), <code>double</code> is IEEE 754 double-precision (binary64), and <code>long double</code> is the 8087 80-bit double-extended format.</p>
<p><br /></p>
<p>Let us find ourselves a compiler with the right properties :</p>
<pre>
$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
…
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1)
$ gcc -mfpmath=387 -std=c99 t.c && ./a.out
r1=0
</pre>
<p>Good! (it seems)</p>
<blockquote><p>The test program sets <code>r1</code> to 0 because the left-hand side <code>0.1</code> of the equality test is the double-precision constant 0.1, whereas the right-hand side is the double-extended precision result of the division of 1 by 10. The two differ because 0.1 cannot be represented exactly in binary floating-point, so the <code>long double</code> representation is closer to the mathematical value and thus different from the <code>double</code> representation. We can make sure this is the right explanation by changing the expression for <code>r1</code> to <code>0.1L == (1.0 / ten)</code>, in which the division is typed as <code>double</code> but computed as <code>long double</code>, then promoted to <code>long double</code> in order to be compared to <code>0.1L</code>, the <code>long double</code> representation of the mathematical constant <code>0.1</code>. This change causes <code>r1</code> to receive the value 1 with our test compiler, whereas the change would make <code>r1</code> receive 0 if the program was compiled with a strict IEEE 754 C compiler.</p></blockquote>
<h2>Pitfall 1: Constant expressions</h2>
<p>Let us test the augmented program below:</p>
<pre>
#include <stdio.h>
int r1, r2;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
printf("r1=%d r2=%d\n", r1, r2);
}
</pre>
<p>In our first setback, the program prints “r1=0 r2=1”. The assignment to <code>r2</code> has been compiled into a straight constant-to-register move, based on a constant evaluation algorithm that does not obey the same rules that execution does. If we are to write a <strong>precise</strong> static analyzer that corresponds to this GCC-4.4.3, this issue is going to seriously complicate our task. We will have to delineate a notion of “constant expressions” that the analyzer with evaluate with the same rules as GCC's rules for evaluating constant expressions, and then implement GCC's semantics for run-time evaluation of floating-point expressions for non-constant expressions. And our notion of “constant expression” will have to exactly match GCC's notion of “constant expression”, lest our analyzer be unsound.</p>
<h2>Clarification: What is a “precise” static analyzer?</h2>
<p>This is as good a time as any to point out that Frama-C's value analysis plug-in, for instance, is already able to analyze programs destined to be compiled with <code>FLT_EVAL_METHOD</code> as 2. By default, the value analysis plug-in assumes IEEE 754 and <code>FLT_EVAL_METHOD == 0</code>:</p>
<pre>
$ frama-c -val t.c
…
t.c:9:[kernel] warning: Floating-point constant 0.1 is not represented exactly.
Will use 0x1.999999999999ap-4.
See documentation for option -warn-decimal-float
…
[value] Values at end of function main:
r1 ∈ {1}
r2 ∈ {1}
</pre>
<p><br /></p>
<p>The possibility of <code>FLT_EVAL_METHOD</code> being set to 2 is captured by the option <code>-all-rounding-modes</code>:</p>
<pre>
$ frama-c -val -all-rounding-modes t.c
…
t.c:9:[kernel] warning: Floating-point constant 0.1 is not represented exactly.
Will use 0x1.999999999999ap-4.
See documentation for option -warn-decimal-float
…
[value] Values at end of function main:
r1 ∈ {0; 1}
r2 ∈ {0; 1}
</pre>
<p>The sets of values predicted for variables <code>r1</code> and <code>r2</code> at the end of <code>main()</code> each contain the value given by the program as compiled by GCC-4.4.3, but these sets are not precise. If the program then went on to divide <code>r1</code> by <code>r2</code>, Frama-C's value analysis would warn about a possible division by zero, whereas we know that with our compiler, the division is safe. The warning would be a false positive.</p>
<p>We are talking here about making a static analyzer with the ability to conclude that <code>r1</code> is 0 and <code>r2</code> is 1 because we told it that we are targeting a compiler that makes it so.</p>
<p><br /></p>
<blockquote><p>The above example command-lines are for Frama-C's value analysis, but during her PhD, Thi Minh Tuyen Nguyen has shown that the same kind of approach could be applied to source-level Hoare-style verification of floating-point C programs. The relevant articles can be found in the <a href="http://hisseo.saclay.inria.fr/documents.html">results of the Hisseo project</a>.</p></blockquote>
<h2>To be continued</h2>
<p>In the next post, we will find more pitfalls, revisit a post by Joseph S. Myers in the GCC mailing list, and conclude that implementing a precise static analyzer for this sort of compilation platform is a lot of work.</p>Contrarianismurn:md5:6f50f6e52fb1173e7c0185f9b9615f5c2013-05-14T22:08:00+01:002013-05-14T22:08:00+01:00pascalfloating-point <p>If I told you that when <code>n</code> is a positive power of two and <code>d</code> an arbitrary number, both represented as <code>double</code>, the condition <code>(n - 1) * d + d == n * d</code> in strictly-IEEE-754-implementing C is always true, would you start looking for a counter-example, or start looking for a convincing argument that this property may hold?</p>
<p>If you started looking for counter-examples, would you start with the vicious values? Trying to see if <code>NaN</code> or <code>+inf</code> can be interpreted as “a positive power of two” or “an arbitrary number” represented “as <code>double</code>”? A subnormal value for <code>d</code>? A subnormal value such that <code>n*d</code> is normal? A subnormal value such that <code>(n - 1) * d</code> is subnormal and <code>n * d</code> is normal?</p>
<p>Or would you try your luck with ordinary values such as <code>0.1</code> for <code>d</code> and <code>4</code> for <code>n</code>?</p>
<p><br /></p>
<p>This post is based on a remark by Stephen Canon. Also, I have discovered a truly remarkable proof of the property which this quick post is too small to contain.</p>Big round numbers, and a book reviewurn:md5:7c4e26a5b123edb26b6e792ed66f20b02013-05-11T12:19:00+01:002013-07-02T16:47:52+01:00pascalbig-round-numbersfloating-point <p>Nearly 15 months ago, according to a past article, this blog celebrated its 15-month anniversary, and celebrated with the announcement of minor milestones having been reached: 100 articles and 50 comments.</p>
<p>Fifteen months after that, the current count is nearly 200 articles and 200 comments. Also, the blog managed to get 100 subscribers in Google's centralized service for never having to mark the same post as read twice, Reader. This was a <a href="http://lifehacker.com/5990456/google-reader-is-getting-shut-down-here-are-the-best-alternatives">close call</a>.</p>
<p><br /></p>
<p>A lot of recent posts have been related to floating-point arithmetic. I would like to reassure everyone that this was only a fluke. Floating-point correctness became one of the Frama-C tenets with our involvement in two collaborative projects, U3CAT and Hisseo, now both completed. Very recently, something must have clicked for me and I became quite engrossed by the subject.</p>
<p><br /></p>
<p>As a result of this recent passion, in the last few days, I started reading the “Handbook of Floating-Point Arithmetic”, by Jean-Michel Muller et al. This book is both thick and dense, but fortunately well organized, so that it is easy to skip over sections you do not feel concerned with, such as decimal floating-point or hardware implementation details. This book is an amazing overview. It contains cristal-clear explanations of floating-point idioms that I was until then painstakingly reverse-engineering from library code. Fifteen years from now, people will say, “… and you should read the Handbook of Floating-Point Arithmetic. It is a bit dated now, but it is still the best reference, just complete it with this one and that one”, just like people might say now about Aho, Sethi and Ullman's Dragon book for compilation.</p>
<p>Except that right now, the book is current. The references to hardware are references to hardware that you might still have, or certainly remember having had. The open questions are still open. If you were offered the chance to read the Dragon book when it came out and was all shiny and new, would you pass? If not, and if there is the slightest chance that you might hold an interest in the mysteries of floating-point computation in the foreseeable future, read this book now, for the bragging rights.</p>
<p>In addition, the book goes down to the lowest levels of detail, with occasional snippets of programs to make it clear what is meant. The snippets are C code, and irreproachable C code.</p>A 63-bit floating-point type for 64-bit OCamlurn:md5:5df03d33ded2ac9d5ce692c2ed9097b02013-05-09T21:39:00+01:002013-05-11T18:33:18+01:00pascalfloating-pointOCaml <h2>The OCaml runtime</h2>
<p>The OCaml runtime allows polymorphism through the uniform representation of types. Every OCaml value is represented as a single word, so that it is possible to have a single implementation for, say, “list of things”, with functions to access (e.g. <code>List.length</code>) and build (e.g. <code>List.map</code>) these lists that work just the same whether they are lists of ints, of floats, or of lists of sets of integers.</p>
<p>Anything that does not fit in in a word is allocated in a block in the heap. The word representing this data is then a pointer to the block. Since the heap contains only blocks of words, all these pointers are aligned: their few least significants bits are always unset.</p>
<p>Argumentless constructors (like this: <code>type fruit = Apple | Orange | Banana</code>) and integers do not represent so much information that they need to be allocated in the heap. Their representation is <em>unboxed</em>. The data is directly inside the word that would otherwise have been a pointer. So while a list of lists is actually a list of pointers, a list of ints contains the ints with one less indirection. The functions accessing and building lists do not notice because ints and pointers have the same size.</p>
<p>Still, the Garbage Collector needs to be able to recognize pointers from integers. A pointer points to a well-formed block in the heap that is by definition alive (since it is being visited by the GC) and should be marked so. An integer can have any value and could, if precautions were not taken, accidentally look like a pointer. This could cause dead blocks to look alive, but much worse, it would also cause the GC to change bits in what it thinks is the header of a live block, when it is actually following an integer that looks like a pointer and messing up user data.</p>
<p>This is why unboxed integers provide 31 bits (for 32-bit OCaml) or 63 bits (for 64-bit OCaml) to the OCaml programmer. In the representation, behind the scenes, the least significant bit of a word containing an integer is always set, to distinguish it from a pointer. 31- or 63-bit integers are rather unusual, so anyone who uses OCaml at all knows this. What users of OCaml do not usually know is why there isn't a 63-bit unboxed float type for 64-bit OCaml.</p>
<h2>There is no unboxed 63-bit floating-point type in OCaml</h2>
<p>And the answer to this last question is that there is no particular reason one shouldn't have a 63-bit unboxed float type in OCaml. Defining one only requires carefully answering two more intricately related questions:</p>
<ul>
<li>What 63-bit floating-point format should be used?</li>
<li>How will the OCaml interpreter compute values in this format?</li>
</ul>
<p>In 1990, when 64-bit computers were few, Xavier Leroy <a href="http://gallium.inria.fr/~xleroy/publi/ZINC.pdf">decided</a> that in his (then future) Caml-light system, the type for floating-point would be 64-bit double precision. The double precision floating-point format did not come close to fitting in the then-prevalent 32-bit word:</p>
<blockquote><p>Floating-point numbers are allocated in the heap as unstructured blocks of length one, two or three words, depending on the possibilities of the hardware and on the required precision. An unboxed representation is possible, using the 10 suffix for instance, but this gives only 30 bits to represent floating-point numbers. Such a format lacks precision, and does not correspond to any standard format, so it involves fairly brutal truncations. Good old 64-bit, IEEE-standard floating point numbers seem more useful, even if they have to be allocated.</p></blockquote>
<p>First, a remark: it is not necessary to distinguish floats from ints: that is what the static type-system is for. From the point of view of the GC they are all non-pointers, and that's the only important thing. So if we decide to unbox floats, we can take advantage of the same representation as for integers, a word with the least significant bit set. And nowadays even the proverbial grandmother has a 64-bit computer to read e-mail on, hence the temptation to unbox floats.</p>
<p>Second, the reticence to truncate the mantissa of any existing format remains well-founded. Suppose that we defined a format with 51 explicit mantissa bits as opposed to double-precision's 52. We could use the double-precision hardware for computations and then round to 51 bits of mantissa, but the sizes are so close that this would introduced plenty of <em>double rounding errors</em>, where the result is less precise than if it had been rounded directly to 51 bits.
As someone who has to deal with the consequences of hardware computing 64-bit mantissas that are then rounded a second time to 52-bit, I feel dirty just imagining this possibility. If we went for 1 sign bit, 11 exponent bits, 51 explicit mantissa bits, we would have to use software emulation to round directly to the correct result.</p>
<p>This post is about another idea to take advantage of the double-precision hardware to implement a 63-bit floating-point type.</p>
<h2>A truncationless 63-bit floating-point format</h2>
<h3>Borrow a bit from the exponent</h3>
<p>Taking one of the bits from the 11 reserved for the exponent in the IEEE 754 double-precision format does not have such noticeable consequences. At the top of the scale, it is easy to map values above a threshold to infinity. This does not involve double-rouding error. At the bottom of the scale, things are more complicated. The very smallest floating-point numbers of a proper floating-point format, called subnormals, have an effective precision of less than the nominal 52 bits. Computing with full-range double-precision and then rounding to reduced-range 63-bit means that the result of a computation can be computed as a normal double-precision number with 52-bit mantissa, say 1.324867e-168, and then rounded to the narrower effective precision of a 63-bit subnormal float.</p>
<blockquote><p>Incidentally, this sort of issue is the sort that remains even after you have configured an x87 to use only the 53 or 24 mantissa bits that make sense to compute with the precision of double- or single-precision. Only the range of the mantissa is reduced, not that of the exponent, so numbers that would be subnormals in the targeted type are normal when represented in a x87 register. You could hope to fix them after each computation with an option such as GCC's -ffloat-store, but then they are double-rounded. The first rounding is at 53 or 24 bits, and the second to the effective precision of the subnormal.</p></blockquote>
<h3>Double-rounding, Never!</h3>
<p>But since overflows are much easier to handle, we can cheat. In order to make sure that subnormal results are rounded directly to the effective precision, we can bias the computations so that if the result is going to be a 63-bit subnormal, the double-precision operation produces a subnormal result already.</p>
<p>In practice, this means that when the OCaml program is adding numbers 1.00000000001e-152 and -1.0e-152, we do not show these numbers to the double-precision hardware. What we show to the hardware instead is these numbers multiplied by 2^-512, so that if the result need to be subnormal in the 63-bit format, and in this example, it needs, then a subnormal double-precision will be computed with the same number of bits of precision.</p>
<p>In fact, we can maintain this “store numbers as 2^-512 times their intended value” convention all the time, and only come out of it at the time of calling library functions such as <code>printf()</code>.</p>
<p>For multiplication of two operands represented as 2^-512 times their real value, one of the arguments needs to be unbiased (or rebiased: if you have a trick to remember which is which, please share) before the hardware multiplication, by multiplying it by 2^512.</p>
<p>For division the result must be rebiased after it is computed.</p>
<p>The implementation of the correctly-rounded function <code>sqrt()</code> for 63-bit floats is left as an exercise to the reader.</p>
<h3>Implementation</h3>
<p>A quick and dirty implementation, only tested as much as shown, is available from <a href="http://ideone.com/Ev5uIP">ideone</a>. Now I would love for someone who actually uses floating-point in OCaml to finish integrating this in the OCaml runtime and do some benchmarks. Not that I expect it will be very fast: the 63-bit representation involves a lot of bit-shuffling, and OCaml uses its own tricks, such as unboxing floats inside arrays, so that it will be hard to compete.</p>
<h2>Credits</h2>
<p>I should note that I have been reading a report on implementing a perfect emulation of IEEE 754 double-precision using x87 hardware, and that the idea presented here was likely to be contained there. Google, which is prompt to point to the wrong definition of FLT_EPSILON, has been no help in finding this report again.</p>Definition of FLT_EPSILONurn:md5:0d5b474b51e35b8edf00b23674a654c32013-05-09T11:12:00+01:002013-05-09T11:29:55+01:00pascalfloating-pointrant <h2>Correct and wrong definitions for the constant FLT_EPSILON</h2>
<p>If I google “FLT_EPSILON”, the topmost result is <a href="http://www.rowleydownload.co.uk/avr/documentation/index.htm?http://www.rowleydownload.co.uk/avr/documentation/FLT_EPSILON.htm">this page</a> with this definition:</p>
<pre>
FLT_EPSILON the minimum positive number such that 1.0 + FLT_EPSILON != 1.0.
</pre>
<p>No, no, no, no, no.</p>
<p>I don't know where this definition originates from, but it is obviously from some sort of standard C library, and it is wrong, wrong, wrong, wrong, wrong.
The definition of the C99 standard is:</p>
<blockquote><p>the difference between 1 and the least value greater than 1 that is representable in the given floating point type, b^(1−p)</p></blockquote>
<p>The GNU C library gets it right:</p>
<blockquote><p>FLT_EPSILON: This is the difference between 1 and the smallest floating point number of type float that is greater than 1.</p></blockquote>
<h2>The difference</h2>
<p>On any usual architecture, with the correct definition, FLT_EPSILON is <code>0x0.000002p0</code>, the difference between <code>0x1.000000p0</code> and the smallest float above it, <code>0x1.000002p0</code>.</p>
<p><br /></p>
<p>The notation <code>0x1.000002p0</code> is a convenient <a href="https://blogs.oracle.com/darcy/entry/hexadecimal_floating_point_literals">hexadecimal input format</a>, introduced in C99, for floating-point numbers. The last digit is a <code>2</code> where one might have expected a <code>1</code> because single-precision floats have 23 explicit bits of mantissa, and 23 is not a multiple of 4. So the <code>2</code> in <code>0x1.000002p0</code> represents the last bit that can be set in a single-precision floating-point number in the interval [1…2).</p>
<p><br /></p>
<p>If one adds FLT_EPSILON to <code>1.0f</code>, one does obtain <code>0x1.000002p0</code>. But is it the smallest <code>float</code> with this property?</p>
<pre>
#include <stdio.h>
void pr_candidate(float f)
{
printf("candidate: %.6a\tcandidate+1.0f: %.6a\n", f, 1.0f + f);
}
int main(){
pr_candidate(0x0.000002p0);
pr_candidate(0x0.000001fffffep0);
pr_candidate(0x0.0000018p0);
pr_candidate(0x0.000001000002p0);
pr_candidate(0x0.000001p0);
}
</pre>
<p>This program, compiled and executed, produces:</p>
<pre>
candidate: 0x1.000000p-23 candidate+1.0f: 0x1.000002p+0
candidate: 0x1.fffffep-24 candidate+1.0f: 0x1.000002p+0
candidate: 0x1.800000p-24 candidate+1.0f: 0x1.000002p+0
candidate: 0x1.000002p-24 candidate+1.0f: 0x1.000002p+0
candidate: 0x1.000000p-24 candidate+1.0f: 0x1.000000p+0
</pre>
<p>No, <code>0x0.000002p0</code> is not the smallest number that, added to <code>1.0f</code>, causes the result to be above <code>1.0f</code>. This honor goes to <code>0x0.000001000002p0</code>, the smallest float above half FLT_EPSILON.</p>
<p>Exactly half FLT_EPSILON, the number <code>0x0.000001p0</code> or <code>0x1.0p-24</code> as you might prefer to call it, causes the result of the addition to be exactly midway between <code>1.0f</code> and its successor. The rule says that the “even” one has to be picked in this case. The “even” one is <code>1.0f</code>.</p>
<h2>Conclusion</h2>
<p>Fortunately, in the file that initiated this rant, the value for FLT_EPSILON is correct:</p>
<pre>
#define FLT_EPSILON 1.19209290E-07F // decimal constant
</pre>
<p>This is the decimal representation of <code>0x0.000002p0</code>. Code compiled against this header will work. It is only the comment that's wrong.</p>Rounding float to nearest integer, part 3urn:md5:881af5956ef5ee684d223242cda4cf402013-05-04T13:50:00+01:002013-05-05T09:17:11+01:00pascalfloating-point <p>Two earlier posts showed two different approaches in order to round a float to the nearest integer. The <a href="http://blog.frama-c.com/index.php?post/2013/05/02/nearbyintf1">first</a> was to truncate to integer after having added the right quantity (either 0.5 if the programmer is willing to take care of a few dangerous inputs beforehand, or the predecessor of 0.5 so as to have fewer dangerous inputs to watch for).</p>
<p>The <a href="http://blog.frama-c.com/index.php?post/2013/05/03/nearbyintf2">second approach</a> was to mess with the representation of the <code>float</code> input, trying to recognize where the bits for the fractional part were, deciding whether they represented less or more than one half, and either zeroing them (in the first case), or sending the float up to the nearest integer (in the second case) which was simple for complicated reasons.</p>
<h2>Variations on the first method</h2>
<p>Several persons have suggested smart variations on the first theme, included here for the sake of completeness.
The first suggestion is as follows (remembering that the input <code>f</code> is assumed to be positive, and ignoring overflow issues for simplicity):</p>
<pre>
float myround(float f)
{
float candidate = (float) (unsigned int) f;
if (f - candidate <= 0.5) return candidate;
return candidate + 1.0f;
}
</pre>
<p>Other suggestions were to use <code>modff()</code>, that separates a floating-point number into its integral and fractional components, or <code>fmodf(f, 1.0f)</code>, that computes the remainder of <code>f</code> in the division by 1.</p>
<p><br /></p>
<p>These three solutions work better than adding 0.5 for a reason that is simple if one only looks at it superficially: floating-point numbers are denser around zero. Adding 0.5 takes us away from zero, whereas operations <code>f - candidate</code>, <code>modff(f, iptr)</code> and <code>fmodf(f, 1.0)</code> take us closer to zero, in a range where the answer can be exactly represented, so it is. (Note: this is a super-superficial explanation.)</p>
<h2>A third method</h2>
<h3>General idea: the power of two giveth, and the power of two taketh away</h3>
<p>The third and generally most efficient method for rounding <code>f</code> to the nearest integer is to take advantage of this marvelous rounding machine that is IEEE 754 arithmetic. But for this to work, the exact right machine is needed, that is, a C compiler that implements strict IEEE 754 arithmetic and rounds each operation to the precision of the type. If you are using GCC, consider using options <code>-msse2 -mfpmath=sse</code>.</p>
<p>We already noticed that single-precision floats between 2^23 and 2^24 are all the integers in this range. If we add some quantity to <code>f</code> so that the result ends up in this range, wouldn't it follow that the result obtained will be rounded to the integer? And it would be rounded in round-to-nearest. Exactly what we are looking for:</p>
<pre>
f f + 8388608.0f
_____________________________
0.0f 8388608.0f
0.1f 8388608.0f
0.5f 8388608.0f
0.9f 8388609.0f
1.0f 8388609.0f
1.1f 8388609.0f
1.5f 8388610.0f
1.9f 8388610.0f
2.0f 8388610.0f
2.1f 8388610.0f
</pre>
<p>The rounding part goes well, but now we are stuck with large numbers far from the input and from the expected output. Let us try to get back close to zero by subtracting <code>8388608.0f</code> again:</p>
<pre>
f f + 8388608.0f f + 8388608.0f - 8388608.0f
____________________________________________________________________
0.0f 8388608.0f 0.0f
0.1f 8388608.0f 0.0f
0.5f 8388608.0f 0.0f
0.9f 8388609.0f 1.0f
1.0f 8388609.0f 1.0f
1.1f 8388609.0f 1.0f
1.5f 8388610.0f 2.0f
1.9f 8388610.0f 2.0f
2.0f 8388610.0f 2.0f
2.1f 8388610.0f 2.0f
</pre>
<p>It works! The subtraction is exact, for the same kind of reason that was informally sketched for <code>f - candidate</code>. Adding <code>8388608.0f</code> causes the result to be rounded to the unit, and then subtracting it is exact, producing a <code>float</code> that is exactly the original rounded to the nearest integer.</p>
<p>For these inputs anyway. For very large inputs, the situation is different.</p>
<h3>Very large inputs: absorption</h3>
<pre>
f f + 8388608.0f f + 8388608.0f - 8388608.0f
____________________________________________________________________
1e28f 1e28f 1e28f
1e29f 1e29f 1e29f
1e30f 1e30f 1e30f
1e31f 1e31f 1e31f
</pre>
<p>When <code>f</code> is large enough, adding <code>8388608.0f</code> to it does nothing, and then subtracting <code>8388608.0f</code> from it does nothing again. This is good news, because we are dealing with very large single-precision floats that are already integers, and can be returned directly as the result of our function <code>myround()</code>.</p>
<p><br /></p>
<p>In fact, since we entirely avoided converting to a range-challenged integer type, and since adding <code>8388608.0f</code> to <code>FLT_MAX</code> does not make it overflow (we have been assuming the FPU was in round-to-nearest mode all this time, remember?), we could even caress the dream of a straightforward <code>myround()</code> with a single execution path. Small floats rounded to the nearest integer and taken back near zero where they belong, large floats returned unchanged by the addition and the subtraction of a comparatively small quantity (with respect to them).</p>
<h3>Dreams crushed</h3>
<p>Unfortunately, although adding and subtracting 2^23 almost always does what we expect (it does for inputs up to 2^23 and above 2^47), there is a range of values for which it does not work. An example:</p>
<pre>
f f + 8388608.0f f + 8388608.0f - 8388608.0f
____________________________________________________________________
8388609.0f 16777216.0f 8388608.0f
</pre>
<p>In order for function <code>myround()</code> to work correctly for all inputs, it still needs a conditional. The simplest is to put aside inputs larger than 2^23 that are all integers, and to use the addition-subtraction trick for the others:</p>
<pre>
float myround(float f)
{
if (f >= 0x1.0p23)
return f;
return f + 0x1.0p23f - 0x1.0p23f;
}
</pre>
<p>The function above, in round-to-nearest mode, satisfies the contract we initially set out to fulfill. Interestingly, if the rounding mode is other than round-to-nearest, then it still rounds to a nearby integer, but according to the FPU rounding mode. This is a consequence of the fact that the only inexact operation is the addition. The subtraction, being exact, is not affected by the rounding mode.</p>
<p>For instance, if the FPU is set to round downwards and the argument <code>f</code> is <code>0.9f</code>, then <code>f + 8388608.0f</code> produces <code>8388608.0f</code>, and <code>f + 8388608.0f - 8388608.0f</code> produces zero.</p>
<h2>Conclusion</h2>
<p>This post concludes the “rounding float to nearest integer” series. The method highlighted in this third post is actually the method generally used for function <code>rintf()</code>, because the floating-point addition has the effect of setting the “inexact” FPU flag when it is inexact, which is exactly when the function returns an output other than its input, which is when <code>rintf()</code> is specified as setting the “inexact” flag.</p>
<p>Function <code>nearbyintf()</code> is specified as not touching the FPU flags and would typically be implemented with the method from the second post.</p>Rounding float to nearest integer, part 2urn:md5:007ef0aa0b3d87aa6e600d0ecdc166372013-05-03T17:07:00+01:002013-07-15T15:25:44+01:00pascalfloating-point <p>The <a href="http://blog.frama-c.com/index.php?post/2013/05/02/nearbyintf1">previous post</a> offered to round a positive float to the nearest integer, represented as a float, through a conversion back and forth to 32-bit unsigned int. There was also the promise of at least another method. Thanks to reader feedback, there will be two. What was intended to be the second post in the series is hereby relegated to third post.</p>
<h2>Rounding through bit-twiddling</h2>
<p>Several readers seemed disappointed that the implementation proposed in the last post was not accessing the bits of float <code>f</code> directly. This is possible, of course:</p>
<pre>
assert (sizeof(unsigned int) == sizeof(float));
unsigned int u;
memcpy(&u, &f, sizeof(float));
</pre>
<p>In the previous post I forgot to say that we were assuming 32-bit unsigned ints. From now on we are in addition assuming that floats and unsigned ints have the same endianness, so that it is convenient to work on the bit representation of one by using the other.</p>
<p>Let us special-case the inputs that can mapped to zero or one immediately. We are going to need it. We could do the comparisons to 0.5 and 1.5 on <code>u</code>, because positive floats increase with their unsigned integer representation, but there is no reason to: it is more readable to work on <code>f</code>:</p>
<pre>
if (f <= 0.5) return 0.;
if (f <= 1.5) return 1.;
</pre>
<p>Now, to business. The actual exponent of <code>f</code> is:</p>
<pre>
int exp = ((u>>23) & 255) - 127;
</pre>
<p>The explicit bits of <code>f</code>'s significand are <code>u & 0x7fffff</code>, but
there is not need to take them out: we will manipulate them directly inside <code>u</code>. Actually, at one point we will cheat and manipulate a bit of the exponent at the same time, but it will all be for the best.</p>
<p>A hypothetical significand for the number 1, aligned with the existing significand for <code>f</code>, would be <code>1U << (23 - exp)</code>. But this is hypothetical, because <code>23 - exp</code> can be negative. If this happens, it means that <code>f</code> is in a range where all floating-point numbers are integers.</p>
<pre>
if (23 - exp < 0) return f;
unsigned int one = 1U << (23 - exp);
</pre>
<p>You may have noticed that since we special-cased the inputs below <code>1.5</code>, variable <code>one</code> may be up to <code>1 << 23</code> and almost, but not quite align with the explicit bits of <code>f</code>'s significand. Let us make a note of this for later. For now, we are interested in the bits that represent the fractional part of <code>f</code>, and these are always:</p>
<pre>
unsigned int mask = one - 1;
unsigned int frac = u & mask;
</pre>
<p>If these bits represent less than one half, the function must round down. If this is the case, we can zero all the bits that represent the fractional part of <code>f</code> to obtain the integer immediately below <code>f</code>.</p>
<pre>
if (frac <= one / 2)
{
u &= ~mask;
float r;
memcpy(&r, &u, sizeof(float));
return r;
}
</pre>
<p>And we are left with the difficult exercise of finding the integer immediately above <code>f</code>. If this computation stays in the same binade, this means finding the multiple of <code>one</code> immediately above <code>u</code>.</p>
<blockquote><p>“binade” is not a word, according to my dictionary. It should be one. It designates a range of floating-point numbers such as [0.25 … 0.5) or [0.5 … 1.0). I needed it in the last post, but I made do without it. I shouldn't have. Having words to designate things is the most important wossname towards clear thinking.</p></blockquote>
<p>And if the computation does not stay in the same binade, such as 3.75 rounding up to 4.0? Well, in this case it seems we again only need to find the multiple of <code>one</code> immediately above <code>u</code>, which is in this case the power of two immediately above <code>f</code>, and more to the point, the number the function must return.</p>
<pre>
u = (u + mask) & ~mask;
float r;
memcpy(&r, &u, sizeof(float));
return r;
</pre>
<p><br /></p>
<p>To summarize, a function for rounding a float to the nearest integer by bit-twiddling is as follows. I am not sure what is so interesting about that. I like the function in the previous post or the function in the next post better.</p>
<pre>
float myround(float f)
{
assert (sizeof(unsigned int) == sizeof(float));
unsigned int u;
memcpy(&u, &f, sizeof(float));
if (f <= 0.5) return 0.;
if (f <= 1.5) return 1.;
int exp = ((u>>23) & 255) - 127;
if (23 - exp < 0) return f;
unsigned int one = 1U << (23 - exp);
unsigned int mask = one - 1;
unsigned int frac = u & mask;
if (frac <= one / 2)
u &= ~mask;
else
u = (u + mask) & ~mask;
float r;
memcpy(&r, &u, sizeof(float));
return r;
}
</pre>
<h2>To be continued again</h2>
<p>The only salient point in the method in this post is how we pretend not to notice when significand arithmetic overflows over the exponent, for inputs between 1.5 and 2.0, 3.5 and 4.0, and so on. The method in next post will be so much more fun than this.</p>Harder than it looks: rounding float to nearest integer, part 1urn:md5:6d4a8b9339faffc87a54da8f0dac81f92013-05-02T18:14:00+01:002014-06-20T12:08:06+01:00pascalfloating-point <p>This post is the first in a series on the difficult task of rounding a floating-point number to an integer. Laugh not!
The easiest-looking questions can hide unforeseen difficulties, and the most widely accepted solutions can be wrong.</p>
<h2>Problem</h2>
<p>Consider the task of rounding a <code>float</code> to the nearest integer. The answer is expected as a <code>float</code>, same as the input. In other words, we are looking at the work done by standard C99 function <code>nearbyintf()</code> when the rounding mode is the default round-to-nearest.</p>
<p>For the sake of simplicity, in this series of posts, we assume that the argument is positive and we allow the function to round any which way if the float argument is exactly in-between two integers. In other words, we are looking at the ACSL specification below.</p>
<pre>
/*@ requires 0 ≤ f ≤ FLT_MAX ;
ensures -0.5 ≤ \result - f ≤ 0.5 ;
ensures \exists integer n; \result == n;
*/
float myround(float f);
</pre>
<p>In the second <code>ensures</code> clause, <code>integer</code> is an ACSL type (think of it as a super-long <code>long long</code>). The formula <code>\exists integer n; \result == n</code> simply means that <code>\result</code>, the <code>float</code> returned by function <code>myround()</code>, is a mathematical integer.</p>
<h2>Via truncation</h2>
<p>A first idea is to convert the argument <code>f</code> to <code>unsigned int</code>, and then again to <code>float</code>, since that's the expected type for the result:</p>
<pre>
float myround(float f)
{
return (float) (unsigned int) f;
}
</pre>
<h3>Obvious overflow issue</h3>
<p>One does not need Frama-C's value analysis to spot the very first issue, an overflow for large <code>float</code> arguments, but it's there, so we might as well use it:</p>
<pre>
$ frama-c -pp-annot -val r.c -lib-entry -main myround
...
r.c:9:[kernel] warning: overflow in conversion of f ([-0. .. 3.40282346639e+38])
from floating-point to integer. assert -1 < f < 4294967296;
</pre>
<p>This overflow can be fixed by testing for large arguments. Large floats are all integers, so the function can simply return <code>f</code> in this case.</p>
<pre>
float myround(float f)
{
if (f >= UINT_MAX) return f;
return (float) (unsigned int) f;
}
</pre>
<h3>Obvious rounding issue</h3>
<p>The cast from <code>float</code> to <code>unsigned int</code> does not round to the nearest integer. It “truncates”, that is, it rounds towards zero. And if you already know this, you probably know too the universally used trick to obtain the nearest integer instead of the immediately smaller one, adding 0.5:</p>
<pre>
float myround(float f)
{
if (f >= UINT_MAX) return f;
return (float) (unsigned int) (f + 0.5f);
}
</pre>
<p><strong>This universally used trick is wrong.</strong></p>
<h3>An issue when the ULP of the argument is exactly one</h3>
<p>The Unit in the Last Place, or ULP for short, of a floating-point number is its distance to the floats immediately above and immediately below it. For large enough floats, this distance is one. In that range, floats behave as integers.</p>
<blockquote><p>There is an ambiguity in the above definition for powers of two: the distances to the float immediately above and the float immediately below are not the same. If you know of a usual convention for which one is called the ULP of a power of two, please leave a note in the comments.</p></blockquote>
<pre>
int main()
{
f = 8388609.0f;
printf("%f -> %f\n", f, myround(f));
}
</pre>
<p>With a strict IEEE 754 compiler, the simple test above produces the result below:</p>
<pre>
8388609.000000 -> 8388610.000000
</pre>
<p>The value passed as argument is obviously representable as a float, since that's the type of <code>f</code>. However, the mathematical sum <code>f + 0.5</code> does not have to be representable as a float. In the worst case for us, when the argument is in a range of floats separated by exactly one, the floating-point sum <code>f + 0.5</code> falls exactly in-between the two representable floats <code>f</code> and <code>f + 1</code>. Half the time, it is rounded to the latter, although <code>f</code> was already an integer and was the correct answer for function <code>myround()</code>. This causes bugs as the one displayed above.</p>
<p>The range of floating-point numbers spaced every 1.0 goes from 2^23 to 2^24. Half these 2^23 values, that is, nearly 4 millions of them, exhibit the problem.</p>
<p>Since the 0.5 trick is nearly universally accepted as the solution to implement rounding to nearest from truncation, this bug is likely to be found in lots of places. Nicolas Cellier <a href="http://bugs.squeak.org/view.php?id=7134">identified it in Squeak</a>. He offered a solution, too: switch the FPU to round-downward for the time of the addition <code>f + 0.5</code>. But let us not fix the problem just yet, there is another interesting input for the function as it currently stands.</p>
<h3>An issue when the argument is exactly the predecessor of 0.5f</h3>
<pre>
int main()
{
f = 0.49999997f;
printf("%.9f -> %.9f\n", f, myround(f));
}
</pre>
<p>This test produces the output <code>0.499999970 -> 1.000000000</code>, although the input <code>0.49999997</code> is clearly closer to <code>0</code> than to <code>1</code>.</p>
<p>Again, the issue is that the floating-point addition is not exact. The argument <code>0.49999997f</code> is the last <code>float</code> of the interval <code>[0.25 … 0.5)</code>. The mathematical result of <code>f + 0.5</code> falls exactly midway between the last float of the interval <code>[0.5 … 1.0)</code> and <code>1.0</code>. The rule that ties must be rounded to the even choice means that <code>1.0</code> is chosen.</p>
<h3>A function that works</h3>
<p>The overflow issue and the first non-obvious issue (when ulp(f)=1) can be fixed by the same test: as soon as the ULP of the argument is one, the argument is an integer and can be returned as-is.</p>
<p>The second non-obvious issue, with input <code>0.49999997f</code>, can be fixed similarly.</p>
<pre>
float myround(float f)
{
if (f >= 0x1.0p23) return f;
if (f <= 0.5) return 0;
return (float) (unsigned int) (f + 0.5f);
}
</pre>
<h3>A better function that works</h3>
<p>Changing the FPU rounding mode, the suggestion in the Squeak bug report, is slightly unpalatable for such a simple function, but it suggests to add the predecessor of <code>0.5f</code> instead of <code>0.5f</code> to avoid the sum rounding up when it shouldn't:</p>
<pre>
float myround(float f)
{
if (f >= 0x1.0p23) return f;
return (float) (unsigned int) (f + 0.49999997f);
}
</pre>
<p>It turns out that this works, too. It solves the problem with the input <code>0.49999997f</code> without making the function fail its specification for other inputs.</p>
<h2>To be continued</h2>
<p>The next post will approach the same question from a different angle. It should not be without its difficulties either.</p>A conversionless conversion functionurn:md5:7787287f561bf85432182854de9ed6dd2013-05-01T13:11:00+01:002013-05-31T15:31:44+01:00pascalfloating-pointrant <h2>A rant about programming interview questions</h2>
<p>Software development is a peculiar field. An applicant for a more traditionally artistic position would bring a portfolio to eir job interview: a selection of creations ey deems representative of eir work and wants to be judged by.
But in the field of software development, candidates are often asked to solve a programming problem live, the equivalent of telling a candidate for a photography job “these are nice photographs you have brought, but could you take a picture for me, right now, with this unfamiliar equipment?”</p>
<p>Lots have already been written on programming job interview questions. Googling “fizzbuzz” alone reveals plenty of positions taken, reacted to, and counter-argumented. I do not intend to add to the edifice. Taking a step back, however, I notice that many of these posts do not tackle the question of why what works for the traditional arts should not be appropriate for the art of programming.</p>
<p><br /></p>
<p>What I intend to discuss is the poor quality of “do this simple task, but avoid this simple construct that makes it trivial” interview questions. I hate those. Everytime I hear a new one it seems to reach a new high in sheer stupidity. The questions are usually very poorly specified, too. One such question might be to convert a floating-point value to integer without using a cast. Is <code>floor()</code> or <code>ceil()</code> allowed? Are other library functions than these ones that solve the problem too directly allowed? May I use a union to access the bits of the floating-point representation? Or <code>memcpy()</code>?</p>
<p>Well, I have solved this particular question once and for all. The conversion function makes up the second part of this post. It uses only floating-point computations, no tricks. Now, one just needs to learn it and to regurgitate it as appropriate at interview time (there is no way one can write a working version of this program on blackboard, too). Who is hiring?</p>
<h2>A program</h2>
<p>The function below requires a strict IEEE 754 implementation. If your GCC is generating x87 instructions, options <code>-msse2 -mfpmath=sse</code> should prevent it from doing so and allow you to run the program:</p>
<pre>
#include <math.h>
#include <float.h>
#include <stdio.h>
#include <limits.h>
/*@ requires 0 <= f < ULLONG_MAX + 1 ; */
unsigned long long dbl2ulonglong(double f)
{
if (f < 1) return 0;
unsigned long long l = 0;
for (double coef = 3; coef != 0x1.0p53; coef = 2 * coef - 1, l >>= 1)
{
double t = coef * f;
double o = f - t + t - f;
if (o != 0)
{
l |= 0x8000000000000000ULL;
f -= fabs(o);
}
}
l |= 0x8000000000000000ULL;
for ( ; f != 0x1.0p63; f *= 2) l>>=1;
return l;
}
int main()
{
double f = 123456.;
unsigned long long l = dbl2ulonglong(f);
printf("result:%.18g %llu\n", f, l);
f = ULLONG_MAX * 0.99;
l = dbl2ulonglong(f);
printf("result:%.18g %llu\n", f, l);
printf("rounding:%llu %llu %llu %llu\n",
dbl2ulonglong(DBL_MIN),
dbl2ulonglong(1.4),
dbl2ulonglong(1.5),
dbl2ulonglong(1.6));
return 0;
}
</pre>
<p>The expected result is:</p>
<pre>
result:123456 123456
result:1.82622766329724559e+19 18262276632972455936
rounding:0 1 1 1
</pre>
<p>Incidentally, does anyone know how to write a correctness proof for this function? A formal proof would be nice, but just an informal explanation of how one convinces oneself that <code>o = f - t + t - f</code> is the right formula would already be something.</p>Non-expert floating-point-using developers need accurate floating-point libraries the mosturn:md5:2a426dbde80764133967bd5342dc60922013-04-06T22:20:00+01:002013-04-07T15:06:49+01:00pascalfloating-pointlink <h2>Quotes on the Internet</h2>
<p>In 2012, Lukas Mathis took a quote out of the context of a blog post by Marco Arment and ran with it. The result was a though-provoking essay. Key quote:</p>
<blockquote><p>This is a sentiment you often hear from people: casual users only need «entry-level» devices. Even casual users themselves perpetuate it: «Oh, I’m not doing much on my computer, so I always just go with the cheapest option.» And then they buy a horrid, underpowered netbook, find out that it has a tiny screen, is incredibly slow, the keyboard sucks, and they either never actually use it, or eventually come to the conclusion that they just hate computers.</p>
<p>
In reality, it’s exactly backwards: <strong>proficient users can deal with a crappy computer, but casual users need as good a computer as possible</strong>.</p></blockquote>
<p>Lukas fully develops the idea in his post, <a href="http://ignorethecode.net/blog/2012/11/04/crappy_computers/">Crappy Computers</a>. Go ahead and read it now if you haven't already. This blog will still be here when you come back.</p>
<h2>Floating-point libraries</h2>
<p>The idea expressed by Lukas Mathis applies identically in a much more specific setting: developing with floating-point computations. The developers who most need accurate floating-point libraries are those who least care about floating-point. These developers will themselves tell you that it is all the same to them. They do not know what an ULP (“unit in the last place”) is, so what difference is it to them if they get two of them as error where they could have had one or half of one?</p>
<p>In this, they are just as wrong as the casual computer users who pick horrid netbooks for themselves.</p>
<h3>Floating-point-wise, programming environments are not born equal</h3>
<p>All recent processors for desktop computers provide basic operations +, -, *, / and square root for IEEE 754 single- and double-precision floating-point numbers. Each operation has its assembly instruction, and since the assembly instruction is the fastest way to implement the operation, compilers have no opportunity to mess things up in a misguided attempt at optimizing for speed.</p>
<p>Who am I kidding? Of course compilers have plenty of opportunities to mess things up.</p>
<ol>
<li>It may seem to a compiler that a compile-time computation is even faster than the assembly instruction provided by the processor, so that if the program computes <code>x / 10.0</code>, the compiler may compute <code>1 / 10.0</code> at compile-time and generate assembly code that multiplies <code>x</code> by this constant instead. This transformation causes the result to be less accurate in some rare cases.</li>
<li>Or a compiler may simplify source-code expressions as if floating-point operations were associative when they aren't. It may for instance optimize a carefully crafted floating-point expression such as <code>a + b - a - b</code> into <code>0</code>.</li>
</ol>
<p>Nevertheless, there has been much progress recently in standard compliance for compilers' implementations of floating-point. Overall, for programs that only use the basic operators, the situation has never been better.</p>
<p><br /></p>
<p>The situation is not as bright-looking when it comes to mathematical libraries. These libraries provide conversion to and from decimal, and transcendental elementary functions implemented on top of the basic operations. They are typically part of the operating system. Implementations vary wildly in quality from an operating system to the next.</p>
<p><br /></p>
<p>Expert developers know exactly what compromise between accuracy and speed they need, and they typically use their own functions instead of the operating system's. By way of illustration, a famous super-fast pretty-accurate implementation of the <a href="http://en.wikipedia.org/wiki/Fast_inverse_square_root">inverse square root</a> function is used in Quake III and has been much <a href="http://blog.quenta.org/2012/09/0x5f3759df.html">analyzed</a>.</p>
<p><br /></p>
<p>The casual developer of floating-point programs, on the other hand, will certainly use the functions provided by the system. Some of <a href="http://en.wikipedia.org/wiki/Spivak_pronoun">eir</a> expectations may be naive, or altogether impossible to reconcile with the constraints imposed by the IEEE 754 standard. Other expectations may be common sense, such as a <code>sin()</code> function that does not <a href="http://blog.frama-c.com/index.php?post/2011/09/14/Linux-and-floating-point%3a-nearly-there">return</a> <code>-2.76</code>.</p>
<p>For such a developer, the mathematical libraries should strive to be as accommodating and standard-compliant as possible, because ey needs it, regardless of what ey thinks.</p>
<h2>An example</h2>
<p>To illustrate, I have written a string-to-long conversion function. It could have been an entry in John Regehr's contest, but since the deadline is passed, I have allowed the function to expect only positive numbers and to fail miserably when the input is ill-formed.</p>
<p>The function looks like this:</p>
<pre>
long str2long(char *p)
{
size_t l = strlen(p);
long acc = 0;
for (size_t i=0; i<l; i++)
{
int digit = p[i] - '0';
long pow10 = pow(10, l - 1U - i);
acc += digit * pow10;;
}
return acc;
}
</pre>
<p>Neat, huh?</p>
<p><br /></p>
<p>I tested this function more than the <a href="http://blog.frama-c.com/index.php?post/2013/03/20/str2long">last function</a>. This time I compiled it and invoked it on a few strings:</p>
<pre>
printf("%ld %ld %ld %ld\n",
str2long("0"),
str2long("123"),
str2long("999"),
str2long("123456789123456789"));
</pre>
<p>You can <a href="http://blog.frama-c.com/public/float_str2long.c">download the entire C code</a> for yourself. If you run it, you should get:</p>
<pre>
0 123 999 123456789123456789
</pre>
<p>I wrote my function to work for all well-formed inputs that fit in a <code>long</code> (but I only tested it for four values, so do not embed it in your space shuttle, please). Some of the reasons why I expect it to work are implicit: for one, powers of ten up to 10^22 <a href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">are exactly representable as double-precision floating-point numbers</a>. Also, I happen to know that on the system I use, the mathematical library is one of the best available.</p>
<p><br /></p>
<p>I am not, in fact, a floating-point expert. I could be completely illiterate with respect to floating-point and have written the exact same function. In fact, this happened to StackOverflow user1257. (I am not saying that StackOverflow user1257 is illiterate with respect to floating-point, either. Ey wrote a function similar to mine, after all.)</p>
<p><strong>User1257's function returned <code>122</code> when applied to the string <code>"123"</code> !!!</strong></p>
<p>This was so troubling that user1257 <a href="http://stackoverflow.com/q/15851636/139746">suspected a compiler bug</a>. The reality is more likely that on eir computer, the statement <code>long pow10 = pow(10, 2);</code> sets variable <code>pow10</code> to <code>99</code>. The function <code>pow()</code> only needs to be inaccurate by 1ULP for this result to come up because of C's truncation behavior (towards zero) when converting from floating-point to integer.</p>
<h2>Conclusion</h2>
<p>My <code>str2long()</code> function would fail just the same if it was run in user1257's compilation environment. I still think that my function is correct and that I should be able to expect results to the ULP from the math library's <code>pow()</code> function. A floating-point expert would never even encounter the issue at all. I might be able to diagnose it and to cope. But the floating-point beginner simply needs an environment in which <code>long pow10 = pow(10, 2);</code> sets <code>pow10</code> to <code>100</code>.</p>
<p>If you program, and if you use floating-point at all, beware of relying on the math library equivalent of a crappy netbook.</p>Correct rounding or mathematically-correct rounding?urn:md5:ca815bf0722f8d0f1175729e89819ca72013-03-03T22:35:00+00:002013-03-04T12:16:48+00:00pascalfloating-pointlinkrant <p><a href="http://lipforge.ens-lyon.fr/www/crlibm/">CRlibm</a> is a high-quality library of floating-point elementary functions. We used it as reference a long time ago in this blog while looking at lesser elementary function implementations and the even lesser properties we could verify about them.</p>
<h2>A bold choice</h2>
<p>The CRlibm <a href="http://ftp.nluug.nl/pub/os/BSD/FreeBSD/distfiles/crlibm/crlibm-1.0beta3.pdf">documentation</a> contains this snippet:</p>
<blockquote><p>[…] it may happen that the requirement of correct rounding conflicts with a basic mathematical property of the function, such as its domain and range. A typical example is the arctangent of a very large number which, rounded up, will be a number larger than π/2 (fortunately, ◦(π/2) < π/2). The policy that will be implemented in crlibm will be</p>
<p>
• to give priority to the mathematical property in round to nearest mode (so as not to hurt the innocent user who may expect such a property to be respected), and</p>
<p>
• to give priority to correct rounding in the directed rounding modes, in order to provide trustful bounds to interval arithmetic.</p></blockquote>
<p>The choice for directed rounding modes is obviously right. I am concerned about the choice made for round-to-nearest.
The documentation states the dilemma very well. One can imagine slightly out of range values causing out-of-bound indexes during table look-ups and worse things.</p>
<p><br /></p>
<p>I seldom reason about floating-point programs. I work on static analysis and am only concerned about floating-point inasmuch as it is a requirement for writing a static analyzer correct for programs that include floating-point computations.</p>
<p>However, when I do reason about floating-point programs, I am more often compounding approximations, starting from the base assumption that <strong>a correctly rounded function returns a result within 1/2ulp of the mathematical result</strong> than I am assuming that atan(x) ≤ π/2. The choice the CRlibm implementors made means that suddenly, the reasoning I often make is wrong. The value of <code>atan(x)</code> in the program may not be 1/2ulp from the real arctangent of the same <code>x</code>. It can be more when <code>x</code> is very large and mathematical-correctness overrode correct rounding.</p>
<blockquote><p>Truck drivers fall asleep at the wheel when they face long, dull stretches of straight empty roads. Similarly, it is good to have another special case to consider when reasoning about floating-point computations. With only infinites and denormals to worry about, it can get, you know, a bit too easy.</p></blockquote>
<h2>Oh well, it's only π/2</h2>
<p>In this section I rhetorically assume that it is only π/2 for which there is a problem. The CRlibm documentation reminds us that in the case of double precision, we were lucky. Or perhaps it isn't luck, and the IEEE 754 committee took the desirableness of the property (double)π/2 < π/2 into account when it chose the number of bits in the significand of the double-precision format.</p>
<p><br /></p>
<p>How lucky (or careful) have we been, exactly? Let us test it with the program below — assuming my compilation platform works as intended.</p>
<pre>
#include <stdio.h>
#define PI(S) 3.1415926535897932384626433832795028841971693993751##S
float f = PI(f);
double d = PI();
long double ld = PI(L);
int main(){
printf(" 3.14159265358979323846264338327950288419716939937510\n");
printf("f %.50f\n", f);
printf("d %.50f\n", d);
printf("ld %.50Lf\n",ld);
}
</pre>
<p>The result of compiling and executing the program is, for me:</p>
<pre>
3.14159265358979323846264338327950288419716939937510
f 3.14159274101257324218750000000000000000000000000000
d 3.14159265358979311599796346854418516159057617187500
ld 3.14159265358979323851280895940618620443274267017841
</pre>
<p>As you can see, the nearest single-precision float to π is above π, as is the nearest 80-bit long double. The same goes for π/2 because the floating-point representations for π and π/2 only differ in the exponent. Consequently, the issue raised by the CRlibm implementors will come up for both functions <code>atanf()</code> and <code>atanl()</code>, when it is time to get them done. We were not very lucky after all (or careful when defining the IEEE 754 standard).</p>
<h2>A subjective notion</h2>
<p>But what exactly is the informal “mathematical correctness” notion that this post is predicated upon? Yes, the “innocent user” may expect mathematical properties to be respected as much as possible, but there are plenty of mathematical properties! Let us enumerate some more:</p>
<p><br /></p>
<p>If <code>x ≤ 1</code> in a program, then <code>exp(x)</code> should always be lower than the mathematical constant e.</p>
<p>So far so good. The above is a good rule for an exponential implementation to respect. We are making progress.</p>
<p><br /></p>
<p>Here is another property:</p>
<p>If <code>x ≥ 1</code> in a program, then <code>exp(x)</code> should always be greater than the mathematical constant e.</p>
<p><br /></p>
<p>We are decidedly unlucky today, because at most one of these is going to be true of any floating-point function <code>exp()</code>. The programmatic value <code>exp(1)</code> must be either above or below the mathematical constant e (it is never equal to it because the mathematical constant e does not have a finite representation in binary).</p>
<h2>Why does it matter anyway?</h2>
<p>Let us revisit the argument:</p>
<blockquote><p>to give priority to the mathematical property in round to nearest mode (so as not to hurt the innocent user who may expect such a property to be respected)</p></blockquote>
<p>I alluded to a possible problem with a programmer computing an array index from <code>atanf(x)</code> under the assumption that it is always lower than π/2. But how exactly would an innocent user even notice that <code>atanf(1e30)</code> is not lower than π/2? The value π/2 cannot exist in eir program any more than e. The user might innocently write an assertion like:</p>
<pre>
assert(atanf(x)<=(3.1415926535897932f/2.0f));
</pre>
<p>This assertion will never trigger! The function <code>atanf()</code> will indeed return at most the single-precision float <code>3.1415926535897932f/2.0f</code>. It does not matter that this number is actually slightly larger than π/2. For all intents and purposes, in the twisted world of single-precision floating-point, this number is π/2.</p>
<h2>Conclusion</h2>
<p>There are other scenarios in which the innocent user might genuinely have an unpleasant surprise. The result of a computation may be converted to decimal for humans to read and the user may be surprised to see a value outside the range ey expected. But this user would have the wrong expectations, just as if ey expected <code>10.0 * atan(x)</code> to always be less than 5π. Plenty of these users and developers can be found. But my opinion, for what it is worth, is that by making special cases you are not helping these users, only feeding their delusions.</p>
<p>The correct way to set expectations regarding the results of a floating-point program is numerical analysis. Numerical analysis is hard. Special cases such as the authors of CRlibm threaten to implement only seem to make it harder.</p>Solution to yesterday's quizurn:md5:a24a51ecc801a7fef8b14965582067ae2012-11-29T23:02:00+00:002012-12-01T12:00:52+00:00pascalanonymous-arraysc99floating-point <p>Yesterday's quiz was about the expression <code>*(char*)(float[]){x*x} - 63</code> (for big-endian architectures) or <code>*(3+(char*)(float[]){x*x}) - 63</code> (for little-endian ones). This post provides an explanation.</p>
<p><br /></p>
<p>First, let us try the function on a few values:</p>
<pre>
int main(){
for (unsigned int i=0; i<=20; i++)
printf("l(%2u)=%d\n", i, l(i));
}
</pre>
<p>This may provide the beginning of a hint:</p>
<pre>
l( 0)=-63
l( 1)=0
l( 2)=1
l( 3)=2
l( 4)=2
l( 5)=2
l( 6)=3
l( 7)=3
l( 8)=3
l( 9)=3
l(10)=3
l(11)=3
l(12)=4
l(13)=4
l(14)=4
l(15)=4
l(16)=4
l(17)=4
l(18)=4
l(19)=4
l(20)=4
</pre>
<p><br /></p>
<p>The construct <code>(float[]){…}</code> is C99's syntax for anonymous arrays, a kickass programming technique. This is an <a href="http://www.run.montefiore.ulg.ac.be/~martin/resources/kung-f00.html">unabated</a> quote.</p>
<p><br /></p>
<p>In the case at hand, the construct converts to float the contents of the braces and puts the result in memory. The function puts the float in memory in order to read its most significant byte. That's <code>*(char*)…</code> on a big-endian architecture, and <code>*(3+(char*)…)</code> on a little-endian one.</p>
<p><br /></p>
<p>One reason to read a single char is to circumvent <a href="http://stackoverflow.com/q/98650/139746">strict aliasing rules</a>—which do not apply to type <code>char</code>. A simpler version of the same function would have been <code>(*(int*)(float[]){x} >> 23) - 127</code>, but that version would break strict aliasing rules. Also, it would be too obvious.</p>
<p><br /></p>
<p>The most significant bits of a single-precision IEEE 754 floating-point representation are, in order, one sign bit and eight exponent bits. By reading the most significant byte, we get most of the exponent, but one bit is lost. To compensate for this, the operation is applied to <code>x*x</code>, whose exponent is double the exponent of <code>x</code>.</p>
<p><br /></p>
<p>In conclusion, yesterday's one-liner returns an integer approximation of the base-2 logarithm of a reasonably small <code>unsigned int x</code>. On a typical 32-bit architecture, it is exact for powers of two up to <code>2¹⁵</code>. If <code>x</code> is zero, the function returns its best approximation of <code>-∞</code>, that is, <code>-63</code>.</p>C99 quizurn:md5:2b7cacb88a08732757c35aa87e7c02992012-11-28T23:44:00+00:002012-11-29T00:15:11+00:00pascalc99floating-point <p>Here is a convenient one-liner:</p>
<pre>
int l(unsigned int x)
{
return *(char*)(float[]){x*x} - 63;
}
</pre>
<p>What does it do on my faithful PowerMac?</p>
<p><br /></p>
<p>The Intel version is not as nice. That's progress for you: <code>*(3+(char*)(float<a href="http://blog.frama-c.com/index.php?post/2012/11/28/"></a>){x*x}) - 63</code></p>Funny floating-point bugs in Frama-C Oxygen's front-endurn:md5:09a58ffcd3cd920e411d4f46440067e22012-11-19T21:54:00+00:002012-11-21T15:39:30+00:00pascalfloating-pointOCamloxygen <p>In a previous <a href="http://blog.frama-c.com/index.php?post/2011/11/18/Analyzing-single-precision-floating-point-constants">post</a>, almost exactly one year ago, before Frama-C Oxygen was released, I mentioned that the then future release would incorporate a custom decimal-to-binary floating-point conversion function. The reason was that the system's <code>strtof()</code> and <code>strtod()</code> functions could not be trusted.</p>
<p>This custom conversion function is written in OCaml. It can be found in src/kernel/floating_point.ml in the now available Oxygen <a href="http://frama-c.com/download.html">source tree</a>. This post is about a couple of funny bugs the function has.</p>
<h2>History</h2>
<p>There had been arguments about the inclusion of the “correct parsing of decimal constants” feature in Frama-C's front-end, and about the best way to implement it. My colleague Claude Marché was in favor of using the reputable MPFR library. I thought that such an additional dependency was unacceptable and I was against the feature's inclusion as long as I could not see a way to implement it that did not involve such a dependency.</p>
<p>When I presented my dependency-avoiding solution to Claude, he said: “But how do you know your function works? Did you prove it?”. To which I replied that no, I had not proved it, but I had thoroughly tested it. I had written a quick generator of difficult-to-convert decimal numbers, and I was rather happy with the confidence it gave me.</p>
<blockquote><p>My generator allowed me to find, and fix, a double rounding issue when the number to convert was a denormal: in this case the number would first be rounded at 52 significant digits and then at the lower number of significant digits implied by its denormal status.</p></blockquote>
<p><br /></p>
<p>I owe Claude a beer, though, because there were two bugs in my function. The bugs were not found by my random testing but would indeed have been found by formal verification. If you want to identify them yourself, stop reading now and start hunting, because the bugs are explained below.</p>
<h2>Stop reading here for a difficult debugging exercise</h2>
<p>Say that the number being parsed is of the form <em>numopt1</em>.<em>numopt2</em>E<em>num3</em> where <em>numopt1</em> and <em>numopt2</em> expand to optional strings of digits, and <em>num3</em> expands to a mandatory string of digits.</p>
<p>The sequences of digits <em>numopt1</em> and <em>numopt2</em> can be long. The string <em>numopt2</em> in particular should not be parsed as an integer, because the leading zeroes it may have are significant.</p>
<p>At this point of the parsing of the input program, we have already ensured that <em>num3</em> was a string of digits with an optional sign character at the beginning. In these conditions, it is tempting to simply call the OCaml function <code>int_of_string</code>.
The function <code>int_of_string</code> may still fail and raise an exception if the string represents a number too large to be represented as a 31- or 63-bit OCaml <code>int</code>.</p>
<p>This is easy to fix: if the program contains a literal like <code>1.234E9999999999999999999</code>, causing <code>int_of_string</code> to fail when parsing the exponent, return infinity. A vicious programmer might have written <code>0.000…01E9999999999999999999</code>, but this programmer's hard-drive is not large enough to contain all the digits that would prevent infinity to be the correct answer.</p>
<p>Similarly, if <code>int_of_string</code> chokes because the programmer wrote <code>1.234E-9999999999999999999</code>, the function can safely return <code>0.0</code> which, for the same reason, is always the correctly rounded floating-point representation.</p>
<p><br /></p>
<p>Or so it would seem. The above logic is implemented in function <code>parse_float</code> inside Frama-C Oxygen, and this is where the bugs are.</p>
<h2>Stop reading here for an easy debugging exercise</h2>
<p>During a code review, my colleague Boris Yakobowski and I found that Oxygen had the following unwanted behaviors. Read on for the solution.</p>
<h3>Exhibit one: spurious warning</h3>
<pre>
double d = 0.0E-9999999999999999999;
</pre>
<p>For the program above, a warning is emitted whereas the constant is, in fact, exactly represented. The only issue here is the spurious warning:</p>
<pre>
$ frama-c e1.c
[kernel] preprocessing with "gcc -C -E -I. e1.c"
e1.c:1:[kernel] warning: Floating-point constant 0.0E-9999999999999999999
is not represented exactly.
Will use 0x0.0000000000000p-1022.
See documentation for option -warn-decimal-float
</pre>
<h3>Exhibit two: confusion between zero and infinity</h3>
<pre>
double d = 0.0E9999999999999999999;
</pre>
<p>This bug is more serious:</p>
<pre>
$ frama-c e2.c
[kernel] preprocessing with "gcc -C -E -I. e2.c"
e2.c:1:[kernel] warning: Floating-point constant 0.0E9999999999999999999
is not represented exactly.
Will use inf.
See documentation for option -warn-decimal-float
</pre>
<p>These two related bugs are fixed in the development version of Frama-C.</p>About the rounding error in these Patriot missilesurn:md5:c5bed41cecd714ccd0aaa9283f52686d2012-11-18T22:59:00+00:002013-01-22T12:35:57+00:00pascalfloating-pointlinkrant <h2>An old rant: misusing Ariane 5 in motivation slides</h2>
<p>I was lucky to be an intern and then a PhD student at INRIA, while it was still called “INRIA” (it is now called “Inria”). This was about when researchers at INRIA and elsewhere were taken to task to understand the unfortunate software failure of the Ariane 5 maiden flight. So I heard the story from people I respect and who knew at least a little bit about the subject.</p>
<p>Ever since, it has been irking me when this example is taken for the purpose of motivating some formal technique or other. Unless you attend this sort of CS conference, you might not believe the non-solutions that get motivated as getting us all closer to a world in which Ariane 5 rockets do not explode.</p>
<p><br /></p>
<p>What irks me is this: even if the technique being motivated were comparably expensive to traditional testing, requiring comparable or a little less time for comparable or a little more confidence, that would not guarantee it would have been used to validate the component. The problem with Ariane 5 was not a failure of traditional tests. The problem was that, because of constraints of time and money, traditional tests were <strong>not applied</strong> to the component that eventually failed.</p>
<blockquote><p>If your method is not cheaper and faster than not doing the tests, do not imply that it would have saved Ariane 5. It might have been omitted too.</p></blockquote>
<p><br /></p>
<p>The report is <a href="http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html">online</a>. Key quote: “no test was performed to verify that the SRI would behave correctly when being subjected to the count-down and flight time sequence and the trajectory of Ariane 5.”</p>
<h2>A new rant: the Patriot missile rounding error</h2>
<p>If you attend that sort of conference, you have also heard about this other spectacular software failure, the <a href="http://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran">Patriot</a> missile rounding error. This bug too is used to justify new techniques.</p>
<p>This one did not irk me until today. I did not happen to be in the right place to get the inside scoop by osmosis. Or at the wrong time.</p>
<p>I vaguely knew that it had something to do with a clock inside the Patriot missile measuring tenths of seconds and the constant 0.1 not being representable in binary. When researchers tell the story on a motivation slide, it sounds like a stupid mistake.</p>
<blockquote><p>There is nothing reprehensible in calculating with constants that are not represented finitely in binary. Other computations inside the missile surely involved the number π, which is not finitely representable in binary either. The designer of the system simply must understand what <a href="http://en.wikipedia.org/wiki/Spivak_pronoun">ey</a> is doing. It can get a little tricky, especially when the software is evolved. Let me tell you about the single-precision <code>cosf()</code> function from the Glibc library in another rant.</p></blockquote>
<p><br /></p>
<p>Similarly with the Ariane 5 case, a rather good-looking <a href="http://mate.uprh.edu/~pnm/notas4061/patriot.htm">summary</a> of the issue is available.
Assuming that this summary is correct, and it certainly looks more plausible than the rash explanations you get at motivation-slide time, the next time I hear a researcher use the Patriot missile example to motivate eir talk, I will ask the questions that follow.</p>
<ol>
<li>When are you adapting your system to ad-hoc, 24-bit fixed-point computations (not whether it is theoretically possible but when you are doing it)?</li>
<li>When are you adapting your system to ad-hoc, non-IEEE 754, 48-bit floating-point computations?</li>
<li>Will your system then detect the drift between the same computation having gone one path and the other?</li>
</ol>
<p><br /></p>
<p>If your system is only practical in an alternate universe in which the Patriot missile software is cleanly implemented with good old IEEE 754 double-precision values: sorry, but in that universe, the Patriot missile does not exhibit the problem you are claiming to solve.</p>
<p><br /></p>
<p>Thanks to Martin Jambon for proof-reading this post.</p>Short, difficult programsurn:md5:d8e5e604a9daa4457788d786d2d3bdc32012-11-02T11:56:00+00:002013-01-25T12:45:07+00:00pascalbenchmarksfloating-point <p>When researchers start claiming that they have a sound and complete analyzer for predicting whether a program statement is reachable, it is time to build a database of interesting programs.</p>
<h2>Goldbach's conjecture</h2>
<p>My long-time favorite is a C program that verifies <a href="http://en.wikipedia.org/wiki/Goldbach's_conjecture">Goldbach's conjecture</a> (actual program left as an exercise to the reader).</p>
<p>If the conjecture is true, the program never terminates (ergo a statement placed after it is unreachable). If the conjecture is false, the program terminates and moves on to the following statement. The program can be implemented using <code>unsigned long long</code> integers that should be good for counter-examples up to 18446744073709551615. It can alternately use dynamic allocation and multi-precision integers, in which case, depending whether your precise definition of “analyzing a C program” includes out-of-memory behaviors, you could claim that the reachability of the statement after the counter-example-finding loop is equivalent to the resolution of “one of the oldest and best-known unsolved problems in number theory and in all of mathematics”.</p>
<h2>Easier programs than that</h2>
<p>No-one expects the resolution of Goldbach's conjecture to come from a program analyzer. This example is too good, because it is out of reach for everyone—it has eluded our best mathematicians for centuries. What I am looking for in this post is easier programs, where the solution is just in reach. Examples that it would genuinely be informative to run these sound, complete analyzers on. If they were made available.</p>
<h3>With integers</h3>
<p>For 32-bit ints and 64-bit long longs, I know that the label <code>L</code> is unreachable in the program below, but does your average sound and complete program analyzer do?</p>
<pre>
/*@ requires 2 <= x ;
requires 2 <= y ; */
void f(unsigned int x, unsigned int y)
{
if (x * (unsigned long long)y == 17)
L: return;
}
</pre>
<p>The key is that with the aforementioned platform hypotheses, the multiplication does not overflow. Label <code>L</code> being reached would mean that the program has identified divisors of the prime number <code>17</code>, which we don't expect it to.</p>
<p><br /></p>
<p>In the program below, the multiplication can overflow, and not having tried it, I have truly no idea whether the label <code>L</code> is reachable. I expect it is, statistically, but if you told me that it is unreachable because of a deep mathematical property of computations modulo a power of two, I would not be shocked.</p>
<pre>
/*@ requires 2 <= x ;
requires 2 <= y ; */
void f(unsigned long long x, unsigned long long y)
{
if (x * y == 17)
L: return;
}
</pre>
<h3>With floating-point</h3>
<p>Floating-point is fun. Label <code>L</code> in the following program is unreachable:</p>
<pre>
/*@ requires 10000000. <= x <= 200000000. ;
requires 10000000. <= y <= 200000000. ; */
void sub(float x, float y)
{
if (x - y == 0.09375f)
L: return;
}
</pre>
<p>Frama-C's value analysis, and most other analyzers, will not tell you that label <code>L</code> is unreachable. It definitely looks reachable, The difference between <code>x</code> and <code>y</code> can be zero, and it can be <code>1.0</code>. It looks like it could be <code>0.09375</code> but it cannot: the subtracted numbers are too large for the difference, if non-zero, to be smaller than <code>1.0</code>.</p>
<p><br /></p>
<p>So the subtlety in the example above is the magnitude of the arguments. What about smaller arguments then?</p>
<pre>
/*@ requires 1.0 <= x <= 200000000. ;
requires 1.0 <= y <= 200000000. ; */
void sub(float x, float y)
{
if (x - y == 0.09375f)
L: return;
}
</pre>
<p>This time the label <code>L</code> is easily reachable, for instance with the inputs <code>2.09375</code> for <code>x</code> and <code>2.0</code> for <code>y</code>.</p>
<p><br /></p>
<p>What about this third program?</p>
<pre>
/*@ requires 1.0 <= x <= 200000000. ;
requires 1.0 <= y <= 200000000. ; */
void sub(float x, float y)
{
if (x - y == 0.1f)
L: return;
}
</pre>
<p>The target difference <code>0.1f</code> is larger than <code>0.09375f</code>, so it should be easier to obtain as the difference of two floats from the prescribed range, right? In fact, it isn't. The numbers <code>0.099999904632568359375</code> and <code>0.10000002384185791015625</code> can be obtained as values of <code>x-y</code> in the above program, for the former as the result of picking <code>1.099999904632568359375</code> for <code>x</code> and <code>1.0</code> for <code>y</code>. The value <code>0.1f</code>, on the other hand, cannot be obtained as the subtraction of two floats above <code>1.0</code>, because some bits are set in its significand that cannot be set by subtracting floats that large.</p>
<h2>Conclusion</h2>
<p>I expect it will be some time before automatic analyzers can soundly and completely decide, in all two-line programs, whether the second line is reachable. Frama-C's value analysis does not detect unreachability in any of the programs above, for instance (the value analysis is sound, so it soundly detects that <code>L</code> may be reachable when it is). Please leave your own difficult two-line programs in the comments.</p>
<p><br /></p>
<p>The floating-point examples in this post owe quite a bit to my colleague Bruno Marre explaining his article “Improving the Floating Point Addition and Subtraction Constraints” to me over lunch.</p>