Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Friday, July 19 2013

The word “binade” now has its Wikipedia page…

… but that's only because I created it.

If you are more familiar than me with Wikipedia etiquette, feel free to adjust, edit, or delete this page. Also, although a Wikipedia account is necessary to create a page, I think it is not required for editing, so you can add to the story too (but if you do not have an account you are perhaps no more familiar than me with Wikipedia etiquette).

Thursday, July 11 2013

Arithmetic overflows in Fluorine

There is a C quiz in the middle of this post, lost in the middle of all the reminiscing.

A history of arithmetic overflows in Frama-C

From the very beginnings in 2005, until after the first Open Source release in 2008, Frama-C's value analysis was assuming that all arithmetic overflows produced two's complement results. This seemed the right thing to do at the time.

Then an option was introduced to emit alarms on undefined overflows. John Regehr suggested it after testing one of the Open Source versions. The option was turned off by default. If a value analysis user turned the option on, any undefined arithmetic overflow in the program would be detected and reported as a possible error, with the same gravity as dereferencing NULL or accessing an array out of its bounds.

Later, a helpful reminder was added to the value analysis' output: in the default behavior of not emitting alarms, an informative message was emitted instead—if such an overflow was detected—about two's complement being assumed.

There was one last change in the last release, Fluorine. Actually, two changes: the name of the option for emitting alarms on undefined overflows changed, and the default value changed. The setting is now to emit alarms by default, and can be changed to not emitting them, for instance if the target code is destined to be compiled with gcc -fno-strict-overflow -fwrapv, in which case all overflows happening during execution can be expected to produce two's complement results.


One aspect remains unchanged in the above evolution: the discussion only applies to undefined overflows.

The philosophy was always to analyze programs as they were written, and not to force any change of habit on software developers. The initial choice not to warn about overflows was because we knew there were so many of these—most of them intentional—that we would be deluging the user with what would feel like a flood of false positives.

The gradual shift towards more arithmetic overflow awareness is a consequence of the fact that in C, some arithmetic overflows are undefined behavior. Compilers display increasing sophistication when optimizing the defined behaviors to the detriment of the predictability of undefined ones. To make a long story short, the “overflows produce 2's complement results” heuristic was wrong for some programs as compiled by some optimizing compilers.

In keeping with the same philosophy, “overflows” that are defined according to the C99 standard have always been treated by the value analysis plug-in with the semantics mandated by the standard. Those overflows that the standard says must have “implementation-defined” results are treated with the semantics that the overwhelming majority of compilation platforms give them (and it remains possible to model other architectures as the need arises).

A quiz

Other static analyzers may also warn for arithmetic overflows, but the philosophy can be different. The philosophy may for instance be that any overflow, regardless of whether it is defined or implementation-defined according to the letter of the standard, might be unintentional and should be brought to the attention of the developer.

In the few examples below, the goal is to predict whether Frama-C's value analysis with its new default setting in Fluorine would emit an alarm for an overflow. For extra credit, a secondary goal is to predict whether another static analyzer that warns for all overflows would warn. We assume a 32-bit int and a 16-bit short, same as (almost) everybody has.

int a = 50000;
int b = 50000;
int r = a * b;
unsigned int a = 50000;
unsigned int b = 50000;
unsigned int r = a * b;
int a = 50000;
int b = 50000;
unsigned int r = a * b;
short a = 30000;
short b = 30000;
short r = a * b;
unsigned short a = 50000;
unsigned short b = 50000;
unsigned int r = a * b;

Answers

int a = 50000;
int b = 50000;
int r = a * b;

There is an overflow in this snippet (mathematically, 50000 * 50000 is 2500000000, which does not fit in an int). This overflow is undefined, so the value analysis warns about it.


unsigned int a = 50000;
unsigned int b = 50000;
unsigned int r = a * b;

The multiplication is an unsigned int multiplication, and when the mathematical result of unsigned operations is out of range, the C99 standard mandates that overflows wrap around. Technically, the C99 standard says “A computation involving unsigned operands can never overflow, …” (6.2.5:9) but we are using the word “overflow” with the same meaning as everyone outside the C standardization committee including Wikipedia editors.

To sum up, in the C99 standard, overflows in signed arithmetic are undefined and there are no overflows in unsigned arithmetic (meaning that unsigned overflows wrap around).


int a = 50000;
int b = 50000;
unsigned int r = a * b;

The multiplication is again a signed multiplication. It does not matter that the result is destined to an unsigned int variable because in C, types are inferred bottom-up. So the value analysis warns about an undefined overflow in the multiplication here.


short a = 30000;
short b = 30000;
short r = a * b;

There is no overflow here in the multiplication because the last line behaves as short r = (short) ((int) a * (int) b);. The justification for this behavior can be found in clause 6.3.1 of the C99 standard about conversions and promotions (the general idea is that arithmetic never operates on types smaller than int or unsigned int. Smaller types are implicitly promoted before arithmetic takes place). The product 900000000 does fit in the type int of the multiplication. But then there is a conversion when the int result is assigned to the short variable r. This conversion is implementation-defined, so the value analysis does not warn about it, but another static analyzer may choose to warn about this conversion.


unsigned short a = 50000;
unsigned short b = 50000;
unsigned int r = a * b;

Perhaps contrary to expectations, there is an undefined overflow in the multiplication a * b in this example. Right in the middle of the aforementioned 6.3.1 clause in the C99 standard, on the subject of the promotion of operands with smaller types than int, the following sentence can be found:

If an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int.

All values of a 16-bit unsigned short fit a 32-bit int, so the third line is interpreted as unsigned int r = (unsigned int) ((int) a * (int) b);.


Incidentally, things would be completely different in this last example if int and short were the same size, say if int was 16-bit. In this case the third line would be equivalent to unsigned int r = (unsigned int) a * (unsigned int) b; and would only contain an unsigned, innocuous overflow.

Wrapping up

In Fluorine, the option to activate or deactivate the emission of these undefined arithmetic overflow alarms is called -warn-signed-overflow (the opposite version for no alarms being -no-warn-signed-overflow). I felt that providing this piece of information earlier would have rendered the quiz too easy.

Although Frama-C's value analysis adheres to the semantics of C and only warns for undefined overflows, it is possible to use Frama-C to check for the other kinds of overflows by using another plug-in, Rte, to automatically annotate the target C program with ACSL assertions that express the conditions for overflows. Note that that post pre-dates the Fluorine release and is written in terms of the old options.

Saturday, July 6 2013

On the precise analysis of C programs for FLT_EVAL_METHOD==2

There has been talk recently amongst my colleagues of Frama-C-wide support for compilation platforms that define FLT_EVAL_METHOD as 2. Remember that this compiler-set value, introduced in C99, means that all floating-point computations in the C program are made with long double precision, even if the type of the expressions they correspond to is float or double. This post is a reminder, to the attention of these colleagues and myself, of pitfalls to be anticipated in this endeavor.


We are talking of C programs like the one below.

#include <stdio.h>

int r1;

double ten = 10.0;

int main(int c, char **v)
{
  r1 = 0.1 == (1.0 / ten);
  printf("r1=%d\n", r1);
}

With a C99 compilation platform that defines FLT_EVAL_METHOD as 0, this program prints "r1=1", but with a compilation platform that sets FLT_EVAL_METHOD to 2, it prints “r1=0”.

Although we are discussing non-strictly-IEEE 754 compilers, we are assuming IEEE 754-like floating-point: we're not in 1980 any more. Also, we are assuming that long double has more precision than double, because the opposite situation would make any discussion specifically about FLT_EVAL_METHOD == 2 quite moot. In fact, we are precisely thinking of compilation platforms where float is IEEE 754 single-precision (now called binary32), double is IEEE 754 double-precision (binary64), and long double is the 8087 80-bit double-extended format.


Let us find ourselves a compiler with the right properties :

$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
…
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) 
$ gcc -mfpmath=387 -std=c99 t.c && ./a.out 
r1=0

Good! (it seems)

The test program sets r1 to 0 because the left-hand side 0.1 of the equality test is the double-precision constant 0.1, whereas the right-hand side is the double-extended precision result of the division of 1 by 10. The two differ because 0.1 cannot be represented exactly in binary floating-point, so the long double representation is closer to the mathematical value and thus different from the double representation. We can make sure this is the right explanation by changing the expression for r1 to 0.1L == (1.0 / ten), in which the division is typed as double but computed as long double, then promoted to long double in order to be compared to 0.1L, the long double representation of the mathematical constant 0.1. This change causes r1 to receive the value 1 with our test compiler, whereas the change would make r1 receive 0 if the program was compiled with a strict IEEE 754 C compiler.

Pitfall 1: Constant expressions

Let us test the augmented program below:

#include <stdio.h>

int r1, r2;

double ten = 10.0;

int main(int c, char **v)
{
  r1 = 0.1 == (1.0 / ten);
  r2 = 0.1 == (1.0 / 10.0);
  printf("r1=%d r2=%d\n", r1, r2);
}

In our first setback, the program prints “r1=0 r2=1”. The assignment to r2 has been compiled into a straight constant-to-register move, based on a constant evaluation algorithm that does not obey the same rules that execution does. If we are to write a precise static analyzer that corresponds to this GCC-4.4.3, this issue is going to seriously complicate our task. We will have to delineate a notion of “constant expressions” that the analyzer with evaluate with the same rules as GCC's rules for evaluating constant expressions, and then implement GCC's semantics for run-time evaluation of floating-point expressions for non-constant expressions. And our notion of “constant expression” will have to exactly match GCC's notion of “constant expression”, lest our analyzer be unsound.

Clarification: What is a “precise” static analyzer?

This is as good a time as any to point out that Frama-C's value analysis plug-in, for instance, is already able to analyze programs destined to be compiled with FLT_EVAL_METHOD as 2. By default, the value analysis plug-in assumes IEEE 754 and FLT_EVAL_METHOD == 0:

$ frama-c -val t.c
…
t.c:9:[kernel] warning: Floating-point constant 0.1 is not represented exactly.
 Will use 0x1.999999999999ap-4.
 See documentation for option -warn-decimal-float
…
[value] Values at end of function main:
          r1 ∈ {1}
          r2 ∈ {1}


The possibility of FLT_EVAL_METHOD being set to 2 is captured by the option -all-rounding-modes:

$ frama-c -val -all-rounding-modes t.c
…
t.c:9:[kernel] warning: Floating-point constant 0.1 is not represented exactly.
 Will use 0x1.999999999999ap-4.
 See documentation for option -warn-decimal-float
…
[value] Values at end of function main:
          r1 ∈ {0; 1}
          r2 ∈ {0; 1}

The sets of values predicted for variables r1 and r2 at the end of main() each contain the value given by the program as compiled by GCC-4.4.3, but these sets are not precise. If the program then went on to divide r1 by r2, Frama-C's value analysis would warn about a possible division by zero, whereas we know that with our compiler, the division is safe. The warning would be a false positive.

We are talking here about making a static analyzer with the ability to conclude that r1 is 0 and r2 is 1 because we told it that we are targeting a compiler that makes it so.


The above example command-lines are for Frama-C's value analysis, but during her PhD, Thi Minh Tuyen Nguyen has shown that the same kind of approach could be applied to source-level Hoare-style verification of floating-point C programs. The relevant articles can be found in the results of the Hisseo project.

To be continued

In the next post, we will find more pitfalls, revisit a post by Joseph S. Myers in the GCC mailing list, and conclude that implementing a precise static analyzer for this sort of compilation platform is a lot of work.

Thursday, June 20 2013

On the prototype of recv()

Typing man recv on my system, I get:

ssize_t
recv(int socket, void *buffer, size_t length, int flags);

The man page contains this additional bit of specification: “These calls return the number of bytes received, or -1 if an error occurred.”


The type ssize_t is defined as a signed variant of size_t. “Is that not what ptrdiff_t is for?”, you may ask. Jonathan Leffler was wondering about this very question during the last days of 2011. I am unsure of the answers conclusiveness: on 32-bit platforms that let malloc() create objects of more than 2GiB, ptrdiff_t is generally 32-bit wide, although it should be wider by the same reasoning applied in the accepted answer and in some of its comments.


That function recv() may return -1 in case of error is probably the reason for the result of recv() being of the signed type ssize_t. Fair enough. Now let us write a bit of formal specification to go with recv()'s prototype:

   /*@ … ensures -1 <= \result <= length ; */

The unimpeachable logic here is that since function recv() is well-written not to overflow the buffer passed to it, the number of bytes received can only be lower than the number of bytes it was allowed to write. Or, naturally, it can be -1 in case of error.


Question: what is strange about the above post-condition?

Wednesday, June 19 2013

Microsoft's bug bounty program

I like Robert Graham's analysis on Microsoft's new bug bounty program.


I would never have thought of selling vulnerabilities to the NSA (but then, I am not American and not a security researcher). Does the NSA not employ qualified people to look for vulnerabilities as their day job? Is that not like trying to sell a loaf of bread to a company whose business is to make bread?


Sometimes, you have a really good loaf of bread, but still… Regardless of whether the NSA already owns your particular loaf of bread, and independently of the payment-by-carrot-or-stick discussion, you are a competitor, not a provider.

Monday, May 20 2013

Attack by Compiler

The title of this post, “Attack by Compiler”, has been at the back of my mind for several weeks. It started with a comment by jduck on a post earlier this year. The post's topic, the practical undefinedness of reading from uninitialized memory, and jduck's comment, awakened memories from a 2008 incident with the random number generator in OpenSSL.

As I am writing this, if I google “attack by compiler”, the first page of results include the classic essay Reflections on Trusting Trust by Ken Thompson, Wikipedia's definition of a backdoor in computing, an article by David A. Wheeler for countering the attack described by Thompson, a commentary by Bruce Schneier on Wheeler's article, and a Linux Journal article by David Maynor on the practicality of the attack described by Thompson on the widespread GNU/Linux platform.


This post is about a slightly different idea.

Initial conditions: trustworthy compiler, peer-reviewed changes

Suppose that we start with a trustworthy compiler, widely distributed in both source and binary form. Some people are in the position to make changes to the compiler's source code and could, like Thompson in his essay, attempt to insert a backdoor in the compiler itself.

But now, each change is scrutinized by innumerable witnesses from the Open-Source community. I say “witnesses”, in fact they are mostly students, but we will probably be forced to assume that they can't all be mischievous.

The attackers could try to obfuscate the backdoor as they insert it, but the problem is that some of the witnesses are bound to possess this character trait that the less they understand a change, the more they investigate it. Furthermore, once these witnesses have uncovered the truth, loss of credibility will ensue for the person who tried to sneak the backdoor in. This person will lose eir commit privilege to the compiler's sources, and people will recover untainted compilers in source and binary form from their archives. This kind of approach is risky and may only result in a temporary advantage—which may still be enough.

The underhanded approach

The 2013 edition of the Underhanded C Contest is under way. The contest defines underhanded code as:

code that is as readable, clear, innocent and straightforward as possible, and yet [fails] to perform at its apparent function

Underhandedness is exactly what an attacker with commit access to the source code of the widely used compiler should aim for. If the commit is underhanded enough, the committer may not only enjoy full deniability, but ey may obtain that the incriminating change stays in ulterior versions of the compiler, as a “good” change. This implies that all affected security-sensitive applications, like the “login” program in Thompson's essay, must be updated to work around the now official backdoor in the compiler. In this scenario, even after the attack has been discovered, anytime someone unknowingly compiles an old version of “login” with a recent compiler, it's another win for the attacker.


Fortunately, we agree with Scott Craver that the C programming language is a very good context to be underhanded in, and a C compiler is even better. How about the following ideas?

  1. making pseudo-random number generators that rely on uninitialized memory less random, in the hope that this will result in weaker cryptographic keys for those who do not know about the flaw;
  2. optimizing a NULL test out of kernel code when it is one of several defenses that need to be bypassed;
  3. optimizing buffer + len >= buffer_end || buffer + len < buffer overflow tests out from application code, so that buffer overflows do take place in code that is guarded thus;
  4. optimizing source code that was written to take constant-time into binary code that reveals secrets by terminating early.

I am not being very original. According to this post by Xi Wang, idea (1) is only waiting for someone to give the compiler one last well-calibrated shove. The NULL test optimization was already implemented in the compiler when it was needed for a famous Linux kernel exploit. The interesting scenario would have been if someone had found that the code in tun_chr_poll() was almost exploitable and had submitted the GCC optimization to activate the problem, but it did not happen in this order. Idea (3) really happened.

Idea (4) has not been exploited that I know of, but it is only only a matter of time. If I google for “constant-time memcmp”, I may find an implementation such as follows:

int memcmp_nta(const void *cs, const void *ct, size_t count)
{
  const unsigned char *su1, *su2;
  int res = 0;

  for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
  res |= (*su1 ^ *su2);

  return res;
}

Nothing in the C language definition forces a compiler to compile the above function into a function that does not return as soon as variable res contains (unsigned char)-1, not to mention the possibilities if the compiler first inlines the function in a site where its result is only compared to 0, and then optimizes the code for early termination. If I was trying to sneak in a compiler change that defeats the purpose of this memcmp_nta() function, I would bundle it with auto-vectorization improvements. It is a fashionable topic, and quite exciting if one does not care about non-functional properties such as execution time.

Conclusion

The impracticability of the described attack is counter-balanced by some unusual benefits: at the time of the attack, someone may already have audited the pseudo-random number generator or function memcmp_nta() we used as examples. The audit may have considered both source and generated assembly code, and involved actual tests, at a time when the code was “safely” compiled. But the auditor is not going to come back to the review again and again each time a new compiler comes out. Like Monty Python's Spanish Inquisition, nobody expects the compiler-generated backdoor.


Three of my four examples involve undefined behavior. More generally, my examples all involve unwarranted programmer expectations about C idioms. This is the key to plausibly deniable compiler changes that reveal latent security flaws. What other unwarranted expectations should we take advantage of for an “attack by compiler”?

Tuesday, May 14 2013

Contrarianism

If I told you that when n is a positive power of two and d an arbitrary number, both represented as double, the condition (n - 1) * d + d == n * d in strictly-IEEE-754-implementing C is always true, would you start looking for a counter-example, or start looking for a convincing argument that this property may hold?

If you started looking for counter-examples, would you start with the vicious values? Trying to see if NaN or +inf can be interpreted as “a positive power of two” or “an arbitrary number” represented “as double”? A subnormal value for d? A subnormal value such that n*d is normal? A subnormal value such that (n - 1) * d is subnormal and n * d is normal?

Or would you try your luck with ordinary values such as 0.1 for d and 4 for n?


This post is based on a remark by Stephen Canon. Also, I have discovered a truly remarkable proof of the property which this quick post is too small to contain.

Saturday, May 11 2013

Big round numbers, and a book review

Nearly 15 months ago, according to a past article, this blog celebrated its 15-month anniversary, and celebrated with the announcement of minor milestones having been reached: 100 articles and 50 comments.

Fifteen months after that, the current count is nearly 200 articles and 200 comments. Also, the blog managed to get 100 subscribers in Google's centralized service for never having to mark the same post as read twice, Reader. This was a close call.


A lot of recent posts have been related to floating-point arithmetic. I would like to reassure everyone that this was only a fluke. Floating-point correctness became one of the Frama-C tenets with our involvement in two collaborative projects, U3CAT and Hisseo, now both completed. Very recently, something must have clicked for me and I became quite engrossed by the subject.


As a result of this recent passion, in the last few days, I started reading the “Handbook of Floating-Point Arithmetic”, by Jean-Michel Muller et al. This book is both thick and dense, but fortunately well organized, so that it is easy to skip over sections you do not feel concerned with, such as decimal floating-point or hardware implementation details. This book is an amazing overview. It contains cristal-clear explanations of floating-point idioms that I was until then painstakingly reverse-engineering from library code. Fifteen years from now, people will say, “… and you should read the Handbook of Floating-Point Arithmetic. It is a bit dated now, but it is still the best reference, just complete it with this one and that one”, just like people might say now about Aho, Sethi and Ullman's Dragon book for compilation.

Except that right now, the book is current. The references to hardware are references to hardware that you might still have, or certainly remember having had. The open questions are still open. If you were offered the chance to read the Dragon book when it came out and was all shiny and new, would you pass? If not, and if there is the slightest chance that you might hold an interest in the mysteries of floating-point computation in the foreseeable future, read this book now, for the bragging rights.

In addition, the book goes down to the lowest levels of detail, with occasional snippets of programs to make it clear what is meant. The snippets are C code, and irreproachable C code.

Thursday, May 9 2013

A 63-bit floating-point type for 64-bit OCaml

The OCaml runtime

The OCaml runtime allows polymorphism through the uniform representation of types. Every OCaml value is represented as a single word, so that it is possible to have a single implementation for, say, “list of things”, with functions to access (e.g. List.length) and build (e.g. List.map) these lists that work just the same whether they are lists of ints, of floats, or of lists of sets of integers.

Anything that does not fit in in a word is allocated in a block in the heap. The word representing this data is then a pointer to the block. Since the heap contains only blocks of words, all these pointers are aligned: their few least significants bits are always unset.

Argumentless constructors (like this: type fruit = Apple | Orange | Banana) and integers do not represent so much information that they need to be allocated in the heap. Their representation is unboxed. The data is directly inside the word that would otherwise have been a pointer. So while a list of lists is actually a list of pointers, a list of ints contains the ints with one less indirection. The functions accessing and building lists do not notice because ints and pointers have the same size.

Still, the Garbage Collector needs to be able to recognize pointers from integers. A pointer points to a well-formed block in the heap that is by definition alive (since it is being visited by the GC) and should be marked so. An integer can have any value and could, if precautions were not taken, accidentally look like a pointer. This could cause dead blocks to look alive, but much worse, it would also cause the GC to change bits in what it thinks is the header of a live block, when it is actually following an integer that looks like a pointer and messing up user data.

This is why unboxed integers provide 31 bits (for 32-bit OCaml) or 63 bits (for 64-bit OCaml) to the OCaml programmer. In the representation, behind the scenes, the least significant bit of a word containing an integer is always set, to distinguish it from a pointer. 31- or 63-bit integers are rather unusual, so anyone who uses OCaml at all knows this. What users of OCaml do not usually know is why there isn't a 63-bit unboxed float type for 64-bit OCaml.

There is no unboxed 63-bit floating-point type in OCaml

And the answer to this last question is that there is no particular reason one shouldn't have a 63-bit unboxed float type in OCaml. Defining one only requires carefully answering two more intricately related questions:

  • What 63-bit floating-point format should be used?
  • How will the OCaml interpreter compute values in this format?

In 1990, when 64-bit computers were few, Xavier Leroy decided that in his (then future) Caml-light system, the type for floating-point would be 64-bit double precision. The double precision floating-point format did not come close to fitting in the then-prevalent 32-bit word:

Floating-point numbers are allocated in the heap as unstructured blocks of length one, two or three words, depending on the possibilities of the hardware and on the required precision. An unboxed representation is possible, using the 10 suffix for instance, but this gives only 30 bits to represent floating-point numbers. Such a format lacks precision, and does not correspond to any standard format, so it involves fairly brutal truncations. Good old 64-bit, IEEE-standard floating point numbers seem more useful, even if they have to be allocated.

First, a remark: it is not necessary to distinguish floats from ints: that is what the static type-system is for. From the point of view of the GC they are all non-pointers, and that's the only important thing. So if we decide to unbox floats, we can take advantage of the same representation as for integers, a word with the least significant bit set. And nowadays even the proverbial grandmother has a 64-bit computer to read e-mail on, hence the temptation to unbox floats.

Second, the reticence to truncate the mantissa of any existing format remains well-founded. Suppose that we defined a format with 51 explicit mantissa bits as opposed to double-precision's 52. We could use the double-precision hardware for computations and then round to 51 bits of mantissa, but the sizes are so close that this would introduced plenty of double rounding errors, where the result is less precise than if it had been rounded directly to 51 bits. As someone who has to deal with the consequences of hardware computing 64-bit mantissas that are then rounded a second time to 52-bit, I feel dirty just imagining this possibility. If we went for 1 sign bit, 11 exponent bits, 51 explicit mantissa bits, we would have to use software emulation to round directly to the correct result.

This post is about another idea to take advantage of the double-precision hardware to implement a 63-bit floating-point type.

A truncationless 63-bit floating-point format

Borrow a bit from the exponent

Taking one of the bits from the 11 reserved for the exponent in the IEEE 754 double-precision format does not have such noticeable consequences. At the top of the scale, it is easy to map values above a threshold to infinity. This does not involve double-rouding error. At the bottom of the scale, things are more complicated. The very smallest floating-point numbers of a proper floating-point format, called subnormals, have an effective precision of less than the nominal 52 bits. Computing with full-range double-precision and then rounding to reduced-range 63-bit means that the result of a computation can be computed as a normal double-precision number with 52-bit mantissa, say 1.324867e-168, and then rounded to the narrower effective precision of a 63-bit subnormal float.

Incidentally, this sort of issue is the sort that remains even after you have configured an x87 to use only the 53 or 24 mantissa bits that make sense to compute with the precision of double- or single-precision. Only the range of the mantissa is reduced, not that of the exponent, so numbers that would be subnormals in the targeted type are normal when represented in a x87 register. You could hope to fix them after each computation with an option such as GCC's -ffloat-store, but then they are double-rounded. The first rounding is at 53 or 24 bits, and the second to the effective precision of the subnormal.

Double-rounding, Never!

But since overflows are much easier to handle, we can cheat. In order to make sure that subnormal results are rounded directly to the effective precision, we can bias the computations so that if the result is going to be a 63-bit subnormal, the double-precision operation produces a subnormal result already.

In practice, this means that when the OCaml program is adding numbers 1.00000000001e-152 and -1.0e-152, we do not show these numbers to the double-precision hardware. What we show to the hardware instead is these numbers multiplied by 2^-512, so that if the result need to be subnormal in the 63-bit format, and in this example, it needs, then a subnormal double-precision will be computed with the same number of bits of precision.

In fact, we can maintain this “store numbers as 2^-512 times their intended value” convention all the time, and only come out of it at the time of calling library functions such as printf().

For multiplication of two operands represented as 2^-512 times their real value, one of the arguments needs to be unbiased (or rebiased: if you have a trick to remember which is which, please share) before the hardware multiplication, by multiplying it by 2^512.

For division the result must be rebiased after it is computed.

The implementation of the correctly-rounded function sqrt() for 63-bit floats is left as an exercise to the reader.

Implementation

A quick and dirty implementation, only tested as much as shown, is available from ideone. Now I would love for someone who actually uses floating-point in OCaml to finish integrating this in the OCaml runtime and do some benchmarks. Not that I expect it will be very fast: the 63-bit representation involves a lot of bit-shuffling, and OCaml uses its own tricks, such as unboxing floats inside arrays, so that it will be hard to compete.

Credits

I should note that I have been reading a report on implementing a perfect emulation of IEEE 754 double-precision using x87 hardware, and that the idea presented here was likely to be contained there. Google, which is prompt to point to the wrong definition of FLT_EPSILON, has been no help in finding this report again.

Definition of FLT_EPSILON

Correct and wrong definitions for the constant FLT_EPSILON

If I google “FLT_EPSILON”, the topmost result is this page with this definition:

FLT_EPSILON the minimum positive number such that 1.0 + FLT_EPSILON != 1.0.

No, no, no, no, no.

I don't know where this definition originates from, but it is obviously from some sort of standard C library, and it is wrong, wrong, wrong, wrong, wrong. The definition of the C99 standard is:

the difference between 1 and the least value greater than 1 that is representable in the given floating point type, b^(1−p)

The GNU C library gets it right:

FLT_EPSILON: This is the difference between 1 and the smallest floating point number of type float that is greater than 1.

The difference

On any usual architecture, with the correct definition, FLT_EPSILON is 0x0.000002p0, the difference between 0x1.000000p0 and the smallest float above it, 0x1.000002p0.


The notation 0x1.000002p0 is a convenient hexadecimal input format, introduced in C99, for floating-point numbers. The last digit is a 2 where one might have expected a 1 because single-precision floats have 23 explicit bits of mantissa, and 23 is not a multiple of 4. So the 2 in 0x1.000002p0 represents the last bit that can be set in a single-precision floating-point number in the interval [1…2).


If one adds FLT_EPSILON to 1.0f, one does obtain 0x1.000002p0. But is it the smallest float with this property?

#include <stdio.h>

void pr_candidate(float f)
{
  printf("candidate: %.6a\tcandidate+1.0f: %.6a\n", f, 1.0f + f); 
}

int main(){
  pr_candidate(0x0.000002p0);
  pr_candidate(0x0.000001fffffep0);
  pr_candidate(0x0.0000018p0);
  pr_candidate(0x0.000001000002p0);
  pr_candidate(0x0.000001p0);
}

This program, compiled and executed, produces:

candidate: 0x1.000000p-23	candidate+1.0f: 0x1.000002p+0
candidate: 0x1.fffffep-24	candidate+1.0f: 0x1.000002p+0
candidate: 0x1.800000p-24	candidate+1.0f: 0x1.000002p+0
candidate: 0x1.000002p-24	candidate+1.0f: 0x1.000002p+0
candidate: 0x1.000000p-24	candidate+1.0f: 0x1.000000p+0

No, 0x0.000002p0 is not the smallest number that, added to 1.0f, causes the result to be above 1.0f. This honor goes to 0x0.000001000002p0, the smallest float above half FLT_EPSILON.

Exactly half FLT_EPSILON, the number 0x0.000001p0 or 0x1.0p-24 as you might prefer to call it, causes the result of the addition to be exactly midway between 1.0f and its successor. The rule says that the “even” one has to be picked in this case. The “even” one is 1.0f.

Conclusion

Fortunately, in the file that initiated this rant, the value for FLT_EPSILON is correct:

#define FLT_EPSILON 1.19209290E-07F // decimal constant

This is the decimal representation of 0x0.000002p0. Code compiled against this header will work. It is only the comment that's wrong.

- page 3 of 23 -