Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Tag - OCaml

Entries feed - Comments feed

Friday, October 18 2013

OCaml's option -compact can optimize for code size and speed

The OCaml compiler can generate native code for various architectures. One of the few code generation options is named -compact:

$ ocamlopt -help
Usage: ocamlopt <options> <files>
Options are:
  ...
  -compact  Optimize code size rather than speed
  ...

One of the effects, perhaps the only effect(?), of this option is on the code generated for allocation. In OCaml, a small block typically gets allocated from the minor heap. When all goes well, allocating in the minor heap means simply decrementing the pointer that separates the already-allocated top of the minor heap from the still-available bottom of the minor heap. Eventually the minor heap becomes full; what happens then is complicated. But the fast path is basically a subtraction and a conditional branch (usually not taken) to the complicated stuff.

With option -compact, an allocation simply calls a routine that does all the above. Without it, the subtraction and conditional jump are inlined at the site where the allocation is requested. OCaml being a typical functional language, there are plenty of these. Inlining a couple of instructions makes the code bigger, but when execution stays on the fast path, which is often, a routine call is avoided completely.

A small example

Consider the one-line OCaml function let f i = Some i. This function takes a value i and returns a Some memory cell that points to i. I intended i to be an integer, but I inadvertently wrote code that works for any type of i. This was possible because in OCaml, all types of values share a uniform representation, as discussed in the introduction of a previous post.

A Some cell occupies 16 bytes (yes, this is huge. Blame 64-bit computing). The fast code generated by ocamlopt -S t.ml, where L102 is the label at which selcom-called code manages the case when the minor heap is full:

	...
	subq	$16, %r15
	movq	_caml_young_limit@GOTPCREL(%rip), %rax
	cmpq	(%rax), %r15
	jb	L102
	...

The -compact option causes more compact code to be emitted for the allocation. When our function f is compiled with ocamlopt -S -compact t.ml, the sequence generated for the allocation is much shorter indeed. There is even a specialized allocation function for allocating 16-byte blocs, so no arguments need to be set up:

	...
	call	_caml_alloc1
	...

Effects of option -compact in the large

When the development version of Frama-C is compiled with option -compact, the binary takes 14.4MB.

Without option -compact, the Frama-C binary weights in at 15.4MB.


One additional megabyte of code is not going to hurt us at this point, right? It is not as if the Frama-C binary had to fit exactly in a box of ten floppy disks. As long as the larger code is faster, it is worth it…

$ time  bin/toplevel.lean_and_mean tests/test/adpcm.c -sparecode > /dev/null

user	0m0.843s

$ time  bin/toplevel.big_and_slow tests/test/adpcm.c -sparecode > /dev/null

user	0m0.856s

… but it turns out that the larger code isn't faster. In the case of Frama-C, and of the computer I am typing on, at least, the code generated with option -compact is smaller and faster.

Conclusion

It is not easy to try to predict performance on modern out-of-order architectures, but if I had to hazard an explanation, I would say that, besides the obvious cache concerns, having many scattered predictable conditional branches may be a losing proposition compared to one single conditional branch visited from many places with call and ret. Every conditional branch that isn't in the branch prediction buffer is at risk of being mispredicted. In addition, each entry taken by the branches that have recently been visited is an entry that isn't available for branches translated from conditionals in the source code.

In contrast, the execution of the call and ret instructions has been optimized over the years. The target of the ret instruction is predicted with a Return Stack Buffer that has a conceptually simpler job than the Branch Predictor (although it is clearly possible to write code for which the ret instructions cause a stall, which could re-establish an advantage for the partially inlined version).


The conclusion is that everyone who uses OCaml should try the native compiler's option -compact, and that perhaps the default should be for this option to be on.

Thursday, May 9 2013

A 63-bit floating-point type for 64-bit OCaml

The OCaml runtime

The OCaml runtime allows polymorphism through the uniform representation of types. Every OCaml value is represented as a single word, so that it is possible to have a single implementation for, say, “list of things”, with functions to access (e.g. List.length) and build (e.g. List.map) these lists that work just the same whether they are lists of ints, of floats, or of lists of sets of integers.

Anything that does not fit in in a word is allocated in a block in the heap. The word representing this data is then a pointer to the block. Since the heap contains only blocks of words, all these pointers are aligned: their few least significants bits are always unset.

Argumentless constructors (like this: type fruit = Apple | Orange | Banana) and integers do not represent so much information that they need to be allocated in the heap. Their representation is unboxed. The data is directly inside the word that would otherwise have been a pointer. So while a list of lists is actually a list of pointers, a list of ints contains the ints with one less indirection. The functions accessing and building lists do not notice because ints and pointers have the same size.

Still, the Garbage Collector needs to be able to recognize pointers from integers. A pointer points to a well-formed block in the heap that is by definition alive (since it is being visited by the GC) and should be marked so. An integer can have any value and could, if precautions were not taken, accidentally look like a pointer. This could cause dead blocks to look alive, but much worse, it would also cause the GC to change bits in what it thinks is the header of a live block, when it is actually following an integer that looks like a pointer and messing up user data.

This is why unboxed integers provide 31 bits (for 32-bit OCaml) or 63 bits (for 64-bit OCaml) to the OCaml programmer. In the representation, behind the scenes, the least significant bit of a word containing an integer is always set, to distinguish it from a pointer. 31- or 63-bit integers are rather unusual, so anyone who uses OCaml at all knows this. What users of OCaml do not usually know is why there isn't a 63-bit unboxed float type for 64-bit OCaml.

There is no unboxed 63-bit floating-point type in OCaml

And the answer to this last question is that there is no particular reason one shouldn't have a 63-bit unboxed float type in OCaml. Defining one only requires carefully answering two more intricately related questions:

  • What 63-bit floating-point format should be used?
  • How will the OCaml interpreter compute values in this format?

In 1990, when 64-bit computers were few, Xavier Leroy decided that in his (then future) Caml-light system, the type for floating-point would be 64-bit double precision. The double precision floating-point format did not come close to fitting in the then-prevalent 32-bit word:

Floating-point numbers are allocated in the heap as unstructured blocks of length one, two or three words, depending on the possibilities of the hardware and on the required precision. An unboxed representation is possible, using the 10 suffix for instance, but this gives only 30 bits to represent floating-point numbers. Such a format lacks precision, and does not correspond to any standard format, so it involves fairly brutal truncations. Good old 64-bit, IEEE-standard floating point numbers seem more useful, even if they have to be allocated.

First, a remark: it is not necessary to distinguish floats from ints: that is what the static type-system is for. From the point of view of the GC they are all non-pointers, and that's the only important thing. So if we decide to unbox floats, we can take advantage of the same representation as for integers, a word with the least significant bit set. And nowadays even the proverbial grandmother has a 64-bit computer to read e-mail on, hence the temptation to unbox floats.

Second, the reticence to truncate the mantissa of any existing format remains well-founded. Suppose that we defined a format with 51 explicit mantissa bits as opposed to double-precision's 52. We could use the double-precision hardware for computations and then round to 51 bits of mantissa, but the sizes are so close that this would introduced plenty of double rounding errors, where the result is less precise than if it had been rounded directly to 51 bits. As someone who has to deal with the consequences of hardware computing 64-bit mantissas that are then rounded a second time to 52-bit, I feel dirty just imagining this possibility. If we went for 1 sign bit, 11 exponent bits, 51 explicit mantissa bits, we would have to use software emulation to round directly to the correct result.

This post is about another idea to take advantage of the double-precision hardware to implement a 63-bit floating-point type.

A truncationless 63-bit floating-point format

Borrow a bit from the exponent

Taking one of the bits from the 11 reserved for the exponent in the IEEE 754 double-precision format does not have such noticeable consequences. At the top of the scale, it is easy to map values above a threshold to infinity. This does not involve double-rouding error. At the bottom of the scale, things are more complicated. The very smallest floating-point numbers of a proper floating-point format, called subnormals, have an effective precision of less than the nominal 52 bits. Computing with full-range double-precision and then rounding to reduced-range 63-bit means that the result of a computation can be computed as a normal double-precision number with 52-bit mantissa, say 1.324867e-168, and then rounded to the narrower effective precision of a 63-bit subnormal float.

Incidentally, this sort of issue is the sort that remains even after you have configured an x87 to use only the 53 or 24 mantissa bits that make sense to compute with the precision of double- or single-precision. Only the range of the mantissa is reduced, not that of the exponent, so numbers that would be subnormals in the targeted type are normal when represented in a x87 register. You could hope to fix them after each computation with an option such as GCC's -ffloat-store, but then they are double-rounded. The first rounding is at 53 or 24 bits, and the second to the effective precision of the subnormal.

Double-rounding, Never!

But since overflows are much easier to handle, we can cheat. In order to make sure that subnormal results are rounded directly to the effective precision, we can bias the computations so that if the result is going to be a 63-bit subnormal, the double-precision operation produces a subnormal result already.

In practice, this means that when the OCaml program is adding numbers 1.00000000001e-152 and -1.0e-152, we do not show these numbers to the double-precision hardware. What we show to the hardware instead is these numbers multiplied by 2^-512, so that if the result need to be subnormal in the 63-bit format, and in this example, it needs, then a subnormal double-precision will be computed with the same number of bits of precision.

In fact, we can maintain this “store numbers as 2^-512 times their intended value” convention all the time, and only come out of it at the time of calling library functions such as printf().

For multiplication of two operands represented as 2^-512 times their real value, one of the arguments needs to be unbiased (or rebiased: if you have a trick to remember which is which, please share) before the hardware multiplication, by multiplying it by 2^512.

For division the result must be rebiased after it is computed.

The implementation of the correctly-rounded function sqrt() for 63-bit floats is left as an exercise to the reader.

Implementation

A quick and dirty implementation, only tested as much as shown, is available from ideone. Now I would love for someone who actually uses floating-point in OCaml to finish integrating this in the OCaml runtime and do some benchmarks. Not that I expect it will be very fast: the 63-bit representation involves a lot of bit-shuffling, and OCaml uses its own tricks, such as unboxing floats inside arrays, so that it will be hard to compete.

Credits

I should note that I have been reading a report on implementing a perfect emulation of IEEE 754 double-precision using x87 hardware, and that the idea presented here was likely to be contained there. Google, which is prompt to point to the wrong definition of FLT_EPSILON, has been no help in finding this report again.

Tuesday, February 26 2013

Portable modulo signed arithmetic operations

In a recent post, John Regehr reports on undefined behaviors in Coq! In Coq! Coq! Exclamation point!

Coq is a proof assistant. It looks over your shoulder and tells you what remains to be done while you are elaborating a mathematical proof. Coq has been used to check that the 4-color theorem really holds. It would be supremely ironic if an undefined behavior in a bit of C code in the OCaml implementation used to compile Coq caused it to accept as valid some incorrect proofs, for instance in the formally verified (in Coq) compiler CompCert.

Your mission, should you decide to accept it

Where are these undefined behaviors? They are in several places, but some of them are in the implementation of the bytecode arithmetic operations. In OCaml, basic operations such as addition and multiplication are specified as returning the mathematical result modulo 2^31, 2^32, 2^63 or 2^64, depending on the compilation platform and the integer type.

The problem here is not limited to OCaml. The Java Virtual Machine too specifies that arithmetic operations on its 32-bit integer type return results modulo 2^32. Someone wishing to implement a JVM in portable C would face the same issues.

In fact, this is just such a big issue that we should collectively try to work out a solution now.

A safer signed addition instruction

Let us assume a 32-bit platform. In this case, values of the unboxed OCaml type int range between -2^30 and 2^30-1 (31 bits of information). They are stored at run-time in the 31 most significant bits of a machine word, with the least significant bit set to one to distinguish it from a pointer. This representation is taken into account when applying arithmetic operations to machine words representing unboxed integers.

Here is an example, in file interp.c in the OCaml implementation. We are in the middle of the large switch that decodes bytecode instructions:

    Instruct(ADDINT):
      accu = (value)((intnat) accu + (intnat) *sp++ - 1); Next;

The - 1 in this expression compensates for the fact that the least significant bit is set in both operands, and we want it to be set in the result (and no carry to be propagated). But as written, there is an undefined overflow in the C implementation of the bytecode interpreter when the unboxed integers in accu and *sp total more than 2^30-1. In order to avoid a signed overflow here, the instruction ADDINT could be written:

      accu = (value)((uintnat) accu + (uintnat) *sp++ - 1U); Next;

This solution takes advantage of the fact that unlike signed addition, the behavior of unsigned addition when the result overflows the destination type is defined by the C standard. The solution above does make assumptions about the conversion from a signed to an unsigned type and vice versa, but these are implementation-defined (and we are only targeting implementations that define these conversions as 2's-complement).

No overhead is involved in this approach: a half-decent C compiler should generate the same instruction sequence for the new version as it already generates for the old version.

Multiplication

You may be vaguely worried about multiplication. For instance, you may remember that while, in a typical instruction set, there is only one addition instruction that works for both signed and unsigned operands, there are dedicated signed multiplication and unsigned multiplication instructions. These are IMUL and MUL in the x86 family of processors.

Fear not! The two multiplications are only because on x86 and many other processors, 32-bit multiplication returns a 64-bit result. If you want all 64 bits of the results to be what you expect, you need to choose the instruction that interprets the operands' highest bit correctly. But if on the other hand you only care for the lowest 32 bits of the result (that is, you only want a result modulo 2^32), then it does no matter that you use unsigned multiplication. Think of it as (x + k1 * 2^32) * (y + k2 * 2^32) for unknown k1 and k2.

And this is our case. Thus, in interp.c, the implementation for bytecode instruction MULINT could be changed from:

Instruct(MULINT):
  accu = Val_long(Long_val(accu) * Long_val(*sp++)); Next;

to:

Instruct(MULINT):
  accu = Val_long((intnat)((uintnat)Long_val(accu) * (uintnat)Long_val(*sp++))); Next;

Conclusion

As pointed out by John Regehr in the comments, the two fixes above are appropriate because the OCaml language definition specifies wrap-around behavior for its (signed) arithmetic operations. One should not assume that all the warnings emitted by IOC have a similar cause. Some may be deeper bugs requiring the logic to be fixed, instead of conversions to unsigned types that only ensure wrap-around behavior.

Issues of the nature discussed here should be in the bytecode version of the OCaml compiler only. By contrast, the native compiler should translate OCaml-level multiplication directly to assembly multiplication, bypassing the portable^Wwoefully underspecified assembly that is the C language.

Even programs generated by the native compiler interface with bits of OCaml runtime written in C, but this runtime is small.

So actually, there is not much cause for worry. If you are using a C compiler or a static analysis platform that involve OCaml code on a modern computer, they are probably already compiled with the native OCaml compiler.

Monday, November 19 2012

Funny floating-point bugs in Frama-C Oxygen's front-end

In a previous post, almost exactly one year ago, before Frama-C Oxygen was released, I mentioned that the then future release would incorporate a custom decimal-to-binary floating-point conversion function. The reason was that the system's strtof() and strtod() functions could not be trusted.

This custom conversion function is written in OCaml. It can be found in src/kernel/floating_point.ml in the now available Oxygen source tree. This post is about a couple of funny bugs the function has.

History

There had been arguments about the inclusion of the “correct parsing of decimal constants” feature in Frama-C's front-end, and about the best way to implement it. My colleague Claude Marché was in favor of using the reputable MPFR library. I thought that such an additional dependency was unacceptable and I was against the feature's inclusion as long as I could not see a way to implement it that did not involve such a dependency.

When I presented my dependency-avoiding solution to Claude, he said: “But how do you know your function works? Did you prove it?”. To which I replied that no, I had not proved it, but I had thoroughly tested it. I had written a quick generator of difficult-to-convert decimal numbers, and I was rather happy with the confidence it gave me.

My generator allowed me to find, and fix, a double rounding issue when the number to convert was a denormal: in this case the number would first be rounded at 52 significant digits and then at the lower number of significant digits implied by its denormal status.


I owe Claude a beer, though, because there were two bugs in my function. The bugs were not found by my random testing but would indeed have been found by formal verification. If you want to identify them yourself, stop reading now and start hunting, because the bugs are explained below.

Stop reading here for a difficult debugging exercise

Say that the number being parsed is of the form numopt1.numopt2Enum3 where numopt1 and numopt2 expand to optional strings of digits, and num3 expands to a mandatory string of digits.

The sequences of digits numopt1 and numopt2 can be long. The string numopt2 in particular should not be parsed as an integer, because the leading zeroes it may have are significant.

At this point of the parsing of the input program, we have already ensured that num3 was a string of digits with an optional sign character at the beginning. In these conditions, it is tempting to simply call the OCaml function int_of_string. The function int_of_string may still fail and raise an exception if the string represents a number too large to be represented as a 31- or 63-bit OCaml int.

This is easy to fix: if the program contains a literal like 1.234E9999999999999999999, causing int_of_string to fail when parsing the exponent, return infinity. A vicious programmer might have written 0.000…01E9999999999999999999, but this programmer's hard-drive is not large enough to contain all the digits that would prevent infinity to be the correct answer.

Similarly, if int_of_string chokes because the programmer wrote 1.234E-9999999999999999999, the function can safely return 0.0 which, for the same reason, is always the correctly rounded floating-point representation.


Or so it would seem. The above logic is implemented in function parse_float inside Frama-C Oxygen, and this is where the bugs are.

Stop reading here for an easy debugging exercise

During a code review, my colleague Boris Yakobowski and I found that Oxygen had the following unwanted behaviors. Read on for the solution.

Exhibit one: spurious warning

double d = 0.0E-9999999999999999999;

For the program above, a warning is emitted whereas the constant is, in fact, exactly represented. The only issue here is the spurious warning:

$ frama-c e1.c
[kernel] preprocessing with "gcc -C -E -I.  e1.c"
e1.c:1:[kernel] warning: Floating-point constant 0.0E-9999999999999999999
is not represented exactly.
Will use 0x0.0000000000000p-1022.
See documentation for option -warn-decimal-float

Exhibit two: confusion between zero and infinity

double d = 0.0E9999999999999999999;

This bug is more serious:

$ frama-c e2.c
[kernel] preprocessing with "gcc -C -E -I.  e2.c"
e2.c:1:[kernel] warning: Floating-point constant 0.0E9999999999999999999
is not represented exactly.
Will use inf.
See documentation for option -warn-decimal-float

These two related bugs are fixed in the development version of Frama-C.

Wednesday, September 19 2012

Never forget to sanitize your input

This post is a follow up of this one

A facetious colleague pointed out to me that the print_stmt function that is used to display the CFG in the post mentioned above behaves incorrectly when used over code that include string constants, such as the one below:

void f(const char *);

void test(int c) {
  if ("A random string"[c]=='s') f("Hello, world"); else f("Goodbye");
}

Namely, poor dot is a little bit confused by all these double quotes occurring in labels that are themselves delimited by double quotes. It manages to output something, but the result is not really satisfying: cfg_string_cst.png

Thus, it looks as though the first improvement to our little script will be to have a more robust pretty-printing function for statements. Basically, we have to escape those pesky double quotes that might appear when pretty-printing a C expression. Fortunately, OCaml's Printf (hence Format) module does know how to do this, with the "%S" directive. There's one drawback, though: this directive will also put (unescaped) double quotes at the beginning and end of the result, that we'll have to get rid of. In the end, the escaping function is the following:

let protect pretty chan v =
  let s = Pretty_utils.sfprintf "%a" pretty v in
  let s = Pretty_utils.sfprintf "%S" s in
  Format.pp_print_string chan (String.sub s 1 (String.length s - 2))

First, we create a string containing the the unescaped output. Pretty_utils.sfprintf is a Frama-C function that is analogous to Format.sprintf except that 1) it can be fed with the same pretty-printing functions as fprintf when there is a custom "%a" directive and 2) each call uses its own internal buffer, avoiding subtle flushing issues. Second, we escape all the inner double quotes. Finally, we output the result, except for the first and last characters that by definition of "%S" are double quotes.

We can now use protect in print_stmt each time there might be an issue (the complete file is available here):

 let print_stmt out =
  function
-   | Instr i -> !Ast_printer.d_instr out i
+   | Instr i -> protect !Ast_printer.d_instr out i
...
-   | If (e,_,_,_) -> Format.fprintf out "if %a" !Ast_printer.d_exp e
-   | Switch(e,_,_,_) -> Format.fprintf out "switch %a" !Ast_printer.d_exp e
+   | If (e,_,_,_) -> Format.fprintf out "if %a" (protect !Ast_printer.d_exp) e
+   | Switch(e,_,_,_) ->
+       Format.fprintf out "switch %a" 
+         (protect !Ast_printer.d_exp) e

Using this new version of the script, frama-c -load-script cfg_print_escape.ml test.c && dot -Tpng -O cfg.out produces a better result: cfg_string_cst_correct.png

Tuesday, May 22 2012

Iterating over the AST

Context

A facetious colleague who claims that he should be better writing his thesis but keeps committing Coq and OCaml files on Frama-C's repository, asked me the following question: Is there a function in Frama-C's kernel that can fold[1] a function over all types that appear in a C program?

As it turns out, nothing is readily available for this exact purpose, but another facetious colleague could have come up with a terse answer: There's a visitor for that! Now, we can be a bit more helpful and actually write that visitor. Our end goal is to come up with a function

fold_typ: (Cil_types.typ -> 'a -> 'a) -> 'a -> 'a

so that fold_typ f init will be f t_1 (f t_2 (... (f t_n init)...)), where the t_i are all the C types appearing in a given C program (in no particular order).

Frama-C's visitors

The visitor pattern is a well-known object-oriented design pattern that can be used to perform a given action over all nodes of a complex data structure (in our case the Abstract Syntax Tree -AST for short- of a C program). Frama-C provides a generic visitor mechanism, built upon CIL's visitor, whose entry points can be found in the aptly named src/kernel/visitor.mli file. It is also documented in section 5.14 of the developer manual. In summary, you first define a class that inherits from the generic visitor and overrides the methods corresponding to the nodes you're interested in (in our case, this will be vtype for visiting uses of Cil_types.typ). Then you apply an object of this class to the function from the Visitor module that correspond to the subpart of the AST that you want to inspect (in our case, this will be the whole AST).

Basic version

The standard visitor does not provide anything to return a value outside of the AST. In fact, all the entry points in Visitor return a node of the same type that the node in which the visit starts (for visitors that don't perform code transformation, this is in fact physically the same node). But if you accept to sneak in a little bit of imperative code into your development -and by using Frama-C you've already accepted that- there is an easy way out: pass to the visitor a reference that it can update, and you just have to read the final value that reference is holding after the visit. The visitor then looks like the following:

class fold_typ_basic f acc =
object
  inherit Visitor.frama_c_inplace
  method vtype ty = acc:= f ty !acc; Cil.DoChildren
end

And that's it. Each time the visitor sees a type, it will apply f to ty and the result of the previous computations, stored in acc. fold_typ then just needs to call the visitor over the whole AST and give it a reference initialized with init:

let fold_typ f init =
  let racc = ref init in
  let vis = new fold_typ_basic f racc in
  Visitor.visitFramacFileSameGlobals vis (Ast.get());
  !racc

Don't do the same work twice

This first version, that is barely more than 10 LoC, works, but we can do a little better. Indeed, f will be called each time a type is encountered in the AST. In most cases, we want to call f once for any given type. This can be done quite simply by memoizing in an instance variable of our visitor the set of types encountered thus far. Frama-C's Cil_datatype module (cil/src/cil_datatype.mli) provides all the needed functions for that:

class fold_typ f acc =
object
  inherit Visitor.frama_c_inplace
  val mutable known_types = Cil_datatype.Typ.Set.empty
  method vtype ty =
    if Cil_datatype.Typ.Set.mem ty known_types then Cil.DoChildren
    else begin
      known_types <- Cil_datatype.Typ.Set.add ty known_types;
      acc:= f ty !acc;
      Cil.DoChildren
    end
end

Testing the infrastructure

It is now time to test if everything works smoothly. The following function will print the name and size of the type who has the biggest size in the analyzed program.

let test () =
  let f ty (maxty,maxsize as acc) =
    try
      let size = Cil.sizeOf_int ty in
      if size > maxsize then (ty,size) else acc
    with Cil.SizeOfError _ -> acc
  in
  let (ty,size) = fold_typ f (Cil.voidType,0) in
  Format.printf "Biggest type is %a,@ with size %d@." 
    !Ast_printer.d_type ty size

Since it is only a quick test, we don't do anything special if Cil complains that it cannot compute the size of a given type: we just stick to the maximal value computed so far.

File fold_typ.ml provides the code for the visitor and the test function. frama-c -load-script fold_typ.ml file.c should output something like

[kernel] preprocessing with "gcc -C -E -I.  file.c"
Biggest type is struct Ts,
with size 44

Exercises

  1. How can you use the fold_typ class to define an iter_typ function that apply a function f returning unit to each type of the AST (val iter_typ: (Cil_types.typ -> unit) -> unit)?
  2. Writing fold_typ is a bit overkill if you're going to apply it once in your plug-in. Write a specialized visitor that will do the same thing as the test function above.

Notes

[1] For those who are not familiar with functional programming: http://en.wikipedia.org/wiki/Fold_%...

Saturday, January 7 2012

Making OCaml native code 0.5% shorter on Mac OS X

Mac OS X and assembly labels

A few months ago, I was moving things around in Zarith. It's a good way to relax, not unlike gardening. And then I noticed something strange.

  • On Mac OS X, a label in assembly code is marked as local by prefixing it with "L", unlike another common convention of using ".".
  • On Mac OS X, the assembler has some sort of strange limitation that makes it generate long variants of jump instructions if the destination label is not local, even if the label is in the same compilation unit (and its relative position known).


Here is a hand-written assembly snippet to show this:

	testl	%eax, %eax
	jne	L6
	testl	%ebx, %ebx
	jne	.L7
	movl	$0, %ecx
L6:
	movl	$0, %eax
.L7:
	movl	$0, %ebx

The two above facts together mean that on Mac OS X, the snippet is compiled into object code that can then be disassembled as:

0x0000000000000057 <main+24>:	test   %eax,%eax
0x0000000000000059 <main+26>:	jne    0x68 <main+41>
0x000000000000005b <main+28>:	test   %ebx,%ebx
0x000000000000005d <main+30>:	jne    0x63 <main+36>
0x0000000000000063 <main+36>:	mov    $0x0,%ecx
0x0000000000000068 <main+41>:	mov    $0x0,%eax
0x000000000000006d <.L7+0>:	mov    $0x0,%ebx

You may notice that since .L7 is not a local label, gdb considers that it may be the name you want to see at address 0x6d. This is just a heuristic of little importance. The second conditional jump at <main+30> appears to be going to <main+36>, but this is just because we are looking at an unlinked object file. The destination has been left blank in the object file, and since it is expressed as a relative offset, the default value 0 makes it look like the destination is the instruction that immediately follows. More to the point, the first conditional jump at <main+26> occupies two bytes, because the assembler sees that the destination is close and that the relative offset fits into one byte, whereas the second conditional jump at <main+30> occupies 6 bytes, leaving room for a 4-byte encoding of the target address.

Enter OCaml. The plot thickens.

Antoine Miné and Xavier Leroy then respectively contributed the following additional facts:

  • OCaml generates labels intended to be local with a ".L" prefix on Mac OS X. This at first sight seems inefficient, since it leads the assembler to use the long encoding of jumps all the time, even when the destination is nearby.
  • But in fact, Mac OS X's linker complains when you have subtracted local labels in the file being linked, so that if you intend to subtract some of the labels you are generating, you shouldn't make them local anyway.

Subtracting addresses indicated by local labels is something that OCaml does in the process of generating meta-data accompanying the code in the assembly file. Thus, on Mac OS X, OCaml is prevented from using proper local labels with an "L" prefix.


Mac OS X's compilation chain is, of course, being obtuse. It could generate the short jump variants when the non-local destination label happens to be known and nearby. It could as well compute whatever subtractions between local labels occur in an assembly file while it is being assembled, instead of leaving them for later and then complaining that it's impossible. The usual GNU compilation suite on a modern Linux distribution gets both of these features right, and Mac OS X's gets both of them wrong, leaving the OCaml native compiler no choice but to generate the inefficiently compiled non-local labels. Mac OS X deserves all the blame here.

Solution: a hack

Are we destined to keep ugly 6-byte jumps in our OCaml-generated native code on Mac OS X then? No, because I made a hack. In Frama-C's Makefile, I changed the rule to compile .ml files into the .cmx, .o, ..., natively compiled versions thus:

--- share/Makefile.common	(revision 16792)
+++ share/Makefile.common	(working copy)
@@ -306,7 +306,10 @@
 
 %.cmx: %.ml
 	$(PRINT_OCAMLOPT) $@
-	$(OCAMLOPT) -c $(OFLAGS) $<
+	$(OCAMLOPT) -S -c $(OFLAGS) $<
+	sed -f /Users/pascal/ppc/sed_asm \
+            < $(patsubst %.ml,%.s,$<) > $(patsubst %.ml,%.1.S,$<)
+	gcc -c $(patsubst %.ml,%.1.S,$<) -o $(patsubst %.ml,%.o,$<)
 
 # .o are generated together with .cmx, but %.o %.cmx: %.ml only confuses
 # make when computing dependencies...

This uses ocamlopt's -S option to generate the assembly file from the .ml source code file. Then, a sed script is applied to the assembly file to modify it a little. And finally, the modified assembly file is compiled (gcc can be used for this).

The sed commands to transform the assembly file are these:

s/^[.]\(L[0-9]*\):/.\1: \1:/g
s/\([[:space:]]*j.*[[:space:]]*\)[.]\(L[0-9]*\)$/\1\2/g

The first command transform all label declarations (e.g. .L100:) into a double declaration .L100: L100:. The two labels thus indicate the same location, but the second one is local whereas the first one isn't.

The second command transforms labels when used inside jump instructions, so that jne .L100 is transformed into jne L100. Crucially, it does not transform labels elsewhere, for instance when referenced in the meta-data that the OCaml compiler generates, where the compiler may have to subtract one label's address from another's.

Results

On Mac OS X, the described trick makes Ocaml-generated native code smaller:

-rwxr-xr-x  1 pascal  staff  11479984 Jan  6 23:24 bin/toplevel.old
-rwxr-xr-x  1 pascal  staff  11414512 Jan  7 00:12 bin/toplevel.opt

The nearly 65000 bytes of difference between the two version represent the accumulation of all the inefficiently assembled jumps in a large piece of software such as Frama-C.

Conclusion

If you play with the above compilation rule and sed script, do not expect much in terms of speed improvements: changing an inefficient encoding into an efficient one of the same instruction helps the processor, but only marginally. I guess it would be measurable, but not with my usual protocol of launching three of each and keeping the median measurement. I would have to learn about confidence intervals, which does not sound fun (not like gardening at all). Instead, I will avoid making the claim that this hack improves execution speed, and I will just postpone a bit more the moment I have to learn about statistics.

Friday, November 18 2011

Analyzing single-precision floating-point constants

The previous post on this blog points out how subtle just floating-point constants can be. In a previous previous post, exactly one year ago, I was already alluding to these difficulties in the context of Frama-C's front-end.

Programming a code analyzer that can answer question 5

How should the program below be parsed to have a chance to be analyzed correctly?

main(){
  float f1 = 1.01161128282547f;
  float f2 = 1.01161128282547;
  return f1 == f2;
}

In order to avoid the very double-rounding that could make variable f1 appear equal to f2, the method I delineated one year ago was to call strtof() to parse the single-precision constant. An analyzer that carelessly parses the initializer of f1 with a call to double-precision function strtod() would wrongly conclude that both variables are double-rounded and therefore equal. In OCaml, it is easy to slip because by default, strtod() is the only interfaced conversion function.

I have now tried the "interface and call strtof()" method, and in this post, I report on the lesson I learned. It fits in one sentence.

Calling strtof() from OCaml in order to avoid double rounding issues only works if the function strtof() is correct on the host system in the first place.


The diagram and the program explaining the problem are almost identical from the previous post's question 5:

#include <stdlib.h>
#include <stdio.h>
main(){
  float f1 = strtof("1.01161128282547", NULL);
  float f2 = strtod("1.01161128282547", NULL);
  printf("f1:%.9g\nf2:%.9g\n", f1, f2);
}
       ~1.01161122                    ~1.01161128                     ~1.01161134
            +------------------------------+------------------------------+
           f2                                ^                            f1
                                       original number

This program, when executed, should display:

$ ./a.out 
f1:1.01161134
f2:1.01161122

It is expected for the program to print 1.01161122 for f2, as the string in the program is converted to a double near 1.01161128, and then to a float (the float near 1.01161122 is chosen). For variable f1, calling function strtof() to convert the string directly to a float should result in the float near 1.01161134 being computed. Indeed, this is what the program does on Mac OS X. However:

$ ./a.out 
f1:1.01161122
f2:1.01161122
$ uname -a
Linux 2.6.34.6-xxxx-grs-ipv6-64 #3 SMP Fri Sep 17 16:06:38 UTC 2010 x86_64 GNU/Linux

This is a bug in Glibc.

Note that the standard only requires that conversion functions strtod(), strtof() and conversions done by the C compiler when parsing must produce a result within one ULP of the exact value, in an implementation-defined manner. However, I do not think that "trying to return the nearest float, but failing in some difficult cases and returning the second nearest" qualifies as "implementation-defined". There's an implementation all right, but it's not much of a definition.


An analyzer works on the source code. It is natural, during the elaboration of the Abstract Syntax Tree, to convert strings from the program's source code into floating-point numbers, and the natural way to do this is to carefully call the host's function strtof() of single-precision constants and strtod() on double-precision constants. Unfortunately, if the analyzer is running on GNU/Linux, all this care is for nothing: the AST will contain wrong floating-point values (wrong in the sense of being different from the values the compiler will use for the same constants), and therefore, the analyzer will, for instance, wrongly conclude that the program in question 5 in last post returns 1.

Mitigation technique

A simple technique to mitigate this issue is to analyze the program from question 5 less precisely, taking into account rounding issues, so as not to say anything false. This is what the value analysis in Nitrogen does:

$ ~/frama-c-Nitrogen-20111001/bin/toplevel.opt -val q5.c
[...]
          f1 ∈ [1.01161122322 .. 1.01161134243]
          f2 ∈ [1.01161122322 .. 1.01161134243]
          __retres ∈ {0; 1}

The sets computed for f1 and f2 are both over-approximated but correct. As a result of the approximation, the analyzer is unsure whether the program returns 0 or 1. This is a bit unsatisfactory, because the analysis is imprecise on a simple and deterministic program. But at least it is sound.

The solution: write your own strtof()

According to this post on the Exploring Binary blog, it is easy to write your own correctly rounded decimal to binary function as long as you do not worry too much about performance. In the context of static analysis, parsing time is small with respect to the time taken by subsequent treatments, and it is reassuring to know that the values in the AST are the correct values, regardless of bugs in the host's libraries. Following this principle, I wrote Frama-C's own Caml conversion function based on the outline there. Preliminary tests indicate that this function is adequate for this usage. It will in all likelihood be used for both single and double precision constants in the next Frama-C release.


This means that the value analysis in the Oxygen release will be able to give precise and correct answers to all questions in the floating-point quiz from last post. No "maybe it is and maybe it ain't" answers as Nitrogen gives on this quiz, and no incorrect answers like very old Frama-C versions gave. Here is the development version answering question 5:

$ bin/toplevel.opt -val q5.c
[...]
warning: Floating-point constant 1.01161128282547f is not represented exactly.
  Will use 0x1.02f8f60000000p0. See documentation for option -warn-decimal-float
[...]
          f1 ∈ 1.01161134243
          f2 ∈ 1.01161122322
          __retres ∈ {0}

Small approximations soon snowball into large ones, so I think it is important to be able to give the most precise value to the simplest expressions of the C language, constants.

Should we worry about being too precise ?

The problem encountered with strtof() also highlights the relative fuzziness of the standard. If Frama-C is too precise, it may omit the actual behavior of the program as compiled by the compiler. This would be unfortunate. Recent versions of GCC go to great lengths to offer correct rounding to nearest (using the MPFR library). Another compiler may itself use strtof() for parsing. For covering both the possibilities that the program is compiled with GCC or this hypothetical other compiler, Nitrogen's imprecise handling of single-precision constants was better.


I would say that a Frama-C plug-in may wish to offer both modes through a command-line option. The value analysis probably will, eventually. Regardless, it is also important that Frama-C always provides consistent results regardless of the host platform, so it was necessary to implement a correctly rounded string -> float function, even if we then add and subtract an ULP here and there in order to capture all possible compilations.

Conclusion

In the introduction, I self-centeredly linked to two earlier posts from this blog. The post from one year ago showed one difficulty as an example of why single-precision floating-point was not supported precisely in Frama-C in general. The post from last week showed several contrived examples that can only be handled with a precise understanding of floating-point —including but not limited to the issue described one year ago— in the form of a quiz. What I find interesting is that there were unforeseen difficulties in the implementation of even the foreseen difficulty of getting the right value for a single-precision constant. And, to reiterate, the current development version of the value analysis correctly answers all five questions in the quiz.


This post was improved by suggestions from Fabrice Derepas and Florent Kirchner.

Saturday, October 29 2011

A portable OCaml function to print floats

For printing the results of automated Frama-C tests, we need printing functions with the following properties:

  1. they should print all the information available (doing otherwise might mask an existing bug);
  2. they should produce results readable enough for a person to validate the result of each test once;
  3. and their output should not differ depending on the execution platform.


For tests that manipulates floating-point data, we need a floating-point printing function with the above properties. If you start with the OCaml function Format.printf and print enough digits to make condition 1 true, then a number of issues arise with condition 3. For a while, we used hexadecimal in all tests outputs. For instance, the value for d in the program d = 3.14; was printed as d ∈ 0x1.91eb851eb851fp1. In this notation, the digits after 1. are in base 16, and p1 means "multiplied by 2^1", by similarity with e1 meaning "multiplied by 10^1". Users were not fond of this notation. I don't know why. It is accepted for input by C99 compilers and can be printed with the %a format code in printf(). It is easy to see at a glance that the d is between 3 and 4, closer to 3, because 0x1.8p1 is 3, and 0x2.0p1 is 4.


In order to improve condition 2, I wrote my own floating-point pretty-printing function. The goal here is not to have something very sophisticated, but to make sure we do not miss a bug in a test output for the silly reason of having had ten fingers for a long time. For the same program d = 3.14;, the new decimal pretty-printing function produces d ∈ 3.1400000000000001, which is what could be expected considering that the number 3.14 cannot be represented exactly as a double. More important is that if the program is modified to read d = 0x1.91eb851eb8520p1;, changing only the least significant bit in the floating-point representation of d, the output becomes d ∈ 3.1400000000000005, showing that there is a difference.


I have recently improved my improvement, so I thought I might as well provide the function here (but you may have seen a previous version through other channels). The function I wrote only uses 64-bit ints and double-precision floats, so that it would be straightforward to translate to most other programming languages.

It works by first isolating the sign, exponent and mantissa of the floating-point number to print. The mantissa ranges over 0 .. 0xFFFFFFFFFFFFF, and we would like a sequence of decimal digits that ranges over 0 .. 9999999999999998 instead. For this purpose, the mantissa is bitwise or-ed into the representation of the double-precision float 1.0, giving a result that ranges over 1.0 .. 2.0 (excluded). The basic idea is then to subtract 1.0, to multiply by 1e16, to convert to an integer and to print with width 16 and with leading zeroes (the %016Ld format).


The function is at the bottom of this post. The recent improvement is this test:

      let d, re =
	if d >= 1.5
	then d -. 1.5, 5000000000000000L
	else d -. 1.0, 0L
      in
      ...

Without these lines, the function was a bit unsatisfactory when the fractional part was above 0.5. The printed decimal number was a strictly increasing function of the actual floating-point number, which was enough to ensure condition 1, but some numbers that were exactly representable in both the floating-point double-precision format and as decimal numbers were printed with a decimal representation other than that representing exactly the number. It was bugging me, although I doubt it ever bugged someone else. This is fixed by the above code, which regains a vital bit of mantissa when the fractional part of the number to print is greater than 0.5. Even with this fix, the function is not correctly rounded, but it doesn't claim to be. It is only the closest we have bothered to come to satisfying conditions 1-3 simultaneously.


let double_norm = Int64.shift_left 1L 52
let double_mask = Int64.pred double_norm

let pretty_normal ~use_hex fmt f =
  let i = Int64.bits_of_float f in
  let s = 0L <> (Int64.logand Int64.min_int i) in
  let i = Int64.logand Int64.max_int i in
  let exp = Int64.to_int (Int64.shift_right_logical i 52) in
  let man = Int64.logand i double_mask in
  let s = if s then "-" else "" in
  let firstdigit, exp =
    if exp <> 0
    then 1, (exp - 1023)
    else 0, -1022
  in
  if not use_hex
  then begin
      let firstdigit, man, exp =
	if 0 <= exp && exp <= 12
	then begin
	    Int64.to_int
	      (Int64.shift_right_logical
		  (Int64.logor man double_norm)
		  (52 - exp)),
            Int64.logand (Int64.shift_left man exp) double_mask,
            0
	  end
	else firstdigit, man, exp
      in
      let d =
	Int64.float_of_bits
	  (Int64.logor 0x3ff0000000000000L man)
      in
      let d, re =
	if d >= 1.5
	then d -. 1.5, 5000000000000000L
	else d -. 1.0, 0L
      in
      let d = d *. 1e16 in
      let decdigits = Int64.add re (Int64.of_float d) in
      if exp = 0
      then
	Format.fprintf fmt "%s%d.%016Ld"
	  s
	  firstdigit
	  decdigits
      else
	Format.fprintf fmt "%s%d.%016Ld*2^%d"
	  s
	  firstdigit
	  decdigits
	  exp
    end
  else
    Format.fprintf fmt "%s0x%d.%013Lxp%d"
      s
      firstdigit
      man
      exp

Monday, October 17 2011

Features in Frama-C Nitrogen, part 1

Here is a series of posts that highlight interesting features in the recently released Frama-C Nitrogen 20111001. There is new functionality in Nitrogen, that we hope will entice both existing and prospective users. Other improvements yet will only have meaning for existing users. I will start off with two items from this second category.

Nitrogen compiles with OCaml 3.10.2.

The only non-optional dependency for compiling Nitrogen is an OCaml compiler. But which one? Despite a now long history, the OCaml language is still evolving relatively rapidly. We have never been happier with our choice of an implementation language: it is evolving in a direction that suits us, and it is evolving conservatively (compare Perl 6, PHP 5, Python 3, …). On the other hand, Frama-C is large and contains tricky code; the plug-in architecture with dynamic loading in general on the one hand, and the hash-consing programming technique on the other hand, are only two examples. It is large enough and tricky enough to reveal many of the subtle difference between any two versions of the OCaml compiler.

Nitrogen compiles with the latest version of OCaml, of course. That's 3.12.1 at the time of this writing. We already know that it won't compile as-is with the future OCaml 3.13 (a patch will be released in due time). Similarly, support for older versions has to be gained, version by version. If you have only written medium-sized OCaml programs, you could easily underestimate the difficulty of this. I was lucky enough not to have to deal with it much during this cycle, but some of my colleagues had to. It always is a struggle. Sometimes the equivalent of #define/#if constructs from C pre-processing would help, but this is not idiomatic in OCaml. Again, the only non-optional dependency for compiling Nitrogen is an OCaml compiler, so we won't use fine solutions such as Cppo.


For Windows and Mac OS X, OCaml is not part of the operating system, so we ship the version we like together with the binary package (if we make one). With Unix, the issues are different: there are too many flavors for us to distribute binaries, but there are efficient package systems and distributions to painlessly bring in required dependencies. Often, Frama-C itself is packaged in binary or source form as part of these distributions, thanks to the work of many volunteers. It may take some time for packages to be created for Nitrogen, and some users do not want to wait. Linus Token, for instance, may rely on a computer he bought two years ago. Frama-C works fine on two-years old computers, as seen in the next section. Linus installed his Unix distribution of choice when he bought his computer, and now he expects Frama-C's sources to compile with the OCaml version from his distribution (OCaml programmers can be expected to have the latest OCaml compiler installed from its own sources, but Linus is not an OCaml programmer). The Unix distribution installed two years ago was on average 3 months old at that time, and it may have been frozen for stability 3 months before being released. For Linus, Frama-C has to compile with a 2.5-year-old compiler. And then there are industrial users who like to trail a little bit on the Linux front, but at the same time want all the latest Frama-C features. For Nitrogen, we chose to retain compatibility with OCaml 3.10.2, released in February 2008, and all OCaml versions released since.

The value analysis is up to four times faster on realistic examples

The second Nitrogen feature I want to highlight today the value analysis' speed. Here is a quick and dirty comparison for programs that could already be analyzed with Carbon. There are new alarms and precision improvements in Nitrogen, but I made sure that in this comparison, the two versions were doing the same amount of work.

For this comparison, I did not cherry-pick benchmarks. I looked for programs of varying sizes in the archives of the blog, and used the three that came first, having decided in advance that I wouldn't dismiss any results I didn't like. Each analysis was run three times, and the median time was kept. This is on a Core 2 mobile processor, and the frequency is pinned at 1.5GHz through CoolBook. In plain English, the timings are for an oldish but not antiquated notebook.

The options I used were:

-slevel 1000 -val count.c
-slevel 10000 -val -no-results -cpp-command "gcc -C -E -m32 -Dprintf=Frama_C_show_each" sha1.c
-val -slevel 1000 -slevel-function main:0 *.c -cpp-command "gcc -m32 -C -E -I. "

The programs count.c and sha1.c came from this post. For sha1.c I had to disable the endianness detection code, implemented with a switch, to put the Carbon and Nitrogen versions on an equal footing. With the Carbon version, there used to be a difficult choice to make in presence of switch, remember? I could also have used option -simplify-cfg for the Carbon execution, but then the analyzers would not have been analyzing exactly the same program. The Skein-256 example is the verification that an arbitrary number of calls to Skein_256_Update(..., 80) never cause a run-time error, using Josh Ko's instrumentation.

                             count.c     sha1.c      Skein-256

Carbon                       0m0.491s    0m2.862s    1m10.201s
Nitrogen without Zarith      0m0.410s    0m1.724s    0m37.782s
Nitrogen with Zarith         0m0.313s    0m0.962s    0m16.700s

Total speed-up                 1.5          3            4

How did the value analysis improve so much overall? As exemplified by the timings with and without Zarith, this is the result of many small enhancements and refinements at all levels.

Conclusion

As promised, the two features described here are only worth noting for faithful users. It only matters that Frama-C compiles with OCaml 3.10.2 if you intend to compile it at all, and you only care that it is much faster than before if you were already using it. Even so, some of the existing users may not notice them. This is the kind of feature that I like, because it does not complicate the user interface —the value analysis benchmarks above use the same options and produce the same results— and improves the software nonetheless. Existing users and people who try Frama-C for the first time with Nitrogen will both have a better time of it thanks to the effort spent on the two points described in this post, and on tens of others, big and small, some of which I hope will receive their own minute of spotlight in this series. I have been forced to allude to one other small improvement, a better treatment of switch in the value analysis, that I like even better because it removes the need to learn one option.

- page 1 of 2