Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Static analysis tools comparisons

It is very easy to make snap judgments about other static analyzers when you are yourself working on one. You do not even need to act in bad faith: you just try the other tools on the programs that you worked hard to give interesting results on — difficult examples, on which even your own tool does not give the best result. You intend to give the other tool a passing grade if it produces a result as approximated as yours, and to acknowledge that it's not as bad as you thought if it gives a better result.


The flaw in this line of reasoning is that even if you were honest enough to pick an example on which it was possible to improve on your own tool's result, you picked one on which your tool begins to produce a result. The example is automatically in the language subset that your analyzer handles, probably in the language subset it targets. The comparison is still heavily biased in favor of your tool, whether you realize it or not. This is how you may end up thinking that Frama-C's value analysis only manipulates intervals: by only trying it on programs for which it displays intervals (programs, say, translated to C from this toy language).

Astrée

It is more instructive to take examples from other tools' websites, the examples that they use to advertise, and to see what your own tool can do with them. Let us do this in this post.

Feeling lucky?

Here is one:

astree.png

Let's try it in Frama-C's value analysis:

$ bin/toplevel.opt -val astree.c 
[kernel] preprocessing with "gcc -C -E -I.  astree.c"
astree.c:4: Floating-point constant 1.0e38 is not represented exactly.
     Will use 0x1.2ced32a16a1b1p126. See documentation for option -warn-decimal-float
...
astree.c:4: accessing uninitialized left-value: assert \initialized(&x);
astree.c:4: completely undefined value in {{ x -> {0} }} (size:<32>).
...
[value] Values at end of function main:
          NON TERMINATING FUNCTION

The webpage adds: "Astrée proves that the above program is free of run-time errors when running on a machine with floats on 32 bits." That's reassuring. If we had not read that, we might have feared that it contained undefined behavior due to the use of uninitialized variable x (and perhaps others, but really, there is such a big problem with x that the value analysis did not look any further). "Undefined behavior" means that anything can happen, and you should consider yourself lucky if it's a run-time error. With an uninitialized float variable, you just might be lucky (and get a run-time error, instead of the proverbial demons flying out of your nose): the compiler may produce code that reads bits from the stack that happen to represent a signaling NaN, and then applies a floating-point operation to it, causing a run-time error.

Miscommunication

To be fair, the release notes for the latest version of Astrée said that it could now emit alarms for uninitialized variables. I look forward to an example illustrating that. Perhaps that example will demonstrate an unsoundness in Frama-C's value analysis, who knows?

Also, the sentence "Astrée proves that the above program is free of run-time errors…" really means "Astrée does not find any of the run-time errors that it detects in the above program", and both its designers and its users at Airbus Operations S.A. are aware of this meaning. It's just that the somewhat pompous formulation is a little bit unfortunate on this example.


Coincidentally, I sometimes notice the same phrase used, about "Frama-C" this time, by makers of comparable tools, as below:

Frama-C "proves" [something ridiculous] about [this program].

It's usually just another instance of someone running other tools on the examples eir tool is good at, of course, and might more objectively be phrased as:

An external plug-in of Frama-C, separately distributed, written by a PhD student so brilliant that he was able to produce useful software during his PhD in addition to making several solid contributions to the science of program analysis, "proves" [something ridiculous] about [a program that's out of scope for this plug-in].

It doesn't have the same ring to it. I see why toolmakers prefer to peddle their own imperfect creations using the first sentence.

Polyspace

So many rhetorical questions

Polyspace's website has this example:

polyspace.png

On the webpage, the explanation of why this example is interesting is a little confusing:

Now, rounding is not the same when casting a constant to a float, or a constant to a double:

* floats are rounded to the nearest lower value;

* doubles are rounded to the nearest higher value;

Now, are they, really? Floating-point has an aura of arbitrariness about it, but that would be a little bit over the top, wouldn't it? Isn't it just that the number 3.40282347e+38, when rounded to float in round-to-nearest mode, rounds downwards, and when rounded to double in round-to-nearest mode, rounds upwards?

Letting the compiler be the judge

According to the colored comments in the program, Polyspace guarantees that there is an overflow. Well, let's try it then:

$ bin/toplevel.opt -val ps_ovfl.c -warn-decimal-float all -float-hex
...
ps_ovfl.c:4: Floating-point constant 3.40282347e+38f is not represented exactly.
     Will use 0x1.fffffe0000000p127
ps_ovfl.c:5: Floating-point constant 3.40282347e+38 is not represented exactly.
     Will use 0x1.fffffe091ff3dp127
...
[value] Values at end of function main:
          x ∈ 0x1.fffffe0000000p127
          y ∈ 0x1.fffffe0000000p127

Frama-C's value analysis does not agree about the example. Ah, but this is about floating-point subtleties. It's not a cut-and-dry case of uninitialized variable. Perhaps Polyspace is right. However, we may remember that in floating-point, an overflow is not a run-time error: it produces an infinity, a special kind of value that some numerical algorithms handle correctly but most don't. It makes sense for a static analyzer to treat infinities as errors (in fact, the value analysis does too), but if a static analyzer says that an overflow certainly occurs(red alarm), and we are unsure of this diagnostic, we can run the program and see what happens. It's not undefined in the C sense.

#include <stdio.h>

void main(void)
{
    float x, y;
    x = 3.40282347e+38f; // is green
    y = (float) 3.40282347e+38; // OVFL red
    printf("%a %a %d", x, y, x == y);
}
$ gcc ps_ovfl_modified.c 
...
~/ppc $ ./a.out 
0x1.fffffep+127 0x1.fffffep+127 1

No, no overflow, it seems. Polyspace's red alarm led us to expect the output 0x1.fffffep+127 inf 0, but that's not what happens.

An explanation why Polyspace makers think there should be an overflow

The explanation on the webpage continues:

In the case of the second assignment, the value is cast to a double first - by your compiler, using a temporary variable D1 -, then into a float - another temporary variable -, because of the cast. Float value is greater than MAXFLOAT, so the check is red.

Unfortunately, this not how round-to-nearest works. When a double, here 0x1.fffffe091ff3dp127, falls between two floats, here 0x1.fffffep127 (MAXFLOAT) and +infinity, the nearest one should be picked.

A philosophical conundrum

Which is nearest between a finite float and an infinite one? I would like to recommend this question for the next philosophy A-levels (or equivalent worldwide). In the world of IEEE 754, the answer is terribly pragmatic. Infinity starts at the first number in floating-point format that can't be represented for having too high an exponent. For single-precision, this number is 0x1.0p128. The midpoint between this unrepresentable number and the last finite float 0x1.fffffep127 is 0x1.ffffffp127, and therefore, in round-to-nearest, all numbers above 0x1.ffffffp127 round up to +inf, whereas numbers below 0x1.ffffffp127 round down to 0x1.fffffep127.

Experimentation

The number 0x1.ffffffp127 is representable as a double, so that we can obtain a decimal approximation of it with a simple C program:

#include <stdio.h>
main(){
  printf("%.16e\n", 0x1.ffffffp127);
}

$ gcc limit.c
$ ./a.out
3.4028235677973366e+38

My claim is that regardless of what Polyspace's website says, and although the number 3.402823567797336e+38 is much larger than the number in their example, you can round it to a double, and then to a float, and you will still get the finite 0x1.fffffep127 (MAXFLOAT). Conversely, the number 3.402823567797337e+38 is on the other side of the fence, and rounds to +inf.

We can try it out:

#include <stdio.h>
main(){
  printf("%a %a\n", (float) 3.402823567797336e+38,  (float) 3.402823567797337e+38);
}

$ gcc experiment1.c
$ ./a.out
0x1.fffffep+127 inf


What about the number 3.4028235677973366e+38? It is even more amusing. It is below the fence, and converted directly to float, it gives 0x1.fffffep127. However, converted to double, it rounds to 0x1.ffffffp127. Then, when rounding to float, the "even" rule for picking between two equidistant floats results in 0x1.0p128 being picked, and the final float result is therefore +inf.

#include <stdio.h>
main(){
  printf("%a %a\n", 3.4028235677973366e+38f, (float) 3.402823567797366e+38);
}

$ gcc experiment2.c
$ ./a.out
0x1.fffffep+127 inf

This is a case of double rounding, just like in question 5 here.

But really, Polyspace wouldn't be that wrong, would it?

I'm afraid it would. Unlike the webpage on Astrée's site, I cannot find any plausible explanation for the misunderstanding. The example is wrong and the explanations that follow make it worse. It could be the case that Polyspace gives results that cover all possible floating-point rounding modes, just like the value analysis option -all-rounding-modes does. There would be an overflow then (in round-upwards mode). Unfortunately, Polyspace would have to emit an orange, since in round-to-nearest and in round-downwards, there is no overflow. It would have to produce results similar to those below, predicting that either the overflow or the normal termination of the program can happen.

$ bin/toplevel.opt -val ps_ovfl.c -all-rounding-modes -float-hex
...
ps_ovfl.c:5:[kernel] warning: overflow in float (0x1.fffffe091ff3dp127).
                  assert -0x1.fffffe0000000p127f ≤ (float)3.40282347e+38 ∧
                  (float)3.40282347e+38 ≤ 0x1.fffffe0000000p127f;
...
[value] Values at end of function main:
          x ∈ 0x1.fffffe0000000p127
          y ∈ 0x1.fffffe0000000p127

Conclusion

In conclusion, we should be careful about the way we express ourselves: static analysis has been oversold enough already. We would do well to learn the rules we pretend to enforce, and we should avoid picking on other toolmakers, something which is detrimental for all and that I do not completely avoid in this post. However, I use each tool's own example, which I think is a good sign that there is still some major overselling going on, despite how detrimental this has already been in the past. I would tell you about examples on PVS-Studio's website, but at least it has different goals and does not use the verb "prove" in the sense "my imperfect implementation based on an idealized formalization computes that…" all the time.

To end on a lighter note, here is a funny picture to meditate.

Comments

1. On Monday, November 28 2011, 15:19 by Sylvain

What you demonstrate in this post is that the field of static error-detection for embedded C (as opposed to security-oriented static analysis, Cf. the CERT test suite) is incredibly immature from a customer point of view.

There is currently no expert-checked independent assessment suite that covers all run-time errors and undefined behavior of the C programming language, with the help of which people could assess a tool. And of course no regularly updated published comparison.

All what is left is a few outdated anecdotal comparisons papers, and assertions from vendors, mainly marketing-driven.

Given what is at stake, one would think that such an assessment framework should already exist. As developing and maintaining such a framework is demanding in resources and expertise, maybe customers of such tools could syndicate and promote the elaboration of an objective assessment tool, open to public scrutiny, that would serve as a basis for tool certification. Some research projects already led to results that can serve as a basis for such a project, so work does not even have to start from zero!

This would no doubt drive progress in the field, don't you think?

BTW, do you have an idea on how old is the article whom the picture is taken of? It is so wonderful on so much level!

2. On Monday, November 28 2011, 19:32 by pascal

Hello, Sylvain.

> What you demonstrate in this post is that the field of static error-detection for embedded C (as opposed to security-oriented static analysis, Cf. the CERT test suite) is incredibly immature from a customer point of view.

> There is currently no expert-checked independent assessment suite that covers all run-time errors and undefined behavior of the C programming language, with the help of which people could assess a tool. And of course no regularly updated published comparison.

I am sorry, but where I can I find this "CERT test suite"? Googling for it does turn up some pages where it is being spoken about, but not the page where it is hosted.

I think you are over-estimating the number of users of these tools. If there were thousands of users, some experts would appear and make serious comparisons, because they could hope to generate enough interest to recoup their expenses. Instead, there are tens of users and they are not going to collectively fund this kind of comparison, either directly or indirectly.

I am not a specialist, and I didn't even see the "CERT test suite", but I think you are also over-estimating the quality of information about security analysis tools. There are plenty of design tradeoffs there too, and even if a tool claims that it can theoretically detect this or that category of security flaw, doesn't mean that it does it with 100% recall and 100% precision. Actually, it cannot do it with 100% recall and 100% precision, if you allow me to identify computers and Turing machines for this discussion.

> All what is left is a few outdated anecdotal comparisons papers, and assertions from vendors, mainly marketing-driven.

> Given what is at stake, one would think that such an assessment framework should already exist. As developing and maintaining such a framework is demanding in resources and expertise, maybe customers of such tools could syndicate and promote the elaboration of an objective assessment tool, open to public scrutiny, that would serve as a basis for tool certification. Some research projects already led to results that can serve as a basis for such a project, so work does not even have to start from zero!

Sure. Just multiply the price you think each customer would be willing to pay for the study by the probable number of customers I indicated, and you'll get the budget for the study. If you find qualified experts who would do it, please let me know, I have a better-paying job for them (assuming they are qualified).

Of course, embedded code verification doesn't lend itself to easy comparisons any better than security auditing does. It's just that the market is smaller. Astrée does way, way better than both PolySpace and Frama-C's value analysis for airplane control software: that's what it was designed for. PolySpace is the only one of the three to handle Ada code. And Frama-C has its own merits (it knows what an uninitialized variable is and how to round between MAXFLOAT and +inf, among other qualities). There is so much any analysis tool could possibly do, and there are so few tools that even try to be sound.

Seriously, though, if the risk-free, manager-friendly, pick-the-one-with-the-most-checkmarks comparative study you desire was already available, then you would be a laggard for only considering these tools now. The reason there is an opportunity in trying these tools now is that everyone is not yet doing it (but early adopters have done their job: there are some elements to base your choices on). Sure, the tools are not for everyone yet. If they were, you wouldn't be reading this blog.

3. On Tuesday, November 29 2011, 11:51 by Sylvain

> CERT test suite

I must apologize: I mixed things up. I meant the "NIST SAMATE Reference Dataset Project".

"The purpose of the SAMATE Reference Dataset (SRD) is to provide users, researchers, and software security assurance tool developers with a set of known security flaws".
It can be found here: http://samate.nist.gov/SRD/

> risk-free, manager-friendly, pick-the-one-with-the-most-checkmarks comparative study

But that's not what I was suggesting!

I meant a set of qualified reference test files along with the results a tool is expected to provide, organized as an open-source project hosted on p.e. Sourceforge, open to contributions from "experts" and researchers. Or alternatively maintained by an institution (Cf. the NIST) or a consortium.
The onus would be on the QA of a putative user to use this set to run any comparison he wishes.

The effort would be leveraged, instead of for each QA department doing it again and again. C and its weaknesses does not change all the time, the gain would be cumulative.

> The reason there is an opportunity in trying these tools now is that everyone is not yet doing it.

Point taken. We are speaking of market spaces with few big players. In such a case there is obviously a tension between cooperation à la free-software and trying to gain a competitive edge, and it may very well be that the advantages of the last overweight the former - this precludes openness and cooperation.
So I guess my suggestion will be moot unless the user base expend into the rest of the industry (which I very much hope).

As for not reading this blog, are you sure your are not underestimating both your entertaining writing style and the interest of the subjects you write on?

4. On Thursday, December 1 2011, 01:09 by Derek Jones

Sylvain,

In theory your idea is a good one. In practice users are not willing to pay what it costs to create and maintain any worthwhile collection of tests. This observation is based on my experience with compiler validation (see shape-of-code.coding-guidelines.com/2011/09/01/c-compiler-validation-is-21-today/ for a discussion of C validation) where users are rarely willing to invest in having in-house compiler test suites to check the compilers they are using.

5. On Wednesday, December 14 2011, 11:48 by David Monniaux

It is true indeed that static analyzers may be based on incorrect semantics (or correct semantics... but just not the one used by the compiler) or overlook certain kinds of undefined behaviors. This is why in 2012-2015, the VERASCO project, led by Xavier Leroy, will, among other things, implement a Coq-certified static analyzer and hook it into CompCert. This is a joint project of INRIA-Rocquencourt, INRIA-Saclay, Univ. Rennes-1, VERIMAG and Airbus (Airbus will try to make such advanced techniques palatable for civil aviation authorities).