Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Tag - benchmarks

Entries feed - Comments feed

Monday, April 30 2012

Benchmarking static analyzers

A meta-article on benchmarking

Some colleagues and I are about to submit this article on the subject of benchmarking static analyzers. Rather than another benchmark, it is a list of general recommendations and pitfalls when benchmarking static analyzers. The article is illustrated on C program constructs because that's what we are familiar with, but these examples are only intended to show how complicated static analysis can be in general. Had we focused on another language, the underlying ideas would have remained the same: different users and toolmakers have different views of what static analysis is about. Even if they agreed on what static analysis is about, there would still be no perfect oracle to tell what the right answer to a particular testcase is.

Discussing this article as it was being written, several valid points have been raised, but unfortunately, the limit for the category in which we wish to submit is 4 pages, so I have to address these here. Consider this post a sort of addendum to the article.

Why should the C standard be the ultimate authority?

The short, facetious answer to this question is that the C standard is the democracy of C semantics: it is the worst form of reference document we have, to the exception of all others.


Which version of the C standard?

This is not a big problem in practice. Everyone seems to focus on what compilers use, for obvious reasons, and what compilers use is C90. In Frama-C, we refer to C99 when it is convenient. For instance, in C99, division is defined in more detail than in C90. C90 does not specify whether division rounds towards minus infinity (Euclidean division with a positive remainder) or rounds towards zero (all processors implement division like this, because it means that the sign of the results can be computed independently from their magnitudes, saving transistors). C99 specifies that division rounds towards zero. More information is good, so Frama-C follows C99 here. But when C90 and C99 are mutually exclusive, and compilers all implement C90, we follow the compilers. In this post about subtle typing issues, we are with gcc, predicting 2147483648 for the second example, not with gcc -std=c99 which compiles the program as displaying -2147483648. (See the following post for the explanations behind these differences).


What if most implementations agree with each other but disagree with the standard?

There are a couple of more interesting examples that we intended for the article, to illustrate this very dilemma. One is about pointer comparisons. Technically, the comparison &a < &b is undefined behavior. The same reasoning that, in the article, justifies that analyzers do not have to worry what happens in the program after an illegal pointer access could be applied to this expression.

The point here is not just that most analyzers do not care about this issue, and simply treat the comparison as returning 0 or 1. The point is that there are good reasons for an analyzer to let this kind of comparison slip: it happens naturally when a user implementation of memmove() is passed &a as source and &b as destination. An analyzer that lets this construct slip may be an analyzer that received more testing and for which usability studies revealed that users were too confused when &a < &b was treated as if it halted execution. Since there are arguments both ways, it is difficult for organizers to arbitrarily decide what an analyzer should do when confronted with this expression. And since it does not seem too fundamental, the simple escape that we recommend is to omit the construction from the benchmark. It shouldn't occur in any testcase, not as the goal of the testcase (the undefined behavior to be detected), nor as an incidental construct, that would bias testcases intended for another criterion.


For the sake of exhaustiveness, this can be contrasted to the condition &a == &b + 1. You may expect that this would be undefined, too, but for some reason it is only unspecified. The implementation may return 0 or 1, and it will not return the same value everytime. Derek Jones would point out here that an implementation could rely on just-in-time compilation and that the value of &a == &b + 1 could as a result change from one evaluation to the other, for the same expression referring to the same variables a and b. Anyway, it is not undefined, so if the standard was the absolute reference, static analyzers would not be allowed to consider that execution gets stuck on evaluating &a == &b + 1.

In Frama-C's value analysis, we do not see much reason to treat equality and inequality differently, but we recognize that both stopping execution and continuing with results {0; 1} can have their uses, so we offer an option to choose between these two behaviors. This series of posts already covers this kind of nuance.


Frama-C's value analysis detects, warns and considers execution stuck on uninitialized variables. But by design, it avoids warning (and considering execution stuck) on the statement y = x; or *dst = *src;. This is a compromise, made necessary by the necessity to analyze user implementations of memcpy(), which may use exactly that second statement for uninitialized data when copying a struct or union.


One long-standing bug in Frama-C's front-end is that it erases a possible bug during parsing and normalization. According to the letter of the standard, addressee().a is dangerous. However, this misfeature of the C99 standard has quite a history (in the circle of people who care about this sort of thing), as illustrated by the numerous comments on the linked CERT page. A static analyzer could omit this bug (and consider execution continues) by a conscious design decision (for instance if it specializes in a compiler that documents that this construct is safe). Another static analyzer could treat it as undefined, and consider that execution gets stuck. Again, the construct is not very interesting in itself, so it shouldn't be restrictive to omit it from the explicit goals of the benchmark, and to omit it from the incidental features that might perturb other tests.


Finally, there are areas where the standard under-specifies things. One egregious example is floating-point. David Monniaux's report is still the reference I recommend for these floating-point issues.


All but the last of these examples are cases of “Don't do that, then”: the issue can simply be avoided by not using any of the ambiguous constructs for which it is unclear whether an analyzer should be allowed to consider them blocking. Also, all the issues in these first examples disappear when toolmakers are allowed to participate: when their analyzer scores badly on a test because it is better than others are detecting incidental issues, they will be prompt to point out the problem.

For the last example, floating-point, I do not have a solution to offer. Some real-life software issues come specifically from the liberties that the standard allows and from the fact that compilers take advantage of them, so it is a worthy goal for a static analyzer to try to model these liberties, and a worthy goal for a benchmark to measure how well the static analyzer is doing. But floating-point in C is a mess, and since one of the issues is that compilation is under-specified, it does not suffice to use one (or even several) compilations as reference.


What about implementation-defined behavior?

This is a non-issue, really. All real C programs invoke implementation-defined behavior. As an example, the line int x = 38000; invokes implementation-defined behavior. Therefore, all static analyzers that aim to be useful for real programs assume some implementation choices, and they all tend to assume the compilation choices of the dominant platform, so implementation-defined behavior causes very little trouble in practice.

Acknowledgments

Radu Grigore provided many ideas that did not fit in the article, and is directly responsible for the existence of this blog post.

Wednesday, January 11 2012

Static analysis benchmarks

Reminiscing

The first benchmark Frama-C's value analysis was ever evaluated on was the Verisec suite, described in "A buffer overflow benchmark for software model checkers". My colleague Géraud Canet was doing the work, thence the facetious-colleagues tag of this post. In retrospect, Verisec was a solid piece of work. Being naive when it comes to statistics, I didn't recognize it at the time, but when I later examined other benchmarks, I developed an appreciation for this one. It does a lot of things right.

  • It contains good examples as well as bad. A static analyzer is for separating good programs from bad ones. It should be obvious that you cannot evaluate how well a static analyzer does using only bad programs.
  • The article's title says it's for "Software Model Checkers" but it does a good job of being accessible to a wide range of tools with varying abilities. Some good/bad program pairs are available in "no pointer, only arrays" variants and in "small arrays only" variants, providing more comparison points between different techniques. This may help discover that some technique detects a wrong behavior with small arrays but does not scale to larger ones, whereas another one does not detect them at all.
  • The entries in the benchmark are moderate-sized snippets of code with unknown inputs. If, by contrast, in most or all of the examples, the inputs were fixed, and assuming that these examples did not escape in an infinite loop, then a detection strategy based on the concrete execution of the program step by step and detection of wrong behaviors on the fly would sail through the benchmark. The benchmark would accurately capture the fact that for such cases, an analyzer can easily be made correct and complete. What would be missing would be that users of static analysis tools turn to them to detect wrong behaviors that can occur with some inputs among many, and not just to detect that there are some wrong behaviors when known (dangerous) inputs are used.


Géraud was using the Verisec benchmark right, too. He spent enough time on each case to uncover some interesting facts. One uncovered interesting fact was a definite bug in one of Verisec's "good" cases, where the targeted bug was supposed to be fixed. The explanation was this: there were two bugs in the real-life snippet that had been used as "bad" case, and only one had hitherto been noticed. The known bug had been fixed in the corresponding "good" case in the benchmark and in real life. The unknown bug had not, as the song of La Palisse might have said.

When a static analysis benchmark is done right, as the Verisec benchmark is, it tries to measure the ability of tools to solve a hard problem: predicting whether something happens at run-time in a non-trivial program with unknown inputs. Furthermore, the reason there is a need for static analyzers and for a benchmark is that humans are not good either at recognizing issues in code. So, if the benchmark is representative of the problems a static analyzer should solve, there exists no perfect oracle that can help decide what the correct answers are on each snippet, and it can be expected to contain a few bugs classified as "good" code.

Is safety static analysis incredibly immature?

In a recent post I compared Frama-C's value analysis to PolySpace and Astrée. It was a single-point comparison in each case, each time using an example chosen by the people who sell the analyzer, an example used on the analyzer's website to demonstrate the analyzer's precision. In that post's comments, Sylvain regretted "that the field of static error-detection for embedded C (as opposed to security-oriented static analysis, Cf. the [NIST SAMATE] test suite) is incredibly immature from a customer point of view".

His conclusion is at the opposite of the idea I was trying to express there. I was trying to say that embedded safety static analysis was mature and that all that remained to do to convince was to stop making unrealistic claims and derisive comparisons. Talk about scoring own goals. This blog post is my penitence for not having been clear the first time.


The field of software security is orders of magnitude more complex and disorganized than that of embedded safety. It deals with programs that are larger and use more difficult constructs (e.g. dynamic allocation). The properties that need to be verified are numerous and widely dispersed:

  • There are the undefined behaviors that cause something to happen that was not intended by the programmer. These can breach any of a system's confidentiality, integrity or availability. They are the same undefined behaviors that Polyspace, Astrée or Frama-C's value analysis detect, give or take that a previous version of Astrée did not detect uninitialized variables or that the Nitrogen version of Frama-C's value analysis (initially intended for embedded code) does not include the modelization of free() I have been working on recently — and therefore does not detect undefined behaviors related to free().
  • There are security issues in code without undefined behaviors, such as the hashing denial of service attack everyone is talking about these days.
  • And as a last illustration of the variety of security properties, timing information leaks, the detection of which I recently discussed, belong to security considerations.

As a digression, we are not worried about the hashing denial of service attack in Frama-C, but the counter-measures in OCaml 3.13 (that allows to seed hash functions) are going to be a bit of a bother for Frama-C developers. He that among you never iterated on a hash-table when displaying the results of a regression test, let him first cast a stone.


The point is that software security people need a suite like NIST's to describe by example what their field is about (more than 45 000 C/C++ cases to date). The field of embedded safety does not have a similar suite because it does not need one. AbsInt, the company that sells Astrée, did not wait for me to (gently) make fun of their website before they implemented uninitialized variable detection: they did it of their own volition because they knew Astrée should do it, and they had already released a new version with that feature at the time my colleague Florent Kirchner tried their example and noticed the uninitialized variable.

A non-benchmark

Also, I think it's a slippery slope to call what NIST has a "test suite", because before long someone will call it a "benchmark", and NIST's suite is no benchmark. Many of the examples have definite inputs, making them unrepresentative — specifically, making them too easy to handle with techniques that do not scale to the billions of possible inputs that must be handled routinely in real life. There does not seem to be nearly enough "good" examples in the suite: the few I looked at were all "bad" examples, and based on this sample, an analyzer that always answered "there is an issue in the target code" would get a very good mark indeed if this was a benchmark.


Worse, in some cases, the "bad" line to find is extremely benign and there are more serious issues in the case. Consider entry 1737, one of the few I looked at:

...
void test( char * str )
{
  char * buf;
  char * newbuf;
  char * oldbufaddress;

  buf = malloc( MAXSIZE );
  if ( !buf )
    return;
  strncpy( buf, str, MAXSIZE );
  buf[MAXSIZE - 1] = '\0';
  printf( "original buffer content is : %s\n", buf );
  oldbufaddress = buf;
  buf = realloc( buf, 1024 ); /*BAD */
  printf( "realloced buffer content is : %s\n", buf );
  printf( "original buffer address content is : %s\n", oldbufaddress );
  free( buf );
}

int main( int argc, char * * argv )
{
  char * userstr;

  if ( argc > 1 )
  {
    userstr = argv[1];
    test( userstr );
  }
...

With the proper modelization for printf(), free(), and realloc(), such as I am currently working on, Frama-C's value analysis would warn about both printf() calls. The warning for the first call is because realloc() can fail and return NULL (passing NULL to printf() invokes undefined behavior). In this case, there famously also exists a memory leak problem, as the initial block is not freed, but a memory leak is not an undefined behavior. The warning for the second call is because oldbufaddress becomes indeterminate if realloc() moves the block.


The above is not what the author of the test case wants to hear. What the author of the test case wants to hear, clarified by the CWE-244 reference, is that the realloc() call is dangerous because if realloc() moves the block, it may leave the confidential information that was contained in the block lying on the heap. This is fine, if the case is intended as a starting point for discussion (although it would be better without an undefined behavior unrelated to the specific issue being tested).

One problem is that there is no indication in the program that the block contains anything secret. A static analyzer that warns for this program should warn for any use of realloc(), since this is how realloc() is intended to be used (it is intended to preserve the contents of the block being resized). An optimal analyzer for this issue is grep realloc *.c.

It does not hurt to zero the secret data before you free() a block, and avoid realloc(), in a "belt and suspenders" approach to security, but if you use Frama-C's value analysis properly to verify that the program does not have undefined behaviors, you do not have to do this, and the code will still be safe from heap inspection vulnerabilities. Perhaps using the value analysis "properly" is inapplicable in your case; but you should at least be aware of the design choices here. Case 1737, by itself, just seems to pick one specific circumvention method for a general issue, and then confuses the resolution of the general issue with the application the specific method. The map is not the territory.


Were this case used as a benchmark, it would very much remind me of primary school tests written by teachers without any deep understanding of the matter they were teaching, and where you had to guess what was in these teachers' minds at each of their unclear questions. Here, for instance, the value analysis warns that the program may use a dangling pointer in the second printf() — intended to symbolize the information leak. If the program instead called malloc() and tried to use the contents of the new block without having initialized it, then the value analysis would warn about that. Generally speaking, any way that I know of the contents of the initial block can be read from the heap is flagged at some point as an undefined behavior, but this is not what case 1737 tests, because case 1737 tests "analyzer warns about realloc()", and the correct answer is to warn about realloc().

Conclusion

Coming back to the question of what field is immature from a customer point of view, if your interest is in software security, NIST's suite certainly can be used as a basis to understand what kind of static analyzer you may be after. You can try candidate static analyzers on cases that you have selected as most representative of your own problems. But it won't save you from reading each candidate's documentation, understanding how each of them work, and carefully examining their results. Any mechanical interpretation of the results will lead you to pick grep realloc without even understanding case 1737 and its limitations.


The issue for comparing embedded safety static analyzers is that the design space is too large, and the design space for security static analysis certainly isn't any smaller. And if someone ever boasts that their security-oriented software analyzer covers 95% of the NIST SAMATE "benchmark", they are probably selling snake oil. Just saying.

Friday, November 25 2011

Static analysis tools comparisons

It is very easy to make snap judgments about other static analyzers when you are yourself working on one. You do not even need to act in bad faith: you just try the other tools on the programs that you worked hard to give interesting results on — difficult examples, on which even your own tool does not give the best result. You intend to give the other tool a passing grade if it produces a result as approximated as yours, and to acknowledge that it's not as bad as you thought if it gives a better result.


The flaw in this line of reasoning is that even if you were honest enough to pick an example on which it was possible to improve on your own tool's result, you picked one on which your tool begins to produce a result. The example is automatically in the language subset that your analyzer handles, probably in the language subset it targets. The comparison is still heavily biased in favor of your tool, whether you realize it or not. This is how you may end up thinking that Frama-C's value analysis only manipulates intervals: by only trying it on programs for which it displays intervals (programs, say, translated to C from this toy language).

Astrée

It is more instructive to take examples from other tools' websites, the examples that they use to advertise, and to see what your own tool can do with them. Let us do this in this post.

Feeling lucky?

Here is one:

astree.png

Let's try it in Frama-C's value analysis:

$ bin/toplevel.opt -val astree.c 
[kernel] preprocessing with "gcc -C -E -I.  astree.c"
astree.c:4: Floating-point constant 1.0e38 is not represented exactly.
     Will use 0x1.2ced32a16a1b1p126. See documentation for option -warn-decimal-float
...
astree.c:4: accessing uninitialized left-value: assert \initialized(&x);
astree.c:4: completely undefined value in {{ x -> {0} }} (size:<32>).
...
[value] Values at end of function main:
          NON TERMINATING FUNCTION

The webpage adds: "Astrée proves that the above program is free of run-time errors when running on a machine with floats on 32 bits." That's reassuring. If we had not read that, we might have feared that it contained undefined behavior due to the use of uninitialized variable x (and perhaps others, but really, there is such a big problem with x that the value analysis did not look any further). "Undefined behavior" means that anything can happen, and you should consider yourself lucky if it's a run-time error. With an uninitialized float variable, you just might be lucky (and get a run-time error, instead of the proverbial demons flying out of your nose): the compiler may produce code that reads bits from the stack that happen to represent a signaling NaN, and then applies a floating-point operation to it, causing a run-time error.

Miscommunication

To be fair, the release notes for the latest version of Astrée said that it could now emit alarms for uninitialized variables. I look forward to an example illustrating that. Perhaps that example will demonstrate an unsoundness in Frama-C's value analysis, who knows?

Also, the sentence "Astrée proves that the above program is free of run-time errors…" really means "Astrée does not find any of the run-time errors that it detects in the above program", and both its designers and its users at Airbus Operations S.A. are aware of this meaning. It's just that the somewhat pompous formulation is a little bit unfortunate on this example.


Coincidentally, I sometimes notice the same phrase used, about "Frama-C" this time, by makers of comparable tools, as below:

Frama-C "proves" [something ridiculous] about [this program].

It's usually just another instance of someone running other tools on the examples eir tool is good at, of course, and might more objectively be phrased as:

An external plug-in of Frama-C, separately distributed, written by a PhD student so brilliant that he was able to produce useful software during his PhD in addition to making several solid contributions to the science of program analysis, "proves" [something ridiculous] about [a program that's out of scope for this plug-in].

It doesn't have the same ring to it. I see why toolmakers prefer to peddle their own imperfect creations using the first sentence.

Polyspace

So many rhetorical questions

Polyspace's website has this example:

polyspace.png

On the webpage, the explanation of why this example is interesting is a little confusing:

Now, rounding is not the same when casting a constant to a float, or a constant to a double:

* floats are rounded to the nearest lower value;

* doubles are rounded to the nearest higher value;

Now, are they, really? Floating-point has an aura of arbitrariness about it, but that would be a little bit over the top, wouldn't it? Isn't it just that the number 3.40282347e+38, when rounded to float in round-to-nearest mode, rounds downwards, and when rounded to double in round-to-nearest mode, rounds upwards?

Letting the compiler be the judge

According to the colored comments in the program, Polyspace guarantees that there is an overflow. Well, let's try it then:

$ bin/toplevel.opt -val ps_ovfl.c -warn-decimal-float all -float-hex
...
ps_ovfl.c:4: Floating-point constant 3.40282347e+38f is not represented exactly.
     Will use 0x1.fffffe0000000p127
ps_ovfl.c:5: Floating-point constant 3.40282347e+38 is not represented exactly.
     Will use 0x1.fffffe091ff3dp127
...
[value] Values at end of function main:
          x ∈ 0x1.fffffe0000000p127
          y ∈ 0x1.fffffe0000000p127

Frama-C's value analysis does not agree about the example. Ah, but this is about floating-point subtleties. It's not a cut-and-dry case of uninitialized variable. Perhaps Polyspace is right. However, we may remember that in floating-point, an overflow is not a run-time error: it produces an infinity, a special kind of value that some numerical algorithms handle correctly but most don't. It makes sense for a static analyzer to treat infinities as errors (in fact, the value analysis does too), but if a static analyzer says that an overflow certainly occurs(red alarm), and we are unsure of this diagnostic, we can run the program and see what happens. It's not undefined in the C sense.

#include <stdio.h>

void main(void)
{
    float x, y;
    x = 3.40282347e+38f; // is green
    y = (float) 3.40282347e+38; // OVFL red
    printf("%a %a %d", x, y, x == y);
}
$ gcc ps_ovfl_modified.c 
...
~/ppc $ ./a.out 
0x1.fffffep+127 0x1.fffffep+127 1

No, no overflow, it seems. Polyspace's red alarm led us to expect the output 0x1.fffffep+127 inf 0, but that's not what happens.

An explanation why Polyspace makers think there should be an overflow

The explanation on the webpage continues:

In the case of the second assignment, the value is cast to a double first - by your compiler, using a temporary variable D1 -, then into a float - another temporary variable -, because of the cast. Float value is greater than MAXFLOAT, so the check is red.

Unfortunately, this not how round-to-nearest works. When a double, here 0x1.fffffe091ff3dp127, falls between two floats, here 0x1.fffffep127 (MAXFLOAT) and +infinity, the nearest one should be picked.

A philosophical conundrum

Which is nearest between a finite float and an infinite one? I would like to recommend this question for the next philosophy A-levels (or equivalent worldwide). In the world of IEEE 754, the answer is terribly pragmatic. Infinity starts at the first number in floating-point format that can't be represented for having too high an exponent. For single-precision, this number is 0x1.0p128. The midpoint between this unrepresentable number and the last finite float 0x1.fffffep127 is 0x1.ffffffp127, and therefore, in round-to-nearest, all numbers above 0x1.ffffffp127 round up to +inf, whereas numbers below 0x1.ffffffp127 round down to 0x1.fffffep127.

Experimentation

The number 0x1.ffffffp127 is representable as a double, so that we can obtain a decimal approximation of it with a simple C program:

#include <stdio.h>
main(){
  printf("%.16e\n", 0x1.ffffffp127);
}

$ gcc limit.c
$ ./a.out
3.4028235677973366e+38

My claim is that regardless of what Polyspace's website says, and although the number 3.402823567797336e+38 is much larger than the number in their example, you can round it to a double, and then to a float, and you will still get the finite 0x1.fffffep127 (MAXFLOAT). Conversely, the number 3.402823567797337e+38 is on the other side of the fence, and rounds to +inf.

We can try it out:

#include <stdio.h>
main(){
  printf("%a %a\n", (float) 3.402823567797336e+38,  (float) 3.402823567797337e+38);
}

$ gcc experiment1.c
$ ./a.out
0x1.fffffep+127 inf


What about the number 3.4028235677973366e+38? It is even more amusing. It is below the fence, and converted directly to float, it gives 0x1.fffffep127. However, converted to double, it rounds to 0x1.ffffffp127. Then, when rounding to float, the "even" rule for picking between two equidistant floats results in 0x1.0p128 being picked, and the final float result is therefore +inf.

#include <stdio.h>
main(){
  printf("%a %a\n", 3.4028235677973366e+38f, (float) 3.402823567797366e+38);
}

$ gcc experiment2.c
$ ./a.out
0x1.fffffep+127 inf

This is a case of double rounding, just like in question 5 here.

But really, Polyspace wouldn't be that wrong, would it?

I'm afraid it would. Unlike the webpage on Astrée's site, I cannot find any plausible explanation for the misunderstanding. The example is wrong and the explanations that follow make it worse. It could be the case that Polyspace gives results that cover all possible floating-point rounding modes, just like the value analysis option -all-rounding-modes does. There would be an overflow then (in round-upwards mode). Unfortunately, Polyspace would have to emit an orange, since in round-to-nearest and in round-downwards, there is no overflow. It would have to produce results similar to those below, predicting that either the overflow or the normal termination of the program can happen.

$ bin/toplevel.opt -val ps_ovfl.c -all-rounding-modes -float-hex
...
ps_ovfl.c:5:[kernel] warning: overflow in float (0x1.fffffe091ff3dp127).
                  assert -0x1.fffffe0000000p127f ≤ (float)3.40282347e+38 ∧
                  (float)3.40282347e+38 ≤ 0x1.fffffe0000000p127f;
...
[value] Values at end of function main:
          x ∈ 0x1.fffffe0000000p127
          y ∈ 0x1.fffffe0000000p127

Conclusion

In conclusion, we should be careful about the way we express ourselves: static analysis has been oversold enough already. We would do well to learn the rules we pretend to enforce, and we should avoid picking on other toolmakers, something which is detrimental for all and that I do not completely avoid in this post. However, I use each tool's own example, which I think is a good sign that there is still some major overselling going on, despite how detrimental this has already been in the past. I would tell you about examples on PVS-Studio's website, but at least it has different goals and does not use the verb "prove" in the sense "my imperfect implementation based on an idealized formalization computes that…" all the time.

To end on a lighter note, here is a funny picture to meditate.

page 2 of 2 -