The first benchmark Frama-C's value analysis was ever evaluated on was the Verisec suite, described in "A buffer overflow benchmark for software model checkers". My colleague Géraud Canet was doing the work, thence the facetious-colleagues tag of this post. In retrospect, Verisec was a solid piece of work. Being naive when it comes to statistics, I didn't recognize it at the time, but when I later examined other benchmarks, I developed an appreciation for this one. It does a lot of things right.
- It contains good examples as well as bad. A static analyzer is for separating good programs from bad ones. It should be obvious that you cannot evaluate how well a static analyzer does using only bad programs.
- The article's title says it's for "Software Model Checkers" but it does a good job of being accessible to a wide range of tools with varying abilities. Some good/bad program pairs are available in "no pointer, only arrays" variants and in "small arrays only" variants, providing more comparison points between different techniques. This may help discover that some technique detects a wrong behavior with small arrays but does not scale to larger ones, whereas another one does not detect them at all.
- The entries in the benchmark are moderate-sized snippets of code with unknown inputs. If, by contrast, in most or all of the examples, the inputs were fixed, and assuming that these examples did not escape in an infinite loop, then a detection strategy based on the concrete execution of the program step by step and detection of wrong behaviors on the fly would sail through the benchmark. The benchmark would accurately capture the fact that for such cases, an analyzer can easily be made correct and complete. What would be missing would be that users of static analysis tools turn to them to detect wrong behaviors that can occur with some inputs among many, and not just to detect that there are some wrong behaviors when known (dangerous) inputs are used.
Géraud was using the Verisec benchmark right, too. He spent enough time on each case to uncover some interesting facts. One uncovered interesting fact was a definite bug in one of Verisec's "good" cases, where the targeted bug was supposed to be fixed. The explanation was this: there were two bugs in the real-life snippet that had been used as "bad" case, and only one had hitherto been noticed. The known bug had been fixed in the corresponding "good" case in the benchmark and in real life. The unknown bug had not, as the song of La Palisse might have said.
When a static analysis benchmark is done right, as the Verisec benchmark is, it tries to measure the ability of tools to solve a hard problem: predicting whether something happens at run-time in a non-trivial program with unknown inputs. Furthermore, the reason there is a need for static analyzers and for a benchmark is that humans are not good either at recognizing issues in code. So, if the benchmark is representative of the problems a static analyzer should solve, there exists no perfect oracle that can help decide what the correct answers are on each snippet, and it can be expected to contain a few bugs classified as "good" code.
Is safety static analysis incredibly immature?
In a recent post I compared Frama-C's value analysis to PolySpace and Astrée. It was a single-point comparison in each case, each time using an example chosen by the people who sell the analyzer, an example used on the analyzer's website to demonstrate the analyzer's precision. In that post's comments, Sylvain regretted "that the field of static error-detection for embedded C (as opposed to security-oriented static analysis, Cf. the [NIST SAMATE] test suite) is incredibly immature from a customer point of view".
His conclusion is at the opposite of the idea I was trying to express there. I was trying to say that embedded safety static analysis was mature and that all that remained to do to convince was to stop making unrealistic claims and derisive comparisons. Talk about scoring own goals. This blog post is my penitence for not having been clear the first time.
The field of software security is orders of magnitude more complex and disorganized than that of embedded safety. It deals with programs that are larger and use more difficult constructs (e.g. dynamic allocation). The properties that need to be verified are numerous and widely dispersed:
- There are the undefined behaviors that cause something to happen that was not intended by the programmer. These can breach any of a system's confidentiality, integrity or availability. They are the same undefined behaviors that Polyspace, Astrée or Frama-C's value analysis detect, give or take that a previous version of Astrée did not detect uninitialized variables or that the Nitrogen version of Frama-C's value analysis (initially intended for embedded code) does not include the modelization of
free() I have been working on recently — and therefore does not detect undefined behaviors related to
- There are security issues in code without undefined behaviors, such as the hashing denial of service attack everyone is talking about these days.
- And as a last illustration of the variety of security properties, timing information leaks, the detection of which I recently discussed, belong to security considerations.
As a digression, we are not worried about the hashing denial of service attack in Frama-C, but the counter-measures in OCaml 3.13 (that allows to seed hash functions) are going to be a bit of a bother for Frama-C developers. He that among you never iterated on a hash-table when displaying the results of a regression test, let him first cast a stone.
The point is that software security people need a suite like NIST's to describe by example what their field is about (more than 45 000 C/C++ cases to date). The field of embedded safety does not have a similar suite because it does not need one. AbsInt, the company that sells Astrée, did not wait for me to (gently) make fun of their website before they implemented uninitialized variable detection: they did it of their own volition because they knew Astrée should do it, and they had already released a new version with that feature at the time my colleague Florent Kirchner tried their example and noticed the uninitialized variable.
Also, I think it's a slippery slope to call what NIST has a "test suite", because before long someone will call it a "benchmark", and NIST's suite is no benchmark. Many of the examples have definite inputs, making them unrepresentative — specifically, making them too easy to handle with techniques that do not scale to the billions of possible inputs that must be handled routinely in real life. There does not seem to be nearly enough "good" examples in the suite: the few I looked at were all "bad" examples, and based on this sample, an analyzer that always answered "there is an issue in the target code" would get a very good mark indeed if this was a benchmark.
Worse, in some cases, the "bad" line to find is extremely benign and there are more serious issues in the case. Consider entry 1737, one of the few I looked at:
void test( char * str )
char * buf;
char * newbuf;
char * oldbufaddress;
buf = malloc( MAXSIZE );
if ( !buf )
strncpy( buf, str, MAXSIZE );
buf[MAXSIZE - 1] = '\0';
printf( "original buffer content is : %s\n", buf );
oldbufaddress = buf;
buf = realloc( buf, 1024 ); /*BAD */
printf( "realloced buffer content is : %s\n", buf );
printf( "original buffer address content is : %s\n", oldbufaddress );
free( buf );
int main( int argc, char * * argv )
char * userstr;
if ( argc > 1 )
userstr = argv;
test( userstr );
With the proper modelization for
realloc(), such as I am currently working on, Frama-C's value analysis would warn about both
printf() calls. The warning for the first call is because
realloc() can fail and return
printf() invokes undefined behavior). In this case, there famously also exists a memory leak problem, as the initial block is not freed, but a memory leak is not an undefined behavior. The warning for the second call is because
oldbufaddress becomes indeterminate if
realloc() moves the block.
The above is not what the author of the test case wants to hear. What the author of the test case wants to hear, clarified by the CWE-244 reference, is that the
realloc() call is dangerous because if
realloc() moves the block, it may leave the confidential information that was contained in the block lying on the heap. This is fine, if the case is intended as a starting point for discussion (although it would be better without an undefined behavior unrelated to the specific issue being tested).
One problem is that there is no indication in the program that the block contains anything secret. A static analyzer that warns for this program should warn for any use of
realloc(), since this is how
realloc() is intended to be used (it is intended to preserve the contents of the block being resized). An optimal analyzer for this issue is
grep realloc *.c.
It does not hurt to zero the secret data before you
free() a block, and avoid
realloc(), in a "belt and suspenders" approach to security, but if you use Frama-C's value analysis properly to verify that the program does not have undefined behaviors, you do not have to do this, and the code will still be safe from heap inspection vulnerabilities. Perhaps using the value analysis "properly" is inapplicable in your case; but you should at least be aware of the design choices here. Case 1737, by itself, just seems to pick one specific circumvention method for a general issue, and then confuses the resolution of the general issue with the application the specific method. The map is not the territory.
Were this case used as a benchmark, it would very much remind me of primary school tests written by teachers without any deep understanding of the matter they were teaching, and where you had to guess what was in these teachers' minds at each of their unclear questions.
Here, for instance, the value analysis warns that the program may use a dangling pointer in the second
printf() — intended to symbolize the information leak. If the program instead called
malloc() and tried to use the contents of the new block without having initialized it, then the value analysis would warn about that. Generally speaking, any way that I know of the contents of the initial block can be read from the heap is flagged at some point as an undefined behavior, but this is not what case 1737 tests, because case 1737 tests "analyzer warns about
realloc()", and the correct answer is to warn about
Coming back to the question of what field is immature from a customer point of view, if your interest is in software security, NIST's suite certainly can be used as a basis to understand what kind of static analyzer you may be after. You can try candidate static analyzers on cases that you have selected as most representative of your own problems. But it won't save you from reading each candidate's documentation, understanding how each of them work, and carefully examining their results. Any mechanical interpretation of the results will lead you to pick
grep realloc without even understanding case 1737 and its limitations.
The issue for comparing embedded safety static analyzers is that the design space is too large, and the design space for security static analysis certainly isn't any smaller. And if someone ever boasts that their security-oriented software analyzer covers 95% of the NIST SAMATE "benchmark", they are probably selling snake oil. Just saying.