Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Static analysis benchmarks

Reminiscing

The first benchmark Frama-C's value analysis was ever evaluated on was the Verisec suite, described in "A buffer overflow benchmark for software model checkers". My colleague Géraud Canet was doing the work, thence the facetious-colleagues tag of this post. In retrospect, Verisec was a solid piece of work. Being naive when it comes to statistics, I didn't recognize it at the time, but when I later examined other benchmarks, I developed an appreciation for this one. It does a lot of things right.

  • It contains good examples as well as bad. A static analyzer is for separating good programs from bad ones. It should be obvious that you cannot evaluate how well a static analyzer does using only bad programs.
  • The article's title says it's for "Software Model Checkers" but it does a good job of being accessible to a wide range of tools with varying abilities. Some good/bad program pairs are available in "no pointer, only arrays" variants and in "small arrays only" variants, providing more comparison points between different techniques. This may help discover that some technique detects a wrong behavior with small arrays but does not scale to larger ones, whereas another one does not detect them at all.
  • The entries in the benchmark are moderate-sized snippets of code with unknown inputs. If, by contrast, in most or all of the examples, the inputs were fixed, and assuming that these examples did not escape in an infinite loop, then a detection strategy based on the concrete execution of the program step by step and detection of wrong behaviors on the fly would sail through the benchmark. The benchmark would accurately capture the fact that for such cases, an analyzer can easily be made correct and complete. What would be missing would be that users of static analysis tools turn to them to detect wrong behaviors that can occur with some inputs among many, and not just to detect that there are some wrong behaviors when known (dangerous) inputs are used.


Géraud was using the Verisec benchmark right, too. He spent enough time on each case to uncover some interesting facts. One uncovered interesting fact was a definite bug in one of Verisec's "good" cases, where the targeted bug was supposed to be fixed. The explanation was this: there were two bugs in the real-life snippet that had been used as "bad" case, and only one had hitherto been noticed. The known bug had been fixed in the corresponding "good" case in the benchmark and in real life. The unknown bug had not, as the song of La Palisse might have said.

When a static analysis benchmark is done right, as the Verisec benchmark is, it tries to measure the ability of tools to solve a hard problem: predicting whether something happens at run-time in a non-trivial program with unknown inputs. Furthermore, the reason there is a need for static analyzers and for a benchmark is that humans are not good either at recognizing issues in code. So, if the benchmark is representative of the problems a static analyzer should solve, there exists no perfect oracle that can help decide what the correct answers are on each snippet, and it can be expected to contain a few bugs classified as "good" code.

Is safety static analysis incredibly immature?

In a recent post I compared Frama-C's value analysis to PolySpace and Astrée. It was a single-point comparison in each case, each time using an example chosen by the people who sell the analyzer, an example used on the analyzer's website to demonstrate the analyzer's precision. In that post's comments, Sylvain regretted "that the field of static error-detection for embedded C (as opposed to security-oriented static analysis, Cf. the [NIST SAMATE] test suite) is incredibly immature from a customer point of view".

His conclusion is at the opposite of the idea I was trying to express there. I was trying to say that embedded safety static analysis was mature and that all that remained to do to convince was to stop making unrealistic claims and derisive comparisons. Talk about scoring own goals. This blog post is my penitence for not having been clear the first time.


The field of software security is orders of magnitude more complex and disorganized than that of embedded safety. It deals with programs that are larger and use more difficult constructs (e.g. dynamic allocation). The properties that need to be verified are numerous and widely dispersed:

  • There are the undefined behaviors that cause something to happen that was not intended by the programmer. These can breach any of a system's confidentiality, integrity or availability. They are the same undefined behaviors that Polyspace, Astrée or Frama-C's value analysis detect, give or take that a previous version of Astrée did not detect uninitialized variables or that the Nitrogen version of Frama-C's value analysis (initially intended for embedded code) does not include the modelization of free() I have been working on recently — and therefore does not detect undefined behaviors related to free().
  • There are security issues in code without undefined behaviors, such as the hashing denial of service attack everyone is talking about these days.
  • And as a last illustration of the variety of security properties, timing information leaks, the detection of which I recently discussed, belong to security considerations.

As a digression, we are not worried about the hashing denial of service attack in Frama-C, but the counter-measures in OCaml 3.13 (that allows to seed hash functions) are going to be a bit of a bother for Frama-C developers. He that among you never iterated on a hash-table when displaying the results of a regression test, let him first cast a stone.


The point is that software security people need a suite like NIST's to describe by example what their field is about (more than 45 000 C/C++ cases to date). The field of embedded safety does not have a similar suite because it does not need one. AbsInt, the company that sells Astrée, did not wait for me to (gently) make fun of their website before they implemented uninitialized variable detection: they did it of their own volition because they knew Astrée should do it, and they had already released a new version with that feature at the time my colleague Florent Kirchner tried their example and noticed the uninitialized variable.

A non-benchmark

Also, I think it's a slippery slope to call what NIST has a "test suite", because before long someone will call it a "benchmark", and NIST's suite is no benchmark. Many of the examples have definite inputs, making them unrepresentative — specifically, making them too easy to handle with techniques that do not scale to the billions of possible inputs that must be handled routinely in real life. There does not seem to be nearly enough "good" examples in the suite: the few I looked at were all "bad" examples, and based on this sample, an analyzer that always answered "there is an issue in the target code" would get a very good mark indeed if this was a benchmark.


Worse, in some cases, the "bad" line to find is extremely benign and there are more serious issues in the case. Consider entry 1737, one of the few I looked at:

...
void test( char * str )
{
  char * buf;
  char * newbuf;
  char * oldbufaddress;

  buf = malloc( MAXSIZE );
  if ( !buf )
    return;
  strncpy( buf, str, MAXSIZE );
  buf[MAXSIZE - 1] = '\0';
  printf( "original buffer content is : %s\n", buf );
  oldbufaddress = buf;
  buf = realloc( buf, 1024 ); /*BAD */
  printf( "realloced buffer content is : %s\n", buf );
  printf( "original buffer address content is : %s\n", oldbufaddress );
  free( buf );
}

int main( int argc, char * * argv )
{
  char * userstr;

  if ( argc > 1 )
  {
    userstr = argv[1];
    test( userstr );
  }
...

With the proper modelization for printf(), free(), and realloc(), such as I am currently working on, Frama-C's value analysis would warn about both printf() calls. The warning for the first call is because realloc() can fail and return NULL (passing NULL to printf() invokes undefined behavior). In this case, there famously also exists a memory leak problem, as the initial block is not freed, but a memory leak is not an undefined behavior. The warning for the second call is because oldbufaddress becomes indeterminate if realloc() moves the block.


The above is not what the author of the test case wants to hear. What the author of the test case wants to hear, clarified by the CWE-244 reference, is that the realloc() call is dangerous because if realloc() moves the block, it may leave the confidential information that was contained in the block lying on the heap. This is fine, if the case is intended as a starting point for discussion (although it would be better without an undefined behavior unrelated to the specific issue being tested).

One problem is that there is no indication in the program that the block contains anything secret. A static analyzer that warns for this program should warn for any use of realloc(), since this is how realloc() is intended to be used (it is intended to preserve the contents of the block being resized). An optimal analyzer for this issue is grep realloc *.c.

It does not hurt to zero the secret data before you free() a block, and avoid realloc(), in a "belt and suspenders" approach to security, but if you use Frama-C's value analysis properly to verify that the program does not have undefined behaviors, you do not have to do this, and the code will still be safe from heap inspection vulnerabilities. Perhaps using the value analysis "properly" is inapplicable in your case; but you should at least be aware of the design choices here. Case 1737, by itself, just seems to pick one specific circumvention method for a general issue, and then confuses the resolution of the general issue with the application the specific method. The map is not the territory.


Were this case used as a benchmark, it would very much remind me of primary school tests written by teachers without any deep understanding of the matter they were teaching, and where you had to guess what was in these teachers' minds at each of their unclear questions. Here, for instance, the value analysis warns that the program may use a dangling pointer in the second printf() — intended to symbolize the information leak. If the program instead called malloc() and tried to use the contents of the new block without having initialized it, then the value analysis would warn about that. Generally speaking, any way that I know of the contents of the initial block can be read from the heap is flagged at some point as an undefined behavior, but this is not what case 1737 tests, because case 1737 tests "analyzer warns about realloc()", and the correct answer is to warn about realloc().

Conclusion

Coming back to the question of what field is immature from a customer point of view, if your interest is in software security, NIST's suite certainly can be used as a basis to understand what kind of static analyzer you may be after. You can try candidate static analyzers on cases that you have selected as most representative of your own problems. But it won't save you from reading each candidate's documentation, understanding how each of them work, and carefully examining their results. Any mechanical interpretation of the results will lead you to pick grep realloc without even understanding case 1737 and its limitations.


The issue for comparing embedded safety static analyzers is that the design space is too large, and the design space for security static analysis certainly isn't any smaller. And if someone ever boasts that their security-oriented software analyzer covers 95% of the NIST SAMATE "benchmark", they are probably selling snake oil. Just saying.

Comments

1. On Wednesday, January 11 2012, 18:56 by John Regehr

I'd compare static analysis with automobile technology in the early part of the 20th century, maybe the 1920s. Lots of products are available and they do in fact work, but there's a very long way to go before they're reliable and efficient.

2. On Wednesday, January 11 2012, 18:58 by Chucky

Hey Pascal,

Although I'm very much not a static tool person, I have a few small points about the NIST's juliet test suite:

The suite contains good and bad code for every test. The files are organized with OMITGOOD and OMITBAD #ifndef sections that can be used to compile an only good, only bad, or mixed test case.

Also, the majority of the tests are machine generated from a core "problem". Given a specific problem, there are something like 50 or so variations. Each variation uses a different kind of flow. My understanding is that tools can use this to figure out what exactly what kind of paths they catch or miss. At least one of the variations includes "random" behavior to simulate inputs (variation 12). This variation is included for most of the tests. It's probably the one you're most interested in if you want to thwart dynamic tools or make sure your static tool is not missing bugs.

3. On Thursday, January 12 2012, 11:09 by pascal

@Chucky

You are right, downloading the Juliet suite from http://samate.nist.gov/SRD/testCases/suites/Juliet-2010-12.c.cpp.zip is much better than browsing at http://samate.nist.gov/SRD/view.php . I thought the two were equivalent because the numbers 45309 and 14184 from the top of http://samate.nist.gov/SRD/testsuite.php?login=Guest more or less added up to the number of the last test case on http://samate.nist.gov/SRD/view.php , but they are not the same.

4. On Thursday, January 12 2012, 12:08 by Sylvain

"if you use Frama-C's value analysis properly to verify that the program does not have undefined behaviors, you do not have to do this".

This is alas not true in general, but as long as _all_ possibly executed code is free of uninitialized access. The last condition may not be guaranteed: consider "exec(...)" on a compromised program (case hinted at in the CWE description). When your program needs system calls that may execute possibly untrusted code (which covers a lot of useful programs), this CWE makes sense, value analysis or not.

I believe that in this case the primary school teacher does indeed have understanding of the matter...

5. On Thursday, January 12 2012, 14:49 by pascal

@Sylvain

Using the value analysis “properly”, the word used in the body of this post, includes making sure that you correctly modelize all system calls and generally calls to functions that are not provided as source code. I hope that section 2.3.3 and chapter 8 in the manual make it clear that it is the user's responsibility, even if the unavailable functions are standardized as part of C99 or POSIX, to make sure that they are modelized correctly (this may include consulting with Frama-C's authors). If your target program uses exec() to launch a program not written in C, or not itself verified not to read from uninitialized memory, or that you cannot guarantee to be the intended program, then the “Perhaps using the value analysis "properly" is inapplicable in your case” from the body of the article applies. That's fine. No-one ever said that Frama-C solved every software engineering problem, and if someone else claims their analyzer does, they are probably selling more snake oil.

This does not change anything to my arguments:

- the test case 1737 does not state in any way that the contents of the block being reallocated are secret, implying that a static analyzer that handles this case correctly is exactly a static analyzer that warns for every use of realloc().

- the test case 1737 contains an undefined behavior when realloc() returns NULL. This will complicate the interpretation of any analyzer's results on it. Considering the number of test cases in the SRD, if the SRD is used as a benchmark, very little time will be spent checking each analyzer's result, and this will cause some analyzers' answer to be mis-categorized. As I said in the first section, the value analysis finds a bug in a case from the Verisec benchmark, and this could easily be chalked up as an imprecision on its part (diminishing its "precision" score) since the case is in the "good" category; the benchmark's authors as well as everyone who ever used it to test their analyzer consider that this case is "good". The only problem is that there is a real bug in it. This case should instead be used to diminish the "recall" score of every analyzer that remain silent on it.

To reiterate, a proper benchmark will inevitably contain a few mis-categorized cases. This is unavoidable if the benchmark is measuring the right thing, because the actual objective being measured is finding bugs in real software with unknown inputs, an arbitrarily hard problem for which there exists no oracle. The least benchmark authors can do to make it easy to interpret results, so that analyzer authors can debug the benchmark. This means:

1- having a notation for secrets if appropriate (otherwise "grep realloc" will get the best score, and if "grep realloc" really was that useful, everyone would be using it. Heck, if it was that useful, OpenBSD would warn when a program uses realloc(), the same way it currently warns if a program use strcpy())

2- in each test case, avoiding undefined behaviors that are unrelated to the flaw whose detection is being tested