Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Tag - benchmarks

Entries feed - Comments feed

Thursday, November 15 2018

Frama-C/Eva in SATE VI with Juliet 1.3

Frama-C chose to participate in the SATE VI Ockham Sound Analysis Criteria. With the upcoming Frama-C 18 (Argon) release, results can be reproduced using the open source version of the Eva plug-in. This post briefly presents SATE, the Juliet test suite, and the results obtained with Frama-C 18. We then describe the Frama-C SATE VI repository on Github, created to allow anyone to reproduce the results obtained with Frama-C/Eva. We also explain some patches that were applied to the test suites, and our choice of parametrization options for Eva.

SATE VI and Juliet 1.3

SATE (Static Analysis Tool Exposition), organized by NIST, is a comparative study of static analysis tools: toolmakers volunteer to participate by running their tools on the sets provided by NIST, and then by submitting their analysis scripts and results. The NIST SAMATE team then compiles a dense report, comparing and evaluating all participating tools.

For the analysis of C programs, there are two tracks: one in which all kinds of tools participate, called the Classic track, and another dedicated to sound analyzers, in the sense of the Ockham Sound Analysis Criteria.

For the Ockham track in SATE VI, we chose the Juliet 1.3 test suite for C/C++, which contains over 64,000 test cases. Some are written in C++, but most are written in C. Test suites are organized in directories related to CWEs (Common Weakness Enumerations): each directory has code samples containing violations of the CWE. For each test case, there are two variants: one with good code (which does not trigger the CWE), and another with bad code (which triggers the CWE in question). In our tests using Juliet, Frama-C is supposed to find all occurrences of bad code; it may, however, report some good code as bad.

CWEs handled by Frama-C/Eva

For the analysis with Frama-C/Eva, we chose the subset of CWEs in Juliet 1.3 for C/C++ which are handled by the analyzer. Click the arrow below to unfold the list.

CWEs in Juliet 1.3 handled by Frama-C/Eva (18 - Argon)
  • CWE121 - Stack Based Buffer Overflow
  • CWE122 - Heap Based Buffer Overflow
  • CWE123 - Write What Where Condition
  • CWE124 - Buffer Underwrite
  • CWE126 - Buffer Overread
  • CWE127 - Buffer Underread
  • CWE190 - Integer Overflow
  • CWE191 - Integer Underflow
  • CWE194 - Unexpected Sign Extension
  • CWE195 - Signed to Unsigned Conversion Error
  • CWE196 - Unsigned to Signed Conversion Error
  • CWE197 - Numeric Truncation Error
  • CWE369 - Divide by Zero
  • CWE415 - Double Free
  • CWE416 - Use After Free
  • CWE457 - Use of Uninitialized Variable
  • CWE464 - Addition of Data Structure Sentinel
  • CWE469 - Use of Pointer Subtraction to Determine Size
  • CWE475 - Undefined Behavior for Input to API
  • CWE562 - Return of Stack Variable Address
  • CWE587 - Assignment of Fixed Address to Pointer
  • CWE588 - Attempt to Access Child of Non Structure Pointer
  • CWE590 - Free Memory Not on Heap
  • CWE617 - Reachable Assertion
  • CWE665 - Improper Initialization
  • CWE690 - NULL Deref From Return
  • CWE758 - Undefined Behavior
  • CWE835 - Infinite Loop
  • CWE843 - Type Confusion

Note that some CWEs are very broad, such as CWE758 - Undefined Behavior. We do not claim Frama-C/Eva handles all cases which might be categorized under such CWEs; however, those considered in the Juliet 1.3 test suite have all their bad code samples correctly labeled.

Some examples of CWEs which are not considered by Frama-C/Eva include CWE426 - Untrusted Search Path, CWE90 - LDAP Injection and CWE780 - Use of RSA Algorithm Without OAEP. All three of them constitute functional-level weaknesses which do not cause runtime errors.

Some of the CWEs not currently considered by Frama-C/Eva could be handled via specific function annotations, but this would constitute a very ad hoc solution. Other plugins in the Frama-C platform might be better suited for these CWEs; e.g., using the WP plugin with function annotations, or the Secure Flow plugin with trusted and untrusted sinks and sources, might allow identifying the cases with API violations leading to weaknesses.

Results

The following table summarizes the results obtained with Frama-C 18 (Argon). The script that produces this table is presented in a later section.

total imprecise      :    751
total non-terminating:      3
total ok             :  44668
total timeout        :      0
total unsound        :      0
precision (excluding non-terminating): 0.983465

Each test case is labeled as one of the following cases:

  • ok: correctly labeled: either a good test without alarms, or a bad test with alarms;
  • imprecise: a good test with a false alarm;
  • unsound: a bad test without an alarm;
  • non-terminating: the program does not terminate; this should never happen, except in the case of CWE835 - Infinite Loop;
  • timeout: a test which did not finish within the allotted time (by default, 30 seconds). This never happens in our benchmarking rig, but slower or overloaded machines may see this happen. If so, then a larger timeout should be given and the test should be re-run.

The precision (98.3%) displayed in the last line is the number of ok tests divided by the total number of tests (ok + imprecise + unsound).

The improvements between Frama-C 17 (Chlorine) and 18 (Argon) are noteworthy, especially the fact that the memcpy and memset builtins are now released in the open source version. Other improvements in precision also contributed to remove some spurious warnings and to simplify the analysis script.

The remaining imprecisions are due to different situations, but most of them occur several times, due to reuse of the same patterns among different test cases. One source of imprecision is the snprintf function, whose specification is not precise enough to ensure that the output string is null-terminated. This leads to false alarms when that string is used later in the code. Another case is related to CWE369 - Divide by Zero, in which some tests use socket-related functions whose specifications do not allow for precise results in every case.

The Frama-C SATE VI Repository

The Frama-C SATE VI repository on Github allows anyone to run Frama-C on the Juliet 1.3 test suite for C/C++, to reproduce the results or to try different approaches, e.g. parametrizing the analysis differently, or trying a new abstract domain.

Software requirements are: Frama-C 18.0 (Argon), GNU Make (>= 4.0), the GNU parallel tool, Bash (>= 4.0) and wget (to download the Juliet archive from NIST's website).

The repository is structured in the following way:

  • A GNUmakefile allows bootstrapping the repository, by downloading the Juliet 1.3 for C/C++ archive from NIST's website and then decompressing it (Juliet 1.3 is not distributed in our repository). It also applies some patches to Juliet's sources, described in a later section;
  • the fc directory contains Frama-C scripts used to run the analysis and automatically evaluate its results. The main scripts are:
    • analyze.sh: sets the default analysis parameters; these differ from Frama-C's defaults in several ways, being stricter in some cases but less strict in others;
    • evaluate_case.sh: performs automatic evaluation of good and bad cases into the categories presented before;
  • the fc/patches directory contains some patches applied to Juliet 1.3 to fix known issues in a few tests. It also changes the way non-deterministic random numbers are used in some tests (details are presented later);
  • the C directory contains the extracted Juliet 1.3 archive, with its scripts and test cases;
  • finally, the oracles directory contains a versioned copy of the oracles produced by the analysis with Frama-C/Eva; this allows comparing different parametrizations, and following the evolution of the platform.

Reproducing the results

The overall process to use the repository is the following:

git clone https://github.com/Frama-C/SATE-VI.git
cd SATE-VI
make
cd C/testcases
make

The last make step takes a very long time (2+ hours, depending on your CPU and number of cores).

Note that, if Frama-C is not installed in the PATH, you can use the fc/frama-c-path.sh script to specify the directory where Frama-C is installed. This is especially useful to compare different versions of Frama-C.

After running make inside the C/testcases directory, a small summary with the number of tests performed in each subdirectory will be presented in textual form:

CWE121_Stack_Based_Buffer_Overflow/s01: 876 tests
    876  ok
CWE121_Stack_Based_Buffer_Overflow/s02: 1020 tests
   1020  ok
CWE121_Stack_Based_Buffer_Overflow/s03: 1020 tests
     34  imprecise
    986  ok
...

Running ./summarize_results.sh will also output this result. The stat_results.py script will produce an overall summary across all subdirectories, as presented in the Results section.

Re-running individual tests

Inside each directory in C/testcases/, it is possible to run a single test, by removing its extension and adding _good.res or _bad.res, e.g. if there is a file file.c inside directory CWE000/s01, then doing cd CWE000/s01 and make file_good.res will run the test and produce three files:

  • file_good: output of the analysis;
  • file_good.csv: CSV report of warnings produced during the analysis;
  • file_good.res: result of evaluate_case.sh applied to the test case report;

Note that for tests consisting of multiple files (each ending with a, b, c, etc.), the test case name is their prefix without the letter, e.g. for the test comprised of files file_42a.c and file_42b.c, the command is make file_42_good.res (or bad).

Finally, by replacing .res with .gui in the make command, it is possible to open the Frama-C GUI on the test case, to inspect it.

Patches applied to Juliet 1.3

In the Frama-C SATE VI repository, there are two sets of patches applied to Juliet 1.3 for C/C++:

fix-tests.patch

These are fixes in a small number of tests, due to 2 patterns which were found out to lead to undefined behavior.

The first fix consists in replacing an overflow check which accidentally introduces a signed overflow:

 /* FIX: Add a check to prevent an overflow from occurring */
-if (abs((long)data) <= (long)sqrt((double)CHAR_MAX))
+if (data <= LONG_MAX && labs(data) <= (long)sqrt((double)CHAR_MAX))
 {

data has type unsigned int. In the original code, the call to abs((long)data) is equivalent to abs((int)(long)data), since abs takes an int as argument. If int and long have the same length (e.g. as in x86 32-bit), then data can be larger than LONG_MAX (and also INT_MAX, since they have the same length), leading to a signed overflow in the call, and thus an undefined behavior. The patched version first checks that data is small enough, and then uses labs to avoid downcasting data.

The second fix concerns a set of files which had a local variable declared in a scope smaller than intended:

 static void goodG2B()
 {
     void * data;
+    int intBuffer = 8;
     /* Initialize data */
     data = NULL;
     {
     /* FIX: Point data to an int */
-        int intBuffer = 8;
         data = &intBuffer;
     }
     CWE843_Type_Confusion__short_68_goodG2BData = data;

In this set of test cases, intBuffer (and similar variables) was declared inside a block, and the value was later used outside of the scope of the variable, leading to undefined behavior. The fix consists in declaring the variable in the main scope of the function.

improve-testcasesupport.patch

This concerns mainly the RAND32 macro, used in Juliet to produce non-deterministic numbers. This macro is used in some test cases to assign values to variables of different types. The result is downcast to char or short whenever necessary. However, this downcast leads to warnings due to usage of option -warn-signed-downcast. This option is enabled in the analyze.sh script since some CWEs are related to it.

The patches applied to the test cases consist in replacing calls to (char)RAND32() and (short)RAND32() with RAND8() and RAND16(), two new macros defined in a way that prevents overflows from happening. These macros are defined in a way very similar to the existing RAND32() and RAND64() macros, but avoiding overflows triggering downcast warnings. To preserve existing behavior, these macros are defined to RAND32() when not running Frama-C.

Finally, minor patches were applied to pretty-printing functions in io.c, such as:

void printHexCharLine (char charHex)
 {
-    printf("%02x\n", charHex);
+    printf("%02hhx\n", charHex);
 }

The printf modifier for a variable of type char is hh, although this is very rarely used in practice. Still, the Variadic plug-in warns about it, so fixing the test case support file prevents spurious warnings.

Note that all of these patches are intended to improve the precision of the analysis, but do not affect its correction: not applying these patches will lead Frama-C/Eva to produce several spurious warnings (and, in a few cases, such warnings will lead to non-termination, when the only possible behavior is undefined), but their absence will not make Frama-C's results unsound; bad cases will still be reported as such.

Rationale for analysis options

The analyze.sh script contains a series of non-default options used in the analysis. We present here the rationale for choosing these options during the analysis of Juliet 1.3. Similar options are grouped together for presentation purposes.

-no-autoload-plugins -load-module from,inout,report,scope,eva,variadic

These options are present in most of Frama-C's test suites. The -no-autoload-plugins option simply disables by default all plugins, which speeds up initialization of Frama-C. We then only load the plug-ins which are needed for Juliet: eva and its derived plugins (from, inout, scope), report for the CSV report, and variadic to handle variadic functions present in almost every test (e.g. printf). Overall, this is an optimization for running thousands of tests; when running only a few dozens of tests, the difference is negligible.

-kernel-warn-key parser:decimal-float=inactive \
-kernel-warn-key typing:no-proto=inactive \
-kernel-warn-key typing:implicit-conv-void-ptr=inactive \
-eva-warn-key locals-escaping=inactive

These options disable warnings which are not directly related to runtime errors, but are enabled by default since they often indicate code quality issues in real code bases. The first three warnings could be avoided by patching the code (e.g. adding prototypes, explicit casts, and avoiding inexact floating-point constants); the latter is related to addresses of variables remaining accessible out of their scopes (in Eva, their value becomes ESCAPINGADDR), and it is emitted since it is often easier to identify the moment where the variable becomes out of scope, rather than the later time at which it may be used. This is not an undefined behavior, unless the variables are actually used (which does not happen in the good test cases). In any case, when the values are accessed (as in the bad test cases), a different warning is emitted.

-eva-msg-key=-initial-state \
-eva-no-show-progress \
-eva-print-callstacks \

These are all options related to minimizing the default verbosity of the analysis, except for the last, which adds callstacks to the output: when there are warnings, it is better to know where they come from.

-slevel 300 \

Some amount of slevel is useful to improve precision of the analysis. The chosen value is arbitrary, and variable according to the characteristics of the code under analysis. In the case of Juliet 1.3, this value is a good compromise between precision and speed of the analysis.

-eva-builtin alloca:Frama_C_vla_alloc_by_stack \

The alloca function (which is not in the C standard) is used in several test cases. In all of them, it behaves just like the allocation of a VLA (variable length array), that is, the variables are not used outside of the scope in which alloca was called1 . It is therefore safe to associate the vla_alloc builtin to calls to the alloca function in Juliet 1.3.

1 The main difference between alloca and the allocation of a VLA is that the latter is deallocated at the end of the scope in which it was created, while alloca memory blocks are deallocated at the end of the function in which they are called.

-warn-special-float none \

By default, Frama-C considers non-finite floating-point values as forbidden, emitting warnings when they may arise, even though this does not constitute an undefined behavior. This option allows non-finite floating-point values, thus avoiding spurious warnings.

-warn-signed-downcast \
-warn-unsigned-overflow \

Signed downcasts (possible loss of information when converting a larger type to a smaller one) and unsigned overflows, by default, do not constitute undefined behavior; however, it is a good practice to avoid them, and some specific CWEs do require treating them as undesired. Another option that could have been activated is -eva-symbolic-locations-domain, but in Juliet 1.3 it does not make any difference, neither in the oracles nor in the overall analysis time.

-eva-warn-copy-indeterminate=-@all \

By default, copying a possibly indeterminate value (e.g. uninitialized) in Eva triggers a warning, even if the value is copied in a standards-compliant way (e.g. via unsigned char pointers). This helps detect problems earlier in the code, rather than when used. However, the usage of such values already triggers warnings anyway, so we can afford to be less strict, avoiding spurious warnings in good test cases where the copied values are not used later.

-eva-equality-domain \
-eva-sign-domain \

These two abstract domains are efficient enough to be used in most analyses, and improve the precision of the results at a minimal cost. The equality domain in particular is useful in some tests which require relational information, while the sign domain is well suited for testing divide-by-zero errors. The equality domain eliminates false alarms in 184 tests, and the sign domain removes 308 false alarms.

-add-symbolic-path TESTCASESUPPORT_DIR:$TESTCASESUPPORT_DIR

This is a technicality, but useful for scripts and versioning: by default, Frama-C uses relative paths when they are "sufficiently close" to the current working directory, otherwise absolute paths are used (e.g. when there is a .. in the path). To avoid versioning such paths (and also for creating test oracles), Frama-C has a mechanism to replace specific paths with user-defined strings. The most common use case is Frama-C's share directory, written as FRAMAC_SHARE in reports and in the GUI. In this case, the actual path to the C/testcasesupport directory is replaced with string TESTCASESUPPORT_DIR. This ensures every user should have the same oracles for the CSV files mentioned below.

-report-no-proven -report-csv <file>.csv

These options are not present in analyze.sh, but in evaluate_case.sh, due to availability of the output filename. These options invoke the Report plug-in to generate a CSV file containing the list of properties in the program, along with their statuses. Option -report-no-proven filters only the properties which are left Unknown or Invalid, which are the ones we are interested in when looking for undefined behaviors.

Conclusion

The NIST SAMATE Juliet 1.3 test suite for C/C++ provides an excellent set of well-defined programs, as well as programs with isolated bugs, which are very useful for automated testing and comparative evaluation.

Frama-C/Eva can handle several of the CWEs present in the test suite, and recent improvements in the analyzer allow it to obtain a very high precision. The tests cases are small, which does not make them representative of large case studies, but they allowed the identification of limitations in the analyzer and offered opportunities for improvement.

The analysis with Frama-C/Eva allowed the identification of very few issues with the code, which given the size of the test suites is an indicator of its high quality.

Overall, participation in the Ockham track enabled Frama-C/Eva to improve its analysis, and offered a very extensive set of test cases for regression testing and quantitative experimentation with different parametrizations. For instance, the sign domain, developed originally as part of a tutorial for Eva, proved itself very useful to improve the precision of several tests related to CWE369 - Divide by Zero, with a minimal runtime cost. The large-scale testing afforded by Juliet enables such kinds of evaluation.

We would like to thank NIST for preparing Juliet and organizing SATE, and for their commitment to the exposition of innovative tools for program analysis and verification.

Friday, November 2 2012

Short, difficult programs

When researchers start claiming that they have a sound and complete analyzer for predicting whether a program statement is reachable, it is time to build a database of interesting programs.

Goldbach's conjecture

My long-time favorite is a C program that verifies Goldbach's conjecture (actual program left as an exercise to the reader).

If the conjecture is true, the program never terminates (ergo a statement placed after it is unreachable). If the conjecture is false, the program terminates and moves on to the following statement. The program can be implemented using unsigned long long integers that should be good for counter-examples up to 18446744073709551615. It can alternately use dynamic allocation and multi-precision integers, in which case, depending whether your precise definition of “analyzing a C program” includes out-of-memory behaviors, you could claim that the reachability of the statement after the counter-example-finding loop is equivalent to the resolution of “one of the oldest and best-known unsolved problems in number theory and in all of mathematics”.

Easier programs than that

No-one expects the resolution of Goldbach's conjecture to come from a program analyzer. This example is too good, because it is out of reach for everyone—it has eluded our best mathematicians for centuries. What I am looking for in this post is easier programs, where the solution is just in reach. Examples that it would genuinely be informative to run these sound, complete analyzers on. If they were made available.

With integers

For 32-bit ints and 64-bit long longs, I know that the label L is unreachable in the program below, but does your average sound and complete program analyzer do?

/*@ requires 2 <= x ;
    requires 2 <= y ; */
void f(unsigned int x, unsigned int y)
{
  if (x * (unsigned long long)y == 17)
    L: return;
}

The key is that with the aforementioned platform hypotheses, the multiplication does not overflow. Label L being reached would mean that the program has identified divisors of the prime number 17, which we don't expect it to.


In the program below, the multiplication can overflow, and not having tried it, I have truly no idea whether the label L is reachable. I expect it is, statistically, but if you told me that it is unreachable because of a deep mathematical property of computations modulo a power of two, I would not be shocked.

/*@ requires 2 <= x ;
    requires 2 <= y ; */
void f(unsigned long long x, unsigned long long y)
{
  if (x * y == 17)
    L: return;
}

With floating-point

Floating-point is fun. Label L in the following program is unreachable:

/*@ requires 10000000. <= x <= 200000000. ;
    requires 10000000. <= y <= 200000000. ; */
void sub(float x, float y)
{
  if (x - y == 0.09375f)
    L: return;
}

Frama-C's value analysis, and most other analyzers, will not tell you that label L is unreachable. It definitely looks reachable, The difference between x and y can be zero, and it can be 1.0. It looks like it could be 0.09375 but it cannot: the subtracted numbers are too large for the difference, if non-zero, to be smaller than 1.0.


So the subtlety in the example above is the magnitude of the arguments. What about smaller arguments then?

/*@ requires 1.0 <= x <= 200000000. ;
    requires 1.0 <= y <= 200000000. ; */
void sub(float x, float y)
{
  if (x - y == 0.09375f)
    L: return;
}

This time the label L is easily reachable, for instance with the inputs 2.09375 for x and 2.0 for y.


What about this third program?

/*@ requires 1.0 <= x <= 200000000. ;
    requires 1.0 <= y <= 200000000. ; */
void sub(float x, float y)
{
  if (x - y == 0.1f)
    L: return;
}

The target difference 0.1f is larger than 0.09375f, so it should be easier to obtain as the difference of two floats from the prescribed range, right? In fact, it isn't. The numbers 0.099999904632568359375 and 0.10000002384185791015625 can be obtained as values of x-y in the above program, for the former as the result of picking 1.099999904632568359375 for x and 1.0 for y. The value 0.1f, on the other hand, cannot be obtained as the subtraction of two floats above 1.0, because some bits are set in its significand that cannot be set by subtracting floats that large.

Conclusion

I expect it will be some time before automatic analyzers can soundly and completely decide, in all two-line programs, whether the second line is reachable. Frama-C's value analysis does not detect unreachability in any of the programs above, for instance (the value analysis is sound, so it soundly detects that L may be reachable when it is). Please leave your own difficult two-line programs in the comments.


The floating-point examples in this post owe quite a bit to my colleague Bruno Marre explaining his article “Improving the Floating Point Addition and Subtraction Constraints” to me over lunch.

Tuesday, October 16 2012

Exact Gap Computation for Code Coverage Metrics in ISO-C

Comparing static analysis tools is (still) difficult

Earlier this year of 2012, some of my colleagues and I took the opportunity to point out that, as a research community, we are not doing a satisfactory job of comparing static analysis tools. This article and blog post were concerned with independent third-party comparisons. Another context in which each of us can be lead to do comparisons, and to try to be as objective as possible, is the “related work” section of any traditional, non-benchmark article.

“Related work” sections

Dirk Richter and Christian Berg have designed a technique that, apparently, it makes some sense to compare to what Frama-C's value analysis allows. Here is their summary of their comparison between their tool and the value analysis:

No sophisticated examples can be handled by Frama-C’s value analysis. Some of the examples tested even cause runtime-errors in Frama-C itself, thus it is not reliable.

Richter and Berg's technique appears to tackle the problem of estimating how much code must actually be covered for all the code that can be covered to have been covered.

Frankly, it seems to me that the researchers should be comparing their tool to existing test-generation tools. They are much closer in objectives, and in the means deployed to reach these objectives. One such tool that I know of because it is developped locally is PathCrawler. I am sure there are plenty of others. I am not sure why they spend a paragraph on Frama-C's value analysis when it is absolutely not documented as trying to do this.

Possible comparisons

However, I am interested in tool comparisons, so this is a good opportunity to understand how these comparisons work and how we can make them better.


First, the claim “No sophisticated examples can be handled by Frama-C’s value analysis” is interesting, because of course what really kills advanced techniques in practice is that they do not scale to even moderate size programs. The authors point out that their own work may only be “practical” for embedded systems. Coincidentally, Frama-C's value analysis was designed to work well on embedded code, too, since we too felt this was a domain where practicality was within reach. This makes me curious as to what embedded code examples the authors experimented on before reaching their conclusion. If these examples reveal weaknesses in Frama-C's value analysis on embedded code, we would definitely like to hear about them. I would summarize this into one additional item that we did not list in our “Benchmarking static analyzers” article (although it was implict in other recommendations we did list): when doing comparisons, conditions should be made explicit and input programs should be made available. Here, this would mean providing the Frama-C version and commandline in addition to the test programs themselves. For such programs and commandline that “cause runtime-errors in Frama-C itself”, a good place to provide them is our Bug Tracking System. Instructions for reporting a bug are displayed on crashes. Otherwise, a tarball with all the embedded C code examples the authors used would be fine. A URL could be provided in the article.


Second, the authors claim that Frama-C is not reliable. This makes me want to investigate because, precisely, in 2012 we published a peer-reviewed article on the subject “Frama-C is reliable, here is how we used random program generation to make it so”. I am eager to get my paws on the authors' implementation so as to evaluate its reliability, since this is supposed to be one aspect where it does better. Unfortunately, the implementation does not appear to be available yet.


Third, it is always a difficult exercise to compare one's own tool to a different, unfamiliar one. One risks ending up with a slightly unfair conclusion, even if one tries one's hardest to be objective. I was remarking about the difficulty of finding “fair” inputs for the comparison of static analyzers in another previous post. The difficulty for static analyzers is that inputs are programs: programming languages are so expressive that two programs, even in the same language, can differ widely.

There, as a remedy, I suggested tool authors first try their own tool on the programs that the other analyzer works well on. For someone working on another technique and unfamiliar with the strengths and weaknesses of Frama-C's value analysis, Csmith-generated examples are one obvious possibility. Another possibility would be to use the programs in the 2012 RERS challenge. These programs contains assert(0); calls and, for each, an objective is to tell whether it is reachable. This seems to me to be the same question as whether a test suite needs to cover the assert(0); call. The good news is that both on Csmith-generated programs and in the RERS challenge, Frama-C's value analysis can be made to be both sound and complete, just like Richter and Berg claim their technique is. It should be simple and informative to compare the results of their tool to the results obtained with the value analysis then. Below are our solutions for problems 1-9.

Reachable assertions in problems 1-9 of the RERS competition

Below are a list of reachable lines for each problem. The assert(0) calls not listed are unreachable.

Problem1.c

error_20: assert(0);
error_47: assert(0);
error_32: assert(0);
error_37: assert(0);
error_56: assert(0);
error_33: assert(0);
error_57: assert(0);
error_50: assert(0);
error_35: assert(0);
error_15: assert(0);
error_38: assert(0);
error_21: assert(0);
error_44: assert(0);
globalError: assert(0);

Problem2.c

error_50: assert(0);
error_45: assert(0);
error_59: assert(0);
globalError: assert(0);
error_43: assert(0);
error_13: assert(0);
error_16: assert(0);
error_44: assert(0);

Problem3.c

error_45: assert(0);
error_35: assert(0);
error_52: assert(0);
error_39: assert(0);
error_9: assert(0);
error_37: assert(0);
error_43: assert(0);
error_31: assert(0);
error_28: assert(0);
error_27: assert(0);
error_50: assert(0);
error_13: assert(0);
error_26: assert(0);
globalError: assert(0);

Problem4.c

error_12: assert(0);
error_19: assert(0);
error_31: assert(0);
error_39: assert(0);
error_52: assert(0);
error_6: assert(0);
error_58: assert(0);
error_40: assert(0);
error_4: assert(0);
error_38: assert(0);
error_45: assert(0);
error_11: assert(0);
error_26: assert(0);
globalError: assert(0);
error_9: assert(0);
error_17: assert(0);
error_32: assert(0);
error_35: assert(0);
error_55: assert(0);
error_36: assert(0);
error_14: assert(0);
error_18: assert(0);
error_13: assert(0);
error_15: assert(0);
error_27: assert(0);

Problem5.c

error_0: assert(0);
error_38: assert(0);
error_57: assert(0);
error_55: assert(0);
error_58: assert(0);
error_32: assert(0);
error_13: assert(0);
error_51: assert(0);
error_33: assert(0);
error_48: assert(0);
error_18: assert(0);
error_39: assert(0);
error_1: assert(0);
error_41: assert(0);
error_37: assert(0);
globalError: assert(0);
error_11: assert(0);
error_26: assert(0);
error_15: assert(0);
error_40: assert(0);
error_36: assert(0);
error_44: assert(0);
error_30: assert(0);
error_47: assert(0);
error_24: assert(0);

Problem6.c

error_12: assert(0);
error_21: assert(0);
error_11: assert(0);
error_44: assert(0);
error_1: assert(0);
error_36: assert(0);
error_0: assert(0);
error_2: assert(0);
error_38: assert(0);
error_48: assert(0);
error_37: assert(0);
error_4: assert(0);
error_59: assert(0);
error_10: assert(0);
error_20: assert(0);
error_5: assert(0);
error_15: assert(0);
error_27: assert(0);
error_33: assert(0);
error_9: assert(0);
error_29: assert(0);
error_47: assert(0);
error_56: assert(0);
error_24: assert(0);
error_58: assert(0);
globalError: assert(0);

Problem7.c

error_58: assert(0);
error_47: assert(0);
error_5: assert(0);
error_48: assert(0);
error_19: assert(0);
error_39: assert(0);
error_36: assert(0);
error_40: assert(0);
error_35: assert(0);
error_31: assert(0);
error_9: assert(0);
error_42: assert(0);
error_7: assert(0);
globalError: assert(0);
error_11: assert(0);
error_20: assert(0);
error_44: assert(0);
error_46: assert(0);
error_18: assert(0);
error_6: assert(0);
error_23: assert(0);
error_30: assert(0);
error_3: assert(0);
error_37: assert(0);
error_15: assert(0);

Problem8.c

error_48: assert(0);
error_2: assert(0);
error_49: assert(0);
error_15: assert(0);
error_7: assert(0);
error_55: assert(0);
error_51: assert(0);
error_50: assert(0);
error_43: assert(0);
error_10: assert(0);
error_29: assert(0);
error_24: assert(0);
error_1: assert(0);
error_26: assert(0);
error_6: assert(0);
error_5: assert(0);
error_46: assert(0);
error_13: assert(0);
error_4: assert(0);
error_37: assert(0);
globalError: assert(0);
error_34: assert(0);
error_25: assert(0);
error_28: assert(0);
error_59: assert(0);

Problem9.c:

error_46: assert(0);
error_57: assert(0);
error_36: assert(0);
error_19: assert(0);
error_6: assert(0);
error_10: assert(0);
error_34: assert(0);
error_15: assert(0);
error_32: assert(0);
error_41: assert(0);
error_11: assert(0);
error_35: assert(0);
error_2: assert(0);
error_20: assert(0);
error_3: assert(0);
globalError: assert(0);
error_44: assert(0);
error_38: assert(0);
error_51: assert(0);
error_54: assert(0);
error_56: assert(0);
error_53: assert(0);
error_47: assert(0);
error_59: assert(0);
error_8: assert(0);

Friday, August 24 2012

On writing a dedicated model-checker for the RERS competition

In recent posts, I have shown that Frama-C's value analysis could answer many reachability questions, and some questions that weren't originally phrased as reachability questions, about the programs in the RERS competition.

If you are vaguely familiar with the internals of Frama-C's value analysis, and if you tried analyzing some of the competition programs yourself, you may have noticed that these analyses use only a small fraction of the plug-in's capabilities. The analyzer is only ever propagating abstract states that correspond to singletons. It does juggle with many program states, but the programs here have small states that are encoded in just a few variables (the analyzer would have been able to manipulate the states encoded on a much larger number of variables and would efficiently share in memory the values common to several of the explored states). There are no bit-fields, no structs with padding (both of which might make identical states look different if carelessly handled). The programs obviously do not execute any undefined behavior for lack of any operation that might introduce them. There is a single outermost loop. There is no arithmetic at all.

In favor of the general verifier

A specialized verifier that was designed for just these programs would have a huge opportunity to do a better job on them. On the other hand, the advantage of working on a more general verifier is that it is useful for more tasks. This enables to spend more time improving it. Some of the improvements enhance the analysis of many programs, including the specific programs built only from assignments, equality tests and conditionals considered in the competition. Some of these improvements are too sophisticated to justify implementing in a verifier that only handles programs with assignments, equality tests and conditionals, because such a verifier will never be usable to find that the SHA-3 candidate Skein's reference implementation does not exhibit undefined behavior, that AES may be susceptible to timing attacks (but Skein isn't), where a couple of bugs are in an old undocumented piece of control software, that there is a subtle memory overflow in compression library QuickLZ, or that units of code have the data and control flows mandated by specification.

What a dedicated verifier might look like

In the particular case of these programs, the same propagation done by the value analysis could be done in C, by a program that would incorporate the transition function directly and execute it as compiled code. This would be much more efficient than what the value analysis does, and in this particular case, it would give the same results. From experiments interpreting Csmith programs, the value analysis slowdown with respect to compiled code can be expected to be of the order of 10000.

Accumulating reachable states

In order to reproduce what Frama-C's value analysis does, a dedicated verifier would need to store states that have already been visited, and to be able to recognize, after applying the transition function once more, whether the obtained state was one that was already visited.

In this particular case, this could be done in constant time with a hashset. Note, however, that it is only possible to compute a hash in this specialized case because all states are “singleton” states. If some states represented several values at once, e.g. a1 ∈ {1; 2} a2 ∈ {0; 1}, the good criterion would then be whether the newly computed state is included in one of the previously propagated states. Testing inclusion in one of the previous states cannot be done in constant time with a hashset (hashsets only allow you to test whether the new state is equal to an existing state, and you need to be careful to use compatible equality and hash functions).

Frama-C's value analysis uses a data structure that is generally as efficient as hashsets when the manipulated states are singletons, and that often remains efficient when testing inclusion in one another of states that are not singletons.

Storing states that remain to be propagated

A small thing: the propagation algorithm also requires a workqueue. Any propagation order will do (Frama-C's value analysis propagation order is not specified either), but since C comes with so few data structures, I just thought I would mention it. For a dedicated verifier in C, one would need to find some linked list implementation or write eir own. Frama-C's value analysis may interpret C slower than the code can be executed once compiled, but it comes with ready to use data structures for the reached set and for the workqueue. One needs not even know they are there to use the analyzer.

Specific drawbacks

Obviously, this ad-hoc verifier could be expected to be must faster than the value analysis. This makes the experiment tempting. What prevents me from starting work on it is the lack of generality of the resulting tool. Some examples follow.

Undefined behavior

Suppose that you had implemented such a specialized verifier for the competition's program, and that you were looking for more transition functions written in C that the same verifier could apply to. You would certainly find some, but would you in general be certain that the transition function never exhibits undefined behavior (not for the first transition, and not for the transition from any reachable state to a successor)? If one isn't certain that the transition function does not cause undefined behavior, from a formal verification point of view, the tool is worthless. An undefined behavior can cause anything to happen. In particular, an out-of-bounds memory access in the transition function can mess up the verifier's reached set or workqueue, and compromise the results of the verification.


Any automatic verifier is susceptible to the caveat that a bug in the verifier can compromise results. This is different: your implementation could be flawless, but it would still be susceptible to a bug in the transition function, a bug in the system being verified.

Of course, you could, as a preliminary step in your verification, check that the transition function does not have any undefined behavior for any input state. If you find that there are a lot of different undefined behaviors in C and that it's a bother to detect them all, we have a tool that we have been working on. It also answers reachability questions.

General interference between unrelated traces

Even if you have only “frank” run-time errors to fear—undefined behavior that compilers kindly translate to run-time errors, at least when optimizations are disabled—, a run-time error in one transition will still interrupt the entire dedicated analyzer. You will notice when this happens, and you can add a guard to the particular division by zero or NULL dereference, but this iterative process may end up taking as long as the value analysis for the same result. The value analysis, when encountering division by zero or NULL dereference, makes a note of it, considers the trace as terminated, and goes on to the propagation of the next trace. The end result, a list of run-time errors to worry about and a reachable states set, is the same, but it is obtained in a single pass.

There is also the issue of non-termination of the transition function. The value analysis detects cases of non-termination more or less quickly; again, when the infinite loop is detected it simply goes on to the next trace. With an ad-hoc verifier, if you expect the verification to take days, it may take days before you realize that the verifier is executing an infinite loop in its third considered transition.

Conclusion

In summary, considering the general C verification framework we already have, it looks like it wouldn't be a good investment of our time to develop a dedicated fast verifier for the competition—although it would provide an insightful datapoint if someone did.

Perhaps participating in the meeting will convince us that Event Condition Action (ECA) systems are more generally useful than our current experience has led us to believe. We could work on a verifier for them if there is a need. There is not much code to reuse from a general-purpose C abstract interpreter. There would be ideas to reuse, certainly. I will come to one of them in the next post.

Thursday, August 23 2012

Technical interlude

The current series of posts is pretty technical, huh? Here is a refreshing diversion in the form of a technical post that also touches on existential questions. What is this blog for? We don't know.

Best solutions, or solutions anyone could have found

If you have been following the current series of posts, you may have noticed that a recurring dilemma is whether the goal is to demonstrate the capabilities of the software tool under consideration (Frama-C's value analysis plug-in) when pushed to its limits by its authors, or to demonstrate what anyone could do without being a creator of the tool, taking advantage of Linux, Mac OS X and sometimes Windows packages for the software, and, most importantly, having only the available documentation to understand what the software makes possible.


The general goal of the blog is to empower users. In fact, much of the aforementioned documentation is available as posts in the blog itself. These posts sometimes describe techniques that were undocumented until then, but describe them so that they wouldn't be undocumented any more. Some of the posts even describe techniques that weren't undocumented at the time of writing. As my colleague Jean-Jacques Lévy would say, “repetition is the essence of teaching”.


However, I do not expect that most software tool competitions are about measuring the availability and learnability of tools. For instance, the SV-COMP competition explicitly forbids anyone other than the authors of the tool to participate with it:

A person (participant) was qualified as competition contributor for a competition candidate if the person was a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person’s name on the tool’s project web page, a tool paper, or in the revision logs).

Ostensibly, the tools are the participants in the competition. Involving the authors is only a mean to make the competition as fair as possible for the tools. Some colleagues and I have expressed the opinion that, for static analysis, at this time, this was the right way:

In general, the best way to make sure that a benchmark does not misrepresent the strengths and weaknesses of a tool is to include the toolmakers in the process.


Some competitions on the other hand acknowledge that a participant is a person(s)+tool hybrid, and welcome entries from users that did not create the tools they used. This is perfectly fine, especially for the very interactive tools involved in the cited competition, as long as it is understood that it is not a tool alone, but a hybrid participant, that is being evaluated in such a competition. Some of the participating tools participated in several teams, associated to several users (although the winning teams did tend to be composed of toolmakers using their respective tools).


It is not clear to me where on this axis the RERS competition falls. It is a “free-style” competition. The entries are largely evaluated on results, with much freedom with respect to the way these were obtained. This definitely can also be fun(*).

The competition certainly does not require participating tools to be packaged and documented in such exquisite detail that anyone could have computed the same results. As such, the best description to write is the description of what the authors of the value analysis can do with it with all the shortcuts they trust themselves to take. If some of the questions can be solved efficiently using undocumented features, so be it. It is the free-style style.

But the mission of the blog remains to document, and showing what we have done without explaining how to reproduce it sounds like boasting. I am torn.


(*) An untranslatable French idiom, “bon public”, literally “good audience”, used in an adjective position, conveys a number of subtle, often humorous nuances. I think this applies to me and software competitions. They just all seem fun to me.

A necessary built-in

A value analysis built-in whose usefulness, if it were available, became apparent in the last post was a built-in that halts the analyzer.

Several analysis log snippets in the previous post show a ^C as last line, to indicate that the log already contains the answer to the question being investigated, and that the user watching the analysis go by can freely interrupt it then.

I cheated when editing the post, by the way. I am not fast enough to interrupt the analysis just after the first conclusive log message. But I did interrupt these analyses, which otherwise might have gone on for a long time for no purpose.

I thought I would enjoy my colleagues' horrified reaction so I told everyone about my intention of writing a value analysis built-in, like some already exist for memcpy() or memset(), for halting the analyzer. Not halting the currently considered execution, mind you. There is /*@ assert \false; */ for that, and if you need this functionality available as a C function, you can put /*@ assert \false; */ inside your own function, like I did for assert() earlier when prepping the competition programs for Frama-C.


Clearly, such a built-in does not have a contract, its effects are meta-effects that transcend the framework (in what would, for these colleagues, be the bad sense of the word).

A time-saver

Today I set out to write this built-in, because I started to get serious about the competition and it would be convenient to just instrument the program with:

if ( /* current state identical to predecessor state /* )
{
  Frama_C_show_each_cycle_found(1);
  Frama_C_halt_analysis();
}

instead of having an operator, or even a shell script, watch the log for the Frama_C_show_each_cycle_found message and interrupt Frama-C then.


Soon after opening the file and finding my place, I realized that this feature was already available. I am telling you this because this blog is for the benefit of transparency, for documenting things and for empowering users, but mostly because I want to see my colleagues' horrified reaction when they read about this trick. Actually, I would not use it if I were you. It might cease to work in a future version.

if ( /* current state identical to predecessor state /* )
{
  Frama_C_show_each_cycle_found(1);
  Frama_C_cos(0.0, 0.0);
}

You see, Frama-C value analysis built-ins, such as Frama_C_cos(), do not have an intrinsic type. You can include a header that will give them a type if you wish:

double Frama_C_cos(double x);

Otherwise, the old C89 rules for missing function prototypes apply. During parsing, a type is inferred from the types of the arguments at the call site.

Only if and when the analysis actually reaches the Frama_C_cos(0.0, 0.0) call, the analyzer will realize that it is passed arguments that are incompatible with the OCaml function that implement the built-in. An exception will be raised. Since there is no reason to try to recover from a built-in misuse, this exception is caught only at the highest level. It stops the entire analyzer (it marks the entire analysis results as unreliable, as it should, but of course it does not retroactively unemit the Frama_C_show_each_cycle_found message in the log that says a conclusion has been reached).


The result looks like this, this time without cheating:

[value] Called Frama_C_show_each_cycle_found({1})
...
[value] computing for function any_int <- main.
        Called from Problem6_terminateU_pred.c:9516.
[value] Done for function any_int
[value] Semantic level unrolling superposing up to 1400 states
[value] user error: Invalid argument for Frama_C_cos function
[value] user error: Degeneration occured:
        results are not correct for lines of code
        that can be reached from the degeneration point.
[kernel] Plug-in value aborted because of invalid user input.

This is convenient for timing the analyzer as it tackles some of the harder problems. In the case above, the time was 5m35s answering a liveness question about Problem 6. It would have been a pain to watch the log for that long, fingers poised to interrupt the analyzer.


That's all for this interlude. Expect exciting timings in the next post. There may also be talk of undocumented features that transcend the problem, in the good sense of the word.

Wednesday, August 22 2012

Proving (F oW) true or false: alternative method

In the context of the RERS 2012 competition, the previous post points out that value analysis option -obviously-terminates can be used to prove the termination of an instrumented program, created such that its termination implies that the property (F oW) is true of the original competition program. Unfortunately, this only works in one direction. If the property (F oW) does not hold of the original competition program, this approach cannot be used to prove that. The previous post continued to offer a method, involving a counter, that works in both directions.

Preliminary remark regarding the counter method

The second method proposed in the previous post requires a state count to use as bound. I suggested to use the number of reachable states of the original (uninstrumented) program, obtained by running Frama-C's value analysis as a sound and complete abstract interpreter.

I would like to point out that instead of the state count of the original program, it is possible to use the state count of the program instrumented to stop when W is emitted. This number is by definition lower. On the other hand, the number of states of the uninstrumented program can be computed once and for all, and then be used for all formulas of the form (F oW) that apply to that program in the competition. In fact we can obtain that number for free when we check the reachability of assert(0); statements for the competition. So there is a trade-off there.

Applying the method of last post

Let us now consider Problem3.c (small-hard). One of the properties to evaluate is (F oY). The first method described in the previous post does not conclude rapidly, so there is reason to suspect that perhaps the property is false for this program. But if this is going to be our answer in the competition, we had better establish that the property is definitely false, that is, there exists an infinite execution along which Y is not emitted.

Instrumenting with a counter

To establish that the property is false using a counter, we instrument Problem3.c thus:

--- Problem3.c	2012-08-08 10:49:53.000000000 +0200
+++ Problem3_terminate_at_Y_with_counter.c	2012-08-22 13:58:30.000000000 +0200
@@ -1,9 +1,8 @@
 void assert(int x)
 {
   if (!x)
     {
-      Frama_C_show_each_assert_reached();
       /*@ assert \false; */
     }
 }
 
@@ -1649,15 +1648,16 @@
 	    } 
 	    if((((((((!(a26==1)&&(a6==1))&&!(a27==1))&&(a12==1))&&(a3==0))&&(a11==1))&&(a5==1))&&(a18==1))){
 	    	error_2: assert(0);
 	    } 
-	    return -2; 
+	    /*@ assert \false ; */
 	}
 
 int main()
 {
     // default output
     int output = -1;
+    int counter = 0;
 
     // main i/o-loop
     while(1)
     {
@@ -1668,6 +1668,10 @@
 
         // operate eca engine
         output = calculate_output(input);
 
+	/*@ assert output != y ; */
+
+	counter++;
+	Frama_C_show_each_counter(counter);
     }
 }

Obtaining a bound

In order to obtain a bound, we analyze the original program, using the value analysis as a model-checker:

$ frama-c -val -slevel 100000000 Problem3.c
...
[value] Semantic level unrolling superposing up to 200 states
...

This message means there are up to a number between 200 and 300 states attached to a single statement during the analysis of the program. That statement is probably the statement just after a new input has been picked non-deterministically between 1 and 6, and the number therefore six times higher than the number of states that interests us. The abstract interpreter also distinguishes states program that differ only for variable output, again making the bound displayed in the “superposing up to” messages higher than the number we mean. But let us not try to compensate for that, just to be on the safe side.


I kid, I kid. Using the “superposing up to” message is not rigorous, since this message was not designed for that (and the message is not documented at all, I think). As a Frama-C developer with knowledge of value analysis internals, I am tempted to use 300 as a safe bound, but in order to obtain a rigorous number of states, anyone can insert the following statement in the while loop of the program:

Frama_C_show_each_state( a1, a4, a0, a15, a29, a10, a16, a22, a2, a17, a25, a7, a14, a19, a20, a8, a23, a21, a24, a13, a9, a28, a26, a6, a27, a12, a3, a11, a5, a18 );

Once filtered, sorted and de-duplicated, the list of states is exactly 31 long. Once de-duplicated, the messages printed by the call correspond exactly to unique states of the program. This method provides an exact number and it uses documented value analysis features only. The list is provided at the end of the post for the doubters.

Looking for an execution with more than 31 transitions

Knowing that we are looking for an execution where counter goes higher than 31, we can now launch the analysis of the version instrumented with counter:

$ frama-c -val -slevel 100000000 -no-val-show-progress Problem3_terminate_at_Y_with_counter.c
...
[value] Called Frama_C_show_each_counter({1})
[value] Called Frama_C_show_each_counter({1})
[value] Called Frama_C_show_each_counter({1})
[value] Called Frama_C_show_each_counter({1})
[value] Called Frama_C_show_each_counter({1})
Problem3_terminate_at_Y_with_counter.c:5:[value] Assertion got status invalid (stopping propagation).
[value] Called Frama_C_show_each_counter({2})
...
[value] Called Frama_C_show_each_counter({31})
...
[value] Called Frama_C_show_each_counter({31})
[value] Called Frama_C_show_each_counter({32})
...
[value] Called Frama_C_show_each_counter({32})
^C

This means that there is at least one infinite execution in the instrumented program, which means that the original program can execute infinitely without emitting Y.

A method not involving a transition counter

An irritating feature of the previously described method is that it does not immediately detect infinite executions that are right under its nose. There could exist, for a certain input, a transition from the initial state to itself, and the analyzer would still have to explore all other transitions, breadth-first, down to a depth of 31, before displaying the Called Frama_C_show_each_counter({32}) message that confirms the existence of an infinite execution.

A method to look for this cycle of length one would be to instrument the program differently. This is what I describe now.

Add variables to the program for the description of the predecessor state

This means transforming the declarations

	int a1 = 1;
	int a4 = 1;
...

into:

	int a1 = 1, p1 = -1;
	int a4 = 1, p4 = -1;
...

This can be done with a regexp.

Check whether a newly reached state coincides with its predecessor state

At the end of a cycle, we compare the new state with the predecessor state. Again, code doing this can be generated with a regexp:

        if 
        ((a1 == p1) &&
        (a4 == p4) &&
...
        (a18 == p18))
          Frama_C_show_each_cycle_found(1);

The above causes a message to be emitted by the analyzer as soon as a cycle is found.

Update the predecessor state

After checking whether the new state coincides with the predecessor state, it is time to update the predecessor state. Yada regexp yada:

        p1 = a1;
        p4 = a4 ;
...
        p18 = a18 ;

And presto: we detect all cycles. All cycles of length one, that is.

Maybe update the predecessor state

Although it would work great on the example that motivated this new method, detecting only cycles of length one is a little bit too specific. Let us fix that.

At the end of a cycle, instead of updating the predecessor state as above, let us maybe update it as below:

    if (any_int())
    {
            p1 = a1;
            p4 = a4 ;
...
            p18 = a18 ;
    }

With the above non-deterministic update of the predecessor state, for each vector a1, a4, ...a18 of variables propagated by the value analysis, the variables p1, p4, ...p18 contain a nondeterministic superposition of all its predecessors. Think of it as programming a quantum computer, and if you think of it this way, remember that the variables p1, p4, ...p18 are entangled: when observed, they contain matching values from one specific predecessor. They are entangled because the conditional above either update variables p1, p4, ... all at once or leaves them all unchanged.


On the other hand, different predecessors do not get mixed together because... well, because the analyzer remains complete on this program even after it being instrumented thus, for reasons I still do not want to document. Nevertheless, the important part of the argument is that a sound and complete analyzer would find the Frama_C_show_each_cycle_found(1); statement above reachable if and only if in the program instrumented to stop at Y, a state is identical to one of its predecessors. Hopefully it is possible to convince oneself of that much without being a value analysis expert—although I admit that programming nondeterministic machines is an unusual exercise.


Using Frama-C's value analysis as a sound and complete abstract interpreter, the statement Frama_C_show_each_cycle_found(1); will be reached, and a message be displayed, exactly when a state is reached that is identical to one of its predecessors. This happens if and only if there is a cycle in the transition graph of the program instrumented to stop at Y. In terms of the original program, this corresponds to (F oY) being false.

Putting the new method to the test

Running the value analysis on the program instrumented as described:

$ frama-c -val -slevel 100000000 -no-val-show-progress Problem3_terminate_at_Y_with_predecessor.c 
...
[value] Semantic level unrolling superposing up to 100 states
[value] Semantic level unrolling superposing up to 200 states
[value] Called Frama_C_show_each_cycle_found({1})
^C

Incidentally, on this example, this new method is much faster than the method involving a counter. It is not clear that it should always be the case, and on difficult examples, the two methods can perhaps be tried in parallel.

Annex: an exhaustive list of states for Problem 3

Below is the list of unique states displayed when the statement

Frama_C_show_each_state( a1, a4, a0, a15, a29, a10, a16, a22, a2, a17, a25, a7, a14, a19, a20, a8, a23, a21, a24, a13, a9, a28, a26, a6, a27, a12, a3, a11, a5, a18 );

is inserted in the while loop of Program3.c.

[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{0},{0},{0},{1},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{0},{0},{0},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{0},{0},{1},{0},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{0},{1},{0},{0},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{0},{1},{0},{1},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{0},{1},{0},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{1},{0},{1},{1},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{1},{0},{1},{2},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{1},{1},{0},{1},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{1},{1},{0},{2},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{0},{1},{1},{1},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{0},{0},{0},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{0},{0},{1},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{0},{0},{1},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{0},{1},{0},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{0},{1},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{1},{0},{0},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{1},{0},{1},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{1},{0},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{1},{1},{1},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{0},{1},{1},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{0},{0},{1},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{0},{0},{2},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{0},{0},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{0},{1},{2},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{1},{0},{1},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{1},{0},{2},{0},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{1},{0},{2},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{1},{1},{0},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{1},{1},{1},{1},{1},{1})
[value] Called Frama_C_show_each_state({1},{1},{1},{1},{1},{1},{1},{1},{0},{0}, {0},{0},{0},{0},{0},{0},{0},{1},{1},{1}, {1},{1},{1},{1},{1},{1},{2},{0},{1},{1})


This post benefited from discussions with Sébastien Bardin, Sven Mattsen, Benjamin Monate and Boris Yakobowski.

Tuesday, August 21 2012

Participating in the RERS 2012 competition: liveness questions

This third installment discusses the diagnosis of liveness properties of particular C programs, using Frama-C's value analysis, for the RERS 2012 competition.


[Note added in September 2012: the diagnostic found for the example in this post disagrees with the diagnostic found later and sent to the competition organizers. I am unable to reproduce the result obtained here. Perhaps it was obtain with a version of Frama-C that had a bug.]

Liveness properties 101

Here is another extract of two properties found in the property list attached to Problem 1 in the competition:

(G ! oW)
output W does never occur
----------------------------------------

(F oW)
output W occurs eventually

To tell you how unaccustomed to temporal properties I am, I first read the second formula as a negation of the first one. There are redundancies in the lists of formulas to evaluate, and I thought this was one more of them. It is of course not the case: (F oW) is not the negation of (G ! oW).

(G ! oW) says that there is no execution path that leads to W being output; the negation of that is that there is at least one such execution path. By contrast, (F oW) expresses that all execution paths have a W being output at some point, a much stronger property.


Indeed, (F oW), expressing that a state in which W is output is eventually reached, is a typical liveness property. Checking this kind of property is at odds with traditional abstract interpretation practices, where the analysis is considered done when a fixpoint has been reached: the fixpoint does not tell whether all executions eventually leave the loop. It only tells what is guaranteed for those executions that leave the loop.

Semi-solution: a simple program transformation

Please bear with me and consider the simple modification of Problem1.c below, applied on top of the previous instrumentation:

--- Problem1.c	2012-08-08 10:15:00.000000000 +0200
+++ Problem1_terminate_at_W.c	2012-08-21 17:00:38.000000000 +0200
@@ -1,9 +1,8 @@
 void assert(int x)
 {
   if (!x)
     {
-      Frama_C_show_each_assert_reached();
       /*@ assert \false; */
     }
 }
 
@@ -594,6 +593,8 @@
 
         // operate eca engine
         output = calculate_output(input);
 
+	// continue only if output not W yet
+	/*@ assert output != w ; */
     }
 }

We remove the user message in our assert() function because we want to temporarily forget about assertions. Besides, by instrumenting the main loop, we make the program stop when the output is W.

You should be able to convince yourself that the modified program terminates if and only if the original program does not admit any infinite execution that never outputs W (for any infinite sequence of values assigned to variable input).


Earlier posts (1, 2) in this blog pointed out how Frama-C's value analysis could be made to forget that it was an abstract interpreter, and verify the termination of programs.

Application

Since everything is ready for this particular property of Problem1.c, let us run the verification. The command to prove that the program terminates is as follows:

$ frama-c -val -obviously-terminates -no-val-show-progress Problem1_terminate_at_W.c
...
[value] Semantic level unrolling superposing up to 400 states
[value] done for function main
[value] ====== VALUES COMPUTED ======

We should have used a timeout, because the analysis fails to terminate when the analyzer is unable to prove that the target program terminate. Nevertheless, we were lucky, the target program terminates, the analyzer is able to verify that the program terminates, and we can conclude that there are no infinite executions of the original Program1.c that fail to emit W.

Assertions and consequences on liveness properties

The previous section only tells us that there is no infinite execution that does not emit W. Depending on the organizers' intentions, this may or may not answer the question of whether (F oW) holds.

It still depends whether you count an execution that reaches one of the assert(0); in the program as, well, an execution. If it count as an execution to which the formula should apply, then an execution that does not emit W and terminates on such an assert(0); certainly does not eventually output W.

There is an argument to say that it shouldn't count, however: although it is impossible to understand what these programs are supposed to do, assertions usually characterize illegal behaviors. Perhaps the LTL formulas are to be evaluated only on the legal, infinite behaviors of the system?

In the latter case, we are done: (F oW) holds because W is eventually emitted in all legal executions. In the former case, the additional question is whether an original assert(0); is reached before W is output, which is a simple reachability question on the instrumented Problem1_terminate_at_W.c. The answer is then that (F oW) does not hold, because there are plenty of assert(0); calls that are reached without W having been emitted.


We e-mailed the organizers and they said it was the first, simpler case. Only infinite executions need be taken into account for evaluating LTL formulas, and interrupted executions can be ignored. Our answer for the formula (F oW) of Problem 1 in the competition is therefore that the property holds.

Proving that something does not eventually happen

In the first part of the post, we set out to prove that the liveness property (F oW) was true. Verifying that the property is false requires a different approach. With the toolset provided by Frama-C, I only see the scheme below. We reduce the problem to the seemingly harder but actually easier problem of computing the maximum number of iterations before W is emitted.

  1. Obtain the number N of reachable states for the system by running the value analysis on the original program.
  2. Instrument the program, as previously making execution terminate when W is emitted, and also maintaining a transition counter (incremented at each iteration of the loop).
  3. Using sound and complete propagation on the instrumented program, either reach a fixpoint (with a finite bound computed for the counter) or find an execution with N+1 transitions that does not emit W. The former case means that the program eventually emits W. The latter that, in the transition graph, there is a reachable cycle along which W is not emitted, and therefore an infinite execution that does not output W.


This technique likely gets pretty heavy in practice. Adding a counter that may go up to N+1 to a system that already has N states can probably cause the new system to have of the order of N² states. This segues into at least one, perhaps two more blog posts, to answer the questions “what is the value analysis actually doing when it is applied on these questions?”, “how much work would it be to do the same thing closer to the metal, executing the C function as compiled instead of laboriously interpreting it?”, and “why don't we make a C model checker while we are at it?”

Monday, August 20 2012

Participating in the RERS 2012 competition: reachability questions

This post is a follow-up to this one. It begins to describe our participation, using Frama-C's value analysis, in the RERS 2012 competition.

In theory

The competition features nine target programs of increasing difficulty. Some of the questions to be answered about the target programs are reachability questions. This is the case for all calls to assert(0); in the program: for each call, the question is exactly whether it can be reached in an execution.

In addition, some LTL formulas are offered about each target program. The goal here is to individually diagnose each LTL formula as true or false when applied to the program. Some of the LTL formulas actually express reachability properties. Consider the following property, from the list that applies to problem 1 (small-easy):

(G ! oU)
output U does never occur

This formula holds if a program state in which U is output is never reached. It suffices to instrument the program a little bit to recognize when this happens, and the problem is reduced to a reachability problem, only in reverse: the property holds if the states are unreachable, and does not hold if one of the states can be reached.

In practice

The good news is that the value analysis is, at its core, designed to answer reachability questions. More specifically, it is designed to answer unreachability questions: it computes supersets of the values of variables that can reach each point in the target program, and it warns if dangerous instructions can be reached. It can guarantee that an assert(0); call is not reached.


Preparing the program for analysis as described in the manual and many times on this blog may result for instance in the following changes: removing the headers, defining a function assert(), removing the printf() calls so that one does not need to provide the function.

--- Problem1.c~	2012-07-31 01:32:42.000000000 +0200
+++ Problem1.c	2012-08-08 10:15:00.000000000 +0200
@@ -1,5 +1,12 @@
-#include <stdio.h> 
-#include <assert.h>
+void assert(int x)
+{
+  if (!x)
+    {
+      Frama_C_show_each_assert_reached();
+      /*@ assert \false; */
+    }
+}
+
 
 	// inputs
 	int a= 1;
@@ -582,14 +589,11 @@
     {
         // read input
         int input;
-        scanf("%d", &input);        
+	input = any_int();
+/*@ assert input == 1 || input == 2 || input == 3 || input == 4 || input == 5 || input == 6 ; */
 
         // operate eca engine
         output = calculate_output(input);
 
-        if(output == -2)
-        	fprintf(stderr, "Invalid input: %d\n", input);
-        else if(output != -1)
-			printf("%d\n", output);
     }
 }

We assume that an execution that reaches an assert(0); is interrupted, and that input at each cycle is a number between 1 and 6. This corresponds to our understanding of the description of the programs in the competition.


As previously hinted, the value analysis is designed to answer reachability questions negatively only. It is intended to guarantee that dangerous operations are unreachable (i.e. the program is safe). In doing so, it computes whether any statement is reachable, even non-dangerous ones, and with what variable values each of these statement is reached.

In static analysis, this is called “soundness”. The value analysis does not in general guarantee that any statement is reachable. However, for the particular program above, instrumented in this particular way, with the right command-line options, the value analysis is “complete” in addition to being sound: in addition to only diagnosing as unreachable statements that are actually unreachable, it only diagnoses as reachable statements that are actually reachable.

Being complete in addition to being sound means that the value analysis is able to answer the question (G ! oU) negatively as well as positively. If the value analysis was only sound, it could only guarantee that a particular assert(0); call is not reached, or that U is never output. Being complete means that it can in addition guarantee that an assert(0); call is reached in some execution or that U is output in some execution.

More formulas to evaluate

There are more complex LTL formulas in the list that express reachability properties, sometimes modulo trivial instrumentation. Below are two of them:

(! oZ WU (oZ WU (! oZ WU (oZ WU (G ! oZ)))))
output Z occurs at most twice
----------------------------------------

(! oU WU oZ)
output U does never occur before output Z

The second formula expresses that a state where U is emitted without Z ever having been emitted is not reached. To verify this property, one simply needs to add to the target program a boolean variable that remembers whether Z has been emitted before.

The first is a complex LTL formula with a simple English interpretation. The program can be instrumented with a counter for the number of times Z is output, and the property to check is whether a state where Z has been emitted three times is reached.

Do not make completeness claims about Frama-C's value analysis at home

You may have noticed that the value analysis' manual rarely offers precision guarantees, and in particular never describes the conditions in which it is complete.

Therefore, an independent participant in the competition who wasn't a value analysis developer would not be able to assert that the value analysis is behaving completely. Such a participant could only hope to answer reachability questions negatively, finding that (G ! oU) is true (when it is actually true) or finding that a particular assertion is never reached (when it is actually unreachable). This is the direction that matters when trying to establish that error states are unreachable, to verify the safety of software. The competition organizers implicitly acknowledge that this kind of property can be difficult to establish otherwise when they say, about weighting answers:

The weighting scheme gives you a way to express your personal confidence in your results. e.g. if you have found a feasable path to some error, this is safe, and you should weight it 10. Lifeness properties or stating that certain errors are non-feasible is of course more risky.

Here we are using a sound analyzer to establish that certain errors are non-feasible, so in our case there is little risk involved in attributing the maximum weight to these answers. There is in fact more risk when answering that some execution path leads to some error since the value analysis, even when it happens to be complete, is not designed to produce a trace that we could double-check. But we are confident for these errors too because of all the Csmith testing that we applied earlier in the year.


And by the way, this blog post does not count as documentation: it is still undocumented in what circumstances Frama-C's value analysis is complete. The documentation is long enough trying to describe how to use the value analysis soundly; and this is despite soundness being its defining property. There is no point in enumerating the very specific circumstances in which it is complete. The conditions are unlikely to occur simultaneously in real life, and if you think that this competition contradicts this statement, have another look at the shape of the competition's target programs.

Monday, August 6 2012

Understand LTL? Join us!

Here is yet another software verification competition. If you are a specialist of the verification of temporal properties, and you have been regretting not to snatch that easy Xbox 360 (with Kinect!) in 2011*, this is your chance to make up for it!


We have the reachability part of the problems nearly done** and we are looking for someone who can reduce (at least some of) the temporal properties in the challenge to something that can be verified with the value analysis. We expect that part to be expensive, so if you know how to do it, we should get together and start on this as soon as possible.


(*) I am not saying that a deep understanding of temporal properties would have helped you much in 2011, but it cannot hurt to be able to think clearly about these things

(**) No, really, 95%. We're almost there.

Friday, July 27 2012

Oxygen is stricter about types and why you should get used to it

I have just sent a list of changewishes (1, 2) to a static analysis competition mailing-list, and that reminded me of a blog post I had to write on the strictness of the type-checker in upcoming Frama-C release Oxygen. This is the blog post.

This post is not about uninitialized variables

The static analysis competition in question evaluates each participating tool on many testcases. A good number of these testcases are rejected by the current Frama-C. This is why I was writing to the mailing-list: Frama-C cannot participate in the competition if the testcases are not changed (Frama-C does not have to participate, of course. It's just that it looks fun and we would probably like to be in it).

In fact, many of the testcases were already problematic for Frama-C version 20111001 (Nitrogen), the version in the current Debian, Ubuntu, FreeBSD, NetBSD and other Unix distributions. Indeed, a lot of the testcases rely on uninitialized variables for entropy, which this post by Xi Wang and this post by Mark Shroyer show is wrong. Instead of the problem that is supposed to be detected (or not), Nitrogen detects the uninitialized use. I covered this already; this blog post is not about uninitialized variables (keep reading!).

This post is about C type-checking

While trying to get a list of uninitialized-variable-using testcases, I realized that something had changed since my last evaluation of the competition's benchmarks. Many of them were now rejected at type-checking!

The new problem is, many testcases in the benchmarks call functions without having provided a prototype, and some ultimately define the called function with a type incompatible with the type inferred at the call site. Frama-C used to be lenient about typing issues, but after fixing one soundness bug too many that was caused by the leniency, we have decided to throw in the towel. Virgile described one of the typing issues for which Frama-C used to be too lenient. It was not the only one.

This is an unusual position to take. In the article A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World, the makers of a commercially successful bug-finding tool state:

Our product's requirements roughly form a "least common denominator" set needed by any tool that uses non-trivial analysis to check large amounts of code across many organizations; the tool must find and parse the code, [...]

The above is wise: to be maximally useful, the static analyzer should choke on as few idiosyncrasies as possible. The analyzed code can be assumed to compile with some compiler, and to go through some tests (there must still be a few of these, right)? Why warn when a function is called without a prototype, if the compiler accepted it? Why reject when the function's implementation has a type that is incompatible with the type that was inferred at the call site? This is probably not the issue the user is looking for (if it was, the user would have fixed it when eir compiler warned for it).

Oxygen is strict and that's it

Well, our answer is that we tried and we found that it was too much work to try to be sound and precise with these constraints, as exemplified by Virgile's blog post. Show us a tool that accepts your weakly typed program, and we will show you a tool that probably isn't sound and precise at the same time (we have kept the examples that led to the decision. These examples demonstrate real issues masked by lenient typing. If you think you have found a sound analyzer that is at the same time conveniently permissive on typing, it will be our pleasure to provide the examples for you to try).


We hope that Frama-C will still be useful with these new restrictions on typing. Fortunately for us, there are more real worlds than the expression “the Real World” in the cited article's title might lead you to think (and this quip should not be taken as a reproach towards the authors of “A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World”, a very useful document with an appropriately chosen title considering its intended public).

- page 1 of 2