Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Tag - position

Entries feed - Comments feed

Monday, September 2 2013

The case for formal verification of existing software

Perry E. Metzger takes a look at formal verification. This is good stuff; there is a lot to agree with here.

However, agreeing with Perry's post alone would not make a very interesting counterpoint. If agreeing was the only thing I intended to do, I might even not have written this post. Instead I intended to add, and I apologize in advance for the predictability of my views, that while creating formally verified software from scratch is useful, verifying existing software is useful too.

Yes, formally verified software written from scratch can now be large enough in scope to be a compiler or a microkernel, but when verifying existing software, we can tackle the problem from the other end: we can pick any useful software component, and verify that. We can pick a software component so useful that it is already used by millions. If we succeed in verifying it, we have put formally verified software in the hands of millions of satisfied users. Transparently.


Take the example of the SSL implementation I am taking a couple of weeks to finish massaging through Frama-C. It is not as wide in scope as Quark, seL4 or CompCert. Neither am I aiming for the same kind of functional correctness as these projects are: I am only verifying the absence of undefined behaviors in the component, and verifying the functional dependencies of the cryptographic primitives(*).

But PolarSSL is useful. Plus, by its very nature, it is exposed to security attacks (SSL is the protocol behind HTTPS). And the former three examples are full-blown research projects, in contrast to my single person.month effort.

The bonus is that the formally verified PolarSSL can be compiled and embedded with all the same compilers and into all the same firmwares as the earlier non-verified version. It is the same as the non-verified version except maybe for a couple of bugfixes, and the confidence that for an identified usage pattern, no more undefined behavior bugs will ever need to be fixed.

All in all, the formal verification of existing code, despite its differences from its “from scratch” counterpart, has too become a force to be reckoned with.


(*) translation: I use Frama-C option -deps and I compare the result to what I expected


Acknowledgement: I got the link to Perry E. Metzger's post through Toby Murray.

Thursday, July 11 2013

Arithmetic overflows in Fluorine

There is a C quiz in the middle of this post, lost in the middle of all the reminiscing.

A history of arithmetic overflows in Frama-C

From the very beginnings in 2005, until after the first Open Source release in 2008, Frama-C's value analysis was assuming that all arithmetic overflows produced two's complement results. This seemed the right thing to do at the time.

Then an option was introduced to emit alarms on undefined overflows. John Regehr suggested it after testing one of the Open Source versions. The option was turned off by default. If a value analysis user turned the option on, any undefined arithmetic overflow in the program would be detected and reported as a possible error, with the same gravity as dereferencing NULL or accessing an array out of its bounds.

Later, a helpful reminder was added to the value analysis' output: in the default behavior of not emitting alarms, an informative message was emitted instead—if such an overflow was detected—about two's complement being assumed.

There was one last change in the last release, Fluorine. Actually, two changes: the name of the option for emitting alarms on undefined overflows changed, and the default value changed. The setting is now to emit alarms by default, and can be changed to not emitting them, for instance if the target code is destined to be compiled with gcc -fno-strict-overflow -fwrapv, in which case all overflows happening during execution can be expected to produce two's complement results.


One aspect remains unchanged in the above evolution: the discussion only applies to undefined overflows.

The philosophy was always to analyze programs as they were written, and not to force any change of habit on software developers. The initial choice not to warn about overflows was because we knew there were so many of these—most of them intentional—that we would be deluging the user with what would feel like a flood of false positives.

The gradual shift towards more arithmetic overflow awareness is a consequence of the fact that in C, some arithmetic overflows are undefined behavior. Compilers display increasing sophistication when optimizing the defined behaviors to the detriment of the predictability of undefined ones. To make a long story short, the “overflows produce 2's complement results” heuristic was wrong for some programs as compiled by some optimizing compilers.

In keeping with the same philosophy, “overflows” that are defined according to the C99 standard have always been treated by the value analysis plug-in with the semantics mandated by the standard. Those overflows that the standard says must have “implementation-defined” results are treated with the semantics that the overwhelming majority of compilation platforms give them (and it remains possible to model other architectures as the need arises).

A quiz

Other static analyzers may also warn for arithmetic overflows, but the philosophy can be different. The philosophy may for instance be that any overflow, regardless of whether it is defined or implementation-defined according to the letter of the standard, might be unintentional and should be brought to the attention of the developer.

In the few examples below, the goal is to predict whether Frama-C's value analysis with its new default setting in Fluorine would emit an alarm for an overflow. For extra credit, a secondary goal is to predict whether another static analyzer that warns for all overflows would warn. We assume a 32-bit int and a 16-bit short, same as (almost) everybody has.

int a = 50000;
int b = 50000;
int r = a * b;
unsigned int a = 50000;
unsigned int b = 50000;
unsigned int r = a * b;
int a = 50000;
int b = 50000;
unsigned int r = a * b;
short a = 30000;
short b = 30000;
short r = a * b;
unsigned short a = 50000;
unsigned short b = 50000;
unsigned int r = a * b;

Answers

int a = 50000;
int b = 50000;
int r = a * b;

There is an overflow in this snippet (mathematically, 50000 * 50000 is 2500000000, which does not fit in an int). This overflow is undefined, so the value analysis warns about it.


unsigned int a = 50000;
unsigned int b = 50000;
unsigned int r = a * b;

The multiplication is an unsigned int multiplication, and when the mathematical result of unsigned operations is out of range, the C99 standard mandates that overflows wrap around. Technically, the C99 standard says “A computation involving unsigned operands can never overflow, …” (6.2.5:9) but we are using the word “overflow” with the same meaning as everyone outside the C standardization committee including Wikipedia editors.

To sum up, in the C99 standard, overflows in signed arithmetic are undefined and there are no overflows in unsigned arithmetic (meaning that unsigned overflows wrap around).


int a = 50000;
int b = 50000;
unsigned int r = a * b;

The multiplication is again a signed multiplication. It does not matter that the result is destined to an unsigned int variable because in C, types are inferred bottom-up. So the value analysis warns about an undefined overflow in the multiplication here.


short a = 30000;
short b = 30000;
short r = a * b;

There is no overflow here in the multiplication because the last line behaves as short r = (short) ((int) a * (int) b);. The justification for this behavior can be found in clause 6.3.1 of the C99 standard about conversions and promotions (the general idea is that arithmetic never operates on types smaller than int or unsigned int. Smaller types are implicitly promoted before arithmetic takes place). The product 900000000 does fit in the type int of the multiplication. But then there is a conversion when the int result is assigned to the short variable r. This conversion is implementation-defined, so the value analysis does not warn about it, but another static analyzer may choose to warn about this conversion.


unsigned short a = 50000;
unsigned short b = 50000;
unsigned int r = a * b;

Perhaps contrary to expectations, there is an undefined overflow in the multiplication a * b in this example. Right in the middle of the aforementioned 6.3.1 clause in the C99 standard, on the subject of the promotion of operands with smaller types than int, the following sentence can be found:

If an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int.

All values of a 16-bit unsigned short fit a 32-bit int, so the third line is interpreted as unsigned int r = (unsigned int) ((int) a * (int) b);.


Incidentally, things would be completely different in this last example if int and short were the same size, say if int was 16-bit. In this case the third line would be equivalent to unsigned int r = (unsigned int) a * (unsigned int) b; and would only contain an unsigned, innocuous overflow.

Wrapping up

In Fluorine, the option to activate or deactivate the emission of these undefined arithmetic overflow alarms is called -warn-signed-overflow (the opposite version for no alarms being -no-warn-signed-overflow). I felt that providing this piece of information earlier would have rendered the quiz too easy.

Although Frama-C's value analysis adheres to the semantics of C and only warns for undefined overflows, it is possible to use Frama-C to check for the other kinds of overflows by using another plug-in, Rte, to automatically annotate the target C program with ACSL assertions that express the conditions for overflows. Note that that post pre-dates the Fluorine release and is written in terms of the old options.

Friday, August 24 2012

On writing a dedicated model-checker for the RERS competition

In recent posts, I have shown that Frama-C's value analysis could answer many reachability questions, and some questions that weren't originally phrased as reachability questions, about the programs in the RERS competition.

If you are vaguely familiar with the internals of Frama-C's value analysis, and if you tried analyzing some of the competition programs yourself, you may have noticed that these analyses use only a small fraction of the plug-in's capabilities. The analyzer is only ever propagating abstract states that correspond to singletons. It does juggle with many program states, but the programs here have small states that are encoded in just a few variables (the analyzer would have been able to manipulate the states encoded on a much larger number of variables and would efficiently share in memory the values common to several of the explored states). There are no bit-fields, no structs with padding (both of which might make identical states look different if carelessly handled). The programs obviously do not execute any undefined behavior for lack of any operation that might introduce them. There is a single outermost loop. There is no arithmetic at all.

In favor of the general verifier

A specialized verifier that was designed for just these programs would have a huge opportunity to do a better job on them. On the other hand, the advantage of working on a more general verifier is that it is useful for more tasks. This enables to spend more time improving it. Some of the improvements enhance the analysis of many programs, including the specific programs built only from assignments, equality tests and conditionals considered in the competition. Some of these improvements are too sophisticated to justify implementing in a verifier that only handles programs with assignments, equality tests and conditionals, because such a verifier will never be usable to find that the SHA-3 candidate Skein's reference implementation does not exhibit undefined behavior, that AES may be susceptible to timing attacks (but Skein isn't), where a couple of bugs are in an old undocumented piece of control software, that there is a subtle memory overflow in compression library QuickLZ, or that units of code have the data and control flows mandated by specification.

What a dedicated verifier might look like

In the particular case of these programs, the same propagation done by the value analysis could be done in C, by a program that would incorporate the transition function directly and execute it as compiled code. This would be much more efficient than what the value analysis does, and in this particular case, it would give the same results. From experiments interpreting Csmith programs, the value analysis slowdown with respect to compiled code can be expected to be of the order of 10000.

Accumulating reachable states

In order to reproduce what Frama-C's value analysis does, a dedicated verifier would need to store states that have already been visited, and to be able to recognize, after applying the transition function once more, whether the obtained state was one that was already visited.

In this particular case, this could be done in constant time with a hashset. Note, however, that it is only possible to compute a hash in this specialized case because all states are “singleton” states. If some states represented several values at once, e.g. a1 ∈ {1; 2} a2 ∈ {0; 1}, the good criterion would then be whether the newly computed state is included in one of the previously propagated states. Testing inclusion in one of the previous states cannot be done in constant time with a hashset (hashsets only allow you to test whether the new state is equal to an existing state, and you need to be careful to use compatible equality and hash functions).

Frama-C's value analysis uses a data structure that is generally as efficient as hashsets when the manipulated states are singletons, and that often remains efficient when testing inclusion in one another of states that are not singletons.

Storing states that remain to be propagated

A small thing: the propagation algorithm also requires a workqueue. Any propagation order will do (Frama-C's value analysis propagation order is not specified either), but since C comes with so few data structures, I just thought I would mention it. For a dedicated verifier in C, one would need to find some linked list implementation or write eir own. Frama-C's value analysis may interpret C slower than the code can be executed once compiled, but it comes with ready to use data structures for the reached set and for the workqueue. One needs not even know they are there to use the analyzer.

Specific drawbacks

Obviously, this ad-hoc verifier could be expected to be must faster than the value analysis. This makes the experiment tempting. What prevents me from starting work on it is the lack of generality of the resulting tool. Some examples follow.

Undefined behavior

Suppose that you had implemented such a specialized verifier for the competition's program, and that you were looking for more transition functions written in C that the same verifier could apply to. You would certainly find some, but would you in general be certain that the transition function never exhibits undefined behavior (not for the first transition, and not for the transition from any reachable state to a successor)? If one isn't certain that the transition function does not cause undefined behavior, from a formal verification point of view, the tool is worthless. An undefined behavior can cause anything to happen. In particular, an out-of-bounds memory access in the transition function can mess up the verifier's reached set or workqueue, and compromise the results of the verification.


Any automatic verifier is susceptible to the caveat that a bug in the verifier can compromise results. This is different: your implementation could be flawless, but it would still be susceptible to a bug in the transition function, a bug in the system being verified.

Of course, you could, as a preliminary step in your verification, check that the transition function does not have any undefined behavior for any input state. If you find that there are a lot of different undefined behaviors in C and that it's a bother to detect them all, we have a tool that we have been working on. It also answers reachability questions.

General interference between unrelated traces

Even if you have only “frank” run-time errors to fear—undefined behavior that compilers kindly translate to run-time errors, at least when optimizations are disabled—, a run-time error in one transition will still interrupt the entire dedicated analyzer. You will notice when this happens, and you can add a guard to the particular division by zero or NULL dereference, but this iterative process may end up taking as long as the value analysis for the same result. The value analysis, when encountering division by zero or NULL dereference, makes a note of it, considers the trace as terminated, and goes on to the propagation of the next trace. The end result, a list of run-time errors to worry about and a reachable states set, is the same, but it is obtained in a single pass.

There is also the issue of non-termination of the transition function. The value analysis detects cases of non-termination more or less quickly; again, when the infinite loop is detected it simply goes on to the next trace. With an ad-hoc verifier, if you expect the verification to take days, it may take days before you realize that the verifier is executing an infinite loop in its third considered transition.

Conclusion

In summary, considering the general C verification framework we already have, it looks like it wouldn't be a good investment of our time to develop a dedicated fast verifier for the competition—although it would provide an insightful datapoint if someone did.

Perhaps participating in the meeting will convince us that Event Condition Action (ECA) systems are more generally useful than our current experience has led us to believe. We could work on a verifier for them if there is a need. There is not much code to reuse from a general-purpose C abstract interpreter. There would be ideas to reuse, certainly. I will come to one of them in the next post.

Friday, August 3 2012

assume and assert

The previous post links to a message from Michał Moskal highlighting ACSL constructs that the VCC developers at Microsoft Research had either regretted the absence of or found superfluous while re-designing their own annotation language for VCC. In that e-mail, the third item in the “missing” list was:

ACSL is surprisingly missing /*@ assume ... */, we've found it tremendously useful when debugging specifications.

The feedback VCC developers sent us caused us to review our design choices and nudge ACSL a bit. Not having an assume annotation, however, was a conscious decision that remained. I am going to give an explanation that makes it look as if the exclusion is part of a larger, well though-out plan. I did not think of this explanation at the time, and it only came to me recently while talking about something else.

I am not sure what this says about the plan. At some level the plan was there all along, but for some reason we failed to explain it well, perhaps especially in these circumstances when it would have been useful.

The assume annotation in VCC

Design-by-contract is also sometimes called “assume-guarantee reasoning” (if you know the precise difference between these two, please leave a comment. The only reason I see for separate denominations is that “Design by Contract” seems to be a registered trademark of Eiffel Software in the United States).

This is not the “assume” we are looking for. In ACSL, and now in VCC, the expectations of a function are introduced with the requires keyword. The keyword assume could have been used for that, but this is not the choice that was made, and Michał's e-mail is definitely not saying that ACSL is missing a keyword to introduce the precondition of a function.

Instead, in VCC, assume is the dual of assert. The assert annotation is attached to a specific point in a program and introduces a property that is supposed to hold (that the verifier should endeavor to verify) there:

  stuff;

/*@ assert property ; */

// the verifier tries to prove that "stuff" makes "property" hold.

The assume keyword tells the verifier to take it as granted that a property holds somewhere, and to use that for reasoning about the rest of the program:

/*@ assume property ; */

// "property" is useful to understand what "stuff" does.

  stuff;


Beware that ACSL, since it does not use assume this way, reserves the keyword assumes for something else entirely. When the rest of this post says that ACSL does not have assume, it means that ACSL does not have a VCC-like assume.

Keyword minimalism

It is a worthy goal to minimize the number of keywords forced on you by a BISL—Behavioral Interface Specification Language (the acronym is catching on already). One could argue that both the functionality of assume and assert above can be captured with a single keyword. The verifier, when encountering that keyword, should first try to establish the property as a consequence of the code above. Then, regardless of whether it was able to establish it, it should incorporate it in its hypotheses for the study of the code below it.

This is exactly how Frama-C approaches the issue. Plug-ins that understand ACSL at all both try to prove such properties and then use them as assumptions. The properties are introduced with the assert keyword which, in ACSL, conveys both meanings. This may mean that a bit of redundant work is made, but the resulting annotation language is so beautiful, like a smartphone with only two hardware buttons, that the sacrifice is worth it.


Well, this explanation is not actually quite right. Please read on.

Plug-in collaboration and property status management

There is a deeper explanation than just “each plug-in can do the work both ways and that will save a keyword”. This other explanation is specific to what Frama-C is. Or rather what Frama-C intends to be, for it is a long-term effort.


Frama-C is—or intends to be—a collaborative verification framework. Frama-C makes it convenient to seamlessly subject one same annotated program to different and complementary analysis techniques. In particular, it was always the intention that whichever property was assumed by a plug-in could be proved by another. For the Frama-C platform to be usable this way, it is necessary, but not sufficient, to have a single annotation language with a formal semantics that all plug-ins agree on, to express properties like \valid(p).

It is also necessary to use a single keyword for assumptions and goals, because without that, the user would need to go to the source code and change assume to assert or vice-versa when switching from a plug-in to another. In Frama-C, a plug-in's assumption is another plug-in's goal.


The Frama-C kernel tracks each program property and its verification status. A property that was only ever assumed, never proved, is an assumption of the whole verification. A property that was assumed by a plug-in and proved by another is just like an intermediate lemma in a mathematical proof.

The devil is in the details, of course, and the details are in the article “Combining Analyses for C Program Verification” (FMICS2012) by my colleagues Loïc Correnson and Julien Signoles.


Trying to prove a goal can still be expensive and unnecessary. There will eventually be a way to mark a property as “assumed” in the Frama-C kernel for efficiency reasons. We are still thinking about the best way to do that.


Thanks to Julien Signoles, Florent Kirchner, Virgile Prevosto, Sébastien Bardin and Michał Moskal for their remarks on earlier iterations of this blog post.

Friday, July 27 2012

Oxygen is stricter about types and why you should get used to it

I have just sent a list of changewishes (1, 2) to a static analysis competition mailing-list, and that reminded me of a blog post I had to write on the strictness of the type-checker in upcoming Frama-C release Oxygen. This is the blog post.

This post is not about uninitialized variables

The static analysis competition in question evaluates each participating tool on many testcases. A good number of these testcases are rejected by the current Frama-C. This is why I was writing to the mailing-list: Frama-C cannot participate in the competition if the testcases are not changed (Frama-C does not have to participate, of course. It's just that it looks fun and we would probably like to be in it).

In fact, many of the testcases were already problematic for Frama-C version 20111001 (Nitrogen), the version in the current Debian, Ubuntu, FreeBSD, NetBSD and other Unix distributions. Indeed, a lot of the testcases rely on uninitialized variables for entropy, which this post by Xi Wang and this post by Mark Shroyer show is wrong. Instead of the problem that is supposed to be detected (or not), Nitrogen detects the uninitialized use. I covered this already; this blog post is not about uninitialized variables (keep reading!).

This post is about C type-checking

While trying to get a list of uninitialized-variable-using testcases, I realized that something had changed since my last evaluation of the competition's benchmarks. Many of them were now rejected at type-checking!

The new problem is, many testcases in the benchmarks call functions without having provided a prototype, and some ultimately define the called function with a type incompatible with the type inferred at the call site. Frama-C used to be lenient about typing issues, but after fixing one soundness bug too many that was caused by the leniency, we have decided to throw in the towel. Virgile described one of the typing issues for which Frama-C used to be too lenient. It was not the only one.

This is an unusual position to take. In the article A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World, the makers of a commercially successful bug-finding tool state:

Our product's requirements roughly form a "least common denominator" set needed by any tool that uses non-trivial analysis to check large amounts of code across many organizations; the tool must find and parse the code, [...]

The above is wise: to be maximally useful, the static analyzer should choke on as few idiosyncrasies as possible. The analyzed code can be assumed to compile with some compiler, and to go through some tests (there must still be a few of these, right)? Why warn when a function is called without a prototype, if the compiler accepted it? Why reject when the function's implementation has a type that is incompatible with the type that was inferred at the call site? This is probably not the issue the user is looking for (if it was, the user would have fixed it when eir compiler warned for it).

Oxygen is strict and that's it

Well, our answer is that we tried and we found that it was too much work to try to be sound and precise with these constraints, as exemplified by Virgile's blog post. Show us a tool that accepts your weakly typed program, and we will show you a tool that probably isn't sound and precise at the same time (we have kept the examples that led to the decision. These examples demonstrate real issues masked by lenient typing. If you think you have found a sound analyzer that is at the same time conveniently permissive on typing, it will be our pleasure to provide the examples for you to try).


We hope that Frama-C will still be useful with these new restrictions on typing. Fortunately for us, there are more real worlds than the expression “the Real World” in the cited article's title might lead you to think (and this quip should not be taken as a reproach towards the authors of “A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World”, a very useful document with an appropriately chosen title considering its intended public).

Tuesday, July 17 2012

Code smells smell

This blog post is on the subject of “code smells” and what is wrong with them. I initially wanted to go with the title “Code smells are crap”. That made it clearer that the title was grammatically a sentence, with “code smells” a nominal group followed by a verb. But there seems to be something in this blog's content management system that censors four-letter words in the title of articles, like some sort of administrator or something[1] friendly moderator[2].


At the time of this writing, Wikipedia defines a code smell as “any symptom in the source code of a program that possibly indicates a deeper problem”. I do not challenge this definition. Indeed, to the best of my knowledge this is exactly the meaning many researchers (and engineers) working in software engineering intend when they talk of “code smells”. I think, however, that there is something fundamentally wrong, not with the definition, but with scientific use of any notion accurately described by this definition. The issue I see, in so many words, is that it is not falsifiable.

Say that you build a tool that detects “code smells”. You might even call your tool a “static analyzer”, and there might be so many of you and of your tools that “static analysis” is understood by the majority as the craft of building code smell detectors. The tool flags as smelly a diligent implementation of a subtle but efficient computation taken from The Art of Computer Programming. This deserves the name “false positive”, right? Surely your tool does not pretend to judge Knuth's carefully crafted, state-of-the-art algorithms? Ah, but no, it is not a false positive, because the notion is built right in the definition that a code smell only “possibly indicates a deeper problem”. Whatever the tool found is a code smell by virtue of having been flagged by the tool. It was not a problem, but it was a code smell because it could have indicated a problem.


As scientists, what the four-letter word are we supposed to do with such a notion?

Notes

[1] This post was crying so loud for moderation that it would have been a shame not to grant it

[2] a special kind of facetious colleague

Saturday, January 14 2012

Public announcements

Yes, I have read that static analysis post

If you are one of the three persons in the world who have not sent me a link to John Carmack's Christmas post on static analysis, go ahead and read it, it might interest you. You can stop sending me that link now.

The post is very much a text version of the second half of his August QuakeCon keynote, which I prefer, if only, as someone who sometimes has to speak in public, for the amazing delivery. Here's some commentary on the PC-lint part of the post. I was myself going to comment on the "pursuing a Space Shuttle style code development process for game development", because the tools can be the same, or at least take advantage of the same techniques, and just be employed differently to answer different needs. I thought that if I only had time to put my arguments in order, I might convincingly argue this, and rebound on John Regehr's post by characterizing semantic static analyzers as limiting the programmer to the dialect of correct programs. But thinking about it, I realize that I know nothing about game development and that it would be presumptuous to make strong claims there. In the case of Frama-C, and Frama-C is not the only analyzer intended for "Space Shuttle style code" and quality objectives, the total feedback we have is still limited enough to nearly double in one day, and this feedback is still all about Space Shuttle style development and verification. And that day was yesterday.

U3CAT meeting

Yesterday was the day members of the national research project U3CAT gathered to tell each other about recent developments. Several presentations were on experiments done at Atos Origin SA and CS. You can't expect them to reveal too much about the actual code, but yes, it was Space Shuttle-like, and the results are very satisfying.

One of the points that emerged was that we need to invent a word for alarms that are eventually categorized as innocuous, but for which the information to classify them so was not in the source code provided to the analyzer. One actual example yesterday was the division of two values that came from sensors, where Frama-C's value analysis warned that the result of the division might overflow the 16-bit int it was then stored into. Checking with the developers, it turned out that specifications ensure that the values from the sensors will be such that the conversion is safe. Calling this a "false alarm" would be unfair to the analyzer, since the information was not in the source code. I suggest to call this kind of alarm a "verification point". And so the bottom line of that particular experiment on a few thousand lines of C was a couple of true alarms, a great many verification points, and no false alarms. Apparently both the true alarms (that lead to code changes) and the verification points (that did not lead to code changes) were well received by the developers.

The thing about Atos Origin and CS is that they are service providers. They will develop and/or verify software the way you contract them to, and while they may use Frama-C a little bit for software quality even if it's not in the contract, they can't use it very much then. By not making Frama-C verification part of the contract, you are effectively tying their hands. If, on the other hand, you contract them to use it as part of the development and/or verification, they already have experts who can do it efficiently (and one thing I am sure both these companies are good at is internally routing tasks to employees who have the corresponding competences).

Conclusion

One wish I have for 2012 is to spend at least some of my time working on making Frama-C useful for non-Shuttle-like code. On the other hand, if you are going to work on Shuttle-like code in 2012, you don't need to wait. Yesterday's meeting was another confirmation that for that purpose, Frama-C is ready to be employed productively now.

Wednesday, January 11 2012

Static analysis benchmarks

Reminiscing

The first benchmark Frama-C's value analysis was ever evaluated on was the Verisec suite, described in "A buffer overflow benchmark for software model checkers". My colleague Géraud Canet was doing the work, thence the facetious-colleagues tag of this post. In retrospect, Verisec was a solid piece of work. Being naive when it comes to statistics, I didn't recognize it at the time, but when I later examined other benchmarks, I developed an appreciation for this one. It does a lot of things right.

  • It contains good examples as well as bad. A static analyzer is for separating good programs from bad ones. It should be obvious that you cannot evaluate how well a static analyzer does using only bad programs.
  • The article's title says it's for "Software Model Checkers" but it does a good job of being accessible to a wide range of tools with varying abilities. Some good/bad program pairs are available in "no pointer, only arrays" variants and in "small arrays only" variants, providing more comparison points between different techniques. This may help discover that some technique detects a wrong behavior with small arrays but does not scale to larger ones, whereas another one does not detect them at all.
  • The entries in the benchmark are moderate-sized snippets of code with unknown inputs. If, by contrast, in most or all of the examples, the inputs were fixed, and assuming that these examples did not escape in an infinite loop, then a detection strategy based on the concrete execution of the program step by step and detection of wrong behaviors on the fly would sail through the benchmark. The benchmark would accurately capture the fact that for such cases, an analyzer can easily be made correct and complete. What would be missing would be that users of static analysis tools turn to them to detect wrong behaviors that can occur with some inputs among many, and not just to detect that there are some wrong behaviors when known (dangerous) inputs are used.


Géraud was using the Verisec benchmark right, too. He spent enough time on each case to uncover some interesting facts. One uncovered interesting fact was a definite bug in one of Verisec's "good" cases, where the targeted bug was supposed to be fixed. The explanation was this: there were two bugs in the real-life snippet that had been used as "bad" case, and only one had hitherto been noticed. The known bug had been fixed in the corresponding "good" case in the benchmark and in real life. The unknown bug had not, as the song of La Palisse might have said.

When a static analysis benchmark is done right, as the Verisec benchmark is, it tries to measure the ability of tools to solve a hard problem: predicting whether something happens at run-time in a non-trivial program with unknown inputs. Furthermore, the reason there is a need for static analyzers and for a benchmark is that humans are not good either at recognizing issues in code. So, if the benchmark is representative of the problems a static analyzer should solve, there exists no perfect oracle that can help decide what the correct answers are on each snippet, and it can be expected to contain a few bugs classified as "good" code.

Is safety static analysis incredibly immature?

In a recent post I compared Frama-C's value analysis to PolySpace and Astrée. It was a single-point comparison in each case, each time using an example chosen by the people who sell the analyzer, an example used on the analyzer's website to demonstrate the analyzer's precision. In that post's comments, Sylvain regretted "that the field of static error-detection for embedded C (as opposed to security-oriented static analysis, Cf. the [NIST SAMATE] test suite) is incredibly immature from a customer point of view".

His conclusion is at the opposite of the idea I was trying to express there. I was trying to say that embedded safety static analysis was mature and that all that remained to do to convince was to stop making unrealistic claims and derisive comparisons. Talk about scoring own goals. This blog post is my penitence for not having been clear the first time.


The field of software security is orders of magnitude more complex and disorganized than that of embedded safety. It deals with programs that are larger and use more difficult constructs (e.g. dynamic allocation). The properties that need to be verified are numerous and widely dispersed:

  • There are the undefined behaviors that cause something to happen that was not intended by the programmer. These can breach any of a system's confidentiality, integrity or availability. They are the same undefined behaviors that Polyspace, Astrée or Frama-C's value analysis detect, give or take that a previous version of Astrée did not detect uninitialized variables or that the Nitrogen version of Frama-C's value analysis (initially intended for embedded code) does not include the modelization of free() I have been working on recently — and therefore does not detect undefined behaviors related to free().
  • There are security issues in code without undefined behaviors, such as the hashing denial of service attack everyone is talking about these days.
  • And as a last illustration of the variety of security properties, timing information leaks, the detection of which I recently discussed, belong to security considerations.

As a digression, we are not worried about the hashing denial of service attack in Frama-C, but the counter-measures in OCaml 3.13 (that allows to seed hash functions) are going to be a bit of a bother for Frama-C developers. He that among you never iterated on a hash-table when displaying the results of a regression test, let him first cast a stone.


The point is that software security people need a suite like NIST's to describe by example what their field is about (more than 45 000 C/C++ cases to date). The field of embedded safety does not have a similar suite because it does not need one. AbsInt, the company that sells Astrée, did not wait for me to (gently) make fun of their website before they implemented uninitialized variable detection: they did it of their own volition because they knew Astrée should do it, and they had already released a new version with that feature at the time my colleague Florent Kirchner tried their example and noticed the uninitialized variable.

A non-benchmark

Also, I think it's a slippery slope to call what NIST has a "test suite", because before long someone will call it a "benchmark", and NIST's suite is no benchmark. Many of the examples have definite inputs, making them unrepresentative — specifically, making them too easy to handle with techniques that do not scale to the billions of possible inputs that must be handled routinely in real life. There does not seem to be nearly enough "good" examples in the suite: the few I looked at were all "bad" examples, and based on this sample, an analyzer that always answered "there is an issue in the target code" would get a very good mark indeed if this was a benchmark.


Worse, in some cases, the "bad" line to find is extremely benign and there are more serious issues in the case. Consider entry 1737, one of the few I looked at:

...
void test( char * str )
{
  char * buf;
  char * newbuf;
  char * oldbufaddress;

  buf = malloc( MAXSIZE );
  if ( !buf )
    return;
  strncpy( buf, str, MAXSIZE );
  buf[MAXSIZE - 1] = '\0';
  printf( "original buffer content is : %s\n", buf );
  oldbufaddress = buf;
  buf = realloc( buf, 1024 ); /*BAD */
  printf( "realloced buffer content is : %s\n", buf );
  printf( "original buffer address content is : %s\n", oldbufaddress );
  free( buf );
}

int main( int argc, char * * argv )
{
  char * userstr;

  if ( argc > 1 )
  {
    userstr = argv[1];
    test( userstr );
  }
...

With the proper modelization for printf(), free(), and realloc(), such as I am currently working on, Frama-C's value analysis would warn about both printf() calls. The warning for the first call is because realloc() can fail and return NULL (passing NULL to printf() invokes undefined behavior). In this case, there famously also exists a memory leak problem, as the initial block is not freed, but a memory leak is not an undefined behavior. The warning for the second call is because oldbufaddress becomes indeterminate if realloc() moves the block.


The above is not what the author of the test case wants to hear. What the author of the test case wants to hear, clarified by the CWE-244 reference, is that the realloc() call is dangerous because if realloc() moves the block, it may leave the confidential information that was contained in the block lying on the heap. This is fine, if the case is intended as a starting point for discussion (although it would be better without an undefined behavior unrelated to the specific issue being tested).

One problem is that there is no indication in the program that the block contains anything secret. A static analyzer that warns for this program should warn for any use of realloc(), since this is how realloc() is intended to be used (it is intended to preserve the contents of the block being resized). An optimal analyzer for this issue is grep realloc *.c.

It does not hurt to zero the secret data before you free() a block, and avoid realloc(), in a "belt and suspenders" approach to security, but if you use Frama-C's value analysis properly to verify that the program does not have undefined behaviors, you do not have to do this, and the code will still be safe from heap inspection vulnerabilities. Perhaps using the value analysis "properly" is inapplicable in your case; but you should at least be aware of the design choices here. Case 1737, by itself, just seems to pick one specific circumvention method for a general issue, and then confuses the resolution of the general issue with the application the specific method. The map is not the territory.


Were this case used as a benchmark, it would very much remind me of primary school tests written by teachers without any deep understanding of the matter they were teaching, and where you had to guess what was in these teachers' minds at each of their unclear questions. Here, for instance, the value analysis warns that the program may use a dangling pointer in the second printf() — intended to symbolize the information leak. If the program instead called malloc() and tried to use the contents of the new block without having initialized it, then the value analysis would warn about that. Generally speaking, any way that I know of the contents of the initial block can be read from the heap is flagged at some point as an undefined behavior, but this is not what case 1737 tests, because case 1737 tests "analyzer warns about realloc()", and the correct answer is to warn about realloc().

Conclusion

Coming back to the question of what field is immature from a customer point of view, if your interest is in software security, NIST's suite certainly can be used as a basis to understand what kind of static analyzer you may be after. You can try candidate static analyzers on cases that you have selected as most representative of your own problems. But it won't save you from reading each candidate's documentation, understanding how each of them work, and carefully examining their results. Any mechanical interpretation of the results will lead you to pick grep realloc without even understanding case 1737 and its limitations.


The issue for comparing embedded safety static analyzers is that the design space is too large, and the design space for security static analysis certainly isn't any smaller. And if someone ever boasts that their security-oriented software analyzer covers 95% of the NIST SAMATE "benchmark", they are probably selling snake oil. Just saying.

Friday, November 25 2011

Static analysis tools comparisons

It is very easy to make snap judgments about other static analyzers when you are yourself working on one. You do not even need to act in bad faith: you just try the other tools on the programs that you worked hard to give interesting results on — difficult examples, on which even your own tool does not give the best result. You intend to give the other tool a passing grade if it produces a result as approximated as yours, and to acknowledge that it's not as bad as you thought if it gives a better result.


The flaw in this line of reasoning is that even if you were honest enough to pick an example on which it was possible to improve on your own tool's result, you picked one on which your tool begins to produce a result. The example is automatically in the language subset that your analyzer handles, probably in the language subset it targets. The comparison is still heavily biased in favor of your tool, whether you realize it or not. This is how you may end up thinking that Frama-C's value analysis only manipulates intervals: by only trying it on programs for which it displays intervals (programs, say, translated to C from this toy language).

Astrée

It is more instructive to take examples from other tools' websites, the examples that they use to advertise, and to see what your own tool can do with them. Let us do this in this post.

Feeling lucky?

Here is one:

astree.png

Let's try it in Frama-C's value analysis:

$ bin/toplevel.opt -val astree.c 
[kernel] preprocessing with "gcc -C -E -I.  astree.c"
astree.c:4: Floating-point constant 1.0e38 is not represented exactly.
     Will use 0x1.2ced32a16a1b1p126. See documentation for option -warn-decimal-float
...
astree.c:4: accessing uninitialized left-value: assert \initialized(&x);
astree.c:4: completely undefined value in {{ x -> {0} }} (size:<32>).
...
[value] Values at end of function main:
          NON TERMINATING FUNCTION

The webpage adds: "Astrée proves that the above program is free of run-time errors when running on a machine with floats on 32 bits." That's reassuring. If we had not read that, we might have feared that it contained undefined behavior due to the use of uninitialized variable x (and perhaps others, but really, there is such a big problem with x that the value analysis did not look any further). "Undefined behavior" means that anything can happen, and you should consider yourself lucky if it's a run-time error. With an uninitialized float variable, you just might be lucky (and get a run-time error, instead of the proverbial demons flying out of your nose): the compiler may produce code that reads bits from the stack that happen to represent a signaling NaN, and then applies a floating-point operation to it, causing a run-time error.

Miscommunication

To be fair, the release notes for the latest version of Astrée said that it could now emit alarms for uninitialized variables. I look forward to an example illustrating that. Perhaps that example will demonstrate an unsoundness in Frama-C's value analysis, who knows?

Also, the sentence "Astrée proves that the above program is free of run-time errors…" really means "Astrée does not find any of the run-time errors that it detects in the above program", and both its designers and its users at Airbus Operations S.A. are aware of this meaning. It's just that the somewhat pompous formulation is a little bit unfortunate on this example.


Coincidentally, I sometimes notice the same phrase used, about "Frama-C" this time, by makers of comparable tools, as below:

Frama-C "proves" [something ridiculous] about [this program].

It's usually just another instance of someone running other tools on the examples eir tool is good at, of course, and might more objectively be phrased as:

An external plug-in of Frama-C, separately distributed, written by a PhD student so brilliant that he was able to produce useful software during his PhD in addition to making several solid contributions to the science of program analysis, "proves" [something ridiculous] about [a program that's out of scope for this plug-in].

It doesn't have the same ring to it. I see why toolmakers prefer to peddle their own imperfect creations using the first sentence.

Polyspace

So many rhetorical questions

Polyspace's website has this example:

polyspace.png

On the webpage, the explanation of why this example is interesting is a little confusing:

Now, rounding is not the same when casting a constant to a float, or a constant to a double:

* floats are rounded to the nearest lower value;

* doubles are rounded to the nearest higher value;

Now, are they, really? Floating-point has an aura of arbitrariness about it, but that would be a little bit over the top, wouldn't it? Isn't it just that the number 3.40282347e+38, when rounded to float in round-to-nearest mode, rounds downwards, and when rounded to double in round-to-nearest mode, rounds upwards?

Letting the compiler be the judge

According to the colored comments in the program, Polyspace guarantees that there is an overflow. Well, let's try it then:

$ bin/toplevel.opt -val ps_ovfl.c -warn-decimal-float all -float-hex
...
ps_ovfl.c:4: Floating-point constant 3.40282347e+38f is not represented exactly.
     Will use 0x1.fffffe0000000p127
ps_ovfl.c:5: Floating-point constant 3.40282347e+38 is not represented exactly.
     Will use 0x1.fffffe091ff3dp127
...
[value] Values at end of function main:
          x ∈ 0x1.fffffe0000000p127
          y ∈ 0x1.fffffe0000000p127

Frama-C's value analysis does not agree about the example. Ah, but this is about floating-point subtleties. It's not a cut-and-dry case of uninitialized variable. Perhaps Polyspace is right. However, we may remember that in floating-point, an overflow is not a run-time error: it produces an infinity, a special kind of value that some numerical algorithms handle correctly but most don't. It makes sense for a static analyzer to treat infinities as errors (in fact, the value analysis does too), but if a static analyzer says that an overflow certainly occurs(red alarm), and we are unsure of this diagnostic, we can run the program and see what happens. It's not undefined in the C sense.

#include <stdio.h>

void main(void)
{
    float x, y;
    x = 3.40282347e+38f; // is green
    y = (float) 3.40282347e+38; // OVFL red
    printf("%a %a %d", x, y, x == y);
}
$ gcc ps_ovfl_modified.c 
...
~/ppc $ ./a.out 
0x1.fffffep+127 0x1.fffffep+127 1

No, no overflow, it seems. Polyspace's red alarm led us to expect the output 0x1.fffffep+127 inf 0, but that's not what happens.

An explanation why Polyspace makers think there should be an overflow

The explanation on the webpage continues:

In the case of the second assignment, the value is cast to a double first - by your compiler, using a temporary variable D1 -, then into a float - another temporary variable -, because of the cast. Float value is greater than MAXFLOAT, so the check is red.

Unfortunately, this not how round-to-nearest works. When a double, here 0x1.fffffe091ff3dp127, falls between two floats, here 0x1.fffffep127 (MAXFLOAT) and +infinity, the nearest one should be picked.

A philosophical conundrum

Which is nearest between a finite float and an infinite one? I would like to recommend this question for the next philosophy A-levels (or equivalent worldwide). In the world of IEEE 754, the answer is terribly pragmatic. Infinity starts at the first number in floating-point format that can't be represented for having too high an exponent. For single-precision, this number is 0x1.0p128. The midpoint between this unrepresentable number and the last finite float 0x1.fffffep127 is 0x1.ffffffp127, and therefore, in round-to-nearest, all numbers above 0x1.ffffffp127 round up to +inf, whereas numbers below 0x1.ffffffp127 round down to 0x1.fffffep127.

Experimentation

The number 0x1.ffffffp127 is representable as a double, so that we can obtain a decimal approximation of it with a simple C program:

#include <stdio.h>
main(){
  printf("%.16e\n", 0x1.ffffffp127);
}

$ gcc limit.c
$ ./a.out
3.4028235677973366e+38

My claim is that regardless of what Polyspace's website says, and although the number 3.402823567797336e+38 is much larger than the number in their example, you can round it to a double, and then to a float, and you will still get the finite 0x1.fffffep127 (MAXFLOAT). Conversely, the number 3.402823567797337e+38 is on the other side of the fence, and rounds to +inf.

We can try it out:

#include <stdio.h>
main(){
  printf("%a %a\n", (float) 3.402823567797336e+38,  (float) 3.402823567797337e+38);
}

$ gcc experiment1.c
$ ./a.out
0x1.fffffep+127 inf


What about the number 3.4028235677973366e+38? It is even more amusing. It is below the fence, and converted directly to float, it gives 0x1.fffffep127. However, converted to double, it rounds to 0x1.ffffffp127. Then, when rounding to float, the "even" rule for picking between two equidistant floats results in 0x1.0p128 being picked, and the final float result is therefore +inf.

#include <stdio.h>
main(){
  printf("%a %a\n", 3.4028235677973366e+38f, (float) 3.402823567797366e+38);
}

$ gcc experiment2.c
$ ./a.out
0x1.fffffep+127 inf

This is a case of double rounding, just like in question 5 here.

But really, Polyspace wouldn't be that wrong, would it?

I'm afraid it would. Unlike the webpage on Astrée's site, I cannot find any plausible explanation for the misunderstanding. The example is wrong and the explanations that follow make it worse. It could be the case that Polyspace gives results that cover all possible floating-point rounding modes, just like the value analysis option -all-rounding-modes does. There would be an overflow then (in round-upwards mode). Unfortunately, Polyspace would have to emit an orange, since in round-to-nearest and in round-downwards, there is no overflow. It would have to produce results similar to those below, predicting that either the overflow or the normal termination of the program can happen.

$ bin/toplevel.opt -val ps_ovfl.c -all-rounding-modes -float-hex
...
ps_ovfl.c:5:[kernel] warning: overflow in float (0x1.fffffe091ff3dp127).
                  assert -0x1.fffffe0000000p127f ≤ (float)3.40282347e+38 ∧
                  (float)3.40282347e+38 ≤ 0x1.fffffe0000000p127f;
...
[value] Values at end of function main:
          x ∈ 0x1.fffffe0000000p127
          y ∈ 0x1.fffffe0000000p127

Conclusion

In conclusion, we should be careful about the way we express ourselves: static analysis has been oversold enough already. We would do well to learn the rules we pretend to enforce, and we should avoid picking on other toolmakers, something which is detrimental for all and that I do not completely avoid in this post. However, I use each tool's own example, which I think is a good sign that there is still some major overselling going on, despite how detrimental this has already been in the past. I would tell you about examples on PVS-Studio's website, but at least it has different goals and does not use the verb "prove" in the sense "my imperfect implementation based on an idealized formalization computes that…" all the time.

To end on a lighter note, here is a funny picture to meditate.

Monday, October 17 2011

Features in Frama-C Nitrogen, part 1

Here is a series of posts that highlight interesting features in the recently released Frama-C Nitrogen 20111001. There is new functionality in Nitrogen, that we hope will entice both existing and prospective users. Other improvements yet will only have meaning for existing users. I will start off with two items from this second category.

Nitrogen compiles with OCaml 3.10.2.

The only non-optional dependency for compiling Nitrogen is an OCaml compiler. But which one? Despite a now long history, the OCaml language is still evolving relatively rapidly. We have never been happier with our choice of an implementation language: it is evolving in a direction that suits us, and it is evolving conservatively (compare Perl 6, PHP 5, Python 3, …). On the other hand, Frama-C is large and contains tricky code; the plug-in architecture with dynamic loading in general on the one hand, and the hash-consing programming technique on the other hand, are only two examples. It is large enough and tricky enough to reveal many of the subtle difference between any two versions of the OCaml compiler.

Nitrogen compiles with the latest version of OCaml, of course. That's 3.12.1 at the time of this writing. We already know that it won't compile as-is with the future OCaml 3.13 (a patch will be released in due time). Similarly, support for older versions has to be gained, version by version. If you have only written medium-sized OCaml programs, you could easily underestimate the difficulty of this. I was lucky enough not to have to deal with it much during this cycle, but some of my colleagues had to. It always is a struggle. Sometimes the equivalent of #define/#if constructs from C pre-processing would help, but this is not idiomatic in OCaml. Again, the only non-optional dependency for compiling Nitrogen is an OCaml compiler, so we won't use fine solutions such as Cppo.


For Windows and Mac OS X, OCaml is not part of the operating system, so we ship the version we like together with the binary package (if we make one). With Unix, the issues are different: there are too many flavors for us to distribute binaries, but there are efficient package systems and distributions to painlessly bring in required dependencies. Often, Frama-C itself is packaged in binary or source form as part of these distributions, thanks to the work of many volunteers. It may take some time for packages to be created for Nitrogen, and some users do not want to wait. Linus Token, for instance, may rely on a computer he bought two years ago. Frama-C works fine on two-years old computers, as seen in the next section. Linus installed his Unix distribution of choice when he bought his computer, and now he expects Frama-C's sources to compile with the OCaml version from his distribution (OCaml programmers can be expected to have the latest OCaml compiler installed from its own sources, but Linus is not an OCaml programmer). The Unix distribution installed two years ago was on average 3 months old at that time, and it may have been frozen for stability 3 months before being released. For Linus, Frama-C has to compile with a 2.5-year-old compiler. And then there are industrial users who like to trail a little bit on the Linux front, but at the same time want all the latest Frama-C features. For Nitrogen, we chose to retain compatibility with OCaml 3.10.2, released in February 2008, and all OCaml versions released since.

The value analysis is up to four times faster on realistic examples

The second Nitrogen feature I want to highlight today the value analysis' speed. Here is a quick and dirty comparison for programs that could already be analyzed with Carbon. There are new alarms and precision improvements in Nitrogen, but I made sure that in this comparison, the two versions were doing the same amount of work.

For this comparison, I did not cherry-pick benchmarks. I looked for programs of varying sizes in the archives of the blog, and used the three that came first, having decided in advance that I wouldn't dismiss any results I didn't like. Each analysis was run three times, and the median time was kept. This is on a Core 2 mobile processor, and the frequency is pinned at 1.5GHz through CoolBook. In plain English, the timings are for an oldish but not antiquated notebook.

The options I used were:

-slevel 1000 -val count.c
-slevel 10000 -val -no-results -cpp-command "gcc -C -E -m32 -Dprintf=Frama_C_show_each" sha1.c
-val -slevel 1000 -slevel-function main:0 *.c -cpp-command "gcc -m32 -C -E -I. "

The programs count.c and sha1.c came from this post. For sha1.c I had to disable the endianness detection code, implemented with a switch, to put the Carbon and Nitrogen versions on an equal footing. With the Carbon version, there used to be a difficult choice to make in presence of switch, remember? I could also have used option -simplify-cfg for the Carbon execution, but then the analyzers would not have been analyzing exactly the same program. The Skein-256 example is the verification that an arbitrary number of calls to Skein_256_Update(..., 80) never cause a run-time error, using Josh Ko's instrumentation.

                             count.c     sha1.c      Skein-256

Carbon                       0m0.491s    0m2.862s    1m10.201s
Nitrogen without Zarith      0m0.410s    0m1.724s    0m37.782s
Nitrogen with Zarith         0m0.313s    0m0.962s    0m16.700s

Total speed-up                 1.5          3            4

How did the value analysis improve so much overall? As exemplified by the timings with and without Zarith, this is the result of many small enhancements and refinements at all levels.

Conclusion

As promised, the two features described here are only worth noting for faithful users. It only matters that Frama-C compiles with OCaml 3.10.2 if you intend to compile it at all, and you only care that it is much faster than before if you were already using it. Even so, some of the existing users may not notice them. This is the kind of feature that I like, because it does not complicate the user interface —the value analysis benchmarks above use the same options and produce the same results— and improves the software nonetheless. Existing users and people who try Frama-C for the first time with Nitrogen will both have a better time of it thanks to the effort spent on the two points described in this post, and on tens of others, big and small, some of which I hope will receive their own minute of spotlight in this series. I have been forced to allude to one other small improvement, a better treatment of switch in the value analysis, that I like even better because it removes the need to learn one option.

- page 1 of 2