← Sign extension A conversionless conversion function →

Of compiler warnings discussions

Pascal Cuoq - 26th Apr 2013

A discussion I often have addresses the question of whether a compiler¹ can warn for all possible illegal actions a program could take at run-time within a specific, well-understood family² .

(1) or some other piece of software that receives a program as input and, in finite time, does something with it, for example a static analyzer

(2) for example the family of out-of-bounds memory accesses

For some reason my opinion differs significantly from the consensus opinion, but since this is my^H^H the Frama-C blog, I thought I would nevertheless put it here for exposure.

The Halting problem is only relevant as a metaphor

The Halting problem is defined for Turing machines. A computer is not a Turing machine. A Turing machine has an infinite ribbon whereas a real computer has finite memory.

Imagine a real physical computer running a program. Imagine that you sit at the keyboard typing new inputs indefinitely swapping the keyboard for a new one when wear and tear make it unusable putting the computer on a battery and moving it to another planet when the Sun explodes rigging your will to ensure your descendants will carry on the typing this sort of thing.

Seems like a glimpse of infinity right? But the mathematical abstraction for this computer is not a Turing machine it is a boring Mealy machine (an automaton with inputs and outputs). A Turing machine has an infinite ribbon it can write to and read from again. The above set-up does not. It is a Mealy machine with 256^(10^9) states (assuming a 1TB hard-drive).

A sound analyzer is not impossible

A sound analyzer for out-of-bounds accesses is an analyzer that warns for every out-of-bounds access. Making one is easier than it seems. There is always the solution of warning for every dereferencing construct in the program (in C these would be * [] and ->).

A useful sound analyzer is not impossible

That last analyzer although sound was pretty useless was it not? General theoretical results on the undecidability of deciding of the run-time behavior of Turing machine would predict —if allowed to be intrapolated to computers and programs— that it is impossible to build a sound and complete analyzer.

A complete analyzer for out-of-bounds accesses warns only for accesses that can really be out-of-bounds for some program inputs.

Indeed a sound and complete analyzer is difficult. It is impossible for Turing machines and it is hard even for mere computers and programs.

Luckily the theoretical difficulty is only in reference to arbitrary programs. It is theoretically difficult to make the promise to warn for all out-of-bounds accesses and to warn only for real out-of-bound accesses for arbitrary programs but an analyzer does not have to do that to be useful. It only needs to deal with programs people are actually interested in (an excessively small subset of all programs).

And it does not even need to be sound and complete to be useful. An analyzer that warns for all out-of-bounds accesses with only a few false positives is still useful (and still sound. “Complete” is what it is not).

Conversely an analyzer that warns only for actual out-of-bounds accesses and only omits a few is still useful and complete although it is not sound. But I have not spent the last 8 years working on such an analyzer so it is not my job to advertise one. Whoever wrote one can tout it in its own blog.

This blog contains examples of analyses made using a sound analyzer. The analyzer does not emit so many false positives that it is unusable. The analyses involve a variety of real programs: QuickLZ (there were a few false positives but a bug was found and was fixed) the reference implementation for Skein (no bug was found but there were no false positives) Zlib (in this ongoing analysis there are plenty of false positives but one extremely minor issue has already been found) …

But this cannot work in presence of dynamically loaded libraries!

Well yes in order to do “program analysis” the program must be available. It is in the name.

Speaking of binary-only libraries or programs the counter-arguments I have written above apply the same to the analysis of a program in binary form. Analyzing a binary program is significantly harder than analyzing source code but a binary analyzer does not have to be unsound it does not have to be incomplete and it does not have to decide the Halting problem. The examples I have provided are for analyses of source code because that's what I work on but look for the blog of a binary static analyzer and there is no theoretical reason you won't be pleasantly surprised.

But this cannot work if there are inputs from the user / from a file!

Of course it can. The more you know and don't tell to the analyzer the more its warnings may subjectively feel like false positives to you but the analyzer can always assume the worst about any inputs:

#define EOF (-1) 
/*@ ensures \esult == EOF || 0 <= \esult <= 255 ; */ 
int getchar(void); 
int main(int c  char **v) 
{ 
  int a[5]  i; 
  i = getchar(); 
  a[i] = 6; 
}

$ frama-c -pp-annot t.c -val 
... 
t.c:10:[kernel] warning: accessing out of bounds index [-1..255]. assert 0 ≤ i < 5;

The first four lines would not have to be written everytime. They model a function from the standard library and would only need to be updated when the specification of the standard C library changes.

Speaking of which these four lines are not a very good modelisation of function getchar(). It is possible to do better than this and the above is only an example simplified for clarity but best not re-used.

Conclusion

In conclusion it is possible to make sound static analyzers because theoretical undecidability results are about Turing machines because people are not interested in arbitrary programs and because a couple false positives once in a while do not necessarily make a sound analyzer unusable. The theoretical counter-arguments apply to binary code to concurrent programs and when reversing the use of the “sound” and “complete” adjective but I only have examples for the static analysis of sequential C source code.