Automatic testcase reduction for compilers: the story so far
A previous post linked to a discussion of manual testcase reduction when the program being debugged is a compiler (and the input demonstrating the faulty behavior is thus a C program). Since then, quite a few related events took place:
In a comment to the linked post, someone mentioned automatic testcase reduction with “delta debugging”, and Bruce Dawson, the author of the original post, replied:
Given that the compiler sometimes changed what register it was using I’m not sure that the effort of writing an accurate test for this bug would be worthwhile. It really didn’t take me much time to reduce this bug — half an hour perhaps? Once you don’t need to link the code you can slash-and-burn at a fearsome rate, and my brain is more flexible at recognizing legitimate variations of the bug than any script would be.
And I further commented that on the other hand, a generic “accurate [automatic] test” only needs to be written once, and can serve multiple times. This post intends to show how right Bruce is: subtle pitfalls lurk when writing the acceptance test for automatically reducing a C program. Some of the dangers are theoretical enough when the program being debugged is a compiler, but become more annoying in practice when it is, say, a static analysis framework.
Meanwhile, John Regehr wrote a blog post of his own about the automatic reduction of testcases for compilers. The remarks in his post are quite orthogonal to the remarks in this one. My remarks are in the context of reducing testcases for Frama-C, which is a bit of a misappropriation, and reveals specific issues in the same way that using Csmith to find bugs in Frama-C provided some new insights.
The general idea explained on an example
Consider the example of a file from the Linux kernel that Frama-C's front-end chokes on. The original file causing the problem is 1.5MiB large after pre-processing and we would like automatic reduction to produce a smaller file that exhibits the same problem, so that we can show it to our Frama-C front-end expert.
The symptom of the problem is the error message below:
include/linux/delay.h:39:[kernel] failure: invalid implicit conversion from void to int
We can build a script that tests for the presence of this line in Frama-C's output. This script will then tell the reducer whether the latter is still on the right path or accidentally erased the interesting behavior.
A first obvious remark is that the test should not check for the presence of “include/linux/delay.h:39:[kernel] failure: invalid implicit conversion from void to int” exactly. If it does, then a reduced program can only pass the test by making it look as if the problem is at that file and line, which will prevent removal of
#line file directives in the testcase, and also hinder the removal of irrelevant lines that only seem useful because they happen to make the error message come from the expected line.
Instead, the test should be only the presence of “failure: invalid implicit conversion from void to int” in the output of Frama-C. We think we will be happy with any testcase that produces this output, so we should configure the reducer with a test that detects this output and this output only. This idea will be a recurring theme in this post.
C-Reduce is the best reducer of C programs out there. C-Reduce takes as input a C program to reduce and a script implementing the interestingness test. A pre-condition is that the initial program has to be interesting. In exchange, C-Reduce produces the smallest interesting program that it can, removing and moving around bits of the original program and checking that the result remains interesting. When applied with the aforementioned 1.5MiB file from the Linux kernel and with the interestingness test
frama-c t.i | grep "failure: invalid implicit conversion…", C-Reduce produces the minimal file below:
0 ? 0 * 0 ? : 0 : 0;
This superficially looks like something that we could show to our Frama-C front-end expert, although we will generate a better reduced program later in the post. If we make the mistake of grepping for
"delay.h:39:[kernel] failure: invalid implicit conversion…" instead, C-Reduce still does a good job, preserving only a single additional line that happens to cause the warning to be localized where the interestingness test expects it:
0 ? 0 * 0 ? : 0 : 0;
C-Reduce's only quirk may be that it sometimes reduces too much. In both reduced programs above, the function
fn1() is implicitly typed as returning
int according to now obsolescent C90 rules. In addition, the syntax
e1 ? : e2 is not standard C but a GCC extension that happens to be accepted by the Frama-C front-end.
We have been lucky
We have in fact been lucky in the above reduction: the reduced program looks like something that should be accepted by Frama-C (despite the
int ellipsis and the strange
? : syntax), and instead of accepting it, Frama-C emits an error message. This is exactly what we were hoping to obtain from the reduction process.
Consider the eventuality of the reduced program being an invalid program, incorrectly converting a
void expression to
int. Frama-C would be behaving correctly by emitting its warning on the reduced program. C-Reduce would have behaved according to specification, but we would be left without anything to show our front-end expert. We would have been had.
This is not a problem in C-Reduce, but a problem with us not using it properly. The reason the original 1.5MiB program is interesting is that it is accepted by GCC and rejected by Frama-C. This is the criterion the interestingness test should implement. When the interestingness test passed to C-Reduce implements an approximation of the notion of interestingness we really desire, we have only ourselves to blame if C-Reduce turns up an invalid reduced file that Frama-C correctly rejects.
Stop reducing so much, C-Reduce!
We do not want the Frama-C front-end expert we report the bug to to be sidetracked by considerations on obsolete syntax or strange extensions. The best testcase to show em is a short but perfectly standard C program. We can make this preference part of our interestingness test.
The first idea may be to reject any file that is not strictly C99-compliant. That would be a file that
gcc -std=c99 -pedantic accepts to compile, I think. However, the unreduced testcase here is a file from the Linux kernel. The unreduced testcase does not pass this test.
Delta debugging is for preserving interestingness along size-reduction steps, not for making interestingness appear. Therefore, we cannot demand a reduced file that passes
gcc -std=c99 -pedantic if the original file to start from does not pass it either.
Note: techniques borrowed from C-Reduce and from genetic programming might make interestingness appear out of uninteresting C files. Structural coverage of the program under test, beyond the ability to crash it, may make a useful, non-boolean fitness function. Or better, for a chosen assertion inside the program under test, coverage of the statements that lead to the assertion. This process would be different from delta debugging but might generate files that reveal bugs in static analysis frameworks. I hope someone will look into it someday.
However, simply compiling the tentative reduced file we already have with well-chosen compiler options reveals that GCC can already tell this file is not perfect. It emits the warnings snippets
forbids omitting the middle term (regarding the syntax extension for
? :) and
warning: return type defaults to (regarding implicit return types of functions). Our original file, although it originates from the Linux kernel, is not so ugly that it causes these warnings. Consequently, we can test for these warnings in an interestingness test that accepts the original file and will guide C-Reduce towards a clean reduced file.
If we do this, we obtain:
/* */ void
0 ? 0 * 0 ? 0 : 0 : 0;
The Frama-C issue at hand was reported using the reduced C program as evidence, and was promptly fixed by Virgile Prevosto, Frama-C front-end expert extraordinaire.
C-Reduce is now parallel
The latest version of C-Reduce can explore several possible reductions in parallel. This is particularly interesting when the program under test takes a long time. A bug deep into the value analysis plug-in may take minutes to occur in the original program, and C-Reduce typically launches thousands of instances in the process of reducing a large program to a small one.
However, it does not pay to run too many instances of memory-bandwidth-limited programs in parallel. The computer I ran this example on has 4 cores and two execution threads by core (it relies on hyper-threading). The system thus sees 8 processors. By default, C-Reduce launches up to 8 instances in parallel. Is this the best choice?
If I use option C-Reduce's option
-n to choose different levels of parallelism instead, I obtain, for this particular memory-intensive(*) program being debugged (Frama-C) and this particular initial C program, the following timings:
# instances Time (s)
For this particular use, C-Reduce's default of using all available execution threads is faster than using only one execution thread. It is also slower than all other parallelization choices. The fastest reduction is obtained for 3 parallel instances, and it is almost twice faster than the non-parallel version.
There is some randomness to the reduction algorithm, however, and the experiment should be repeated with varying initial programs before conclusions are drawn.
(*) Frama-C can sometimes use as much as half the memory a modern web navigator typically uses. Imagine!
Making each instance terminate earlier
Even with an optimal choice of the degree of parallelism, automatic reduction can be time-consuming. A sub-optimal interestingness test for a bug that takes several minutes to appear in Frama-C can make the entire reduction take days, or exhaust the operator's patience altogether. This is only machine time we are talking about, but sometimes a human is waiting on the result of the reduction. In this section, we point out two useful tricks to bear in mind, especially when the program under test can take a long time for some inputs. The fact that the program under test takes a long time can be the problem being investigated, but does not have to.
First trick: using grep option
Often, as soon as a particular diagnostic message has been emitted, it is possible to conclude that the testcase is interesting. In the example above, as soon as the message “failure: invalid implicit conversion…” has been seen in Frama-C's output for a C program that GCC otherwise accepts, the input C program is known to be interesting.
By default, a shell script such as
frama-c … test.c | grep "interesting message" allows Frama-C to finish its analysis before returning a boolean value. A simple improvement is to use grep's
-l option, that causes
grep to exit just after the matching line has been seen.
Contrast these two Unix sessions:
$ cat | grep interesting ; echo $?
nothing to see here
this is interesting
this is interesting ← this matching line is repeated by grep
one more line
two more lines
three more lines
cat is taking a long time
0 ← error code of grep, printed by echo, indicating an interesting line was seen
In the above session, the
grep command was still scanning its input after an interesting line had been seen, until I indicated that the “analysis” was finished with control-D. Below, I repeat the experiment, this time using
grep -l to detect the keyword “interesting”.
$ cat | grep -l interesting ; echo $?
nothing to see here
this is interesting
(standard input) ← grep indicates a matching line has been seen on stdin
one more line
0 ← error code of grep, printed by echo, indicating an interesting line was seen
Thanks to the
-l option, the
cat command was terminated abruptly before it had finished, when emitting the line immediately after the interesting line (
cat was terminated for trying to write to a closed file descriptor). Actually, this is a slight chink in the armor: the program whose output is piped into
grep needs to write one more line after the interesting line in order to be terminated. If the interesting behavior is Frama-C taking a long time for some inputs, and the interesting log line is one that says “warning: things are going to take an infinite time from now on”, and nothing is ever printed after that line, Frama-C will not be terminated. My Unix-fu is too weak for me to offer a solution where the program under test is terminated as soon as possible, but the next subsection offers a palliative anyway.
Second trick: using a timeout
The initial input program may fail (i.e. “be interesting”) within, say, 5 minutes. This does not prevent slight variations of the program to take forever to fail, or to take forever to succeed. In each case, we do not want the reduction process to waste days of computations testing a single variation. We want the reduction process instead to try as many variations as possible quickly in order to converge to a short one as soon as possible. The strength of delta debugging is in iteration.
A simple idea here is to define the interestingness property as “the C program causes the interesting message to be emitted by Frama-C within 10 minutes”. Any program that has already been analyzed for 11 minutes is uninteresting by this definition. This property can be implemented simply with the Unix command
Strictly speaking, the interestingness property should be completely reproducible, whereas this one is not. However, C-Reduce still works very well in practice with such a limit in place in the interestingness test.
To limit the risk of breaking internal C-Reduce assertions, the time the initial program takes to be interesting should be multiplied by a safety factor. Because C-Reduce launches several instances in parallel, each instance runs a bit slower than if it was alone, and execution times can vary a bit. I have never had any trouble using as the timeout twice the time the initial program takes to fail.
Automating the manual
Let us take a step back and look at what we have been doing above. We started with one obvious interesting property, and we progressively refined it in order to delineate the really interesting programs. We initially thought we were interested in C programs that cause Frama-C to emit the message “include/linux/delay.h:39:[kernel] failure: invalid implicit conversion from void to int”, but it eventually turned out that we really wanted Frama-C to emit “failure: invalid implicit conversion from void to int” within 10 minutes for a C program that GCC compiles without emitting warnings about
? : syntax extensions or implicit return types.
We have been obtaining the additional sub-properties in a trial-and-error fashion: we did not know that we wanted any of them until C-Reduce produced a program that did not have it. Still, I believe that:
- These additional sub-properties would always come from the same repertoire in actual C-Reduce practice. It is possible to build a large catalog of these desirable properties so that the end user does not need to re-discover each of them.
- Before the reduction per se even starts, the original program should be tested for all the above niceness properties. All the niceness properties the original program has should be preserved in the reduced program by making them part of the interestingness test. That is, if the original program does not warn about implicit return types, the reduced program should not warn about implicit return types. On the other hand, if the original program does warn about implicit return types, then the bug being investigated may be about implicit return types, so it is not possible to require the reduced program to avoid the warning.
- Still, it would be a useful feature if the reducer could detect that a niceness property has appeared during reduction, and strove to preserve it from there on. This was perhaps easy to implement in the early non-parallel version of C-Reduce: it looks like it could be done entirely by side-effects in the interestingness test. Since, C-Reduce has been parallelized, which greatly helps with reduction speed but means that implementing this feature becomes more complicated, if it is at all desirable.
This post collects a number of useful tricks for using C-Reduce, in particular when reducing bugs for Frama-C. C-Reduce is already immensely useful as it is, but this post also indicates some future work directions to make its use more automatic and more convenient yet.