Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Thursday, January 24 2013

Customers, customers, customers

The recent posts on extremely minor undefined behaviors in zlib neatly tie in with a discussion on John Regehr's blog about the future-proofitude of C and C++.


Another insightful post in this regard is this one by Derek Jones. Derek claims that the situation is different for proprietary compilers with paying customers. The argument rings true to me. The only proprietary compiler I know that competes with GCC or Clang in terms of aggressive optimization (at the cost of breakage of existing code) is the Intel C++ compiler. But that is not a counter-example: I did not pay for icc, nor do I know anyone who pays for it in order to use it in production.

Wednesday, January 16 2013

Bad zlib, bad! No compare pointer!

In a previous post, we remarked that the decompression function of zlib, for some inputs, computes an invalid pointer. But at least it neither dereferences it nor compares it to another pointer.

Or does it?

Recipe for an invalid pointer comparison

Instrument

Take an ordinary zlib library version 1.2.7, and instrument it according to the diff below.

$ diff -u zlib-1.2.7/inffast.c zlib-modified/inffast.c
--- zlib-1.2.7/inffast.c	2010-04-19 06:16:23.000000000 +0200
+++ zlib-modified/inffast.c	2013-01-16 23:37:55.000000000 +0100
@@ -64,6 +64,9 @@
       requires strm->avail_out >= 258 for each loop to avoid checking for
       output space.
  */
+
+int printf(const char *fmt, ...);
+
 void ZLIB_INTERNAL inflate_fast(strm, start)
 z_streamp strm;
 unsigned start;         /* inflate()'s starting value for strm->avail_out */
@@ -316,6 +319,7 @@
     strm->next_in = in + OFF;
     strm->next_out = out + OFF;
     strm->avail_in = (unsigned)(in < last ? 5 + (last - in) : 5 - (in - last));
+    printf("out=%p, end=%p\n", out, end);
     strm->avail_out = (unsigned)(out < end ?
                                  257 + (end - out) : 257 - (out - end));
     state->hold = hold;

This instrumentation causes two pointers, out and end, that are compared at that point of the library, to be printed. Apart from that, it does not change the behavior of the library. The library behaves the same when the printf() call is not there, albeit less observably.

Following the documentation, write an ordinary main()

Prepare a main() function that uncompresses a buffer deflated_buffer more or less according to the official tutorial:

main(){
  unsigned char out_buffer[CHUNK_OUT];

  printf("out_buffer: %p\n", out_buffer);

  /* allocate inflate state */
  ...

  ret = inflateInit(&my_strm);

  if (ret != Z_OK)
    return ret;

  my_strm.next_in = deflated_buffer;
  my_strm.avail_in = 40;
  my_strm.next_out = out_buffer;
  my_strm.avail_out = CHUNK_OUT;
	
  ret = inflate(&my_strm, Z_FINISH);

  ...
}

You can download the file zlib_UB.c.

Secret ingredient

Season with a specially crafted deflated buffer:

unsigned char deflated_buffer[40] = {
  120,
  156,
  67,
  84,
...

Test

Having built the instrumented zlib, you can compile and link the main() function:

$ gcc -I Downloads/zlib-modified/ zlib_UB.c Downloads/zlib-modified/libz.a
$ ./a.out 
out_buffer: 0x7fff5bd9fa34
out=0x7fff5bd9fa33, end=0x7fff5bd9fa5e

Although we followed the documentation when writing the main() function that calls it, zlib appears to be, at inffast.c:320, comparing a pointer out that points before out_buffer to a pointer end that points inside out_buffer. In fact, the pointer out has been computed as an offset of out_buffer, namely, out_buffer - 1. Zlib is not supposed to do this, for two reasons:

  1. The compiler could place out_buffer at address 0, or at the beginning of a segment in a segmented architecture. Then out_buffer - 1 would evaluate to 0xffffffffffff and appear to be larger than end, or cause a hardware exception. To be fair, in general the same value still ends up being stored in strm->avail_out, because both branches of the expression (out < end ? 257 + (end - out) : 257 - (out - end)) compute the same thing.
  2. Even without the array out_buffer being placed at address 0—which is rather unlikely—the compiler could suddenly become too smart for its own good and generate code that treats the condition out < end as false in this case. This has happened before. Key quote: “Some versions of gcc may silently discard certain checks for overflow. Applications compiled with these versions of gcc may be vulnerable to buffer overflows.” In this case, the invalid pointer out_buffer - 1 would be interfering with a compiler optimization, so anything might happen. Since it is strm->avail_out the statement is computing, a buffer overflow might result.

Fix

This problem and the previously described one have the same cause; they stem from the following optimization in inffast.c:

#ifdef POSTINC
#  define OFF 0
#  define PUP(a) *(a)++
#else
#  define OFF 1
#  define PUP(a) *++(a)
#endif

Defining POSTINC makes both the minor issues go away.

Conclusion

The defect identified here is related to the previous one, and like it, it is probably innocuous. Nevertheless, we initially set out to formally verify zlib, so we should stick to it: a formally verified zlib wouldn't compute negative offsets of arrays, much less compare them to valid offsets of the same array. So far, I have not been able to confirm any of the potential issues listed by Frama-C's value analysis in zlib with POSTINC defined, so that version of the library may reveal itself to be the formally verified version we are after.


This post was proofread by Fabrice Derepas and Olivier Gay.

Monday, January 14 2013

Why verify zlib?

As an interlude in the zlib verification thread, this post asks two questions. Is there any chance of finding a bug in zlib, and does it matter?

Could there be a bug in zlib?

It is not entirely impossible. The previous post in this blog pointed to a relatively minor issue in the zlib source code, and the next post will show a more serious (albeit still minor) one.

In July of 2005, a buffer overflow in zlib was found by Tavis Ormandy. Key quote: “An attacker could construct a malformed data stream, embedding it within network communication or an application file format, potentially resulting in the execution of arbitrary code when decoded by the application using the zlib library.”

If you have been following the zlib thread, this is exactly the context we are investigating: decompression of a maliciously crafted stream.

Does it matter?

In 2005, people did think it mattered. The CVE summary points out that zlib is involved when decompressing a PNG image (most users think of PNG images as innocuous objects that they should be allowed to look at regardless of origin without compromising the host computer). Other examples are the “Windows binary of Virtual Floppy Drive 2.1” and “Official Windows binaries of curl”. I do not think that list is exhaustive. Does the HTTP/1.1 protocol not allow for zlib-compressed data to be exchanged between client and server? A malicious client could cause a buffer overflow on the server with a simple request, and/or a malicious server could cause a buffer overflow in the client by replying to a request with specially crafted pretend-compressed data.

And a benchmark for the future

I am making a deliberate effort not to look at that bug now, but it goes without saying that when the verification of zlib is done, I will check that the same method find the bug that was in the pre-July-2005 version.

Thursday, January 10 2013

Code review finds minor issue in Zlib

In an article about comparing static analyzers (long-time readers of the blog, do not follow the link. It is still the same old article)…

Where was I? Ah, yes. One reason why it is extremely tricky to compare static analyzers is that a static analyzer is for identifying undefined behavior in C programs, but while some families of undefined behavior are extremely dangerous, C has many families of undefined behaviors. Some of them have been rather harmless until now, and programmers voluntarily use constructs that cause them. One example, briefly alluded to in a certain article, is the computation of an invalid offset of a pointer.

It is not alright for a C program to compute an invalid offset as in char t[50]; ...(t-1).... This is technically undefined behavior. Unfortunately, it is not alright for a C static analyzer to warn for this either, because the programmer is likely doing it on purpose. With some compilation options, the QuickLZ compression library does it (on purpose).

Frama-C's value analysis choice is to remain silent for the computation of t-1 but to warn if it is ultimately compared to another pointer—it could, for instance, equal &c against the programmer's expectations. Or, of course, the value analysis warns if the invalid pointer is dereferenced.

Et tu, zlib?

Here is the beginning of file inffast.c in library zlib:

#ifdef POSTINC
...
#else
#  define OFF 1
#  define PUP(a) *++(a)
#endif

When you see *++p in a C program, you can expect an invalid offset computation nearby. Indeed:

    in = strm->next_in - OFF;
    ...
    out = strm->next_out - OFF;

Zlib's documentation allows the user of the library to pass in strm->next_in and strm->next_out pointers to the beginning of allocated blocks. This can cause a minor undefined behavior for some inputs.

A “bug” not found by static analysis

Nothing would be easier than implementing the detection of this undefined behavior in Frama-C, but seeing more and more code that invoke it does not convince me that programmers want to know about it. Anyway, I just stumbled on macro PUP as I was analyzing zlib.

Post-scriptum

A way to exhibit the undefined behavior is to use the input vector in a future post together with a more stringent instrumentation:

--- zlib-1.2.7/inffast.c	2010-04-19 06:16:23.000000000 +0200
+++ zlib-modified/inffast.c	2013-01-21 00:56:25.000000000 +0100
@@ -64,6 +64,9 @@
       requires strm->avail_out >= 258 for each loop to avoid checking for
       output space.
  */
+
+int printf(const char *fmt, ...);
+
 void ZLIB_INTERNAL inflate_fast(strm, start)
 z_streamp strm;
 unsigned start;         /* inflate()'s starting value for strm->avail_out */
@@ -98,6 +101,7 @@
     state = (struct inflate_state FAR *)strm->state;
     in = strm->next_in - OFF;
     last = in + (strm->avail_in - 5);
+    printf("strm->next_out=%p and zlib is about to subtract one from it\n", strm->next_out);
     out = strm->next_out - OFF;
     beg = out - (start - strm->avail_out);
     end = out + (strm->avail_out - 257);
@@ -316,6 +320,7 @@
     strm->next_in = in + OFF;
     strm->next_out = out + OFF;
     strm->avail_in = (unsigned)(in < last ? 5 + (last - in) : 5 - (in - last));
+    printf("out=%p, end=%p\n", out, end);
     strm->avail_out = (unsigned)(out < end ?
                                  257 + (end - out) : 257 - (out - end));
     state->hold = hold;

When I compile and execute I get:

$ gcc -I Downloads/zlib-modified/ zlib_UB.c Downloads/zlib-modified/libz.a
$ ./a.out 
out_buffer: 0x7fff59e83a34
strm->next_out=0x7fff59e83a34 and zlib is about to subtract one from it
out=0x7fff59e83a33, end=0x7fff59e83a5e

Tuesday, December 18 2012

zlib progress: one comma misused

A few days ago, I announced that the world had been using an unverified zlib library for too long, and that we were going to fix this. This post is the first progress report. I have found a harmless undefined behavior in zlib and I have learnt something about the comma operator. This post only covers the latter aspect.

The comma operator

The comma operator can be used to build an expression e1, e2. When e1, e2 is evaluated, first e1 is evaluated (for its result and side-effects), then the result is discarded and e2 is evaluated.

The comma counts as a sequence point. Sequence points are the points in a C program in-between which it is illegal to both read and write the same memory location; for instance x++ + x is undefined behavior because there is no sequence point between the post-incrementation x++ and the use of the value of x. Coming back to the comma operator, since it counts as a sequence point, I thought this might allow writing (x++, y) + (y++, x), but I am not so sure about that now.

Easing the value analysis' job on zlib

Formally verifying such a library as zlib is difficult. Indeed, otherwise, considering its usefulness and fame, if it was easy, someone would already have advertized that they had formally verified it.

One verification plan that I wanted to try out on this library is the model-checking approach that we first experimented with last summer.

The words “model-checking” may evoke different things to different people. In this context, they mean the propagation of only unabstracted known-to-be-feasible states, so that if a division by zero is reached in the program, we know for sure that this dangerous division can be reached for some inputs. In other words, we are talking here about a technique to avoid false positives, in addition to avoiding false negatives as we always do.

There is a catch: the propagation of unabstracted states does not really scale to a codebase the size and complexity of zlib. My plan was to ease the job of Frama-C's value analysis by carefully re-introducing a little bit of abstraction. I intended to modify the zlib library so that the modified version has all the same opportunities to do something wrong with pointers as the original version, but the modified version always computes zero as output sequence. This would make many memory states that would otherwise have been different become identical, something that the value analysis efficiently recognizes and takes advantage of.

A beginner's mistake

And this is how I found myself modifying *output_pointer++ = expression_with_side_effects; so that it would only write 0 to the output buffer, but otherwise have all the same side-effects as the original. I chose to replace it with the statement below. And I was pretty pleased with myself.

  *output_pointer++ = expression_with_side_effects, 0;


Unfortunately, after one additional cycle of verification, it turned out that in the C syntax, assignment has higher precedence than comma. My modified statement was parsed as:

  (*output_pointer++ = expression_with_side_effects), 0;

It behaved exactly as the original statement in assigning non-zero values to the output buffer. I had instead intended:

  *output_pointer++ = (expression_with_side_effects, 0);

Thursday, December 6 2012

Formally verifying zlib

In a blog post earlier this year, John Regehr wonders when software verification will finally matter. He means “formal verification”, I am pretty sure. “Verification” is what practitioners of, say, the software development V-cycle have been doing for decades, and it has kept us safe for that long—at least, when it mattered most.

Formal verification on the other hand, despite having been used here and there, is not yet the big deal that it could be. You'd better read the post, I fear I may not be doing it justice.


In that post, John writes:

For example, I’d be happy to drop a proved-correct gzip into my Linux machine as long as it was respectably fast.


Okay, John, let us do this, and see how far we can go.


We'll start with zlib, that we are going to approach like we did QuickLZ, by verifying that the decompression feature is resilient to malicious inputs. I have described the first potential issue on this code review site. If you have an opinion on this potential issue, you can contribute by expressing it there (registering is a one-click process), or here in the comments if you prefer.


Edited to add: too late, Andrew Šveikauskas has already determined that our first alarm was a false positive, as you can see on codereview.stackexchange.com. Ah well… It would have been nice if out first alarm were a true positive, but that was unlikely.

In case you regret missing out on the fun, here is another one: when reaching line inffast.c:269, where from is computed as out - dist to be accessed next, what prevents variable out to point one byte into the output buffer and dist to be 2, or 3, or 19? I expect the library must guard against this in the code above, but this is the end of a long day and I did not find where. Also, this would typically be the kind of relational information that the value analysis fails to remember, so I am not promising there is a bug to find here.