Frama-C news and ideas

To content | To menu | To search | Frama-C Home

Tag - unspecified behavior

Entries feed - Comments feed

Wednesday, March 13 2013

Reading indeterminate contents might as well be undefined

Warning: on a punctiliousness scale ranging from zero to ten, this post is a good nine-and-a-half. There was no tag for that, so I tagged it both “C99” and “C11”. The faithful reader will know what to expect. There is a bit of C90, too.

To summarize, it may appear that according to the letter of modern C standards, it is only dangerous to use uninitialized variables, instead of very dangerous. Nevertheless, this post shows that it does not matter what the standards say: compilers will bite you even when you are arguably right.

Some context in the form of a link

In 2012, Xi Wang wrote a nice blog post showing it is not a good idea to use an uninitialized variable as a source of additional entropy when trying to create a random seed.

“Xoring an uninitialized variable with whatever other source of entropy you already have cannot hurt”, the conventional thinking goes. Conventional thinking is wrong. Your typical modern compiler deletes the code that gathers the original entropy, since it is only going to be xored with an uninitialized variable. Hence the title of Xi Wang's blog post, More Randomness or Less.

In C90, “indeterminate” was simple

In the nineties, real men were real men, C standards were short, and reading indeterminate contents(such as uninitialized variables) was listed in the very definition of “undefined behavior”:

1.6 DEFINITIONS OF TERMS

Unspecified behavior — behavior, for a correct program construct and correct data, for which the Standard imposes no requirements.

Undefined behavior — behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately-valued objects, for which the Standard imposes no requirements.

“Undefined behavior” means the compiler can do what it wants, so the behavior noticed by Xi Wang can in no way be held against a C90 compiler.

In 1999, C standards became more complicated

The C99 standard does not directly list “reading indeterminate contents” as undefined behavior. Instead, it defines indeterminate contents as “either an unspecified value or a trap representation”. Reading a trap representation causes undefined behavior (6.2.6.1:5). The nuance here is that the type unsigned char is guaranteed not to have any trap representations (and thus can always be used to read indeterminate contents).

Less randomness : the simplified version

“But my compilation platform does not have trap representations for type int, either, therefore I can use an uninitialized int variable and expect an unspecified value (a much better prospect than undefined behavior)”, one may think. This line of reasoning is attractive. It could even explain the behavior shown in Xi Wang's blog post and reproduced in simplified form below:

$ cat i.c
int f(int x)
{
  int u;
  return u ^ x;
}
$ gcc -O2 -std=c99 -S -fomit-frame-pointer i.c
$ cat i.s
…
_f:
Leh_func_begin1:
	ret
…

On this 64-bit platform, the argument x passed to f() is in register %edi, and the result of f() is expected in register %eax. Thus, by executing instruction ret directly, function f() is not even giving us back the entropy we provided it. It is instead giving us the current contents of %eax, which may not be random at all.

(Giving us back the entropy we passed to it would have been mov %edi, %eax followed by ret, a longer sequence.)

One may argue that the compiler has only opportunistically chosen the most convenient value for variable u, that is, x xored with the current contents of %eax, so that u xored with x is just the current contents of register %eax. This fits the interpretation of “unspecified value” for C99's definition of “indeterminate contents”. It is a good argument, but just wait until you have seen the next example.

The next example

#include <stdio.h>

int main(int c, char **v)
{
  unsigned int j;
  if (c==4) 
    j = 1; 
  else
    j *= 2;
  printf("j:%u ",j);
  printf("c:%d\n",c);
}

If fewer than three command-line arguments are passed to the program, it should display an unspecified even number for j, right?

$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
…
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) 
$ gcc -O2 t.c
$ ./a.out 
j:1 c:1

GCC version 4.4.3 has decided that since the “else” branch was reading from an uninitialized variable j, only the “then” branch was worth compiling. This is acceptable if reading uninitialized variable j is undefined behavior, but not if it is unspecified behavior. Let us insist:

$ gcc -Wall -O2 -std=c99 t.c
$ ./a.out 
j:1 c:1

Although we are requesting the C99 standard to be followed by GCC, the program is not printing for variable j the even unspecified value we are entitled to.

(In passing, a proper static analyzer would know that if it is going to show variable j as containing 1, it might as well show c as containing 4. Also, a proper static analyzer would remind you that your program must, in essence, only be used with three command-line arguments. The reason compilers do not do this is covered elsewhere)

Between 1999 and 2011, C standards did not get shorter

In 2007, Rich Peterson, working at HP, was disappointed to find that the “Not a Thing” (NaT) value that registers can have on the Itanium architecture could not be used to implement an uninitialized unsigned char.

One thing led to another, and the C11 standard was amended with the phrase “If the lvalue designates an object of automatic storage duration that could have been declared with register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer, and no assignment to it has been performed prior to the use), the behavior is undefined.”

That would have been my reaction too, if I was paid by the word. Anyway, this additional sentence re-introduces undefined behavior where there was none in C99.


In the example above, the address of j was never taken, so maybe that's GCC's excuse. Let us check:

#include <stdio.h>

int main(int c, char **v)
{
  unsigned int j;
  unsigned int *p = &j;
  if (c==4) 
    j = 1; 
  else
    j *= 2;
  printf("j:%u ",j);
  printf("c:%d\n",c);
}

$ gcc -O2 t.c
$ ./a.out 
j:1 c:1

No, GCC is still acting as if j *= 2; was undefined.

Conclusion

I am not saying that this is not a bug in GCC. Perhaps it was fixed in later versions (in fact, that version does not accept -std=c11, so it must be rather old). My thesis is that you might as well avoid reading from uninitialized variables as if it was undefined behavior, because otherwise compilers will bite you. This statement holds even if what we have witnessed here is a bug in GCC version 4.4.3.


Also, if this is a bug in GCC 4.4.3, this is the first time I identify a bug in GCC without the assistance of a random program generator. In other words, compiler bugs are rare, but they become surprisingly common if you stick to a strict interpretation of a necessarily ambiguous standard. And speaking of Csmith, if there is indeed a GCC bug here, said bug cannot be detected with Csmith, which does not generate programs like mine.

Saturday, June 4 2011

When is it valid to compare two pointers in C?

This post is about the circumstances in which the value analysis considers comparing pointers is safe, and those in which it considers the comparison is dangerous and emits an alarm. The alarm, an enigmatic assert \pointer_comparable(…, …);, uses an unaxiomatized ACSL predicate. If you use the value analysis regularly, you probably already encountered it (but perhaps only as a false alarm caused by earlier approximations). This posts informally describes the rules applied by the value analysis. It is by these rules false alarms should be sorted from true ones, although a specific programmer may wish the rules were either stricter or relaxed a bit.

If it were axiomatized, the predicate \pointer_comparable should capture all the conditions expressed in this post.

Note: this discussion takes place in the usual working conditions of the value analysis: we know and assume a bit about the architecture: size of integer and pointer types, NULL is zero, …

Pointers

The expression &a < &b is dangerous. It clearly can evaluate to 0 or 1 depending on compiler choices the programmer has no control over. I would have guessed that this constituted unspecified behavior (i.e. the expression evaluates to either 0 or 1). In fact, according to paragraph 6.5.8.5 in the C99 standard, it is undefined behavior (anything can happen, including a segmentation fault or the program displaying aberrant behavior from that point onwards).

Expression &a + 1 == &b is similarly dangerous: variable b may have been placed just after a in memory, or it may not. The comparison may just, in this case, be unspecified: this is my reading of paragraph 6.5.9.6.

Pointers converted to integers

The value analysis employs the same rule for comparisons between expressions that have pointer types and for comparison between expressions that have unsigned integer types but happen to contain addresses. Signed integer inequalities between values that happen to be addresses should be considered dangerous more often, since variables can be placed at will in memory by the compiler. Let me clarify what I mean with an example:

int t[10];
main(){
  return (unsigned int)t <= (unsigned int)(t+8);
}

In Frama-C's default target architecture, an `unsigned int` can contain a pointer to int. The comparison above is deemed safe, and always evaluates to 1 (true), same as the pointer comparison t<=(t+8).

Compare to this slightly different example:

int t[10];
main(){
  return (int)t <= (int)(t+8);
}

The comparison above is dangerous and should be flagged as such by the value analysis, because the array t could have been placed by the compiler at address 0x7ffffffa. That t could, by sheer bad luck, be placed at this address means that (int)(t+8) could be negative while (int)t is positive in the value analysis' permissive model of memory and values. The result of this comparison depending in this way on compiler choices is undesirable, hence the warning.

You can force the value analysis to take into account the possibility that the dangerous comparison (int)t <= (int)(t+8) returns 0 (false) with option -undefined-pointer-comparison-propagate-all. This is exactly what this option is for.


Let's move on from signed integer inequalities: you can really never do anything safe except (int)&a <= (int)&a. Here is another example:

int t[10];
main(){
  return 0 < (unsigned int)t;
}

In the value analysis modelization, this is a contrived way to test whether t is different from NULL, and is therefore always true. This comparison is considered safe.

int t[10];
main(){
  return t-1 <= t;
}

This comparison is dangerous: the address 0 is reserved for NULL, but nothing prevents the compiler from placing t at address 2. Doing so means that the comparison, assumed to be implemented with the unsigned comparison assembly instruction, can return false. It returns true for most other allocations of t, and is therefore unsafe.

Technically, computing t - 1 is undefined behavior in C99. The value analysis waits until this value is used in a comparison before warning about it. This is only one of the various places where it is more lenient than the standard. The rationale is that the value analysis is intended for verification of existing programs. Existing programs do this (you wouldn't believe how often!). It would be very easy to warn at the slightest sign that the program is computing t-1, and this would in fact make a lof of things simpler (including this blog post). We are waiting for an industrial customer to ask for this specific feature, and we anticipate their asking, once it's done, how they can disable it locally.


int t[10];
main(){
  return 2 <= (unsigned int)t;
}

The comparison above, and 2 == (unsigned int)t, are both unsafe because, while they usually respectively return 1 and 0, they might return 0 and 1 at run-time for some rare compilations.

String literals

The comparison "tutu" == "tutu" (or, if you prefer, the comparison s == t when s and t have been defined as char *s = "tutu"; char *t = "tutu";) is dangerous: the compiler may place the anonymous string constants anywhere in memory, and factor them at its option. Again, in this case, my reading of the standard is that this comparison is only unspecified, but the development version of the value analysis emits an alarm for it nevertheless.

char *s = "tutu"; char *t = s; s == t is not dangerous: it should always evaluate to 1.

"hello" == "world" is not dangerous: the compiler has no opportunity to factor the chains and thus the comparison predictably evaluates to 0. "hello" == "hello world" is not dangerous either: the compiler cannot factor the two strings, because the first one must be zero-terminated. The terminating zero clashes with the space in the second string.

"hello" == "hello\000world" is dangerous: the compiler can factor the strings, making the comparison return 1. It may also not factor them, making the comparison return 0.

Finaly, consider the snippet below:

  char *s = "oh, hello";
  char *s4 = s+4;
  char *t = "hello\000hello";

In this context, s4 == t + 6 is harmless, because although s4 points to the string "hello", the compiler was contrained by the requirement to place "oh, " before it and could not place it at t + 6. This comparison always return 0. The comparison s4 == t, on the other hand, is dangerous, because the compiler could have factored the tail of s with the head of t.

State of value analysis implementation

The rules described in this post are only implemented approximately in Carbon. They will work exactly as described here in the next Frama-C release.

Saturday, December 4 2010

Unspecified behaviors and derived analyses, part 2

Context

This post is a sequel and conclusion to this remark.

Example of derived analysis: slicing

When writing a Frama-C plug-in to assist in reverse-engineering source code, it does not really make sense to expect the user to check the alarms that are emitted by the value analysis. Consider for instance Frama-C's slicing plug-in. This plug-in produces a simplified version of a program. It is often applied to large unfamiliar codebases; if the user is at the point where he needs a slicer to make sense of the codebase, it's probably a bad time to require him to check alarms on the original unsliced version.

The slicer and other code comprehension plug-ins work around this problem by defining the results they provide as ``valid for well-defined executions''. In the case of the slicer, this is really the only definition that makes sense. Consider the following code snippet:

  x = a;
  y = *p;
  x = x+1;
  // slice for the value of x here.

This piece of program is begging for its second line to be removed, but if p can be the NULL pointer, the sliced program behaves differently from the original: the original program exits abruptly on most architectures, whereas the sliced version computes the value of x.

It is fine to ignore alarms in this context, but the user of a code comprehension plug-in based on the value analysis should study the categorization of alarms in the value analysis' reference manual with particular care. Because the value analysis is more aggressive in trying to extract precise information from the program than other analyses, the user is more likely to observe incorrect results if there is a misunderstanding between him and the tool about what assumptions should be made.

Example of benign alarm: testing p<=q

Everybody agrees that accessing an invalid pointer is an unwanted behavior, but what about comparing two pointers with <= in an undefined manner or assuming that a signed overflow wraps around in 2's complement representation? Function memmove, for instance, typically does the former when applied to two addresses with different bases.

Currently, if there appears to be an undefined pointer comparison, the value analysis propagates a state that contains all possible results for the comparison and for the pointers. This blog post was just trying to scare you. It is possible to take advantage of the value analysis for program comprehension, and all existing program comprehension tools need to make assumptions about undefined behaviors. Most tools do not tell you if they had to make assumptions or not. The value analysis does: each alarm, in general, is also an assumption. Still, as implementation progresses and the value analysis becomes able to extract more information from the alarms it emits, one or several options to configure it either not to emit some alarms, or not to make the corresponding assumptions, will certainly become necessary.

Sunday, November 7 2010

Unspecified behaviors and derived analyses

Prologue

The C standard(s) specifies a minimum of things that all C compilers must agree on. For the sake of efficiency, many syntactically correct constructs are left without an unambiguous meaning.

The worst way for a construct not to have an unambiguous meaning is to be undefined behavior. An example of undefined behavior is invalid pointer access. Not all platforms have the required MMU, and even when they do, the MMU does not catch all invalid memory uses. Accessing an invalid pointer is not guaranteed to crash the program cleanly. A program that does so may "crash cleanly", or continue executing in an inconsistent state. The program may crash a little bit later because of the inconsistency, or it may appear to work fine and return a wrong result, or it may reformat your harddrive. All these behaviors and others are encompassed by "undefined behavior".

Compilers and compilation-platform-specific static analysis

Recently, C compilers have started to take advantage of undefinedness in the C99 standard for the sake of aggressive optimization. They apply program transformations that are only valid in the case of defined behaviors, and leave the program doing strange, unpredictable things in other cases. John Regehr has a very good explanation of C's undefined behaviors that spans three blog posts.

There are other, more benign ways for a construct not to have an unambiguous meaning, though. Implementation-defined behaviors must be documented and consistent for a given compiler. In the case of unspecified behaviors, one of several plausible alternatives must happen. In Frama-C, we have reclassified some unspecified, and even some undefined, behaviors as implementation-defined. For instance, the value analysis assumes by default that signed arithmetic overflows have been put there on purpose by the programmer and that he intended 2's complement modulo results. In this default configuration, the value analysis does not warn about these overflows and propagate 2's complement modulo results. John Regehr convinced us, however, that this was not satisfactory, so there is now an option -val-signed-overflow-alarms for when the programmer is not supposed to rely on this behavior (signed arithmetic overflows are technically undefined behavior). The Jessie plug-in allows roughly the same distinction, with the defaults reversed (by default, you get alarms for integer overflows). In short, a Frama-C analysis session is parameterized by a description of the compilation platform that includes genuine or reclassified implementation-defined choices.

That still leaves, of course, some unspecified and undefined behaviors that the value analysis or Jessie plug-ins do indeed warn about, and prove the absence of when they do not warn about them. Each construct that may cause some kind of unspecifiedness or undefinedness is the object of a warning containing the word assert in the value analysis log. Terms and conditions apply.

This post is about the one use case where the user does not have to worry about the alarms, So if you were tired of the ongoing thread about the verification of Skein-256, good news! This is something else.

Derived analyses

Well, I say "one" use case, but it's really a family of use cases. I am talking about all the analyses that are helping the user to comprehend the code, and that can therefore not mandate that all the alarms have been eliminated through the kind of laborious inspection and refinement illustrated in the Skein verification thread. Instead, the values, and therefore the results of the analysis that relies on them, are guaranteed to apply to well-defined executions only (executions that do not falsify any of the assertions that have been emitted). As long as this is kept in mind, it may even not be necessary to look at these alarms. In fact, it is standard for high-level analyzers of C programs to assume the absence of undefined behaviors or apply only to executions without undefined behaviors. Fast alias analyses intended to scale to millions of lines of C, for instance, assume that you do not transmit an address to pointer p by writing it in *(&q+1) where q is the address below p in memory. These analyses, and any other analyses that rely on the first ones, do not tell you whether you should have reasons to worry about this happening. If you should worry, they do not give you a list of dangerous constructs in the target program. The value analysis provides such a list of alarms, but you can ignore it if you wish, and then you get results with similar caveats.

Well, almost. We are finally arriving to the core of the problem.

What is a well-defined execution?

The value analysis is more precise and more aggressive than the typical large-scale alias analysis in its handling of what look like possible quirks in the analyzed program. The reason the -val-signed-overflow-alarms option mentioned above is not enabled by default is that several test programs became meaningless. The analysis was in effect saying "look, this program is sure to produce an overflow here at the third line, there is no point in going further until this has been resolved", and refused to analyze the rest of the program. The typical large-scale context-insensitive path-insensitive alias analysis does not have this problem, but that's only because it is too imprecise to have it. It is still making assumptions, such that you do not access p when you read or write from *(&q+1).

So does this mean that all high-level analyses of C programs are meaningless, unless the absence of undefined behaviors in target program is also verified(1)? In the style of Dan Simmons' Hyperion cycle, we are going to leave this crucial question for another volume, that is, a future blog post.

(1) which it rarely is in practice. As pointed out earlier, most of them do not even tell you whether you should worry and what to worry about.