proj-plbook-plChUndefinedBehavior

Table of Contents for Programming Languages: a survey

Chapter : Undefined Behavior

Often abbreviated UB.

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html goes into detail about how allowing undefined behavior (or at least undefined value return) makes C faster (and conversely, how safe languages must be either more restrictive, or slower):

Use of an uninitialized variable: no need to autoinitialize

Signed integer overflow: can optimize eg "X*2/2" to "X"; " While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion.". Another example: "for (i = 0; i <= N; ++i) { ... }"; can assume that the loop will iterate exactly n+1 times (as opposed to i maybe wrapping if N is large)

Oversized Shift Amounts: different platform do different things in this case, so leaving this an undefined result lets the compiler just let the platform decide, rather than putting in extra instructions

Dereferences of Wild Pointers and Out of Bounds Array Accesses: "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic"

Dereferencing a NULL Pointer: "dereferencing wild pointers and the use of NULL as a sentinel. NULL pointer dereferences being undefined enables a broad range of optimizations: in contrast, Java makes it invalid for the compiler to move a side-effecting operation across any object pointer dereference that cannot be proven by the optimizer to be non-null. This significantly punishes scheduling and other optimizations. In C-based languages, NULL being undefined enables a large number of simple scalar optimizations that are exposed as a result of macro expansion and inlining."

Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the "int" as if it were a "float"). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results.... (eg an example is given where a loop that zeros memory is replaced by a call to a platform primitive for this purpose)

https://kukuruku.co/post/undefined-behavior-and-fermats-last-theorem/ gives some examples where C/C++'s undefined behavior rules lead to paradoxical results. For example, in the following code there is an off-by-one mistake in the loop condition; the <= should be <, because table[4] doesn't exist. In the case that the loop reaches the fifth iteration (i == 4), this code would perform an access to table[4], which is undefined behavior. Therefore, in the case that the loop reaches the fifth iteration, the compiler is allowed to do anything (because the compiler is allowed to do anything in the case of undefined behavior:

int table[4]; bool exists_in_table(int v) { for (int i = 0; i <= 4; i++) { if (table[i] == v) return true; } return false; }

So, one thing that the compiler could choose to do in the case that the fifth iteration is accessed is 'return true', in which case it would be returning true in all cases. Therefore, the compiler is allowed to optimize this code to:

int table[4]; bool exists_in_table(int v) { return true; }

" John Regehr provides the following selection of examples when undefined behavior led to significant consequences:

    Using uninitialized data as an extra source of randomness caused a compiler to delete an entire seed computation. [http://kqueue.org/blog/2012/06/25/more-randomness-or-less/]
    A compiler can evaluate (x+1)>x to both 0 and 1 in the same program, compiled at the same optimization level, for the same value of x. [http://www.cs.utah.edu/~regehr/papers/overflow12.pdf]
    An infinite loop such as a counterexample search (where no counterexample exists) can be terminated by the compiler, permitting, for example, Fermat’s Last Theorem to be “disproved.” [http://blog.regehr.org/archives/161]
    An undefined shift operation caused a compiler to delete an important safety check in Google’s Native Client. [http://code.google.com/p/nativeclient/issues/detail?id=245]
    A reference to a possibly-null pointer in the Linux kernel caused the compiler to remove a subsequent null check, creating an exploitable vulnerability. [https://isc.sans.edu/diary.html?storyid=6820]
    A compiler can run the code inside if (p) {… } and also inside if (!p) {… } when p is not initialized. [http://markshroyer.com/2012/06/c-both-true-and-false/]
    A division instruction (that potentially crashes the process) can be moved in front of code that has externally visible side effects. [http://blog.regehr.org/archives/232]

For example, this code:

int a; void bar (void) { setlinebuf(stdout); printf ("hello!\n"); } void foo3 (unsigned y, unsigned z) { bar(); a = y%z; } int main (void) { foo3(1,0); return 0; }

will have time to print the message before SIGFPE, if compiled without optimizations, and it will crash right away in case we enable optimization. The program “knows in advance” that it is destined to crash with SIGFPE, and does not even bother printing the message. Chen provides a similar example, but with SIGSEGV.

In 2012, Regehr announced a contest for the craziest consequence of undefined behavior. One of the winners took advantage of the fact that the usage of a pointer after being passed to realloc(), is undefined behavior. His program prints different values for the same pointer:

  1. include <stdio.h>
  2. include <stdlib.h> int main() { int *p = (int*)malloc(sizeof(int)); int *q = (int*)realloc(p, sizeof(int));

$ clang -O realloc.c; ./a.out

1 2

...

But nothing can compare with Regehr’s example: the disprove of Fermat’s Last Theorem by the compiler... " -- [1]

In many cases (and in even more cases before 2011, when the C 1999 (C99) standard was in effect [2]) the C standard allows the compiler to assume that almost every loop terminates [3]. In 2010, Regehr ran the following program, which terminates if and only if it can find a counterexample to Fermat's last theorem (this theorem was proven true in 1994):

int fermat (void) { const int MAX = 1000; int a=1,b=1,c=1; while (1) { if (((a*a*a) == ((b*b*b)+(c*c*c)))) return 1; a++; if (a>MAX) { a=1; b++; } if (b>MAX) { b=1; c++; } if (c>MAX) { c=1; } } return 0; }

  1. include <stdio.h> int main (void) { if (fermat()) { printf ("Fermat's Last Theorem has been disproved.\n"); } else { printf ("Fermat's Last Theorem has not been disproved.\n"); } return 0; }

regehr@john-home:~$ icc fermat2.c -o fermat2 regehr@john-home:~$ ./fermat2 Fermat's Last Theorem has been disproved. regehr@john-home:~$ suncc -O fermat2.c -o fermat2 "fermat2.c", line 20: warning: statement not reached regehr@john-home:~$ ./fermat2 Fermat's Last Theorem has been disproved.

It appears that the program has found a counterexample to Fermat's last theorem! So, what is the counterexample; "what a,b,c values has it found? Regehr added printing of the found values..."; the revised program hung in an infinite loop.

What happened was that the compiler deduced that the only way out of the while(1) loop in finite time was the 'return 1'. There are no side effects in this loop (before the printing of the found values was added). According to some interpretations of the C 1999 standard ('C99'), the compiler is allowed to assume that this loop terminates. Therefore, the compiler optimized the loop to just "return 1".

When the printing of the found values was added, the loop now had side-effects, and so according to the standard the compiler is no longer allowed to optimize away the loop; so now the program ran as expected (another way to block the compiler from optimizing away the loop would have been to access volatile variables within the loop).

A list of some other undefined behaviors in C++ is found at https://stackoverflow.com/questions/367633/what-are-all-the-common-undefined-behaviours-that-a-c-programmer-should-know-a

Proposals to deal with UB

Platform-specific UB

"Both of these examples highlight a fundamental problem with “platform-specific UB”: any out-of-bounds write could potentially modify any other variable (at least any variable that has an address in memory). This can make even the most basic parts of high-quality code generation, such as register allocation, tricky or impossible: a variable that has its address taken has to be re-loaded from that same address any time an out-of-bounds write might have happened, since that write might just have hit the right address to change this variable’s value. This applies even if the address has not yet been leaked to the outside world, as the first example shows. This is probably why there is hardly any compiler that follows the platform-specific interpretation of UB." -- [4]

C N2769

Discussions about undefined behavior in specific language implementations

Undefined behavior links