proj-plbook-plChCLang

Table of Contents for Programming Languages: a survey

C

Because it is so well-known and well-liked, C gets its own chapter.

Perhaps the most successful language ever.

"Think of C as sort of a plain-spoken grandfather who grew up trapping beavers and served in several wars but can still do 50 pullups."

Some people consider it as a thin portable layer over assembly. However this view is disputed (e.g. https://news.ycombinator.com/item?id=6415240 ). One clear difference between C and assembly is that C is in the structured programming paradigm, and does not provide 'GOTO' (it does provide a facility called 'longjmp' but this is limited to jumps into functions for which there is still a valid stack frame; in addition some commentators assert that longjmp is often not implemented correctly in various C implementations (todo cite)).

One thing that C provides that assembly language often does not is a choice of conventions for string and multidimensional array representation.

Attributes:

Pros:

Cons:

Tutorials:

Best practices:

Respected exemplar code:

Books:

People: Dennis Ritchie

Popularity:

Features:

Size:

Libraries:

Misc:

Variants of C that are simpler, with small compilers:

and/or safer:

    switch statements must be structured as in MISRA-C; unstructured "switch", as in Duff's device, is not supported. ((note: i think this means (a) each 'case' in a switch statement must be followed by its own 'break' and (b) all 'case's must be at the same scope (unlike Duff's device)))
    longjmp and setjmp are not guaranteed to work.
    Variable-length array types are not supported. 

Consequently, CompCert? supports all of the MISRA-C 2004 subset of C, plus many features excluded by MISRA (such as recursive functions and dynamic heap memory allocation). ... Some other unsupported constructs, such as variable-arity function declarations, are rejected. "

and/or subset:

Opinions on C

C0

"C0 is a Pascal-like subset of C, similar to MISRA-C [133]. It excludes the features of C that make full verification problematic: pointer arithmetic, unions, pointer casts. It is the intermediate language used in the Verisoft project for pervasive formal verification of computer systems." -- Yannick Moy. Automatic Modular Static Safety Checking for C Programs. PhD thesis, Université Paris-Sud, January 2009, section 2.1, page 42

http://reports-archive.adm.cs.cmu.edu/anon/2010/CMU-CS-10-145.pdf

HNC

macro language for C:

CIL

http://cil-project.github.io/cil/doc/html/cil/

https://en.wikipedia.org/wiki/George_Necula#C_Intermediate_Language


Checked C

http://research.microsoft.com/en-us/projects/checkedc/

C with language extensions to explicitly specify pointer/array bounds. "For an existing C program to be correct, there has to be an understanding on the part of the programmer as to why pointers and array indices stay in range. The goals of the design are to let the programmer write this knowledge down, to formalize existing practices, such as array pointer parameters being paired with length parameters, and to check this information." There are three ways provided to secure pointers; raw pointer can be declared, which cannot be used in pointer arithmetic; static checking, where bounds expressions in a limited language are declared which can be checked at compile time; and dynamic pointers, which are dynamically checked at runtime. There is also still raw unchecked pointers which can do pointer arithmetic, to allow for backwards compatibility. The extensions are only for checking bounds; the language report gives examples of other areas which are not dealt with: incorrect deallocation of memory still in use; incorrect type-unsafe pointer casts; concurrency races.

Chapter 2 of [86], starting on page 14, summarizes the language extensions' features.

The following operations are supported on pointers (the ops in this list which take two pointers require that the pointers be to objects of the same type; and some of these are not supported on checked raw pointers, which don't support pointer arithmetic):

Either pointers or their values can be const; eg you can declare a const pointer to a mutable value, or, by contrast, a mutable pointer to a const value.

Variables which carry their bounds with them at runtime carry both a lower and an upper bound (unlike some competing projects, in which the bounds carry counts of elements).

Bounds can be declared at variable declarations, in which case they must be true for the entire lifetime of the variable, or at assignments, in which case they are propagated via dataflow. Bounds at variable declarations might involve multiple variables; sometimes you have to update these variables, but since you can only atomically update one of them at a time, there may be a period of time when they are 'out of sync' and the bounds condition is not true, even though it will again become true after all are updated; this is dealt with via 'bundled blocks' of variable updates, within which the bounds expression may be false. The syntax "x : count(e1)" declares that only memory that is at or above x and below x + e1 can be accessed through x ([87] section 3.1), and the syntax "x : bounds(e1, e2)" declares that memory that is at or above e1 and below e2 can be accessed through x.

Bounds-checking expressions must be 'non-modifying expressions' ie no side effects (i think?).

Note that for the bounds on a pointer to be considered to hold, the pointer must either be null or have "been derived via a sequence of operations from a pointer to some object obj with bounds(low, high)" such that the bounds on this pointer are within the range low, high (for the implications of this, see [88] section 3.1 PDF page 35, and section 11.5 PDF page 135, and section 11.6 PDF page 136).

The language provides 'checked scopes', in which only checked pointers are allowed "to prevent inadvertent use of unchecked type". "Another extension is a refinement of C’s notion of undefined behavior. We wish to maintain bounds safety even in the face of integer overflow, which is a widespread possibility in C programs. This requires a more nuanced requirement on the result of overflow than “the result is undefined."

" If a compiler cannot prove that a new pointer type value is non-null before a pointer arithmetic operation, it must insert a check. Similarly, if a compiler cannot prove that a pointer arithmetic expression for a new pointer type cannot overflow, it must insert a check. This may slow a typical program down by a slight amount. ... In our experience working on an operating system written in managed code that required these checks, the slowdown was a few percent, even when all arithmetic instructions were checked for overflow too. The cost of checks was low because it is often the case that checks are redundant or can be combined into adjacent machine instructions....The checks are also inexpensive on superscalar processors: they can be placed so that they are statically predicted by processors to never fail. "

Pointer bounds follow alignment constraints. This slightly complicates the language definition but allows a runtime speedup of bounds checking.

Section 2.4 of [89], starting on PDF page 19, talks about some semantics stuff and undefined/null/overflow/error handling.

Section 2.8 of [90], starting on PDF page 30, has very interesting lists of undefined and unspecified behaviors in C, what changes are made in checkedc, and what the resulting formal aim is. In sum, some examples of undefined behavior in C are division by 0, signed integer overflow, out-of-bounds pointer arithmetic, pointer arithmetic between two pointers that don't point to the same object, and (sometimes) expressions with nested assignments (because their side effects aren't always ordered, therefore the result of the expression can be undefined). Unspecified behavior is not totally undefined, but is defined to be one of many options; examples include expressions with side effects. There is also implementation-defined behavior, including ranges of values for ints, floats, and pointers; data layout (the sizes of types, padding, alignment), and some aspects of pointer conversion. "It is difficult to reason about the correctness of programs that have expressions with undefined behavior. One has to make sure that a program never arrives at a point where behavior is undefined. In practice, this would mean proving that signed integer overflow can never occur, for example. For unspecified behavior, one has to reason about all possible behaviors, such as all possible orders of evaluation. For implementation-defined behavior, one has to reason about the implementation-specific behavior or have reasoning that is independent of the details." Changes CheckedC? makes include semantically treating pointers as addresses in memory, mandating the result of divide-by-0 and signed integer overflow to be either some legal value or a runtime error, and mandating that undefined expressions be rejected at translation-time. They note that other sources of undefined behavior still exist, including: unchecked code that dereferences out-of-bounds pointers; invalid typecasting; use-after-free (accessing deallocated memory). They note that CheckedC? does not purport to save you from invalid bounds when your program contains these other errors.

Section 2.7 of [91] on page 22 explains the reasoning for having assert-like dynamic bounds checks, rather than simply using ordinary if/thens in programs and returning error code if the bounds checks don't pass, citing O'Donell and Sebor. In sum, handling errors leads to more awkward and inefficient code, which is annoying especially when, assuming the programmer is correct, the error should be impossible. They recognize that if the assertion is ever actually involked then there's a problem, but they say that this is the same problem that C programs already have with null pointer dereferences. My note: this is interesting to me because it stands in contrast to the 'null pointers are a billion dollar mistake' narrative that says that this sort of thing is bad design. My note: it seems to me that they are implying that assertions (and assertion-like bounds checks) should be used when the assertion cannot fail unless there is a LOCAL mistake within the current scope (or within the current module); for argument checking however (checking the validity of arguments passed into the module by an external caller), ordinary control flow (or exceptions, etc) should be used; in other words, YOU can assert that YOUR code is 100% correct, but never assert that SOMEONE ELSE's code will be correct.

Section 2.7 also gives an example where a dynamic check can actually increase runtime speed by leading the compiler to deduce that other more expensive checks are not needed.

Lessons learned (from page 5 of [92]:

Chapters 4 and 8 of [93], starting on PDF pages 57 and 107, presents an inference procedure that a compiler can use to statically check and infer bounds and to reason about simple program invariants. Note that invertible functions appear to be important. The syntax "where i >= 0;" is used to 'assert' that i >= 0, and the syntax "where buf : bounds(tmp, end);" is used to 'assert' that array_ptr 'buf' has lower bound equal to array_ptr 'tmp', and upper bound equal to array_ptr 'end'.

Chapter 11 of [94] lists rationale for design choices. The syntax "ptr<const unsigned int> g" was chosen over "const unsigned int ptr" because the latter is more difficult to visually parse. Bounds declaration are permitted at assignments (rather than only at variable declaration) because when the bounds of a pointer need to change, that would mean making the programmer replacing one variable with many different variables, which was deemed too annoying. They consider adding a new bounds syntax, 'object_bounds', whose semantics are that the bounds are always valid, without any exception for null pointers; this would remove the need for null pointer checks (note: i think this is like Haskell Maybe); however they felt that "This risks dividing C functions into those that can take null pointer arguments and those that only take non-null pointer arguments. In our experience, requirements that pointers not be null tend to propagate virally throughout code bases. If an entire code base can be converted, this can work well. Otherwise, it results in messy dynamic checks for non-nullness in code. Now, it is possible with lightweight invariants and dynamic check that such checks would not be messy. The existing practice in C is that null pointers are widely used interchangeably with non-null pointers. This introduces uncertainty about the usefulness of this feature for C code bases. The runtime benefit of eliminating non-null checks in pointer arithmetic is likely small. Following the principal of minimality, we have rejected the design choice of adding object bounds for now".

Chapter 9 of [95] describes and contrasts related work.


in progress:

Retrospectives:

types:

Standards:

Practices

"By default you should compile with optimization on. Forcing the compiler to work harder helps it find more potential problems." -- http://www.slideshare.net/olvemaudal/deep-c

Gotchas

    unsigned int   ui_one       =  1 ;
    signed   int   i_one        =  1 ;
    signed   short s_minus_one  = -1 ;
	
    if( s_minus_one > ui_one)
		printf("-1 > 1 \n");
	
    if( s_minus_one < i_one) 
		printf("-1 < 1 \n");

    #./run
    #
    # -1 > 1 
    # -1 < 1 

-- [103]

explanation: not sure, but i think what is happening here is that the compiler first checks if a promotion is neccesary. In the case of s_minus_one > ui_one, s_minus_one is promoted to an int. So now we are comparing a (signed) int to an unsigned int. Since the ranges of 'int' and 'unsigned int' are disjoint, there is no obvious choice of which one to convert to what; in this case, both are converted to 'unsigned int'. This is done by repeatedly adding 2^32 to the (signed) int until it is in the range of 'unsigned int' (that is, until it is positive). This creates a large positive number, which is greater than 1. In the case of s_minus_one < i_one, s_minus_one is promoted to an int. Now both arguments are of the same type, so the comparison is done, and -1 < 1, so it succeeds.


    extern void foo(void);
    void (*f)();
	
    f = &foo;   // Valid 
    f = foo;    // Valid too ! (syntactic sugar)
	
    f();        // Calls f
    (*f)();     // Calls f too ! (syntactic sugar)

-- [104]

	int array[] = {0,1,2,3,4} ;
	int* pointer = array;
	
	if (sizeof array == sizeof pointer)
		printf("This will never be printed !!");
	
	if (sizeof(int*) == sizeof &array[0])
		printf("This will be printed !!\n");
	
	if (&array[2] - &array[0] == 8)
		printf("This will never be printed either, result is 2 not 8!!");

explanation: In the first case, sizeof array is the size of the entire array. In the second case, int* is a pointer, and &array[0] is a pointer, so their sizeofs are equal (the size of a pointer). In the third case, &array[2] - &array[0] = &*(array+2) - &*(array+0) = (array+2) - (array+0) = 2 - 0. [105].

Note that sizeof(&array[0]) != sizeof(array) (although they 'decay' to the same thing, for the purpose of sizeof the first is the sizeof a pointer and the second is the sizeof the array). [106]

Note that '&array' and '&array[0]' are of a different types, and behave differently in pointer arithmetic, although in a sense both refer to the memory location at the beginning of the array; for &array, the implied offset is the size of the array, but for &array[0] the implied offset is the size of the array's elements. [107].

Relevant parts of C standard: C89 3.3.3.4

"The sizeof operator... When applied to an operand that has array type, the result is the total number of bytes in the array."

C89 3.2.2.1

"Except when it is the operand of the sizeof operator or the unary & operator, or is a character string literal used to initialize an array of character type, or is a wide string literal used to initialize an array with element type compatible with wchar_t, an lvalue that has type ``array of type is converted to an expression that has type ``pointer to type that points to the initial member of the array object and is not an lvalue."

[108]

[109]

and little-known features:

Internals and implementations

Core data structures: todo

Number representations

Integers

Floating points todo

array representation

variable-length lists: todo

multidimensional arrays: todo

limits on sizes of the above

string representation

The famous null-terminated "C string".

Representation of structures with fields

todo

misc details

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html goes into detail about how allowing undefined behavior (or at least undefined value return) makes C faster:

Use of an uninitialized variable: no need to autoinitialize

Signed integer overflow: can optimize eg "X*2/2" to "X"; " While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion.". Another example: "for (i = 0; i <= N; ++i) { ... }"; can assume that the loop will iterate exactly n+1 times (as opposed to i maybe wrapping if N is large)

Oversized Shift Amounts: different platform do different things in this case, so leaving this an undefined result lets the compiler just let the platform decide, rather than putting in extra instructions

Dereferences of Wild Pointers and Out of Bounds Array Accesses: "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic"

Dereferencing a NULL Pointer: "dereferencing wild pointers and the use of NULL as a sentinel. NULL pointer dereferences being undefined enables a broad range of optimizations: in contrast, Java makes it invalid for the compiler to move a side-effecting operation across any object pointer dereference that cannot be proven by the optimizer to be non-null. This significantly punishes scheduling and other optimizations. In C-based languages, NULL being undefined enables a large number of simple scalar optimizations that are exposed as a result of macro expansion and inlining."

Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the "int" as if it were a "float"). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results.... (eg an example is given where a loop that zeros memory is replaced by a call to a platform primitive for this purpose)

Efforts for a 'safer C' dialect, and/or to see how people's practical use of C in edge cases differs from what the standard says:

Extensions

C links