Table of Contents for Programming Languages: a survey
C
Because it is so well-known and well-liked, C gets its own chapter.
Perhaps the most successful language ever.
"Think of C as sort of a plain-spoken grandfather who grew up trapping beavers and served in several wars but can still do 50 pullups."
Some people consider it as a thin portable layer over assembly. However this view is disputed (e.g. https://news.ycombinator.com/item?id=6415240 ). One clear difference between C and assembly is that C is in the structured programming paradigm, and does not provide 'GOTO' (it does provide a facility called 'longjmp' but this is limited to jumps into functions for which there is still a valid stack frame; in addition some commentators assert that longjmp is often not implemented correctly in various C implementations (todo cite)).
One thing that C provides that assembly language often does not is a choice of conventions for string and multidimensional array representation.
Attributes:
Pros:
Cons:
Tutorials:
Best practices and recommended libraries etc:
Respected exemplar code:
Books:
Standards (drafts):
People: Dennis Ritchie
Popularity:
Features:
Size:
Libraries:
Misc:
Variants of C/variant implementations that are simpler, with small compilers:
- http://bellard.org/otcc/ (compiler fits in 2048 bytes of C source excluding the ';', '{', '}' and space characters; see also http://bellard.org/tcc/, a regular C compiler (tcc is ~45kloc [73]))
- https://github.com/rswier/c4 (~500 LOC; is still self-hosting; no structs; C in four functions)
- http://en.wikipedia.org/wiki/Small-C (~3000 LOC [74])
- "an integer/character subset of the C language" -- "A Small-C Compiler" book, preface, [75]
- " In May of 1980 Dr. Dobb's Journal ran an article entitled "A Small C Compiler for the 8080s" in which Ron Cain presented a small compiler for a subset of the C language. The most interesting feature of the compiler besides its small size was the language in which it was written--the one it compiled. It was a self-compiler! (Although this is commonplace today, it was a fairly novel idea at the time.) With a simple, one-pass algorithm, his compiler generated assembly language for the 8080 processor. Being small, however, it had its limitations. It recognized only characters, integers, and single dimension arrays of either type. The only loop controlling device was the while statement. There were no Boolean operators, so the bitwise logical operators & (AND) and
(OR) were used instead. But even with these limitations, it was a very capable language and a delight to use, especially compared to assembly language. Recognizing the need for improvements, Ron encouraged me to produce a second version, and in December of 1982 it also appeared in Dr. Dobb's Journal. The new compiler augmented Small C with (1) code optimizing, (2) data initializing, (3) conditional compiling, (4) the extern storage class, (5) the for, do/while, switch, and goto statements, (6) combination assignment operators, (7) Boolean operators, (8) the one's complement operator, (9) block local variables, and (10) various other features. Then in 1984 Ernest Payne and I developed and published a CP/M compatible run-time library for the compiler. It consisted of over 80 functions and included most of those in the UNIX C Standard I/O Library--the ones that pertained to the CP/M environment. This became version 2.1 and the subject of The Small C Handbook." -- "A Small-C Compiler" book, introduction, [76] |
- http://www.drdobbs.com/cpp/building-your-own-c-interpreter/184408184
- https://github.com/zsaleeba/picoc (~3500-4500 LOC) (not self-hosting)
- https://github.com/alexfru/SmallerC (~10K LOC compiler [77])
- https://github.com/kungfooman/EiC-C-Interpreter (~15K LOC; see also [78] [79] [80])
- SubC
- Cerebus which also has Cerebus Ail and Cerebus Core
- Interpreted C for Arduino
- SCC, simple C Compiler
- cc500 "CC500: a tiny self-hosting C compiler" (~750 LOC)
- https://github.com/andrewchambers/c "about 5k-6k LOC and self hosting...1:1 dumb translation of code to assembly"
- https://github.com/yui0/catc (Tiny C compiler)
- https://github.com/jserv/amacc
- https://github.com/mras0/scc (hobby project)
- "A non-exhaustive list of the current limitations of SCC: Arbitrary limits to e.g. the number of struct members (you can to some degree change these); Very limited error checking/information (see the debugging section for help); Integers are limited to 16-bits (and no support for long); Tiny memory model (i.e. all code and data must live within a single 64K segment). CS=DS=ES=SS is assumed. (See SIM86 for an example of how to work around this limitation); Only basic support for initializers (rules are complex, but if it's not in the test suite it's not supported...); No floating point support; No bitfields; Uncountably many bugs :);"
- https://en.wikipedia.org/wiki/Small-C
- https://zserge.com/posts/cucu-part1/
- Small C for I386 (IA-32)
- Selfie, "a tiny self-compiling compiler for a subset of C, a tiny self-executing MIPS emulator, and a tiny self-hosting MIPS hypervisor, all in a single 7kLoC file". HN discussion. Paper. (via [81])
- "Tiny C expression compiler Written in Forth based on tinyc.c by marc feeley" (via [82])
- http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c
- https://github.com/rui314/8cc https://github.com/rui314/9cc "C compilers by Rui Ueyama blog" (via [83])
- https://github.com/Fedjmike/mini-c/blob/master/cc.c "10 hour self hosting c compiler" (via [84])
- neatcc "neatcc implements a large subset of ansi C but lacks features like floating point types, non-integer parameters and struct bitfields."
- BDS C
- https://github.com/ras52/bootstrap/tree/master/stage-5 (a "C-like language")
- https://github.com/cksystemsteaching/selfie/blob/master/README.md C Star (from the Selfie project)
- https://github.com/cksystemsteaching/selfie/blob/master/semantics.md
- https://github.com/cksystemsteaching/selfie/blob/master/grammar.md
- "C* is a tiny subset of C designed to trade-off code size, sim-plicity, and readability of selfie while providing interestingchallenges for compilation and educational opportunitiesfor extensions. In particular, C* is an attempt at identifyinga subset in which a self-referential system like selfie canbe implemented with the least amount of code that is stillsimple and easily readable. It took around two years of selfiedevelopment work to settle on its design. During develop-ment we were driven by the observation that whenever weadded a language feature we needed code to compile it butcould potentially save or improve code elsewhere includingthe emulator and hypervisor when using that feature. Thismeans that if we now add or remove a feature the code islikely to either grow, get more complicated, or less readable.For example, adding structs requires quite a bit of compilercode but would not save much code elsewhere while remov-ing string literals would save and simplify compiler code butmake the rest of the code much less readable. In short, wewere doing programming language design driven by self-referentiality.The grammar of C* is LL(1) and has six keywords (int,while,if,else,return,void). C* features five statements(assignment, while loop, if-then-else, procedure call, and re-turn) as well as five built-in functions that are sufficient tobootstrap selfie (exit,malloc,open,read,write). Notably,there is nofreeand noclose. Procedures can have argu-ments, local variables, and return values. There are standardarithmetic operators (+,-,*,/,%) and comparison operators(==,!=,<,<=,>,>=) but no bitwise and no Boolean operators.Integer, character, and string literals are supported.There are only two data types in C*, namely signed in-teger (int) and pointer to signed integer (int*). In partic-ular, there are no composite data types for two importantreasons. Firstly, showing that something like selfie can beimplemented without composite data types has a profoundeffect on students. It makes students realize that data typesin programming languages are not needed for expressive-ness which is something that comes as a surprise to manystudents. Secondly, not supporting composite data types inC* allows me to ask students to implement their own (usu-ally, arrays and structs) which in my experience appears tomake students truly understand the concept and motivationof data types.The only way to access dynamically allocated memory inC* is through the dereferencing operator*which gave C*its name. Dereferencing and pointer arithmetic provide fullcontrol over memory access and therefore play a key role insystems programming and in particular in selfie. The basicprinciple that is so important for students to understand isthat an integer in selfie can also be an address and vice versa. In other words, main memory is not just storage, it is alwaysalso an address space that is in the same value domain asthe storage modulo word alignment. In my experience, C*helps students develop awareness and understanding of theduality of address and storage, in addition to the duality ofcode and data" -- [85]
- https://c9x.me/qcc/ ~20K
- https://github.com/rui314/chibicc
- Clean C: "The Clean C dialect is that subset of ANSI C that is compatible with the C++ language" [86]
- cproc is a C11 compiler using QBE as a backend.
- cproc seems to intend to support almost all of C11 but here's the current status: " What's missing Digraph and trigraph sequences (6.4.6p3 and 5.2.1.1, will not be implemented). Wide string literals (#35). Variable-length arrays (#1). volatile-qualified types (#7). _Thread_local storage-class specifier (#5). long double type (#3). Inline assembly (#5). Preprocessor (#6). Generation of position independent code (i.e. shared libraries, modules, PIEs). "
- starc from the starc Selfie project
- http://perso.b2b2c.ca/~sarrazip/dev/cmoc.html
- https://vgel.me/posts/c500/
- the cc_* variants from https://github.com/oriansj/stage0-posix, eg https://github.com/oriansj/stage0-posix-amd64/blob/master/cc_amd64.M1
and/or safer:
switch statements must be structured as in MISRA-C; unstructured "switch", as in Duff's device, is not supported. ((note: i think this means (a) each 'case' in a switch statement must be followed by its own 'break' and (b) all 'case's must be at the same scope (unlike Duff's device)))
longjmp and setjmp are not guaranteed to work.
Variable-length array types are not supported.
Consequently, CompCert? supports all of the MISRA-C 2004 subset of C, plus many features excluded by MISRA (such as recursive functions and dynamic heap memory allocation). ... Some other unsupported constructs, such as variable-arity function declarations, are rejected. "
and/or subset:
- MISRA-C:2004 - Guidelines for the use of the C language in critical systems. Technical report, October 2004.
- SEI CERT C
- Clight
Other variants:
- https://cohost.org/jckarter/post/2955755-the-lost-language-ex
- Underscores in numeric literals
- Labeled arguments
- Case ranges
- Nested functions
- "nonescaping closures...allows local function references to be used as first-class values, though their lifetime doesn't extend past when the surrounding function returns. Nested functions can even goto back into their parent function, allowing for nonlocal exits to break out of nested functions like Smalltalk blocks, allowing control flow-like functions to be built using them."
- Generators
- "How does all this work in plain C without a fancy runtime? High C's generators act as relatively straightforward syntax sugar over the nested function feature."
- https://lobste.rs/s/zku7zr/lost_language_extensions_metaware_s_high
- " Pascal’s nested functions actually did/do work as (non-escaping) closures. The function-pointer type is fat, carrying a context pointer as well as the code address. Regular function pointers ignore the context, but for a nested function the context points back to the outer function’s stack frame. I don’t know if this was true of all Pascals, but IIRC both HP’s and Apple’s worked this way. ... Apple’s blocks extension made block pointers a single word, but that pointed to a struct that had an invoke field containing the function pointer. This was nicer in the type system because you knew whether it was safe to capture (blocks must be explicitly ‘copied’, which either promoted them to the heap or incremented their reference count). Making things that are global and always safe and things that are on the stack and require promotion to copy the same type is a big source of bugs (the fact that C/C++ don’t make this distinction for arbitrary pointers has caused a lot of security problems and is probably the biggest single win for Rust). As I recall, blocks didn’t properly support attribute cleanup, so was impossible to use safely in C for any non-trivial captures (fine for C++ or Objective-C objects with destructors), though I think that was fixed a couple of years ago. " -- https://lobste.rs/s/zku7zr/lost_language_extensions_metaware_s_high#c_mq07iw
Opinions on C
- Rob Pike lists as 'good' "well understood and has aged surprisingly gracefully", 'bad' no garbage collection, no string handling, 'ugly': The preprocessor, conditional compilation [87]
- https://gist.github.com/alexpana/91493a72b374651ab9ae complains about:
- "C++ allows changing the access modifier of a virtual function in the derived class" including allowing the derived class to hide as 'private' things that were public in the superclass
- "Operator over-overloading: One of the increment operators takes a dummy int parameter in order to allow overloading. Can you tell which without googling? (hint: its postfix)."
- "Exception unspecifiers: C++ has two types of exception specifiers: throw() and nothrow. The first is deprecated (because 'we screwed up, sorry, let's forget about this terrible mess'). The second one guarantees it's contract by terminating the application when violated. That's because functions declared nothrow can still call legacy functions that don't declare anything (backwards compatibility and such), so the compiler can't validate your claim, choosing to explode in your face at runtime."
- "Without adding another layer of indirection via the PIMPL idiom, changes to the internal structure of the class propagate to all it's users, making compilation painfully / unecessary slow. Granted, the C++ object model requires this when the modified class is used as a value field inside another class, but C++ has no way of distinguishing between use cases. This is what you get for using a text preprocessor for handling dependencies."
- "Binary incompatibility: Linkers are not standardised"
- "RValue References: C++ has value semantics, which means it generates a copy constructor to deep copy value fields (convenient huh?). While the standard committe thought it was pretty cool, they eventually realised how many times objects were unecessary or accidentally copied behind the scenes (return values from functions, array operations, etc). So they invented 'move semantics', 'move constructors', and a new synthax for overloading on ephemeral 'rvalue references': &&."
- Name-manging is non-standard. Name-mangling is needed for binary compatibility with C (C++ started out as a C preprocessor) b/c "C doesn't allow function overloading and doesn't have namespaces"
- "Inheritances: Since there are multiple access modifiers (public, protected, private), why not multiple inheritance(s)? There's even a virtual inheritance! While their meaning is beyond the scope of this rant, suffice it to say there are aspects that make them non trivial to understand / use. Scott Meyers wrote in his famous Effective C++ book: "protected inheritance is something whose meaning eludes me to this day". And he's a C++ guru."
- "I've commented several times before that I consider "C" and "C with analysis backing" to be in practice two different languages. SQLite is the latter: https://sqlite.org/testing.html Writing SQLite in "plain C" without all that would, well, simply not work. I agree that "C with analysis backing" is the best language for SQLite right now. However, it should not be used as an example of "How C is a great language for programming in" unless you are also planning on skillfully using a very significant subset of the tools listed in that document. SQLite doesn't prove C is great; it demonstrates how much additional scaffolding is necessary to wrap around C to get that level of quality, and given the quantity and diversity of tools we are talking about, it is not particularly complimentary to "plain C"." [88]
- "My favorite aspect of programming in C is how little it gets in the way of the programmer. Once you become familiar with the language there is not much to lookup in the docs because it's such a small language. I might end up writing more verbose code, but it's usually very clear to me. You still can write spaghetti code but that's besides the point." [89]
- "I mostly agree but I think there are too many weird UB and corner cases to qualify as a small language. I guess C compiled with -fwrapv -fno-strict-overflow -fno-strict-aliasing and a few others might come close to what you're describing. I think C's apparent simplicity lulls me into complacency at times, I delude myself into thinking that I'm coding into some kind of macro assembly and I think I know what the resulting machine code will look like. And then some super weird optimization or UB kicks in and nothing makes sense anymore, because I stopped playing by the rules and I triggered the footgun. Just look at the number of bug reports on the GCC bugtracker for code that at a glance ought to work and it turns out that it's actually not a bug, the code just triggered a subtle UB and the compiler ran away with it and generated code that ate your cat." [90]
- "I often find it amusing how C advocates complain about the complexity of C++, and then proceed to implement the same complex functionality in even more brittle ways. Language features I have seen C developers emulate poorly in C: constepxr..., virtual functions (with function pointers), templates (with macros), exceptions (with setjmp/longjmp)." [91]
- C's Biggest Mistake by Walter Bright (the mistake is casting arrays to pointers rather than having a separate array type that stores the array length)
- https://accu.org/content/conf2009/AndreiAlexandrescu_iterators-must-go.pdf
- "Well designed.. I'm not so sure, once you're used to something you're 'blind' to its defaults.. Big: signed/unsigned automatic conversion? Minor: conflating unsigned and modulo arithmetic (why not allowing signed modulo int?)? Huge: no way to pass an array but just a pointer? ... " [92]
- "Arrays are almost but not quite the same as pointers. You can pass functions but not return them. There's a random grab-bag of control flow keywords, too many operators with their own precedence rules, and too many keywords given over to a smorgasbord of different integer types with arbitrary rules for which expressions magically upconvert to other ones." "The examples I gave have a very small language where you really can't remove anything, whereas in C quite a lot of the language is rarely-used, redundant, or bodged: the comma operator surprises people, for and while do overlapping things, braces are mandatory for some constructs but not for others, null and void* are horrible special cases." "Standard libraries are a different matter, but I'm not too impressed by C there either; it's not truly minimal, but it doesn't cover enough to let you write cross-platform code either. Threading is not part of the pre-99 language spec, and so you're completely reliant on the platform to specify how threads work with... everything. Networking isn't specified. GUI is still completely platform-dependent. The C library only seems like a baseline because of the dominance of unix and C (e.g. most platforms will support BSD-style sockets these days)." "I'm actually most impressed by the Java standard library; it's not pretty, but 20+ years on you can still write useful cross-platform applications using only the Java 1.0 standard library. But really the right approach is what Rust and Haskell are doing: keep the actual standard library very small, but also distribute a "platform" that bundles together a useful baseline set of userspace libraries (that is, libraries that are just ordinary code written in the language)." [93] and [94] "...we ended up with about 50 millions (and counting) different meaning for "static" for instance. I like C but its simplicity is almost by accident more than by design. It's pretty far from "perfect" in my book." "There are so many weird features about the language that can bite you ass for no good reason. Why don't switch() break by default since it's what you want it to do the overwhelming majority of time (answer: if you generate the corresponding assembly jump table by hand "fall through" is the easiest and simplest case, so they probably kept it that way)." "Why do we have this weird incestuous relationship between pointers and arrays? It might seem elegant at first (an array is a pointer to the first element or something like that) but actually it breaks down all over the place and can create some nasty unexpected behavior." "Why do we need both . and -> ? The compiler is always able to know which one makes sense from the type of the variable anyway." "String handling is a nightmare due to the choice of using NUL-terminated strings and string.h being so barebones that you could reimplement most of it under an hour." "Some of the operator precedences make little sense." "Writing hygienic macros is an art more than a science which usually requires compiler extensions for anything non-trivial (lest you end up with a macro that evaluates its parameters more than once)." "Aliasing was very poorly handled in earlier standards and they attempted to correct that in more modern revisions while still striving to let old code build correctly and run fast. So you have some weird rules like "char can alias with everything" for instance. Good luck explaining why that makes sense to a newbie without going through 30+ years of history." "The comma operator." "Undefined function parameter evaluation order." [95]
- vs C++: "A lot of C programmers complain about the wildness of C++, but frankly, often complex C programs are that close to be their "own languages" with custom object systems, Macros and what not." [96]
- "I have a huge problem with C because the same characters and keywords are reused in multiple different similar but different ways and it adds cognitive load to things that should be simple (eg declarations) -- leaving me less able to devote my limited mental faculties to the problem I am actually trying to solve. (Not singling out C, just picking a well known example.)" Tloewald
- "I would say it’s very hard to write cross platform C, but it definitely exists. sqlite and Lua are pretty good examples from what I can see." [97]
- "I appreciate the original simplicity of K & R, "The C Programming Language", 2nd Edition, and the relatively simple semantics of ANSI C89/ISO C90 compared to C99 and later. You don't need complex parsing methods for ANSI C89/ISO C90 and you do not need the "lexer hack" to handle the typedef-name versus other "ordinary identifier" ambiguity." [98]
- C’s Biggest Mistake by Walter Bright (creator of the D programming language). The alleged mistake is conflating pointers with arrays. He recommends adding a new syntax for passing arrays in function calls as fat pointers (which retain their length): void foo(char a[..]).
- VLA are... problematic (variable length arrays)
- https://blog.joren.ga/vla-bad
- http://www-cs-students.stanford.edu/~blynn/c/intro.html
- http://www-cs-students.stanford.edu/~blynn/c/cruft.html and http://www-cs-students.stanford.edu/~blynn/c/wishlist.html
- https://danluu.com/boring-languages/ contains a list of important software written in C
- in replies to this comment people list some disadvantages and gotchas about C: https://lobste.rs/s/goq4d8/announcing_hare_programming_language#c_843xlm
- "C offers abstractions which it does not in fact support: Arrays remain without index checking, data types without consistency check, pointers are merely addresses where addition and subtraction are applicable. One might have classified C as being somewhere between misleading and even dangerous." -- Niklaus Wirth, A Brief History of Software Engineering
Opinionated comparisons
- vs Rust: https://pngquant.org/rust.html
- https://lobste.rs/s/1wrjdy/rewriting_libimagequant_rust_for on C interop: " Although people like to bash C++, there isn't any programming language with an ABI as part of the language standard, not even C. The C ABI that so many people adhere to, only exists in OS written either in C or C++ (via extern "C"). By the historical accident that all commercial major OSes are mostly written in C, developers that aren't language lawyers tend to think C ABI is somehow defined in some standard. " -- [99] "...the C ABI / calling convention is OS- and platform-specific. (Which does lead to practical problems on modern systems when you assume the ABI behaves similarly on different platforms; a good example is https://github.com/rust-lang/rust/pull/22862 , caused by 64-bit iOS treating varargs differently from OS X on x86-64 or even Linux on aarch64.)" [100]
- https://c3.handmade.network/blog/p/8486-the_case_against_a_c_alternative mentions Zig, Odin, Jai, also eC and C3, as C replacements, and D, Rust, Nim, Crystal, Beef, Carbon as C++ replacements
- vs C3 (and Zig, Jai, Odin): https://www.c3-lang.org/compare/
- vs Odin, Zig, Jai, C3: https://www.reddit.com/r/C_Programming/comments/nqkn93/comment/h0bvlap/?utm_source=reddit&utm_medium=web2x&context=3
C0
"C0 is a Pascal-like subset of C, similar to MISRA-C [133]. It excludes the features of C that make full verification problematic: pointer arithmetic, unions, pointer casts. It is the intermediate language used in the Verisoft project for pervasive formal verification of computer systems." -- Yannick Moy. Automatic Modular Static Safety Checking for C Programs. PhD thesis, Université Paris-Sud, January 2009, section 2.1, page 42
- Dirk Leinenbach, Wolfgang Paul, and Elena Petrova. Towards the formal verification of a C0 compiler: Code generation and implementation correctnes. In SEFM ’05: Proceedings of the Third IEEE International Conference on Software Engineering and Formal Methods , pages 2–12, Washington, DC, USA, 2005. IEEE Computer Society.
http://reports-archive.adm.cs.cmu.edu/anon/2010/CMU-CS-10-145.pdf
http://c0.typesafety.net/tutorial/From-C0-to-C.html
HNC
macro language for C:
CIL
http://cil-project.github.io/cil/doc/html/cil/
https://en.wikipedia.org/wiki/George_Necula#C_Intermediate_Language
Checked C
http://research.microsoft.com/en-us/projects/checkedc/
C with language extensions to explicitly specify pointer/array bounds. "For an existing C program to be correct, there has to be an understanding on the part of the programmer as to why pointers and array indices stay in range. The goals of the design are to let the programmer write this knowledge down, to formalize existing practices, such as array pointer parameters being paired with length parameters, and to check this information." There are three ways provided to secure pointers; raw pointer can be declared, which cannot be used in pointer arithmetic; static checking, where bounds expressions in a limited language are declared which can be checked at compile time; and dynamic pointers, which are dynamically checked at runtime. There is also still raw unchecked pointers which can do pointer arithmetic, to allow for backwards compatibility. The extensions are only for checking bounds; the language report gives examples of other areas which are not dealt with: incorrect deallocation of memory still in use; incorrect type-unsafe pointer casts; concurrency races.
Chapter 2 of [101], starting on page 14, summarizes the language extensions' features.
The following operations are supported on pointers (the ops in this list which take two pointers require that the pointers be to objects of the same type; and some of these are not supported on checked raw pointers, which don't support pointer arithmetic):
- dereference/indirection (the '*' operator in C) (from the pointer, you can access the value it points to)
- array reference ([]) (syntatic sugar for adding an offset to a pointer, and then dereferencing; from the pointer, you can access the value at an offset)
- assignment (a = b): two pointers can be assigned
- pointer + int (which implies pointer - int)
- equality comparison (==)
- relational comparison (<=)
- pointer - pointer
- typecast (when typecasting, there are three options for bounds checking: either the bounds safety can be checked at compile time, or they are checked at runtime, or there is no checking and the programmer is trusted)
Either pointers or their values can be const; eg you can declare a const pointer to a mutable value, or, by contrast, a mutable pointer to a const value.
Variables which carry their bounds with them at runtime carry both a lower and an upper bound (unlike some competing projects, in which the bounds carry counts of elements).
Bounds can be declared at variable declarations, in which case they must be true for the entire lifetime of the variable, or at assignments, in which case they are propagated via dataflow. Bounds at variable declarations might involve multiple variables; sometimes you have to update these variables, but since you can only atomically update one of them at a time, there may be a period of time when they are 'out of sync' and the bounds condition is not true, even though it will again become true after all are updated; this is dealt with via 'bundled blocks' of variable updates, within which the bounds expression may be false. The syntax "x : count(e1)" declares that only memory that is at or above x and below x + e1 can be accessed through x ([102] section 3.1), and the syntax "x : bounds(e1, e2)" declares that memory that is at or above e1 and below e2 can be accessed through x.
Bounds-checking expressions must be 'non-modifying expressions' ie no side effects (i think?).
Note that for the bounds on a pointer to be considered to hold, the pointer must either be null or have "been derived via a sequence of operations from a pointer to some object obj with bounds(low, high)" such that the bounds on this pointer are within the range low, high (for the implications of this, see [103] section 3.1 PDF page 35, and section 11.5 PDF page 135, and section 11.6 PDF page 136).
The language provides 'checked scopes', in which only checked pointers are allowed "to prevent inadvertent use of unchecked type". "Another extension is a refinement of C’s notion of undefined behavior. We wish to maintain bounds safety even in the face of integer overflow, which is a widespread possibility in C programs. This requires a more nuanced requirement on the result of overflow than “the result is undefined."
" If a compiler cannot prove that a new pointer type value is non-null before a pointer arithmetic operation, it must insert a check. Similarly, if a compiler cannot prove that a pointer arithmetic expression for a new pointer type cannot overflow, it must insert a check. This may slow a typical program down by a slight amount. ... In our experience working on an operating system written in managed code that required these checks, the slowdown was a few percent, even when all arithmetic instructions were checked for overflow too. The cost of checks was low because it is often the case that checks are redundant or can be combined into adjacent machine instructions....The checks are also inexpensive on superscalar processors: they can be placed so that they are statically predicted by processors to never fail. "
Pointer bounds follow alignment constraints. This slightly complicates the language definition but allows a runtime speedup of bounds checking.
Section 2.4 of [104], starting on PDF page 19, talks about some semantics stuff and undefined/null/overflow/error handling.
Section 2.8 of [105], starting on PDF page 30, has very interesting lists of undefined and unspecified behaviors in C, what changes are made in checkedc, and what the resulting formal aim is. In sum, some examples of undefined behavior in C are division by 0, signed integer overflow, out-of-bounds pointer arithmetic, pointer arithmetic between two pointers that don't point to the same object, and (sometimes) expressions with nested assignments (because their side effects aren't always ordered, therefore the result of the expression can be undefined). Unspecified behavior is not totally undefined, but is defined to be one of many options; examples include expressions with side effects. There is also implementation-defined behavior, including ranges of values for ints, floats, and pointers; data layout (the sizes of types, padding, alignment), and some aspects of pointer conversion. "It is difficult to reason about the correctness of programs that have expressions with undefined behavior. One has to make sure that a program never arrives at a point where behavior is undefined. In practice, this would mean proving that signed integer overflow can never occur, for example. For unspecified behavior, one has to reason about all possible behaviors, such as all possible orders of evaluation. For implementation-defined behavior, one has to reason about the implementation-specific behavior or have reasoning that is independent of the details." Changes CheckedC? makes include semantically treating pointers as addresses in memory, mandating the result of divide-by-0 and signed integer overflow to be either some legal value or a runtime error, and mandating that undefined expressions be rejected at translation-time. They note that other sources of undefined behavior still exist, including: unchecked code that dereferences out-of-bounds pointers; invalid typecasting; use-after-free (accessing deallocated memory). They note that CheckedC? does not purport to save you from invalid bounds when your program contains these other errors.
Section 2.7 of [106] on page 22 explains the reasoning for having assert-like dynamic bounds checks, rather than simply using ordinary if/thens in programs and returning error code if the bounds checks don't pass, citing O'Donell and Sebor. In sum, handling errors leads to more awkward and inefficient code, which is annoying especially when, assuming the programmer is correct, the error should be impossible. They recognize that if the assertion is ever actually involked then there's a problem, but they say that this is the same problem that C programs already have with null pointer dereferences. My note: this is interesting to me because it stands in contrast to the 'null pointers are a billion dollar mistake' narrative that says that this sort of thing is bad design. My note: it seems to me that they are implying that assertions (and assertion-like bounds checks) should be used when the assertion cannot fail unless there is a LOCAL mistake within the current scope (or within the current module); for argument checking however (checking the validity of arguments passed into the module by an external caller), ordinary control flow (or exceptions, etc) should be used; in other words, YOU can assert that YOUR code is 100% correct, but never assert that SOMEONE ELSE's code will be correct.
Section 2.7 also gives an example where a dynamic check can actually increase runtime speed by leading the compiler to deduce that other more expensive checks are not needed.
Lessons learned (from page 5 of [107]:
- it's important to have a succinct syntax for declaring bounds, yet one that permits rich expressions
- declaring bounds only at declarations is usually sufficient
- pointer casts are common but usually existing bounds could be easily modified to be appropriate for the new type
- many external libraries specify data layout, and one must interoperate with them
- "signed integer overflow was a pervasive possibility, which raised questions about the meaning of bounds declarations that used signed integers"
Chapters 4 and 8 of [108], starting on PDF pages 57 and 107, presents an inference procedure that a compiler can use to statically check and infer bounds and to reason about simple program invariants. Note that invertible functions appear to be important. The syntax "where i >= 0;" is used to 'assert' that i >= 0, and the syntax "where buf : bounds(tmp, end);" is used to 'assert' that array_ptr 'buf' has lower bound equal to array_ptr 'tmp', and upper bound equal to array_ptr 'end'.
Chapter 11 of [109] lists rationale for design choices. The syntax "ptr<const unsigned int> g" was chosen over "const unsigned int ptr" because the latter is more difficult to visually parse. Bounds declaration are permitted at assignments (rather than only at variable declaration) because when the bounds of a pointer need to change, that would mean making the programmer replacing one variable with many different variables, which was deemed too annoying. They consider adding a new bounds syntax, 'object_bounds', whose semantics are that the bounds are always valid, without any exception for null pointers; this would remove the need for null pointer checks (note: i think this is like Haskell Maybe); however they felt that "This risks dividing C functions into those that can take null pointer arguments and those that only take non-null pointer arguments. In our experience, requirements that pointers not be null tend to propagate virally throughout code bases. If an entire code base can be converted, this can work well. Otherwise, it results in messy dynamic checks for non-nullness in code. Now, it is possible with lightweight invariants and dynamic check that such checks would not be messy. The existing practice in C is that null pointers are widely used interchangeably with non-null pointers. This introduces uncertainty about the usefulness of this feature for C code bases. The runtime benefit of eliminating non-null checks in pointer arithmetic is likely small. Following the principal of minimality, we have rejected the design choice of adding object bounds for now".
Chapter 9 of [110] describes and contrasts related work.
A Formal Model of Checked C
Frama-C
https://frama-c.com/
"Frama-C is an extensible and collaborative platform dedicated to source-code analysis of C software."
in progress:
Retrospectives:
types:
Standards:
Practices
"By default you should compile with optimization on. Forcing the compiler to work harder helps it find more potential problems." -- http://www.slideshare.net/olvemaudal/deep-c
Gotchas
- The goal of the Underhanded C Contest is to write code that appears to be readable, clear, innocent and straightforward, yet fails to perform its apparent function.
- '==' tests, but '=' assigns, even within expressions that you might think are not mutating anything.
- an especially pernicious place for this mistake to be made is within an assert statement; eg if you are supposed to have "assert(answer == 42);" but instead you accidentally put "assert(answer = 42);", then in debug builds not only will this condition not be checked, but in fact when the condition is wrong, it will be corrected; but then in production builds if you turn off asserts, hidden bugs will be uncovered; [111]
- this might be considered a special case of a more general gotcha; executing the expression within an assert statement can mutate things; which might be considered a special case of an even more general gotcha; the syntax is such that you can quickly skim over what you expect to be a side-effectless expression, and not really that it has side-effects. In gcc, the '-Wparentheses' warning sometimes catches this (see [112]).
- x<=y<=z does not mean "(x <= y) && (y <= z)" but rather "(x<=y ? 1 : 0) <= z". In gcc, the '-Wparentheses' warning catches this (see [113]).
- dangling else: in the following, to which 'if' does the 'else' belong to? {if (a) if (b) foo (); else bar ();} (the innermost one, but this isn't apparent). In gcc, the '-Wparentheses' warning catches this (see [114]).
- Struct/union and typedef struct/union are different name spaces in C (not in C++) [115]
- Line continuations can break a single token across two lines, making it hard to search in a text editor [116]
- 4["string"], and sizeof ('a') != sizeof (char) [117]
- Avoid realloc(p, 0) due to standards breakage
unsigned int ui_one = 1 ;
signed int i_one = 1 ;
signed short s_minus_one = -1 ;
if( s_minus_one > ui_one)
printf("-1 > 1 \n");
if( s_minus_one < i_one)
printf("-1 < 1 \n");
#./run
#
# -1 > 1
# -1 < 1
-- [118]
explanation: not sure, but i think what is happening here is that the compiler first checks if a promotion is neccesary. In the case of s_minus_one > ui_one, s_minus_one is promoted to an int. So now we are comparing a (signed) int to an unsigned int. Since the ranges of 'int' and 'unsigned int' are disjoint, there is no obvious choice of which one to convert to what; in this case, both are converted to 'unsigned int'. This is done by repeatedly adding 2^32 to the (signed) int until it is in the range of 'unsigned int' (that is, until it is positive). This creates a large positive number, which is greater than 1. In the case of s_minus_one < i_one, s_minus_one is promoted to an int. Now both arguments are of the same type, so the comparison is done, and -1 < 1, so it succeeds.
extern void foo(void);
void (*f)();
f = &foo; // Valid
f = foo; // Valid too ! (syntactic sugar)
f(); // Calls f
(*f)(); // Calls f too ! (syntactic sugar)
-- [119]
- array/pointer decaying special cases
int array[] = {0,1,2,3,4} ;
int* pointer = array;
if (sizeof array == sizeof pointer)
printf("This will never be printed !!");
if (sizeof(int*) == sizeof &array[0])
printf("This will be printed !!\n");
if (&array[2] - &array[0] == 8)
printf("This will never be printed either, result is 2 not 8!!");
explanation: In the first case, sizeof array is the size of the entire array. In the second case, int* is a pointer, and &array[0] is a pointer, so their sizeofs are equal (the size of a pointer). In the third case, &array[2] - &array[0] = &*(array+2) - &*(array+0) = (array+2) - (array+0) = 2 - 0. [120].
Note that sizeof(&array[0]) != sizeof(array) (although they 'decay' to the same thing, for the purpose of sizeof the first is the sizeof a pointer and the second is the sizeof the array). [121]
Note that '&array' and '&array[0]' are of a different types, and behave differently in pointer arithmetic, although in a sense both refer to the memory location at the beginning of the array; for &array, the implied offset is the size of the array, but for &array[0] the implied offset is the size of the array's elements. [122].
Relevant parts of C standard: C89 3.3.3.4
"The sizeof operator... When applied to an operand that has array type, the result is the total number of bytes in the array."
C89 3.2.2.1
"Except when it is the operand of the sizeof operator or the unary & operator, or is a character string literal used to initialize an array of character type, or is a wide string literal used to initialize an array with element type compatible with wchar_t, an lvalue that has type ``array of type is converted to an expression that has type ``pointer to type that points to the initial member of the array object and is not an lvalue."
[123]
[124]
- "...incorrect conclusions like "int addition wraps around on overflow" (mentioned in the article), "null pointer dereferences are guaranteed to crash the program", "casting from a float pointer to an int pointer and dereferencing it is OK to do low-level tricks"..." [125]
- Structure padding is implementation-dependent, so the sizeof a struct is implementation-dependent [126]
- promotion rules [127]
- sizeof(short int) is never larger than sizeof(int), but they might be the same size [128]
- sizeof(char) is implementation-dependent, so if you do arithmetic on a char, whether or not it overflows might be implementation-dependent in some cases, for example, if you multiply [129]
- sizeof(int) is implementation-dependent, so if you do arithmetic on an int, whether or not it overflows might be implementation-dependent in some cases, for example, if you bit-shift [130]
- order of expression evaluation is implementation-dependent, so if you mutate state within an expression and then access that state elsewhere in the same expression, you can't know if the access will take place before or after the mutation, which may make the expression result implementation-dependent [131]
- https://wordsandbuttons.online/so_you_think_you_know_c.html
- "If you use C as an assembler macro language, you aren't actually writing C. You're likely to get burned someday, unless you compile at -O0." -- Patrick Walton (pcwalton), major Rust contributor
- Writing Portable C. By David Chisnall
- The New Features of C11 discusses some gotchas in the Concurrency section
- "...even seasoned C programmers regularly rely on architecture-specific assumptions when writing ostensibly cross-platform code: assumptions about the atomicity of reads and writes, operation ordering, coherence and visibility in self-modifying code, the safety and performance of unaligned accesses, and so forth." [132]
- "There’s lots of stuff out there that just isn’t supported. Try running something written in “100% portable ANSI C”: on a system whose encoding isn’t based on ASCII. on a system that uses floats based on something other than IEEE-794. on a system where pointers don’t fit in any integer type. on a Harvard architecture (C99 defined some stuff to make this more doable, but…). on a system where CHAR_BITS isn’t 8. " [133]
- https://belaycpp.com/2021/08/31/yet-another-reason-to-not-use-printf-or-write-c-code-in-general/
- example of C macro difficulties: https://news.ycombinator.com/item?id=15783885
- https://jeang3nie.codeberg.page/case-for-modern-language-pt1/
and little-known features:
Internals and implementations
Core data structures: todo
Number representations
Integers
Floating points todo
array representation
variable-length lists: todo
multidimensional arrays: todo
limits on sizes of the above
string representation
The famous null-terminated "C string".
Representation of structures with fields
todo
undefined behavior
https://news.quelsolaar.com/2020/03/16/how-one-word-broke-c/ (but see https://news.ycombinator.com/item?id=22590261 )
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html goes into detail about how allowing undefined behavior (or at least undefined value return) makes C faster:
Use of an uninitialized variable: no need to autoinitialize
Signed integer overflow: can optimize eg "X*2/2" to "X"; " While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion.". Another example: "for (i = 0; i <= N; ++i) { ... }"; can assume that the loop will iterate exactly n+1 times (as opposed to i maybe wrapping if N is large)
Oversized Shift Amounts: different platform do different things in this case, so leaving this an undefined result lets the compiler just let the platform decide, rather than putting in extra instructions
Dereferences of Wild Pointers and Out of Bounds Array Accesses: "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic"
Dereferencing a NULL Pointer: "dereferencing wild pointers and the use of NULL as a sentinel. NULL pointer dereferences being undefined enables a broad range of optimizations: in contrast, Java makes it invalid for the compiler to move a side-effecting operation across any object pointer dereference that cannot be proven by the optimizer to be non-null. This significantly punishes scheduling and other optimizations. In C-based languages, NULL being undefined enables a large number of simple scalar optimizations that are exposed as a result of macro expansion and inlining."
Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the "int" as if it were a "float"). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results.... (eg an example is given where a loop that zeros memory is replaced by a call to a platform primitive for this purpose)
misc details
Efforts for a 'safer C' dialect, and/or to see how people's practical use of C in edge cases differs from what the standard says:
Versions
Extensions and improvement proposals
Implementations
- Clang
- gcc
- Clue is an ANSI C compiler (C89, some C99) that targets high-level languages such as Lua, Javascript or Perl (and some low-level ones). It supports the entire C language, including pointer arithmetic, and can be used to run arbitrary pure-C programs.
- https://github.com/thradams/cake
Implementation tools
C links