proj-oot-ootNotes32

Difference between revision 5 and current revision

No diff available.

https://github.com/8l/cc500

" cc500

CC500: a tiny self-hosting C compiler

http://homepage.ntlworld.com/edmund.grimley-evans/cc500/

CC500: a tiny self-hosting C compiler

Introduction

I wrote this tiny compiler, which translates a subset of C into x86 machine code, for fun. It has no use, unless it counts as educational. I called it CC500 because I initally guessed it would take about 500 lines. It turned out to be about 600 lines even without the comments and blank lines. With the comments and blank lines it has about 750 lines. It could be made shorter, but I wanted the code to be clear and simple rather than obfuscated.

Download

cc500.c - just the code Compilation:

gcc cc500.c -o cc500_1 Self-compilation:

./cc500_1 < cc500.c > cc500_2 chmod +x cc500_2 How it works

The compiler reads the whole program in C from stdin and writes an ELF binary for Linux/x86 to stdout. The ELF binary is somewhat non-standard as it consists of a single section that is both writable and executable. It seems to work for me on a real Linux box, but it might cause problems on an emulator.

No libraries are used. Tiny, machine-code implementations of exit(), getchar(), malloc() and putchar() are included in the binary that is generated. There is no free() as malloc() is just sbrk(), really, implemented with two calls to brk().

There is almost no error-checking. With some invalid C programs the compiler calls exit(1) instead of generating a binary, but in most cases it generates a binary that crashes when you run it.

There is no preprocessor. The lexer looks for tokens that match one of the following Perl regular expressions: /[a-z0-9_]+/, /[<=>

&!]+/, /'[^']'/, /"[^"]"/, /./. Traditional C-style comments are handled, but not C++-style comments.

The symbol table is implemented with a single array of char. The type of each symbol is recorded as 'D' (defined global), 'U' (undefined global), 'L' (local variable) or 'A' (argument of a function) - see sym_get_value(). There is also an address recorded for each symbol, but no type, as there is no type-checking and all arrays are assumed to be arrays of char. The address of an 'L' or 'A' is its position on the stack. The address of a 'D' is the position of that function or global variable in the binary. The address of a 'U' is also a pointer into the binary, but a pointer to where the symbol was last used and the head of a linked list of all the forward references to that symbol.

Scoping rules for local variables are implemented correctly as it is easy to do so: at the end of a compound statement the pointer to the end of the symbol table is restored to what it was at the start of the compound statement - see the assignment to table_pos in statement().

I started off writing CC500 without local variables, but it felt very unnatural. Nested scopes are an essential part of C and it is not easy to write a recursive-descent parser without automatic variables.

It is assumed that all variables, arguments and return values are char * or int, though the compiler does not even parse the types, let alone record them in the symbol table. The functions that implement the recursive-descent parser return 1, 2 or 3, corresponding to a "char lval", "int lval" or "other" - see promote().

Each parsing function is preceded by a comment that documents the syntactic structures recognised, in an obvious notation. In some cases the code accepts additional, incorrect structures in addition to what is documented in the comment. Only a tiny subset of C's operators are accepted: just the ones I needed while writing the compiler. There are no unary operators: instead of the indirection operator * you must use x[y], where it is assumed that x is char * and y is int. The only statements recognised are compound statements, expressions, if, while and return. Local variable declarations, with initialisers, are treated as a kind of statement. (If you write something like if (x) int y; the compiler will not complain, but the binary will crash.)

There are no arbitary limits on the length of symbols, number of symbols are the like: all three buffers (token, table and code) are supposed to be automatically extended as required by calls to my_realloc().

Extending it

Some simple and obvious enhancements would be to print an appropriate message message when an error is detected, check for undefined globals, and allow main() to be anywhere in the file.

It would be easy to extend the compiler to handle the remaining operators and statement types of C. To implement break and continue you would probably add additional arguments to the parsing functions that record the address and stack height of the current loop. The easiest way to implement switch statements might be to add an argument that records the location of the previous case label in the current switch statement; a case label would then be translated with a comparison and a branch forwards to the next case label. In general, C is quite amenable to direct translation into machine code without going through an IR (intermediate representation).

Extending the type system would be a bit more challenging. You could correctly implement scalars and pointers, with type checking, by adding a single integer field to the symbol table to record the type. However, if you wanted to implement structs, or function pointers with proper type checking, you would need a recursive data structure to represent the types. It is interesting, but not really surprising, that you start to start to need structs in the implementation language at roughly the same point that you start to implement them. However, C's types get interesting even without structs, unions and typedefs. For example, void (*(f())(void ()(void)))(void); declares a function that returns a pointer to a function that takes a pointer to a function as an argument and returns another pointer to a function. I must admit to finding that part of C's syntax rather difficult to parse mentally. I'm glad we have computers to do it for us. "

---

from https://github.com/8l/scc

Derivations from standard C

This compiler is aimed to be being fully compatible with the C99 standard, but it will have some differences:

C99 allows you to define type names of function types and write something like:

int f(int (int));

Accepting function types in typenames (or abstract declarators) makes the grammar ambiguous because it is impossible to differentiate between:

        (int (f))  -> function returning int with one parameter of type f
        (int (f))  -> integer variable f

Function type names are stupid, because they are used as an alias of the function pointer types, but it is stupid that something like sizeof(int (int)) is not allowed (because here it should be understood as the size of a function), but f(int (int)) is allowed because it is understood as a parameter of function pointer type.

C89 allows the definition of variables with incomplete type that have external linkage and file scope. The type of the variable is the composition of all the definitions find in the file. The exact rules are a bit complex (3.7.2), and SCC ignores them at this moment and it does not allow any definition of variables with incomplete type.

If you don't believe me try this code:

struct foo x;

struct foo { int i; };

---

https://groups.google.com/forum/m/#!msg/boring-crypto/48qa1kWignU/o8GGp2K1DAAJ

D. J. Bernstein As a boring platform for the portable parts of boring crypto software, I'd like to see a free C compiler that clearly defines, and permanently commits to, carefully designed semantics for everything that's labeled "undefined" or "unspecified" or "implementation-defined" in the C "standard".

Pretty much every real-world C program is "undefined" according to the C "standard", and new compiler "optimizations" often produce new security holes in the resulting object code, as illustrated by

   https://lwn.net/Articles/342330/
   https://kb.isc.org/article/AA-01167
 e.g., having variables automatically initialized to 0

I'm told that the Plan 9 compiler tries to be somewhat predictable---

   http://article.gmane.org/gmane.os.plan9.general/76989

---but it still leaves many things unspecified, and I'm under the impression that it has its own ABI. There's a "Fail-Safe C" compiler that sounds like it defines much more, including safe semantics for buffer overflows---

   https://staff.aist.go.jp/y.oiwa/FailSafeC/index-en.html

---but it's clear that this has its own ABI. There are various projects such as

   http://www.doc.ic.ac.uk/~awl03/projects/miro/

that add more useful semantics to some parts of C _without_ changing the ABI, but the same compilers will happily screw up other parts of C. There's a "Friendly C" proposal that targets other parts of C more relevant to core crypto---

   http://blog.regehr.org/archives/1180

---but the big picture is still tons of misguided flexibility for the compiler writer.

Imagine, for example, making gcc more boring by picking some version of gcc from a few years ago and specifying that code compiled with boringcc will behave in the same way as code compiled with that version of gcc.

It's crystal clear that, for example, almost the entire code base is compatible with signed addition defining overflow the same way that every modern CPU handles it. Until quite recently, gcc didn't know any other way to compile code!

Within the cloud of compiler behaviors that are compatible with the existing code base, boringcc can make choices that are simpler and safer than what gcc has ever done. For example, it's obviously a win for boringcc to initialize everything to 0, rather than defining initial values in terms of previous stack frames etc.

12/21/15 D. J. Bernstein drc...@google.com writes: > How do you feel about defining signed integer overflow as trapping, > with unsigned retaining the ring semantics? Or is that too far from > C-classic?

This is clearly too far from what programmers expect and rely on: it would break tons of existing C code.

If you're designing a new language for general-purpose usage, you should provide auto-resizing bigints, not fake "integers" with sharp edges. But my goal here isn't to design a new language; it's to provide predictable behavior for the existing C code base.

See also - OpenWatcom?'s "Safer C Library" (https://web.archive.org/web/20150503192244/http://openwatcom.org/index.php/Safer_C_Library) and Cyclone (though it's more of a whole new language -- http://cyclone.thelanguage.org/).

D. J. Bernstein xua...@gmail.com writes: > Since accessing an array out of bounds is undefined behavior in C, > wouldn't boringC need to keep track of array length at runtime?

Here's a definition that would be moderately helpful for security: any read out of bounds produces 0; any write out of bounds is ignored.

In a new language this would be a poor choice compared to auto-resizing arrays, for the same reasons that 32-bit integers are a poor choice compared to bigints. However, in a language without a clean exception mechanism, it's not clear how to handle auto-resizing failures.

There are other ways to define everything without the "out of bounds" concept: for example, one can define where variables are stored in a flat address space, and then treat all pointer operations the same way the hardware does. But the definition above is simpler and more useful.

> Doesn't this imply that it must pass array length information when > calling a function that takes an array parameter?

No. Conceptually, what one wants is a "fat pointer" that knows the start and end of the array that it came from, but this can be physically stored as

This is implemented in, e.g., the MIRO project that I linked to before.

---Dan

It would be nice if it were possible to include separation of the control flow stack from the local data stack, as featured in SafeStack?. Segmenting control flow information from local data just makes sense.

http://clang.llvm.org/docs/SafeStack.html

SML has array-out-of-bounds exceptions as e.g. Java which could lead to the same vulnerabilities as in 1) SML relies on garbage collection which could lead to timing side-channels. SML makes no guarantees on the memory layout of structures/types (alignment) another possible (timing) side-channel. SML signatures can result in different runtime behaviors e.g. signature INTEGER with one implementation using fixed precision and one using arbitrary precision

1) http://armoredbarista.blogspot.de/2014/04/easter-hack-even-more-critical-bugs-in.html

andrew0...@gmail.com Wouldn't it be better/simpler to add a switch to clang or GCC that emits warnings (or errors) for any "undefined" or "unspecified" or "implementation-defined" construct in the compiled code? This assumes that someone knows what all these things are, which I'm not sure anyone does. ... Certainly ending up with portable, boring...code is better than not, right?

But most real-world security vulnerabilities in C and C++ code today are not caused or affected by compiler optimizations. The ones that are, such as those cited in the first post, are especially insidious, but they are few. (Fewer if you stick in -fwrapv -fno-delete-null-pointer-checks.)

the set of undefined behaviors is somewhat arbitrary (e.g. most unsigned integers in typical programs are not expected to overflow, but compilers can only optimize for the signed case),

[Brief digression: a reasonably large fraction of vulnerabilities are caused by integer wrapping, such as malloc(count * sizeof(type)). So even if trapping on signed overflow creates breakage in some existing programs, this must be balanced against the possibility of it mitigating actual vulnerabilities. But of course, size_t is unsigned, so it isn't likely to make that much difference, especially in modern programs. See also grsecurity's Size Overflow compiler plugin.]

The first is changing the compiler to not make /optimizations/ that take advantage of undefined behavior (so the assembly is "what you'd expect", nothing like null checks being eliminated or signed overflow generating value range assumptions that turn out to be false, seemingly causing logic to fall apart, etc.), and maybe emit some sequences to fix CPU inconsistencies like shift count overflow.

It seems like there are two very different ideas being grouped together here.

The first is changing the compiler to not make /optimizations/ that take advantage of undefined behavior (so the assembly is "what you'd expect"...

The other idea is to make the language actually fully memory safe by having the compiler generate extra checking code, removing undefined behaviors such as the behavior of use after free, misindexing arrays, etc. (I'm not sure I'd call that "boring", since it's an area of active research and a practical implementation would probably require a decent amount of innovation, but that's just semantics.)

This is, of course, possible - some existing programs will break, but not too badly - but whereas the previous idea had a small performance cost, this has an enormous one. There's no getting around it. From the links in the original post:

Fail-Safe C: "Some benchmark results show that the execution time are around 3 to 5 times of the original, natively-compiled programs, in avarage"

MIRO: the report includes various benchmark results, which look generally similar or worse - in a worst case, gzip took 3881 times(!) as long to compress a 26 MB file. Also not fully memory safe.

Here is another one with "108% performance overhead on average": https://www.cis.upenn.edu/~milom/papers/santosh_nagarakatte_phd.pdf

Especially in modern C++ programs - which significantly mitigate array bounds checking issues compared to typical C ones by using containers like std::vector and std::map rather than directly touching pointers, although of course not completely, and you can do similar things in C (but few do) - a large fraction of vulnerabilities seems to be caused by use after free. This can be mostly eliminated just by using a conservative garbage collector, with much lower overhead. But of course this does not provide complete safety.

The reliable way to stop buffer overflows is to have bounds-checking enforced by the compiler. ... Obviously a new language with automatic bigints would get this particular calculation right. Obviously a new language with overflow exceptions, while less satisfactory in general, would have the same effect for this particular pattern

It's worth mentioning that there were originally two reasons for leaving things undefined in the definition. The first was performance, as many have mentioned. Dropping checks for array boundaries falls in this category. The second reason was that different processors do different things. This, for example, is why overflow of unsigned values is defined but overflow of signed values is not. Here's a critical point: many of the systems that behaved differently just aren't around any more. For example, I don't know of any 1's complement systems outside of museums.

This opens an opportunity to deprecate behavior which was undefined to support architectures that are no longer available.

Even later to the discussion, but we have come up with a "Lenient C" dialect and implemented it in Safe Sulong, a system for executing LLVM-based languages on the JVM: http://ssw.jku.at/General/Staff/ManuelRigger/ManLang17.pdf

---

...

Meanwhile, both Go and Java use precise GC: the former mark and sweep, the latter copying. Copying is *definitely* out for C as it would silently break things like pointer comparisons and hash tables, but precise mark-and-sweep might not totally be... if we could accurately predict what (struct) types allocations were being used for.

At least two recent papers opt for "lock and key" and justify it in part as being more useful for debugging than GC, as mistakes in the original use of malloc/free can be detected rather than merely worked around...

But GC is the only approach that need not add any overhead to dereferences (at least reads, depending on concurrency)...

It is also probably more robust in the presence of threading. Most notably because a thread could be in the middle of accessing a pointer when another frees it (and could be paused indefinitely in that state by the kernel). For loads this could be mostly rectified (at the cost of confusing compiler optimizations) by loading the pointer before the lock value, and inserting appropriate barriers on architectures where this is necessary for ordering; but that doesn't work for stores, so I think this would require some sort of RCU-like thread progress checking or else thread PC inspection before allocations could actually be reused, though I might be missing something simpler. Also, if improperly synchronized accesses (including code that assumes individual stores to pointer-typed variables are atomic) are as common in legacy C code as I think they are, since the key and pointer value cannot be updated together atomically, a thread that observed only one being changed could crash spuriously. (It might be possible to check for this in the crash handler and fix up execution, but at the cost of significantly complicating the CFG. Ew.)

A downside to GC is that it is somewhat more dangerous if applications obfuscate pointers to save memory: GC missing a pointer may lead to use after free, while a lock-and-key approach is more likely to just crash on a pointer whose metadata has gotten lost in transit. I suppose a primarily GC based approach could do some sort of type check upon casting from integer to pointer, if a type could be identified with some dumb heuristics, and complain at compile time otherwise. Both cases require modifying the original source to properly rectify, so it is not like this complaint represents a reduced "just works" factor.

It's apparently necessary to store bounds information with pointers rather than with their referents because temporarily out-of-bounds pointers are common (including but not limited to the standards-legal one-past-the-end, of course). Aside from representing overhead in all cases, this too is tricky to make safe in the presence of threading: I think you would need to load the base and size in the metadata table atomically, and since x86 can't even do 16 byte atomic loads properly (unless you use MPX?), that means stuffing them into a single int64 somehow. I guess you could have a 32-bit address space (ew). And that doesn't even solve the spurious panic issue.

But if we could guarantee that some pointer variables/fields were never used for indexing, there would at least be no need to manage this information for them.

In this and the previous case, a little more performance could be eked out if we could essentially identify the subset of C code that acts like Go or Java code - no pointer arithmetic and no casts. This can be done statically - I think with reasonable recall with flow-insensitive analyses (more robust with some sensitivity), but only with whole program optimization, which generally has a lot of arguments against it. And it would be fairly fragile/unpredictable as a single unanalyzable cast (or memcpy, etc.) to a popular type could deoptimize all the code that uses that type.

Or maybe you would rather eat the performance for the sake of predictability. Though even then, I think it will be necessary to do some ugly-feeling things to avoid unsafety in edge cases, such as the issue I mentioned above with pointers stored as integers.

---

Personally I think we did ourselves a lot of harm by abandoning Pascal in favor of C. I don't recall any undefined behavior in Pascal, and I was big fan of its features, modularity, compile-time and runtime checks, ability to directly inline assembly code, fast single-pass compilation, and good quality of generated code.

Ada...a language that is as fast, portable, low level and mature as C, without the undefined behavior, but with bounds checks and range validation checks.

It also has nice features such as modules, generics, ranges (which is ideal for limiting values), sub-typing, contracts, parallel processing, a nice bitchy compiler that doesn't allow you to write nasty code (unsafe code is clearly marked), and a large existing Ada code-base.

---

https://runtimeverification.com/match/1.0-SNAPSHOT/docs/

RV-Match is a program analysis environment based on the K semantic framework. The formal semantics of a target programming language is given as input, and RV-Match provides a series of formal analysis tools specialized for that language, such as a symbolic execution engine, a semantic debugger, a systematic checker for undesired behaviors (model-checker), and even a fully fledged deductive program verifier. RV-Match is currently instantiated and optimized to work with the publicly available C11 semantics that rigorously formalizes the current ISO C11 Standard, in the form of a tool called kcc, an ISO C11-compliant compiler for C....Unlike modern optimizing compilers, whose goal is to make binaries that are as small and as fast as possible at the expense of compiling programs that may be semantically incorrect, kcc instead aims at mathematically rigorous dynamic checking of program compliance with the ISO C11 Standard. The ISO C11 Standard includes a number of categories of behavior that are subject to change on the whim of the compiler developer

https://runtimeverification.com/match/1.0-SNAPSHOT/docs/runningexamples/

Running Examples

    Unsequenced Side Effects
    Buffer Overflows and Underflows
    ...
    Implementation-Defined Behavior
    Out of Lifetime Access
    Overflow and Optimizations
    Warnings, or Lint-Style Issues
    And Many More

Unsequenced Side Effects

Consider the following three-line 1-unsequenced-side-effect.c program:

int main() { int x = 0; return (x = 1) + (x = 2); }

If you compile it using clang, you get a warning about the undefined behavior, but it returns 3 like one might expect. A developer might choose to ignore this warning, not understanding its significance. However, then one day you decide to switch compilers and run this program with gcc. Suddenly, you get the entirely unexpected value of 4. How is this possible? Well, compilers are allowed to optimize however they choose as long as they do not affect the behavior of well-defined programs. So what gcc has chosen to do is to hoist the side effects out of the addition expression so that they happen before. Thus this program becomes:

int x = 0; x = 1; x = 2; return x + x;

Buffer Overflows and Underflows

Implementation-Defined Behavior

An example of implementation-defined behavior is the conversion to a type that cannot store a specified value, thus triggering a loss of precision.

Out of Lifetime Access

Consider the program 7-out-of-lifetime.c:

int *foo(int x) { int z = x; int *y = &z; return y; }

int main() { int *x = foo(0); foo(255); return *x; }

When this example is run with gcc -O3, the value stored in z can be seen when the program exits. However, if optimizations are disabled or if a different compiler is used, the layout of variables on the stack may change. This causes the call foo(255) to overwrite the same location in memory that was used for the integer z in the first call to foo. Because the variable z‘s lifetime has ended when foo returns, it is undefined for us to have returned a pointer to it from the function.

Overflow and Optimizations¶

... check for overflow if (size > size+2) return; ...

Note the line annotated “check for overflow”. A user, on seeing this program, may assume that this check correctly handles the case where size+2 overflows, and that this program will therefore exit normally after returning from the function without calling malloc. Indeed, if you compile this program with gcc using no flags, that is exactly what happens.

However, if at a later date someday you decide to change your compiler flags and build with -O3, suddenly this program will segfault. Why? Well, overflow on a signed integer is what is considered an exceptional condition, and according to the ISO C11 standard, this behavior is undefined. gcc is free to assume that undefined behavior can never occur and optimize accordingly, thus causing the line if (size > size+2) to be replaced in the resulting binary with if (0).

Other undefined behaviors which have been discovered frequently in real code bases include use of functions that have not been declared, creation of incorrectly aligned pointers, comparison between pointers to different memory objects, and incorrect use of dynamic format strings to printf.

If you want to see all the different types of errors or warnings that kcc can report, see its complete list of error codes as well as the corresponding canonical programs illustrating them.

---

[1]

Little is a statically typed, C-like scripting language. Show me! Features

    Familiar C-like Syntax
    Structs, lists, arrays, hashes
    Perl regexp: buf =~ /.*foo/, I/O: while (buf = <>)
    No memory management (reference counted)
    Compiles to Tcl byte codes, Little can call Tcl, Tcl can call Little
    Full access to Tcl runtime and libraries
    Full access to Tk graphical toolkits

/* trivial grep implementation */ int main(string argv[]) { string buf, regexp; int ret = 1; not found is default

    unless (regexp = argv[1]) die("usage: grep regexp [files]");
    undef(argv[1]);	// left shift down the args
    /*
     * Example perl goodness, the iterate through files and regexp
     */
    while (buf = <>) {
	if (buf =~ /${regexp}/) {
		puts(buf);
		ret = 0;
	}
    }
    return (ret);}

---

[2]

http://c2lang.org/site/introduction/changes_from_c/

" no header files

This is good, yes. However, it's because they got rid of the c preprocessor, which is arguably a good move

    built-in primitive types

We have these in c! ;o

Have you people never heard of stdint.h and stdbool.h?

    uniform type (definition) syntax

Yes, you have slightly better syntax for arrays and much better syntax for function pointers.

    attributes

Whoop-de-doo, you've added standardized pragmas

    member access always through '.', not sometimes '->' philosophy: remove clutter/reduce change effort

Hooray! I've always wanted this

    All global variables are initialized by default

Again, very good. For practical programming, you always want this

    incremental arrays

This marks a crucial deviation from c. In c, when you say that something should happen, that's what happens, and no more. In this case, what happens when you perform this operation on this builtin type? Who can say!

(this is one of the few points directly against c2)

    bitoffsets

Another...neat thing

    Struct-functions

What's the difference between point.add(3), and add(point, 3)? (Spoiler: nothing).

---

[–]progforsatan666 85 points 3 months ago

When I started writing go I was like "damn this is neat". In no time all I could see in my code was this snippet.

...

"if err != nil { return nil, err } "

---

logic

mb read about this misc logic language sometimes

https://github.com/enkiv2/mycroft

---

"starting off with functions, conditions, i/o, and simple mathematics is generally sufficient to start writing real code, and it’s generally possible to get a simple language from zero to capable of hosting a fibbonacci sequence function in an afternoon.)" [3]