proj-oot-ootNotes32

https://github.com/8l/cc500

" cc500

CC500: a tiny self-hosting C compiler

http://homepage.ntlworld.com/edmund.grimley-evans/cc500/

CC500: a tiny self-hosting C compiler

Introduction

I wrote this tiny compiler, which translates a subset of C into x86 machine code, for fun. It has no use, unless it counts as educational. I called it CC500 because I initally guessed it would take about 500 lines. It turned out to be about 600 lines even without the comments and blank lines. With the comments and blank lines it has about 750 lines. It could be made shorter, but I wanted the code to be clear and simple rather than obfuscated.

Download

cc500.c - just the code Compilation:

gcc cc500.c -o cc500_1 Self-compilation:

./cc500_1 < cc500.c > cc500_2 chmod +x cc500_2 How it works

The compiler reads the whole program in C from stdin and writes an ELF binary for Linux/x86 to stdout. The ELF binary is somewhat non-standard as it consists of a single section that is both writable and executable. It seems to work for me on a real Linux box, but it might cause problems on an emulator.

No libraries are used. Tiny, machine-code implementations of exit(), getchar(), malloc() and putchar() are included in the binary that is generated. There is no free() as malloc() is just sbrk(), really, implemented with two calls to brk().

There is almost no error-checking. With some invalid C programs the compiler calls exit(1) instead of generating a binary, but in most cases it generates a binary that crashes when you run it.

There is no preprocessor. The lexer looks for tokens that match one of the following Perl regular expressions: /[a-z0-9_]+/, /[<=>

&!]+/, /'[^']'/, /"[^"]"/, /./. Traditional C-style comments are handled, but not C++-style comments.

The symbol table is implemented with a single array of char. The type of each symbol is recorded as 'D' (defined global), 'U' (undefined global), 'L' (local variable) or 'A' (argument of a function) - see sym_get_value(). There is also an address recorded for each symbol, but no type, as there is no type-checking and all arrays are assumed to be arrays of char. The address of an 'L' or 'A' is its position on the stack. The address of a 'D' is the position of that function or global variable in the binary. The address of a 'U' is also a pointer into the binary, but a pointer to where the symbol was last used and the head of a linked list of all the forward references to that symbol.

Scoping rules for local variables are implemented correctly as it is easy to do so: at the end of a compound statement the pointer to the end of the symbol table is restored to what it was at the start of the compound statement - see the assignment to table_pos in statement().

I started off writing CC500 without local variables, but it felt very unnatural. Nested scopes are an essential part of C and it is not easy to write a recursive-descent parser without automatic variables.

It is assumed that all variables, arguments and return values are char * or int, though the compiler does not even parse the types, let alone record them in the symbol table. The functions that implement the recursive-descent parser return 1, 2 or 3, corresponding to a "char lval", "int lval" or "other" - see promote().

Each parsing function is preceded by a comment that documents the syntactic structures recognised, in an obvious notation. In some cases the code accepts additional, incorrect structures in addition to what is documented in the comment. Only a tiny subset of C's operators are accepted: just the ones I needed while writing the compiler. There are no unary operators: instead of the indirection operator * you must use x[y], where it is assumed that x is char * and y is int. The only statements recognised are compound statements, expressions, if, while and return. Local variable declarations, with initialisers, are treated as a kind of statement. (If you write something like if (x) int y; the compiler will not complain, but the binary will crash.)

There are no arbitary limits on the length of symbols, number of symbols are the like: all three buffers (token, table and code) are supposed to be automatically extended as required by calls to my_realloc().

Extending it

Some simple and obvious enhancements would be to print an appropriate message message when an error is detected, check for undefined globals, and allow main() to be anywhere in the file.

It would be easy to extend the compiler to handle the remaining operators and statement types of C. To implement break and continue you would probably add additional arguments to the parsing functions that record the address and stack height of the current loop. The easiest way to implement switch statements might be to add an argument that records the location of the previous case label in the current switch statement; a case label would then be translated with a comparison and a branch forwards to the next case label. In general, C is quite amenable to direct translation into machine code without going through an IR (intermediate representation).

Extending the type system would be a bit more challenging. You could correctly implement scalars and pointers, with type checking, by adding a single integer field to the symbol table to record the type. However, if you wanted to implement structs, or function pointers with proper type checking, you would need a recursive data structure to represent the types. It is interesting, but not really surprising, that you start to start to need structs in the implementation language at roughly the same point that you start to implement them. However, C's types get interesting even without structs, unions and typedefs. For example, void (*(f())(void ()(void)))(void); declares a function that returns a pointer to a function that takes a pointer to a function as an argument and returns another pointer to a function. I must admit to finding that part of C's syntax rather difficult to parse mentally. I'm glad we have computers to do it for us. "

---

from https://github.com/8l/scc

Derivations from standard C

This compiler is aimed to be being fully compatible with the C99 standard, but it will have some differences:

C99 allows you to define type names of function types and write something like:

int f(int (int));

Accepting function types in typenames (or abstract declarators) makes the grammar ambiguous because it is impossible to differentiate between:

        (int (f))  -> function returning int with one parameter of type f
        (int (f))  -> integer variable f

Function type names are stupid, because they are used as an alias of the function pointer types, but it is stupid that something like sizeof(int (int)) is not allowed (because here it should be understood as the size of a function), but f(int (int)) is allowed because it is understood as a parameter of function pointer type.

C89 allows the definition of variables with incomplete type that have external linkage and file scope. The type of the variable is the composition of all the definitions find in the file. The exact rules are a bit complex (3.7.2), and SCC ignores them at this moment and it does not allow any definition of variables with incomplete type.

If you don't believe me try this code:

struct foo x;

struct foo { int i; };

---

https://groups.google.com/forum/m/#!msg/boring-crypto/48qa1kWignU/o8GGp2K1DAAJ

D. J. Bernstein As a boring platform for the portable parts of boring crypto software, I'd like to see a free C compiler that clearly defines, and permanently commits to, carefully designed semantics for everything that's labeled "undefined" or "unspecified" or "implementation-defined" in the C "standard".

Pretty much every real-world C program is "undefined" according to the C "standard", and new compiler "optimizations" often produce new security holes in the resulting object code, as illustrated by

   https://lwn.net/Articles/342330/
   https://kb.isc.org/article/AA-01167
 e.g., having variables automatically initialized to 0

I'm told that the Plan 9 compiler tries to be somewhat predictable---

   http://article.gmane.org/gmane.os.plan9.general/76989

---but it still leaves many things unspecified, and I'm under the impression that it has its own ABI. There's a "Fail-Safe C" compiler that sounds like it defines much more, including safe semantics for buffer overflows---

   https://staff.aist.go.jp/y.oiwa/FailSafeC/index-en.html

---but it's clear that this has its own ABI. There are various projects such as

   http://www.doc.ic.ac.uk/~awl03/projects/miro/

that add more useful semantics to some parts of C _without_ changing the ABI, but the same compilers will happily screw up other parts of C. There's a "Friendly C" proposal that targets other parts of C more relevant to core crypto---

   http://blog.regehr.org/archives/1180

---but the big picture is still tons of misguided flexibility for the compiler writer.

Imagine, for example, making gcc more boring by picking some version of gcc from a few years ago and specifying that code compiled with boringcc will behave in the same way as code compiled with that version of gcc.

It's crystal clear that, for example, almost the entire code base is compatible with signed addition defining overflow the same way that every modern CPU handles it. Until quite recently, gcc didn't know any other way to compile code!

Within the cloud of compiler behaviors that are compatible with the existing code base, boringcc can make choices that are simpler and safer than what gcc has ever done. For example, it's obviously a win for boringcc to initialize everything to 0, rather than defining initial values in terms of previous stack frames etc.

12/21/15 D. J. Bernstein drc...@google.com writes: > How do you feel about defining signed integer overflow as trapping, > with unsigned retaining the ring semantics? Or is that too far from > C-classic?

This is clearly too far from what programmers expect and rely on: it would break tons of existing C code.

If you're designing a new language for general-purpose usage, you should provide auto-resizing bigints, not fake "integers" with sharp edges. But my goal here isn't to design a new language; it's to provide predictable behavior for the existing C code base.

See also - OpenWatcom?'s "Safer C Library" (https://web.archive.org/web/20150503192244/http://openwatcom.org/index.php/Safer_C_Library) and Cyclone (though it's more of a whole new language -- http://cyclone.thelanguage.org/).

D. J. Bernstein xua...@gmail.com writes: > Since accessing an array out of bounds is undefined behavior in C, > wouldn't boringC need to keep track of array length at runtime?

Here's a definition that would be moderately helpful for security: any read out of bounds produces 0; any write out of bounds is ignored.

In a new language this would be a poor choice compared to auto-resizing arrays, for the same reasons that 32-bit integers are a poor choice compared to bigints. However, in a language without a clean exception mechanism, it's not clear how to handle auto-resizing failures.

There are other ways to define everything without the "out of bounds" concept: for example, one can define where variables are stored in a flat address space, and then treat all pointer operations the same way the hardware does. But the definition above is simpler and more useful.

> Doesn't this imply that it must pass array length information when > calling a function that takes an array parameter?

No. Conceptually, what one wants is a "fat pointer" that knows the start and end of the array that it came from, but this can be physically stored as

This is implemented in, e.g., the MIRO project that I linked to before.

---Dan

It would be nice if it were possible to include separation of the control flow stack from the local data stack, as featured in SafeStack?. Segmenting control flow information from local data just makes sense.

http://clang.llvm.org/docs/SafeStack.html

SML has array-out-of-bounds exceptions as e.g. Java which could lead to the same vulnerabilities as in 1) SML relies on garbage collection which could lead to timing side-channels. SML makes no guarantees on the memory layout of structures/types (alignment) another possible (timing) side-channel. SML signatures can result in different runtime behaviors e.g. signature INTEGER with one implementation using fixed precision and one using arbitrary precision

1) http://armoredbarista.blogspot.de/2014/04/easter-hack-even-more-critical-bugs-in.html

andrew0...@gmail.com Wouldn't it be better/simpler to add a switch to clang or GCC that emits warnings (or errors) for any "undefined" or "unspecified" or "implementation-defined" construct in the compiled code? This assumes that someone knows what all these things are, which I'm not sure anyone does. ... Certainly ending up with portable, boring...code is better than not, right?

But most real-world security vulnerabilities in C and C++ code today are not caused or affected by compiler optimizations. The ones that are, such as those cited in the first post, are especially insidious, but they are few. (Fewer if you stick in -fwrapv -fno-delete-null-pointer-checks.)

the set of undefined behaviors is somewhat arbitrary (e.g. most unsigned integers in typical programs are not expected to overflow, but compilers can only optimize for the signed case),

[Brief digression: a reasonably large fraction of vulnerabilities are caused by integer wrapping, such as malloc(count * sizeof(type)). So even if trapping on signed overflow creates breakage in some existing programs, this must be balanced against the possibility of it mitigating actual vulnerabilities. But of course, size_t is unsigned, so it isn't likely to make that much difference, especially in modern programs. See also grsecurity's Size Overflow compiler plugin.]

The first is changing the compiler to not make /optimizations/ that take advantage of undefined behavior (so the assembly is "what you'd expect", nothing like null checks being eliminated or signed overflow generating value range assumptions that turn out to be false, seemingly causing logic to fall apart, etc.), and maybe emit some sequences to fix CPU inconsistencies like shift count overflow.

It seems like there are two very different ideas being grouped together here.

The first is changing the compiler to not make /optimizations/ that take advantage of undefined behavior (so the assembly is "what you'd expect"...

The other idea is to make the language actually fully memory safe by having the compiler generate extra checking code, removing undefined behaviors such as the behavior of use after free, misindexing arrays, etc. (I'm not sure I'd call that "boring", since it's an area of active research and a practical implementation would probably require a decent amount of innovation, but that's just semantics.)

This is, of course, possible - some existing programs will break, but not too badly - but whereas the previous idea had a small performance cost, this has an enormous one. There's no getting around it. From the links in the original post:

Fail-Safe C: "Some benchmark results show that the execution time are around 3 to 5 times of the original, natively-compiled programs, in avarage"

MIRO: the report includes various benchmark results, which look generally similar or worse - in a worst case, gzip took 3881 times(!) as long to compress a 26 MB file. Also not fully memory safe.

Here is another one with "108% performance overhead on average": https://www.cis.upenn.edu/~milom/papers/santosh_nagarakatte_phd.pdf

Especially in modern C++ programs - which significantly mitigate array bounds checking issues compared to typical C ones by using containers like std::vector and std::map rather than directly touching pointers, although of course not completely, and you can do similar things in C (but few do) - a large fraction of vulnerabilities seems to be caused by use after free. This can be mostly eliminated just by using a conservative garbage collector, with much lower overhead. But of course this does not provide complete safety.

The reliable way to stop buffer overflows is to have bounds-checking enforced by the compiler. ... Obviously a new language with automatic bigints would get this particular calculation right. Obviously a new language with overflow exceptions, while less satisfactory in general, would have the same effect for this particular pattern

It's worth mentioning that there were originally two reasons for leaving things undefined in the definition. The first was performance, as many have mentioned. Dropping checks for array boundaries falls in this category. The second reason was that different processors do different things. This, for example, is why overflow of unsigned values is defined but overflow of signed values is not. Here's a critical point: many of the systems that behaved differently just aren't around any more. For example, I don't know of any 1's complement systems outside of museums.

This opens an opportunity to deprecate behavior which was undefined to support architectures that are no longer available.

Even later to the discussion, but we have come up with a "Lenient C" dialect and implemented it in Safe Sulong, a system for executing LLVM-based languages on the JVM: http://ssw.jku.at/General/Staff/ManuelRigger/ManLang17.pdf

---

...

Meanwhile, both Go and Java use precise GC: the former mark and sweep, the latter copying. Copying is *definitely* out for C as it would silently break things like pointer comparisons and hash tables, but precise mark-and-sweep might not totally be... if we could accurately predict what (struct) types allocations were being used for.

At least two recent papers opt for "lock and key" and justify it in part as being more useful for debugging than GC, as mistakes in the original use of malloc/free can be detected rather than merely worked around...

But GC is the only approach that need not add any overhead to dereferences (at least reads, depending on concurrency)...

It is also probably more robust in the presence of threading. Most notably because a thread could be in the middle of accessing a pointer when another frees it (and could be paused indefinitely in that state by the kernel). For loads this could be mostly rectified (at the cost of confusing compiler optimizations) by loading the pointer before the lock value, and inserting appropriate barriers on architectures where this is necessary for ordering; but that doesn't work for stores, so I think this would require some sort of RCU-like thread progress checking or else thread PC inspection before allocations could actually be reused, though I might be missing something simpler. Also, if improperly synchronized accesses (including code that assumes individual stores to pointer-typed variables are atomic) are as common in legacy C code as I think they are, since the key and pointer value cannot be updated together atomically, a thread that observed only one being changed could crash spuriously. (It might be possible to check for this in the crash handler and fix up execution, but at the cost of significantly complicating the CFG. Ew.)

A downside to GC is that it is somewhat more dangerous if applications obfuscate pointers to save memory: GC missing a pointer may lead to use after free, while a lock-and-key approach is more likely to just crash on a pointer whose metadata has gotten lost in transit. I suppose a primarily GC based approach could do some sort of type check upon casting from integer to pointer, if a type could be identified with some dumb heuristics, and complain at compile time otherwise. Both cases require modifying the original source to properly rectify, so it is not like this complaint represents a reduced "just works" factor.

It's apparently necessary to store bounds information with pointers rather than with their referents because temporarily out-of-bounds pointers are common (including but not limited to the standards-legal one-past-the-end, of course). Aside from representing overhead in all cases, this too is tricky to make safe in the presence of threading: I think you would need to load the base and size in the metadata table atomically, and since x86 can't even do 16 byte atomic loads properly (unless you use MPX?), that means stuffing them into a single int64 somehow. I guess you could have a 32-bit address space (ew). And that doesn't even solve the spurious panic issue.

But if we could guarantee that some pointer variables/fields were never used for indexing, there would at least be no need to manage this information for them.

In this and the previous case, a little more performance could be eked out if we could essentially identify the subset of C code that acts like Go or Java code - no pointer arithmetic and no casts. This can be done statically - I think with reasonable recall with flow-insensitive analyses (more robust with some sensitivity), but only with whole program optimization, which generally has a lot of arguments against it. And it would be fairly fragile/unpredictable as a single unanalyzable cast (or memcpy, etc.) to a popular type could deoptimize all the code that uses that type.

Or maybe you would rather eat the performance for the sake of predictability. Though even then, I think it will be necessary to do some ugly-feeling things to avoid unsafety in edge cases, such as the issue I mentioned above with pointers stored as integers.

---

Personally I think we did ourselves a lot of harm by abandoning Pascal in favor of C. I don't recall any undefined behavior in Pascal, and I was big fan of its features, modularity, compile-time and runtime checks, ability to directly inline assembly code, fast single-pass compilation, and good quality of generated code.

Ada...a language that is as fast, portable, low level and mature as C, without the undefined behavior, but with bounds checks and range validation checks.

It also has nice features such as modules, generics, ranges (which is ideal for limiting values), sub-typing, contracts, parallel processing, a nice bitchy compiler that doesn't allow you to write nasty code (unsafe code is clearly marked), and a large existing Ada code-base.

---

https://runtimeverification.com/match/1.0-SNAPSHOT/docs/

RV-Match is a program analysis environment based on the K semantic framework. The formal semantics of a target programming language is given as input, and RV-Match provides a series of formal analysis tools specialized for that language, such as a symbolic execution engine, a semantic debugger, a systematic checker for undesired behaviors (model-checker), and even a fully fledged deductive program verifier. RV-Match is currently instantiated and optimized to work with the publicly available C11 semantics that rigorously formalizes the current ISO C11 Standard, in the form of a tool called kcc, an ISO C11-compliant compiler for C....Unlike modern optimizing compilers, whose goal is to make binaries that are as small and as fast as possible at the expense of compiling programs that may be semantically incorrect, kcc instead aims at mathematically rigorous dynamic checking of program compliance with the ISO C11 Standard. The ISO C11 Standard includes a number of categories of behavior that are subject to change on the whim of the compiler developer

https://runtimeverification.com/match/1.0-SNAPSHOT/docs/runningexamples/

Running Examples

    Unsequenced Side Effects
    Buffer Overflows and Underflows
    ...
    Implementation-Defined Behavior
    Out of Lifetime Access
    Overflow and Optimizations
    Warnings, or Lint-Style Issues
    And Many More

Unsequenced Side Effects

Consider the following three-line 1-unsequenced-side-effect.c program:

int main() { int x = 0; return (x = 1) + (x = 2); }

If you compile it using clang, you get a warning about the undefined behavior, but it returns 3 like one might expect. A developer might choose to ignore this warning, not understanding its significance. However, then one day you decide to switch compilers and run this program with gcc. Suddenly, you get the entirely unexpected value of 4. How is this possible? Well, compilers are allowed to optimize however they choose as long as they do not affect the behavior of well-defined programs. So what gcc has chosen to do is to hoist the side effects out of the addition expression so that they happen before. Thus this program becomes:

int x = 0; x = 1; x = 2; return x + x;

Buffer Overflows and Underflows

Implementation-Defined Behavior

An example of implementation-defined behavior is the conversion to a type that cannot store a specified value, thus triggering a loss of precision.

Out of Lifetime Access

Consider the program 7-out-of-lifetime.c:

int *foo(int x) { int z = x; int *y = &z; return y; }

int main() { int *x = foo(0); foo(255); return *x; }

When this example is run with gcc -O3, the value stored in z can be seen when the program exits. However, if optimizations are disabled or if a different compiler is used, the layout of variables on the stack may change. This causes the call foo(255) to overwrite the same location in memory that was used for the integer z in the first call to foo. Because the variable z‘s lifetime has ended when foo returns, it is undefined for us to have returned a pointer to it from the function.

Overflow and Optimizations¶

... check for overflow if (size > size+2) return; ...

Note the line annotated “check for overflow”. A user, on seeing this program, may assume that this check correctly handles the case where size+2 overflows, and that this program will therefore exit normally after returning from the function without calling malloc. Indeed, if you compile this program with gcc using no flags, that is exactly what happens.

However, if at a later date someday you decide to change your compiler flags and build with -O3, suddenly this program will segfault. Why? Well, overflow on a signed integer is what is considered an exceptional condition, and according to the ISO C11 standard, this behavior is undefined. gcc is free to assume that undefined behavior can never occur and optimize accordingly, thus causing the line if (size > size+2) to be replaced in the resulting binary with if (0).

Other undefined behaviors which have been discovered frequently in real code bases include use of functions that have not been declared, creation of incorrectly aligned pointers, comparison between pointers to different memory objects, and incorrect use of dynamic format strings to printf.

If you want to see all the different types of errors or warnings that kcc can report, see its complete list of error codes as well as the corresponding canonical programs illustrating them.

---

[1]

Little is a statically typed, C-like scripting language. Show me! Features

    Familiar C-like Syntax
    Structs, lists, arrays, hashes
    Perl regexp: buf =~ /.*foo/, I/O: while (buf = <>)
    No memory management (reference counted)
    Compiles to Tcl byte codes, Little can call Tcl, Tcl can call Little
    Full access to Tcl runtime and libraries
    Full access to Tk graphical toolkits

/* trivial grep implementation */ int main(string argv[]) { string buf, regexp; int ret = 1; not found is default

    unless (regexp = argv[1]) die("usage: grep regexp [files]");
    undef(argv[1]);	// left shift down the args
    /*
     * Example perl goodness, the iterate through files and regexp
     */
    while (buf = <>) {
	if (buf =~ /${regexp}/) {
		puts(buf);
		ret = 0;
	}
    }
    return (ret);}

---

[2]

http://c2lang.org/site/introduction/changes_from_c/

" no header files

This is good, yes. However, it's because they got rid of the c preprocessor, which is arguably a good move

    built-in primitive types

We have these in c! ;o

Have you people never heard of stdint.h and stdbool.h?

    uniform type (definition) syntax

Yes, you have slightly better syntax for arrays and much better syntax for function pointers.

    attributes

Whoop-de-doo, you've added standardized pragmas

    member access always through '.', not sometimes '->' philosophy: remove clutter/reduce change effort

Hooray! I've always wanted this

    All global variables are initialized by default

Again, very good. For practical programming, you always want this

    incremental arrays

This marks a crucial deviation from c. In c, when you say that something should happen, that's what happens, and no more. In this case, what happens when you perform this operation on this builtin type? Who can say!

(this is one of the few points directly against c2)

    bitoffsets

Another...neat thing

    Struct-functions

What's the difference between point.add(3), and add(point, 3)? (Spoiler: nothing).

---

[–]progforsatan666 85 points 3 months ago

When I started writing go I was like "damn this is neat". In no time all I could see in my code was this snippet.

...

"if err != nil { return nil, err } "

---

logic

mb read about this misc logic language sometimes

https://github.com/enkiv2/mycroft

---

"starting off with functions, conditions, i/o, and simple mathematics is generally sufficient to start writing real code, and it’s generally possible to get a simple language from zero to capable of hosting a fibbonacci sequence function in an afternoon.)" [3]

---

"consider perl in the text-processing space  -- although, because of accidents of history, it took a positon better served by icon "

---

" in Python not only are dictionaries separate from objects, but there’s also a distinction between old-style and new-style objects, even though implementing objects and classes as a special case of dictionaries as Javascript and Lua do would make more sense). "

---

" "Also is his explanation of monads as 'functors that can flatten' a simplification for the purposes of teaching, or is that more or less what they are?"

A little of both. Technically it is correct, but the "flattening" in question applies to many things that most programmers wouldn't consider "flattening". For instance, consider monadic IO as Haskell uses. There is a way in which you can consider the execution of an IO value as "flattening" it, and it corresponds to the mathematical term, but it's not what most people have in mind. There's more to "flattening" than "simplifying data structures in some manner"; it doesn't even always involve what we'd traditionally think of as data structures at all, such as, again, IO.

Personally I think it is an actively unhelpful metaphor for these reasons, as it is very prone to leading to false understanding, but YMMV.

reply "

--

jerf 1 hour ago [-]

The weakest definition of functional is that functions are a first-class object that can be passed around. This is the one that C conforms to in that old sense. This is a dead definition because almost everything in modern use conforms to this definition, and definitions are only useful to the extent they create distinct categories. Because almost every modern language has this, it can be difficult to imagine a language in which this is not true. But, yes, once upon a time, languages did not in general have a data type that could contain a function that you could call. (This is before my time, but I caught the tail end of these arguments.)

This was also one of the reasons that assembly programmers were always banging on about the power of assembly back in the day. Nowadays the only remnant of that argument is the claim that you can write more optimal assembly than the compiler. But back in the day, assembly programmers enjoyed the ability to have a pointer to a function and jump to it and/or call it (it's a bit fuzzier in assembler) and people using high-level languages were definitely getting a "weaker" experience. Today we expect our high level languages to also provide us significant power advantages over assembler. (Of course you can do anything in assembler, but not as concisely necessarily.)

When I got into computing, the definition of "functional" that excluded C included having "closures". This is a function pointer + an environment for the function to run in. C only has the function pointer; you don't get an environment. It is more convenient than nothing, but vastly less useful than a full closure. (You can do them manually, but they become problematic fast.)

Stepping up from there, you got languages that generally permitted an imperative style of programming, but "preferred" what we would today call a functional style, when you use map, filter, and such to perform operations. These languages loved them some linked lists; linked lists everywhere. With their own special names like "cons lists". They also tended to be garbage collected, which for a while was a "functional programming" thing, but is now also simply an accepted thing that a language may be, regardless of how "functional" it is.

This definition is still in some usage today, though some improvement in understanding the structure of the relevant of code ("iteration" as a concept you can abstract, rather than accidentally conflating "a linked list" with "the thing you can iterate on") and the fact that hardware is getting ever-more grumpy about non-locality has erased the linked list obsession. You can either have a "functional language" like Lisp, or you can program in a "functional style" in a multi-paradigm language like Javascript. In the latter case, you can do a lot of work with the functional paradigm, but technically you always end back up at structured programming with some flavor of OO, which is the dominant language paradigm. (Languages can be multi-paradigm, but there is always a dominant paradigm, one that when the paradigms conflict, is the winner. And my personal default definition of OO includes Javascript, considering the prototype system a detail relative to the fact you still have "collections of data with associated methods".) People who say that "Javascript is a functional language" mean this definition.

Finally, there's the Haskell definition. Here, immutability is the default, and possibly the only option. Type systems are very strong, able to express types like "a block of code that does not do any IO" that other languages can not express, or can only do very laboriously. You get "functor" and "monad" and such being not just demonstrated on a one-off basis, but being the foundational abstractions of libraries and entire applications. People argue over how much category theory you have to know to practically use these languages. F#, O'Caml, and Haskell live here. Haskell is as far as you can currently go in this direction and get a language usable for practical tasks, work that you can build a business on.

(As an interesting side bar, I think Erlang made an error here, although a perfectly understandable one. When it was written, one of the reasons immutability was favored at the academic level was that it helped write multi-core code. At the time, only big industry and academia had multi-core systems. But you only really need isolation between threads. Immutability is one way to achieve this, but you can also make it so that it is impossible to communicate "references" between processes/threads, so everything is a copy. Within an Erlang process there's no reason not to allow one to "assign" to existing variables. But at the time, "access control" and "immutable" were sort of conflated together. Rust is the first big language that seems to be levering those concepts apart in a really systematic way.)

However, the spectrum keeps going from here. Past Haskell there are functional languages that get really mathematical, and are focused on proving code, creating even more elaborate type systems such as dependent types ("this function takes a string whose length is a prime number", to give a silly example), and constraining the abstractions even further for things like total functional programming, which is one of the most interesting possible ways to limit programming so that it is not Turing Complete, but can still do real work. Here you can get such exotica as ways of using the type system to separate out what code is total, and what is not, in various principled ways. One of the common "I've been in this industry for 20 years and it sucks and here's what we all need to do to fix it" posts is to extol the virtues of one or more of these things. However, while there has been some interesting progress on many of these fronts in the past couple of decades, they remain impractical.

reply

---

" If you take a language like Java and an IDE like Eclipse or Intellij IDEA, it is trivial to find from where a particular piece of code is called from. Whereas in JavaScript?, particularly in huge codebases, you can't easily tell how a piece of code ends up getting called. You will need to run the code, make some educated guesses and put breakpoints or console.log statements to verify that yes, this particular line does end up getting called on this particular action.

Refactoring is also a joy in a language like Java. You can easily modify a method signature and the IDE will take care of updating all the places this method is called from. Now imagine you add a new argument to your JavaScript? function and want to update all the places where the function is called from. "

---

lots of good ideas here:

[4]

" ... Virgil contains familiar control structures such as if, for, while, and do...while. Like C, Virgil uses curly braces { } to represent blocks of code in loops, branches, method bodies, etc.

However, Virgil makes some slight adjustments to certain syntactic categories. For example, the notation for declaring methods, fields, and locals differs slightly in that types are placed after the name of variables, rather than before. Also, the syntax and semantics of switch cases differs slightly, as cases no longer fall through by default, but instead multiple case values can be grouped for the same case body.

method declaration method myMethod(a: type1): type2 field declaration field myField: type1 = . . .; local declaration local myLocal: type1 = . . .; switch cases don't fall through switch ( e ) { case (0) stmt; case (1, 2) { . . . } }

Basic Types, Arrays, and Strings

local f: int = 42; local g: char = 'A'; local h: boolean = true; local i: int[] = { 0, 1, 7 }; local j: char[] = "Xanadu";

Virgil provides three basic primitive types. These primitive types are value types, and quantities of these types are never passed by reference.

    int - a signed, 32-bit integer type with arithmetic operations
    char - an 8-bit quantity for representing ASCII characters
    boolean - a true / false type for representing conditions

Additionally, Virgil provides the array type constructor [] that can construct the array type T[] from any type T. ... unlike Java, Virgil arrays are not objects, and are not covariantly typed [1].

Components

component MyComponent? { field value: int; method reset(v: int): int { local old = value; value = v; return old; } }

A Virgil component is a collection of fields and methods that form a cohesive unit of functionality. Each component is instantiated only once, and its non-private fields and methods are accessible from other code in the program by using the component's name. This provides for global state and behavior in programs, but encapsulated and scoped within modules...

Classes and Inheritance

Virgil is a class-based language that is most closely related to Java, C++, and C#. Like Java, Virgil provides single inheritance between classes only, with all methods virtual by default, except those declared private. Objects are always passed by reference, and never by value. However, like C++, Virgil has no universal super-class akin to java.lang.Object from which classes inherit by default. But Virgil differs from C++ in two important ways: it is strongly typed, which forces dynamic type tests for explicit casts, and it does not provide pointer types.

...

Compile-time Initialization

The most significant feature of Virgil that is not in other mainstream languages is the concept of initialization time. To avoid the need for a large runtime system that dynamically manages heap memory and performs garbage collection, Virgil does not allow applications to allocate memory from the heap at runtime. Instead, the Virgil compiler allows the application to run initialization routines at compilation time, while the program is being compiled. ... When the application's initialization routines terminate... The reachable heap is then compiled directly into the program binary and is immediately available to the program at runtime.

...

Delegates

which is a first-class value that represents a reference to a method. A delegate in Virgil may be bound to a component method or to an instance method of a particular object; either kind can be used interchangeably, provided the argument and return types match. Delegate types in Virgil are declared using the function type constructor. For example, function(int): int represents the type of a function that takes a single integer as an argument and returns an integer. ... Unlike C#, the use of delegate values, either creation or application, does not require allocating memory from the heap. Delegates are implemented as a tuple of a pointer to the bound object and a pointer to the delegate's code.

...

Hardware Access and Interrupts

Programmers can write interrupt handlers for device and microcontroller interrupts in Virgil and connect them directly to the hardware interrupt routine. This is done through a master program declaration, which attaches methods declared in the program to entrypoints corresponding to the main entrypoint and hardware interrupts of the device.

program P { entrypoint "main" = Blink.entry; entrypoint "timer0_interrupt" = Blink.interrupt; } component Blink { method entry() { device.TIMER = 0b1001; } method interrupt() { device.PIN0 ^= 0b0001; } } "

---

" There is another point. Java had a number of other glaring faults and omissions in 1995, and we fixed those before releasing it in 1996. (Because we fixed them, you may never have known what we fixed.) And we had finite resources, facing an apparently finite window or opportunity. We had to make hard choices about what to fix before releasing it. Method call overload resolution was a mess; we fixed it. Generic types were debated, but didn't make it in. Unicode support was added; tail calls were not. And so on. "

---

Tarq0n 1 day ago [-]

The main dissimilarity between python and lisp is that you can't use code as data in python, so there's no macros. In most other ways they're pretty similar.

reply

SeanLuke? 1 day ago [-]

Serious lambdas and closures? Sophisticated array handling? A full-featured numerical tower? Conditions? Python's namespaces seem primitive compared to Lisp. Etc.

reply

bogomipz 1 day ago [-]

>"A full-featured numerical tower"

What is a numerical tower?

reply

dkersten 1 day ago [-]

https://en.m.wikipedia.org/wiki/Numerical_tower

reply

---

 "Python supports all of Lisp's essential features except macros, and you don't miss macros all that much because it does have eval, and operator overloading, and regular expression parsing, so some--but not all--of the use cases for macros are covered. "

" our first attempt at writing Java versions was largely unsuccesful. Java was too verbose, and the differences between the pseudocode in the book and the Java code was too large. I looked around for a language that was closer to the pseudocode in the book, and discovered Python was the closest "

" The Python code I developed looks much more like the (independently developed) pseudo-code in the book than does the Lisp code. This is important, because some students were complaining that they had a hard time seeing how the pseudo-code in the book mapped into the online Lisp code (even though it seemed obvious to Lisp programmers).

The two main drawbacks of Python from my point of view are (1) there is very little compile-time error analysis and type declaration, even less than Lisp, and (2) execution time is much slower than Lisp "

" Python can be seen as either a practical (better libraries) version of Scheme, or as a cleaned-up (no $@&% characters) version of Perl. "

" Python has the philosophy of making sensible compromises that make the easy things very easy, and don't preclude too many hard things. In my opinion it does a very good job. The easy things are easy, the harder things are progressively harder, and you tend not to notice the inconsistencies. Lisp has the philosophy of making fewer compromises: of providing a very powerful and totally consistent core. This can make Lisp harder to learn because you operate at a higher level of abstraction right from the start and because you need to understand what you're doing, rather than just relying on what feels or looks nice. But it also means that in Lisp it is easier to add levels of abstraction and complexity; Lisp is optimized to make the very hard things not too hard, while Python is optimized to make medium hard things easier. "

" Here I've taken a blurb from Python.org and created two vesions of it: one for Python in blue italics and one for Lisp in green bold. The bulk of the blurb, common to both languages, is in black.

    Python/Lisp is an interpreted and compiled, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python/Lisp's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python/Lisp supports modules and packages, which encourages program modularity and code reuse. The Python/Lisp interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed. Often, programmers fall in love with Python/Lisp because of the increased productivity it provides. Since there is no separate compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python/Lisp programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python/Lisp itself, testifying to Python/Lisp's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective."

some of the entries in the page's section 'Gotchas for Lisp Programmers in Python' are good too.

also, the table 'Key Features Lisp Features Python Features' is very very good; i already added a link to it to ootSyntaxNotes7.

[5]

---

devit 1 day ago [-]

This is the best way to categorize programming languages, but applying it only to relatively unpopular languages seems a strange choice.

I'd categorize the popular ones like this:

reply

---

triska 13 hours ago [-]

I often use the declarative programming language Prolog to solve dynamic programming tasks, because it is easy to type and helps you to declaratively express what a solution looks like. After you have expressed what must be the case in a solution, you can easily enable memoization on top of the existing code.

For example, here is how you can use Prolog to solve the task from the article. The following Prolog predicate is true iff a given runway (represented as a list of "t" and "f" elements) is safe with a given speed:

    :- use_module(library(clpfd)).
    safe_runway(0, [t|_]).
    safe_runway(Speed0, Rs) :-
            Speed0 #> 0,
            Rs = [t|_],
            (   Speed = Speed0
            ;   Speed #= Speed0 - 1
            ;   Speed #= Speed0 + 1
            ),
            length(Prefix, Speed),
            append(Prefix, Rest, Rs),
            safe_runway(Speed, Rest).

Sample query and answer:

    ?- safe_runway(4, [t,f,t,t,t,f,t,t,f,t,t]).
    true .

One interesting aspect of this solution is that we can generalize this query, and also use the same program to answer the question: Which speeds are actually safe for a given runway?

For example:

    ?- safe_runway(Speed, [t,f,t,t,t,f,t,t,f,t,t]).
    Speed = 0 ;
    Speed = 2 ;
    etc.

To enable memoization for this task, you only have to use your Prolog system's tabling mechanism. For example, in SWI-Prolog, you turn this into a dynamic programming solution by adding the directive

    :- table safe_runway/2.

This makes the Prolog engine automatically remember and recall solutions it has already computed.

reply

---