Table of Contents for Programming Languages: a survey

Part V: Implementation of a programming language

This is not a book about how to implement a programming language. However, even at the earliest stages of language design, it is valuable to have some idea of how the certain choices in the design of the language could make the implementation much less efficient or much more difficult to write. For this purpose, we'll provide a cursory overview of implementation topics, touching as we go on some of the language design choices that may have a large impact.

Chapter : the general pipeline

parsing (and lexing)
parse tree -> AST
variable binding?
what else here?
typechecking?
basic blocks, control flow graph (also explain call stack, and possibly scope/block stack, at some point)
- see also https://en.wikipedia.org/wiki/Extended_basic_block
what else here?
runtime

when do we convert to normal forms?

todo

" The first big phase of the compilation pipeline is parsing. You need to take your input and turn it into a tree. So you go through preprocessing, lexical analysis (aka tokenization), and then syntax analysis and IR generation. Lexical analysis is usually done with regexps. Syntax analysis is usually done with grammars. You can use recursive descent (most common), or a parser generator (common for smaller languages), or with fancier algorithms that are correspondingly slower to execute. But the output of this pipeline stage is usually a parse tree of some sort.

The next big phase is Type Checking. ...

The third camp, who tends to be the most isolated, is the code generation camp. Code generation is pretty straightforward, assuming you know enough recursion to realize your grandparents weren't Adam and Eve. So I'm really talking about Optimization...

-- http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html

modified from Python's docs:

" The usual steps for compilation are:

    Parse source code into a parse tree
    Transform parse tree into an Abstract Syntax Tree
    Transform AST into a Control Flow Graph
    Emit bytecode based on the Control Flow Graph" -- https://docs.python.org/devguide/compiler.html#abstract

discussions on the benefits of the introduction of a 'Core' or 'middle' language in between the HLL AST and a lower-level, LLVM-ish language:

https://news.ycombinator.com/item?id=11581863
- this thread is discussing the following Rust-specific link, which also makes some general points near the beginning: http://blog.rust-lang.org/2016/04/19/MIR.html

some advantages of a core language:

the core language can allow powerful but unsafe constructs that are not allowed in the HLL (eg for Rust, GOTO, downcasting, changing potentially shared variables conditionally upon some conditions where they have not actually been shared; [1])
the core language has fewer primitives than the HLL so it is easier to work with than the HLL AST; but compared to a LLL (low-level language) like LLVM, less high-level intent has been lost, so it is easier to write some optimizations, or other analysis like borrow checking, on the core language than on the LLL
more caching for incremental re-compilation within compilation units
focuses the designer's mind on what is 'core' for the language

Links:

todo: somewhere put my observations on LLLs and target languages:

'target' languages include both 'Core languages' and LLLs

goal for:

HLL: usability
core: easy compilation from HLL->core; easy manipulation by compiler transformations (implies simplicity). Unlike HLL, verbosity is fine. Unlike LLL, machine sympathy is not necessary.
LLL: efficiency (for actual assembly languages) OR ease of transformation to backend (and similar enough to actual assembly such that stuff which looks efficient in the LLL remains so after compilation to assembly) (for eg LLVM)

Core languages tend to be: similar to the HLL, except: de-sugared (eg apply various local 'desugar' tranformations) have fewer primitives than the HLL (eg de-sugaring various control flow constructs to GOTO) in safe or prescriptive languages, sometimes the Core primitives are more powerful than the HLL (eg in Rust the MIR core language has GOTO but Rust does not) still retain function calling have explicit types have a bunch of other things that are explicit, so that the LLL is way to verbose for human use, but with a similar control flow to the original HLL like the HLL but unlike LLL, still nonlinear (parethesized subexpressions are still present)

LLLs and 'assembly' languages are more similar to each other than you might expect.

LLL general properties:

linear, rather than AST (each instruction is at an address; you cannot give a parenthesized subexpression as the argument to a primitive instruction)
no type polymorphism (would lead to inefficient dispatch)
fixed sizes for things (eg almost always a max bit width of addresses, often a the max # of arguments that a primitive instruction can take)
designed along with, and constrained by, an instruction encoding (eg # of registers and addressing modes may be small so as to permit a dense instruction encoding)
designed along with, and constrained by, internal representations for primitive data types
often takes multiple instructions to do one 'primitive' thing such as a function call (eg first you must marshal the function arguments into their proper place, then you call the function)
often LLLs are in a 'canonical form' such as SSA, CPS (eg LLVM is SSA), or if not, often the compiler chooses to transform to a canonical form before the LLL

Main families of LLLs: imperative * stack vs. register machines stack machines (eg JVM, Python's, mb Forth) register machines * 3-operand vs 2- or 1-operand register machines * various calling conventions * addressing modes LOAD/STORE vs addressing modes vs LOAD/STORE with constant argument but self-modifying code explicit constant instructions (eg ADDC) vs constant addressing mode vs LOADK * are registers and memory locations actual memory locations upon which address arithmetic can be done (like in assembly), or just variable identifiers (like in Lua)? functional * eg https://en.wikipedia.org/wiki/SECD_machine combinator * eg Nock

Also, instruction encodings might be fixed-width or variable (most are variable).

Primitive operations in LLLs: not as much variance as you might expect. Generally:

most everyone has basic arithmetic; addition, subtraction, maybe multiplication, maybe division, various shifts
some sort of GOTO
some sort of conditional local branch or skip; possibly with a TEST or COND instruction separated from the branch/skip, possibly not
possibly bitwise test/set
possibly bitwise boolean ops
possibly some composite data structures; some variance here but not a ton
- cons cells
- fixed-size vectors
- variable-length arrays
- associative arrays
- strings

Links regarding stack machine vs register machines:

"Interpreters for virtual stack machines are easier to build than interpreters for register or memory-to-memory machines; the logic for handling memory address modes is in just one place rather than repeated in many instructions." -- [2]
[3]
Registers vs stacks for interpreter design says "it's easier to generate code for a stack machine" but goes on to say that register machines may be more efficient.
Register machines may be faster, according to Virtual Machine Showdown: Stack Versus Registers by Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl.
- (cites The case for virtual register machines)
https://en.wikipedia.org/wiki/Stack_machine ([4])
A Performance Survey on Stack-based and Register-based Virtual Machines
- discussion: https://news.ycombinator.com/item?id=13154111
Why have a stack? by Eric Lippert
https://web.archive.org/web/20200610223140/http://home.pipeline.com/~hbaker1/ForthStack.html

---

Someone's response to the question "Does anyone know if there is research into the effects of register counts on VMs?":

" Someone 1819 days ago [-]

If you can map all VM registers to CPU registers, the interpreter will be way simpler.

If you have more VM registers than CPU registers, you have to write code to load and store your VM registers.

If you have more CPU registers than VM registers, you will have to write code to undo some of the register spilling done by the compiler (if you want your VM to use the hardware to the fullest.)

So, the optimum (from a viewpoint of 'makes implementation easier') depends on the target architecture.

Of course, a generic VM cannot know how many registers (and what kinds; uniform register files are not universally present) the target CPU will have.

That is a reason that choosing either zero (i.e. a stack machine) or infinitely many (i.e. SSA) are popular choices fo VMs: for the first, you know for sure your target will not have fewer registers than your VM, for the second, that it will not have more.

If you choose any other number of VM registers, a fast generic VM will have to handle both cases.

Alternatively, of course, one can target a specific architecture, and have the VM be fairly close to the target; Google's NaCl? is (was? I think they changed things a bit) an extreme example. I have not checked the code, but this, I think, is similar. " -- https://news.ycombinator.com/item?id=2930109

---

Chapter : modularity

whole program analysis

if modules are separately compiled and allow types to remain completely abstract, e.g. when a module is compiled it is possible to not know what the in-memory representation of values in the module will be, then a host of optimizations cannot be performed. Alternatives:

Acceptance: produce fully generic code and accept the fact that the program will execute slowly. This is especially a problem if the language allows operation overloading on primitive operations and especially if it allows redefining standard operators on primitive types, e.g. if even adding two integers will be slowed down because the compiler did not know at compile time that integers were involved or it did not know if '+' had been overridden for integers.
produce bytecode, and either have a bytecode interpreter at runtime (slow execution), or compile the bytecode at runtime (slow startup) or have a just-in-time compiler (complex)
do the optimizations at link time instead of compile time; by this time the compiler (linker) has all the information and can tell what the representation will be for each value. But if the language supports dynamic linking, then 'link time' does not really occur until runtime, so this reduces to the former case.
give up on separate compilation; at compile time the compiler requires the source code of all libraries. This is impractical because some users may want to use closed-source libraries created by others, and it is also slow because libraries must be recompiled each time they are used.
create something like header files that can be distributed separately from library source code, such that by looking at all of the headers for all of the libraries, enough information is provided such that compiler can determine the in-memory representation of all values in the program at compile time. In some languages, such as C++, the header files require processing (macro preprocessing in the case of C++) and the same header file may be reloaded multiple times during one compilation, slowing down compilation significantly. The presence of macros in the header files also adds complexity and makes it more difficult to write third-party tools and to interoperate with other languages (cite https://news.ycombinator.com/item?id=6273739 ). Language designers using headers should consider (a) omitting macros, and (b) ensuring that each header file needs to be read at most once during any single compilation, and compiler writers should consider producing binary compiled header files as an intermediate output to save time during later recompilation.

see section "compilation model" in "Retrospective Thoughts on BitC?" http://www.coyotos.org/pipermail/bitc-dev/2012-March/003300.html

https://news.ycombinator.com/item?id=5422094

ml functors

intermediate representations (IRs)

" Some common intermediate representations:

General forms of intermediate representations (IR):
Graphical IR (parse tree, abstract syntax trees, DAG. . . )
Linear IR (ie., non graphical)
Three Address Code (TAC): instructions of the form “result=op1 operator op2”
Static single assignment (SSA) form: each variable is assigned once
Continuation-passing style (CPS): general form of IR for functional languages
Control flow graph

Examples:

Java bytecode (executed on the Java Virtual Machine)
LLVM (Low Level Virtual Machine): SSA and TAC based
C is used in several compilers as an intermediate representation (Lisp, Haskell, Cython. . . )
Microsoft’s Common Intermediate Language (CIL)
GNU Compiler Collection (GCC) uses several intermediate representations:
- Abstract syntax trees
- GENERIC (tree-based)
- GIMPLE (SSA form)
- Register Transfer Language (RTL, inspired by lisp lists) "

-- [5]

[6] also describes and motivates SSA and CPS in more detail. It defines SSA and explains why you need phi in SSA, and refers to an algorithm to determine where to place the phis.

Removing and Restoring Control Flow with the Value State Dependence Graph, thesis by James Stanier, Foundations of Software Systems School of Informatics University of Sussex . http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf , Chapter 2, Intermediate Representations in Compilers: A Survey

Abstract machines and the compilers that love/hate them

Introducing structured control flow

Some target languages, such as WASM and SPIR V, require some amount of structure in control flow. If the source language doesn't provide this structure, it must be introduced.

Links:

Chapter : possible compiler safety checks beyond typechecking

check to see if there are any reads to uninitialized variables could implement as types

Chapter : linker

issue in C:

in C, according to one of the comments on http://www.slideshare.net/olvemaudal/deep-c , it was claimed (i didn't check) that if you declare your own printf with the wrong signature, it will still be linked to the printf in the std library, but will crash at runtime, e.g. "void printf( int x, int y); main() {int a=42, b=99; printf( a, b);}" will apparently crash.

-- A new programming language might want to throw a compile-time error in such a case (as C++ apparently does, according to the slides).

Links:

http://www.lurklurk.org/linkers/linkers.html

Chapter : concurrency implementation

atomics

SIMD, MIMD

GPU

Useful algorithms and data structures

Efficient operations on long strings: http://en.wikipedia.org/wiki/Rope_%28data_structure%29

Chapter: designing and implementing a virtual machine

"Instructions should not bind together operations which an optimizing compiler might otherwise choose to separate in order to produce a more efficient program."

Brian Case

Chapter: ?? where to put these

Chapter: implementing functional languages

Links:

http://stackoverflow.com/questions/10992852/what-are-some-obvious-optimizations-for-a-virtual-machine-implementing-a-functio?rq=1
https://www.amazon.com/Virtual-Machines-Iain-D-Craig/dp/1852339691?ie=UTF8&tag=stackoverfl08-20
Implementing functional languages: a tutorial, Simon Peyton Jones and David Lester. Published by Prentice Hall, 1992.
Practical Foundations for Programming Languages, Robert Harper

todo: what else?

Chapter : Normal forms

"Basic block": "a portion of the code within a program with only one entry point and only one exit point." (wikipedia)

Defn in lambda vs here

SSA

"in CPS, as in SSA or ANF, expressions don't have subexpressions." -- http://wingolog.org/archives/2014/05/18/effects-analysis-in-guile

"If an intermediate language is in SSA form, then every variable has a single definition site. Single-assignment is a static property because in a general control flow graph the assignment might be in a loop. SSA form was developed by Wegman, Zadeck, Alpern, and Rosen [4, 37] for efficient computation of data flow problems, such as global value numbering and detecting equality of variables." -- https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf , Chapter 3

For example (examples taken from https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf ), if the source code is: x = a + b; y = x + 1; x = a + 1;

then this is equivalent to:

  x1 = a + b; 
  y1 = x1 + 1; 
  x2 = a + 1;

When two control flow edges join, we have to do something else. Consider

 if x < 5 then v = 1 else v = 2;
 w = v + 1;

In this v, v has two possible definition sites, violating SSA, but we don't know statically which one will be used. As a control flow graph, we have a diamond:

    (if x < 5)
   /          \
  /            \(v1 = 1) (v2 = 2) \ / \ / (w = v? + 1)

SSA deals with this by introducing a "magic" operator, Phi. Phi is only allowed to occur at the beginning of a basic block. Its semantics are that Phi "knows" when it executes from which basic block control was recently passed to it; it is given multiple arguments, and it simple returns one of them; it chooses which one of its various arguments to return based on from which basic block control was recently passed to it. For example, the above diamond can be replaced by:

    (if x < 5)
   /          \
  /            \(v1 = 1) (v2 = 2) \ / \ / (v3 = Phi(v1,v2)

   (w = v3 + 1)

here, Phi will return v1 if control came to it by the left path, or it will return v2 if control came to it by the right path. Meaning that v3 will end up being equal to v1 if the left path is taken, and it will end up being equal to v2 if the right path is taken. Note that the SSA property is preserved; each variable is assigned to exactly once.

"SSA form is typically used as intermediate representation for imperative languages. The functional programming community prefers the λ-calculus and continuations as intermediate language. Andrew Appel pointed out the close relationship between the two representations in his article ”SSA is Functional Programming”" -- https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf

"AliceML? uses a weakening of SSA called Executable SSA Form, which works only on acyclic (control?) graphs, and which does not require Phi nodes. See https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf Chapter 3, section 3.1.1. One disadvantage of that is that "the removal of Phi-functions in the abstract code might artificially extend the liveness of a variable across branches. Suppose a left branch sets a variable at the beginning and the right branch synchronizes this variable at the end. Then, the variable is declared to be live over the whole right branch. This increases register pressure and decreases code quality for register-poor architectures."

A good, simple description of phi nodes: https://capra.cs.cornell.edu/bril/lang/ssa.html

" A program is in SSA form if: 1. each definition has a distinct name 2. each use refers to a single definition ... Main interest: allows to implement several code optimizations " -- http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/05-intermediatecode-2014-2015.pdf

Transforming a 3-address representation into stack is easier than a stack one into 3-address.

Your sequence should be the following:

    Form basic blocks
    Perform an SSA-transform
    Build expression trees within the basic blocks
    Perform a register schedulling (and phi- removal simultaneously) to allocate local variables for the registers not eliminated by the previous step
    Emit a JVM code - registers goes into variables, expression trees are trivially expanded into stack operations

shareeditflag answered Dec 8 '11 at 8:18 SK-logic 8,60612032

Wow! Thanks this is just what I was looking for. Questions: do I need to do the SSA transform within each basic block or across the whole procedure? Do you have any pointers to tutorials, textbooks or other resources? – akbertram Dec 8 '11 at 9:09 SSA transform is aways procedure-wide: en.wikipedia.org/wiki/Single_static_assignment You'll just need to find a dominance frontier for each basic block where you're assigning a variable (with multiple assignment locations), insert the phi nodes there and then get rid of the redundant phis (n.b.: some may have circular dependencies). – SK-logic Dec 8 '11 at 9:13 @akbertram, LLVM can be a useful source of inspiration here, you can safely model your intermediate representation after it. Some important design decisions from there: do not allow to assign one register to another, and do not allow to assign a constant to a register, always substitute it in place instead. – SK-logic Dec 8 '11 at 9:16 The expression tree looks alot like the original AST -- is it worth making a round-trip through a TAC-like IR or does another sort of IR make sense when the only target is the JVM? – akbertram Dec 8 '11 at 9:17 @akbertram, if you can stuff your JVM compilation before making TAC - then yes, you do not need it, a direct stack code generation is easier. Otherwise, if you can only stack on top of the existing compiler infrastructure, you'll need to re-construct the expression trees. Funny bit is that the JIT will do it again too. – SK-logic Dec 8 '11 at 9:22 " -- [7]

Alternatives to Phi nodes in SSA:

"Instead of using phi nodes, MLIR uses a functional form of SSA where terminators pass values into block arguments defined by the successor block." [8]. See also [9]

Links:

CPS

See e.g.

CPS vs. SSA: http://wingolog.org/archives/2011/07/12/static-single-assignment-for-functional-programmers

"in CPS, as in SSA or ANF, expressions don't have subexpressions." -- http://wingolog.org/archives/2014/05/18/effects-analysis-in-guile

" There are a couple of things going on with CPS as implemented in this style of compilers that make it particularly nice for writing optimization passes:

1) There are only function calls ("throws") and no returns. Though there's no theoretical reduction in the stuff you have to track (since what is a direct-style return is now a throw to the rest of the program starting at the return point), there's a huge reduction in in the complexity of your optimizations because you're not tracking the extra thing. For some examples, you can look in the Manticore codebase where even simple optimizations like contraction and let-floating are implemented in both our early direct-style IR (BOM) and our CPS-style IR (CPS). The latter is gobs more readable, and there's no way at all I'd be willing to port most of the harder optimizations like reflow-informed higher-order inlining or useless variable elimination to BOM.

2) IRs are cheap in languages like lisp and ML. So you write optimizations as a tree-to-tree transformation (micro-optimization passes). This style makes it much easier to enforce invariants. If you look at the internals of most compilers written in C++, you'll see far fewer copies of the IR made and a whole bunch of staged state in the object that's only valid at certain points in the compilation process (e.g., symbols fully resolved only after phase 2b, but possibly invalid for a short time during optimization 3d unless you call function F3d_resolve_symbol...). Just CPS'ing the representation of a C++ compiler without also making it easy to write your optimization passes as efficient tree-to-tree transformations will not buy you much, IMO. " -- [10]

" So the lineage of the CPS-as-compiler-IR thesis goes from Steele's Rabbit compiler through T's Orbit to SML/NJ. At which point Sabry & Felleisen at Rice published a series of very heavy-duty papers dumping on CPS as a representation and proposing an alternate called A-Normal Form. ANF has been the fashionable representation for about ten years now; CPS is out of favor. This thread then sort of jumps tracks over to the CMU ML community, where it picks up the important typed-intermediate-language track and heads to Cornell, and Yale, but I'm not going to follow that now. " -- http://www.paulgraham.com/thist.html

A

ANF

"in CPS, as in SSA or ANF, expressions don't have subexpressions." -- http://wingolog.org/archives/2014/05/18/effects-analysis-in-guile

http://lambda-the-ultimate.org/node/1617#comment-19700

http://lambda-the-ultimate.org/node/69

Guarded SSA

see Optimizing compilers for structured programming languages by Marc Michael Brandis

Egraphs

https://github.com/cfallin/rfcs/blob/cranelift-egraphs/accepted/cranelift-egraph.md

Chapter : Performance optimizations

Many large books have been filled with the details of writing efficient compilers and interpreters, so this chapter will only provide an overview of selected techniques.

Inlining

Calling a function typically involves manipulating the stack (upon entry, you gotta push the return address and the frame pointer register onto the stack, push the formal parameters onto the stack, adjust the top-of-the-stack pointer register, and adjust the frame pointer register to point to the new current stack position).

You can save this overhead by inlining.

Stack machines and stack items in registers

Some languages present a computation model in which there is only a stack, no registers. In this case, assuming that the underlying hardware has registers, it may speed things up to use some of the registers to hold the top few stack positions.

Tagged data

Startup time

Dynamic loading: Languages with dynamic loading can experience slow startup times if they:

Load a large amount of data (cite http://web.archive.org/web/20071011183655/http://java.sun.com/developer/technicalArticles/javase/consumerjre/#Quickstarter , https://en.wikipedia.org/wiki/Java_performance#Startup_time )
Search through a large number of directory paths to look for files (cite https://news.ycombinator.com/item?id=5422094 , https://news.ycombinator.com/item?id=5422138 , https://news.ycombinator.com/item?id=5425365 )

Effects analysis

Common subexpression elimination (CSE)

Flow analysis

http://wingolog.org/archives/2014/07/01/flow-analysis-in-guile

Links:

Chapter :

canonical impl vs std

The dismaying history of languages without a canonical implementation

Languages without a canonical implementation tend to:

fragment into many incompatible variant languages, each of which claims heritage from the parent, but each of which is non-conformant to the standard, and/or
eventually find themself in a situation where one implementation 'wins' and becomes the canonical implementation, after which that implementation breaks conformance with the standard and (for the most part) drives the standard, rather than the other way around

Scheme started out as an implementation, was described in a series of published papers (the Lambda Papers), turned into a standard, however

Haskell started out as a standard (todo confirm), then many implementations sprung up, then one won out (GHC). As of this writing, GHC has stopped maintaining conformance with the language standard (except while running in a special mode that is not interoperable with most libraries in the wild; see [12] ), and discussions regarding changing the language appear to revolve around changing GHC first, with updates to the standard to cause them to conform to GHC an afterthought [13].

self-hosting

benefits, costs (lots of popular languages dont)

portability

kernel approach

cross-compiling when self-hosting

There is always the 'chain solution': Every self-hosting language had to be bootstrapped from another language at the beginning. And each subsequent self-hosting version of a self-hosting language had to be compilable by the previous version. So, to bootstrap the language onto a new platform, one could simply re-boostrap that same early version of the language, and then re-compile each version of the language's compiler in sequence (see eg [14]). Disadvantages to this include: (a) if you are worried about Thompson 'trusting trust' attacks (see section below), then you must audit the source code of each version of the language implementation in this chain; and (b) this seems like a lot more compilation steps than should be necessary.

Another option is for the language to deliberately restrict its own implementation to using code compatible with an earlier version of itself (eg Golang [15]). This removes some of the benefits of a self-hosting language in terms of using the compiler as a way for the language designers to eat their own dogfood.

standards bodies

various stories of standards processes and advice on what to do if you find yourself involved in a standards process:

http://www.nhplace.com/kent/Papers/cl-untold-story.html . some advice from that:
- write a charter defining scope and the broad criteria that will be used to make decisions about the language
- for each proposal, fill out forms. The forms should have the problem description and the proposed solution (and versioning stuff). The problem description should be agreed upon (rough consensus) by everyone dealing with the problem, but each faction may write their own proposed solution.
- editing a standards document is about trust. The editor needs to be trusted by all to keep their partisan views out of the editing (e.g. to edit ONLY as prompted by technical votes taken by the committee, never to make even small material changes to the standard which were not the result of a technical vote). even "if a certain aspect of the text of CLTL was meaningless or ambiguous or useless because of how it was written, my job as editor was to make the text clear enough that the meaningless, ambiguous, or useless nature was more transparent. I would sometimes say the job was to transform text from “implicitly vague” to “explicitly vague.”
(didn't i read another good blog post somewhere about this?)

Thompson 'trusting trust' attacks

Ken Thompson wrote a popular essay in which he pointed out that a compiler could be subverted to introduce attacker-desired behavior into the programs it compiles; if the subversion is clever, then such a compiler, when compiling itself from source, would continually re-introduce the subversion.

Links:

Chapter: Case studies

https://blog.mozilla.org/luke/2014/01/14/asm-js-aot-compilation-and-startup-performance/

Chapter: Stacks

" 1.4 WHY ARE STACKS USED IN COMPUTERS?

Both hardware and software stacks have been used to support four major computing areas in computing requirements: expression evaluation, subroutine return address storage, dynamically allocated local variable storage, and subroutine parameter passing.

1.4.1 Expression evaluation stack

Expression evaluation stacks were the first kind of stacks to be widely supported by special hardware. As a compiler interprets an arithmetic expression, it must keep track of intermediate stages and precedence of operations using an evaluation stack. In the case of an interpreted language, two stacks are kept. One stack contains the pending operations that await completion of higher precedence operations. The other stack contains the intermediate inputs that are associated with the pending operations. In a compiled language, the compiler keeps track of the pending operations during its instruction generation, and the hardware uses a single expression evaluation stack to hold intermediate results.

To see why stacks are well suited to expression evaluation, consider how the following arithmetic expression would be computed:

X = (A + B) * (C + D)

First, A and B would be added together. Then, this intermediate results must be saved somewhere. Let us say that it is pushed onto the expression evaluation stack. Next, C and D are added and the result is also pushed onto the expression evaluation stack. Finally, the top two stack elements (A+B and C+D) are multiplied and the result is stored in X. The expression evaluation stack provides automatic management of intermediate results of expressions, and allows as many levels of precedence in the expression as there are available stack elements. Those readers who have used Hewlett Packard calculators, which use Reverse Polish Notation, have direct experience with an expression evaluation stack.

The use of an expression evaluation stack is so basic to the evaluation of expressions that even register-based machine compilers often allocate registers as if they formed an expression evaluation stack.

1.4.2 The return address stack

With the introduction of recursion as a desirable language feature in the late 1950s, a means of storing the return address of a subroutine in dynamically allocated storage was required. The problem was that a common method for storing subroutine return addresses in non-recursive languages like FORTRAN was to allocate a space within the body of the subroutine for saving the return address. This, of course, prevented a subroutine from directly or indirectly calling itself, since the previously saved return address would be lost.

The solution to the recursion problem is to use a stack for storing the subroutine return address. As each subroutine is called the machine saves the return address of the calling program on a stack. This ensures that subroutine returns are processed in the reverse order of subroutine calls, which is the desired operation. Since new elements are allocated on the stack automatically at each subroutine call, recursive routines may call themselves without any problems.

Modern machines usually have some sort of hardware support for a return address stack. In conventional machines, this support is often a stack pointer register and instructions for performing subroutine calls and subroutine returns. This return address stack is usually kept in an otherwise unused portion of program memory.

1.4.3 The local variable stack

Another problem that arises when using recursion, and especially when also allowing reentrancy (the possibility of multiple uses of the same code by different threads of control) is the management of local variables. Once again, in older languages like FORTRAN, management of information for a subroutine was handled simply by allocating storage assigned permanently to the subroutine code. This kind of statically allocated storage is fine for programs which are neither reentrant nor recursive.

However, as soon as it is possible for a subroutine to be used by multiple threads of control simultaneously or to be recursively called, statically defined local variables within the procedure become almost impossible to maintain properly. The values of the variables for one thread of execution can be easily corrupted by another competing thread. The solution that is most frequently used is to allocate the space on a local variable stack. New blocks of memory are allocated on the local variable stack with each subroutine call, creating working storage for the subroutine. Even if only registers are used to hold temporary values within the subroutine, a local variable stack of some sort is required to save register values of the calling routine before they are destroyed.

The local variable stack not only allows reentrancy and recursion, but it can also save memory. In subroutines with statically allocated local variables, the variables take up space whether the subroutine is active or not. With a local variable stack, space on the stack is reused as subroutines are called and the stack depth increases and decreases.

1.4.4 The parameter stack

The final common use for a stack in computing is as a subroutine parameter stack. Whenever a subroutine is called it must usually be given a set of parameters upon which to act. Those parameters may be passed by placing values in registers, which has the disadvantage of limiting the possible number of parameters. The parameters may also be passed by copying them or pointers to them into a list in the calling routine's memory. In this case, reentrancy and recursion may not be possible. The most flexible method is to simply copy the elements onto a parameter stack before performing a procedure call. The parameter stack allows both recursion and reentrancy in programs.

1.4.5 Combination stacks

Real machines combine the various stack types. It is common in register-based machines to see the local variable stack, parameter stack, and return address stack combined into a single stack of activation records, or "frames." In these machines, expression evaluation stacks are eliminated by the compiler, and instead registers are allocated to perform expression evaluation.

The approach taken by the stack machines described later in this book is to have separate hardware expression evaluation and return stacks. The expression evaluation stacks are also used for parameter passing and local variable storage. Sometimes, especially when conventional languages such as C or Pascal are being executed, a frame pointer register is used to store local variables in an area of program memory. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec1_4.html

" ...nesting of tasks or threads. The task and its creator share the stack frames that existed at the time of task creation, but not the creator's subsequent frames nor the task's own frames. This was supported by a cactus stack, whose layout diagram resembled the trunk and arms of a Saguaro cactus. Each task had its own memory segment holding its stack and the frames that it owns. The base of this stack is linked to the middle of its creator's stack. " -- https://en.wikipedia.org/wiki/Stack_machine (see also https://en.wikipedia.org/wiki/Spaghetti_stack , another name for the same concept)

" Use in programming language runtimes

The term spaghetti stack is closely associated with implementations of programming languages that support continuations. Spaghetti stacks are used to implement the actual run-time stack containing variable bindings and other environmental features. When continuations must be supported, a function's local variables cannot be destroyed when that function returns: a saved continuation may later re-enter into that function, and will expect not only the variables there to be intact, but it will also expect the entire stack to be present so the function is able to return again. To resolve this problem, stack frames can be dynamically allocated in a spaghetti stack structure, and simply left behind to be garbage collected when no continuations refer to them any longer. This type of structure also solves both the upward and downward funarg problems, so first-class lexical closures are readily implemented in that substrate also. " -- https://en.wikipedia.org/wiki/Spaghetti_stack

Some VMs with stack-based (as opposed to register-based) instructions have a value stack (for the VM instructions), a call stack, and a block stack (to keep track of eg that you are in a FOR loop nested inside another FOR loop; a BREAK would pop this stack); for example, Python [16].

Stacks usually grow down

In many architectures the stack grows downwards [17] [18] [19], eg x86, Mostek6502, PDP11, most MIPS ABIs [20] [21] [22].

In early memory-constrained devices, often there is either just a memory section for program code, and a section for data, and a stack, and the code is placed near location 0, then the data section, whereas the stack is placed in high memory (near the largest addresses) and grows down [23].

Memory-constrained devices often don't have a heap that grows (only a fixed-size data section, which can be thought of as a fixed-size heap), which means that often the only memory section with a dynamically changing size is the stack. [24]

This arrangement would also work if the fixed-size data were placed in high memory and the stack were placed in low memory and grew up. This is seen sometimes (eg HP PA-RISC, Multics [25]) but less frequently.

One reason given for the stack being placed in high memory and growing downwards is that if the ISA makes it more efficient to add only UNSIGNED (always positives) offsets to memory addresses, then since a common calculation is to add an offset to the stack pointer to locate something in the stack, this will be most efficient if that offset is always positive, that is, if the stack pointer points to the lowest memory location in the stack, which occurs if the stack grows downwards [26] [27]. In addition, alignment calculations are simpler; eg "If you place a local variable on the stack which must be placed on a 4-byte boundary, you can simply subtract the size of the object from the stack pointer, and then zero out the two lower bits to get a properly aligned address. If the stack grows upwards, ensuring alignment becomes a bit trickier." [28].

Stacks: more topics

Elsewhere, we treat (todo):

segmented stacks, in which the stack may be 'segmented' into multiple discontiguous memory regions, allowing expansion of the stack via additional incremental memory allocations if it outgrows its initial allocation
spaghetti stacks or saguaro (cactus) stacks
- https://en.wikipedia.org/wiki/Parent_pointer_tree
- c2.com/cgi/wiki?SpaghettiStack?
separate stacks for calls, data, blocks, etc
- eg SECD machine has data stack (S) and call stack (D), i think, todo
- eg Python's call stack, value stack, block stack [29]
- eg Forth's data stack vs return stack (call stack) [30]

by construct

short-circuit boolean expressions

Links:

A No-Frills Introduction to Lua 5.1 VM Instructions section 'Relational and Logic Instructions', page 35

regexs

There are three basic ways to implement regular expressions:

simulate the NFA specified by the regular expression directly. Worst-case running time is O(m*n), where m is the length of the regular expression and n is the length of the string being matched.
compile the NFA to a DFA and run that. Worst-case running time is O(2^m) to compile and then O(n) to match.
match the pattern against the input string by backtracking. Worst-case running time in n is O(2^n) (todo: what about m?) but this can usually be avoided by taking care not to use certain kinds of regexes (eg [31]). This is the only choice that can implement certain common regular expression extensions such as backreferences.

It is also possible to try to use one of these techniques and then fall back to another one in certain cases, or to partially combine the first two techniques via simulating the NFA but with caching [32].

grep, awk, tcl use combinations of the first two techniques. Perl, PCRE, Python, Ruby, and Java use the third technique [33].

A popular C library to implement (extended) regular expressions is PCRE.

Links:

https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
https://swtch.com/~rsc/regexp/regexp1.html
http://www.pcre.org/
examples of the sorts of regular expressions that can cause worse-case behavior with the backtracking implementation: http://www.regular-expressions.info/catastrophic.html
regular expression builder: https://regex101.com/

for loops

Links:

A No-Frills Introduction to Lua 5.1 VM Instructions section 'Loop Instructions', page 42

misc

eta expansion: turning methods into functions

http://gleichmann.wordpress.com/2011/01/09/functional-scala-turning-methods-into-functions/

note: i think eta is more than that:

SL Peyton Jones, Compiling Haskell by program transformation: a report from the trenches says it is when you have something of the form f = \x -> blah in \y -> blah2, and you notice that f is always applied to two arguments, so you transform it to f = \x y -> blah3.

section 4.4 of this paper mentions eta abstraction:

Lenient evaluation is neither strict nor lazy citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.137.9885 by G Tremblay - ‎2000 - ‎Cited by 2 - ‎Related articles

4.4. Combinator parsing An interesting approach to building parsers in functional languages, described in [29,30], is called combinator parsing . In this approach, a parser is de3ned as a function which, from a string, produces a list of possible results together with the unconsumed part of the input; various combining forms (higher-order functions = combinators) are then used to combine parsers in ways which mimic the di1erent grammar constructs, e.g., sequencing, choice, repetition. For example, suppose we have some parsers p1 and p2 which recognize, respectively, the non- terminals t 1 and t 2 . Then, the parser that recognizes t 1

t 2 could be de3ned as follows ( ‘alt’ denotes the in3x form of the alt function): > (p1 ‘alt’ p2) inp = p1 inp ++ p2 inp Similarly, the parser that recognizes t 1 t 2 (sequencing) could be de3ned as follows: > (p1 ‘then’ p2) inp >

[((v1, v2), out2) | (v1, out1)

← p1 inp; (v2, out2) ← p2 out1] Here, laziness is required because of the way repetition is handled: given a parser p that recognizes the non-terminal t , t ∗ would be recognized by many p de3ned as follows [30]: G. Tremblay/Computer Languages 26 (2000) 43–66 59 > many p >

((p ‘then’ many p) ‘using’ cons)

> ‘alt’ > (succeed []) In function many , one of the subexpression appearing as an argument expression is a call made to many p ; this is bound to create an in3nite loop in a language not using call-by-need, which is e1ectively what happens when such combinator parsers are translated directly into a strict language such as SML or a non-strict language such as Id . Instead, in a non-lazy language, eta -abstraction would need to be used or explicit continuations would have to be manipulated in order to handle backtracking [31].

Links:

SL Peyton Jones, Compiling Haskell by program transformation: a report from the trenches

todo

https://en.wikipedia.org/wiki/Threaded_code

" Separating the data and return stacks in a machine eliminates a great deal of stack management code, substantially reducing the size of the threaded code. The dual-stack principle was originated three times independently: for Burroughs large systems, Forth and PostScript?, and is used in some Java virtual machines.

Three registers are often present in a threaded virtual machine. Another one exists for passing data between subroutines ('words'). These are:

    ip or i (instruction pointer) of the virtual machine (not to be confused with the program counter of the underlying hardware implementing the VM)
    w (work pointer)
    rp or r (return stack pointer)
    sp or s (parameter stack pointer for passing parameters between words)

Often, threaded virtual machines such as implementations of Forth have a simple virtual machine at heart, consisting of three primitives. Those are:

    nest, also called docol
    unnest, or semi_s (;s)
    next

In an indirect-threaded virtual machine, the one given here, the operations are:

next: *ip++ -> w ; jmp w++ nest: ip -> *rp++ ; w -> ip ; next unnest: *--rp -> ip ; next

This is perhaps the simplest and fastest interpreter or virtual machine. " -- https://en.wikipedia.org/wiki/Threaded_code

(note: 'nest' is CALL, and 'unnest' is RETURN (or EXIT))

AST walking (tree walking) vs tree grammars:

http://www.antlr3.org/pipermail/antlr-interest/2011-February/040686.html

Translators Should Use Tree Grammars : http://web.archive.org/web/20121114144133/http://www.antlr.org/article/1100569809276/use.tree.grammars.tml

Manual Tree Walking Is Better Than Tree Grammars: http://www.antlr2.org/article/1170602723163/treewalkers.html

conversion between languages: http://web.archive.org/web/20090611075009/http://jazillian.com/howC.html

http://web.archive.org/web/20070210070048/http://www.jazillian.com/how.html

http://www.nicklib.com/application/3001

todo:

it's interesting to me how some languages go from a source language to a assembly-language-like IR or VM (eg Python, Lua, Java, Smalltalk?, Clang/LLVM), and other languages go from a source language to a non-assembly-like 'core language' and from there to either a C-like language or a VM (eg Haskell->GHC Core -> STG -> C--, Hoon -> Hoon 'non-synthetic glyphs' -> Nock, Shen -> KLambda). Note that (a) most languages' assembly-like language or VM or C-like language is Turing/von Neumann/imperative/linear, but Nock is not; (b) probably many languages that i am saying go straight to an assembly-like language have some sort of desugaring step, but to me desugaring a few constructs in a complex language is distinct from defining a small core language, although i agree that the distinction there is subjective; (c) there are also some language implementations that pass thru a 'generic AST' stage, such as gcc's Gimple, which i don't know much about but which i guess may be like an assembly-like IR, except in tree form instead of linear form (gcc goes: source language -> gcc GENERIC -> gcc GIMPLE -> gcc RTL; RTL is an assembly-like language; GENERIC is a generic AST language; GIMPLE is a 3-operand variant of GENERIC, i think); i see this more an a tree-structured form of assembly than as a core language.
- i'm guessing that the proliferation of steps in GHC is because they ultimately convert from one computational paradigm (functional language/'graph machine'/lambda-calculusy with lots of partial evaluation, laziness, closures) to another (turing machine/imperative/von Neumann); this probably happens in the STG -> C-- step.

---

https://en.m.wikipedia.org/wiki/Setjmp.h

common compiler targets:

assembly languages (CISC, RISC)
compiler framework intermediate languages: LLVM
close-to-the-metal HLLs: C
VM bytecode: JVM, CLR, JS, js.asm, webassembly, etc
todo

stack based, accumulator-based, register-based

finite registers (actual assembly languages) vs. infinite (LLVM)

compiler backends: code generation

the basic steps (typically):

liveness analysis: "Determine for each variable whether it is dead after a particular use" [34]
register (and other memory) allocation: for each variable, determine where that variable will be stored (in a register (which one?), on the stack, on the heap, etc)
define an instruction-set description, which is a list of patterns in the source language, and for each pattern, a corresponding set of instructions in the target language. Match on this to do the actual translation.
- when there are multiple possible patterns that match some part of the source code, you could do some optimization to find the 'best' one

examples of some things that might have to be expanded or recognized:

branching: various assembly languages have various ways of jumping that may not exactly match the source language (branch-if rather than if-then-else; branch-on-zero rather than branch-if; use of condition codes; etc)
constants: assembly languages have constraints on the sizes of constants
exploiting complex instructions: it is usually more efficient to use the complex instructions provided by an assembly language than to emulate this with an equivalent sequence of simple instructions. Different assembly languages provide different sets of complex instructions.

liveness analysis

The output of liveness analysis is (a) a table that for each variable, describes the region(s) of code in which space for that variable must be allocated, and (b) an interference graph, whose nodes are variables and where there is an edge between two variables if they cannot be allocated in the same memory location (because their scopes overlap).

Naive solution: assume every variable must be allocated at the beginning of the program and is never reclaimed until the end of the program. The problem with this is that many (often, the vast majority) of variables are only needed for a very short while for storing intermediate results, so by allocating space for all of them throughout the entire program you vastly increase the memory used by the program beyond what is actually required.

http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/06-finalcode-2014-2015.pdf describes an algorithm for register allocation.

register allocation

(for register machines)

The output of register allocation is a table that for each variable, describes in which register that variable is allocated; and if not all variables can be stored in registers, which ones must be 'spilled' to other places. Register allocation can be run separately for each procedure.

Naive solution: store everything on the heap (in main memory). The problem with this is that you are constantly issuing load instructions before each use of a variable and store instructions after each assignment of a variable. For intermediate results that are used immediately after they are generated, or soon after, this is very inefficient, because the latency of accessing main memory can be orders of magnitude higher than accessing registers.

http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/06-finalcode-2014-2015.pdf describes an algorithm for register allocation.

What 'other places' are available?

the heap: memory allocation for dynamic data (data whose size or quantity is not known at compile time)
the stack: often used to store activation records and local variables. Some architectures provide special optimized instructions for efficient access to the stack. Allocation (and deallocation) on the stack is quicker than on the heap (because rather than keeping track of and looking for free areas of memory as you do with the heap, with the stack you just have to increment/decrement the stack pointer (and, when it can't be statically guaranteed, check for stack overflow))
constants: compile-time constants can sometimes be stored in the code itself, and sometimes in special constant storage areas
static data: for variables whose size is known in advance, the data can be allocated at compile-time and its address can be hardwired into the generated code. For example, C uses this method to allocate global variables, which are allocated for the entire duration of the program [35]. In theory you could also do this for variables not needed for the entire duration of the program, using register allocation to determine when memory can be reused (is this commonly done?).

Links for: compiler backends: code generation

http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/06-finalcode-2014-2015.pdf

providing sandbox functionality

Links:

linking

links:

Ian Lance Taylor’s series of articles on linkers

register allocation

"Linear Scan Register Allocation" see https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf section 1.3

stack maps

Generics

Three strategies for code reuse via generics:

monomorphization: The object code contains many copies of each generic operation/structure, with one copy for each specialization of the generic code. For example, C++ templates work this way. Downsides include large code size; and difficulty with hetrogenous containers (eg a list containing both numbers and strings cannot be implemented by either a list of numbers, nor by a list of strings).
indirection: Each generic operation/structure is represented just once in object code, but contains function pointers, which are called at runtime whenever something needs to be done differently depending on the specialization of the generic structure. For example, C++ inheritance works this way. Downsides: indirection is slower.
discriminated unions: When all possible specializations are known, calls to operations on generic data structures can be replaced by a 'switch' statement whose cases are the possible specializations. Downsides: all possible specializations must be known before any object code operating on the generic data structures is generated.

(todo: is that accurate? i am mainly trying to summarize part of http://cglab.ca/~abeinges/blah/rust-reuse-and-recycle/ )

Links:

The Many Kinds of Code Reuse in Rust
The Generic Dilemma and Golang

Object file formats and binary execution

Some object file formats and related formats are:

ELF, used in Unix/Linux-ish systems
Mach-O, used in OS X and iOS
PE, used on Windows
- variant: Micro Framework PE
DWARF, a debugging file format
[37], an older debugging file format largely supplanted by DWARF
[38] Breakpad symbol files, a debugging format
- [39] more on Breakpad unwinding info
Minidumps, a crash reporting format
[40] the Apple compact unwinding format, also discusses unwinding techniques in general
AppImage? (sorta): https://appimage.org/
https://en.wikipedia.org/wiki/Amiga_Hunk

See https://en.wikipedia.org/wiki/Comparison_of_executable_file_formats for a longer list

Links:

https://github.com/corkami/pics/tree/master/binary
http://www.muppetlabs.com/~breadbox/software/tiny/
https://nullprogram.com/blog/2018/05/27/ (the time savings is not significant but the article is an interesting tour through some parts of some methods of linking shared libraries)
- see also this comment: https://news.ycombinator.com/item?id=17175121
https://ownyourbits.com/2018/05/23/the-real-power-of-linux-executables/
https://lwn.net/Articles/519085/
https://software.intel.com/sites/default/files/m/a/1/e/dsohowto.pdf
https://kishuagarwal.github.io/life-of-a-binary.html
https://en.wikipedia.org/wiki/Comparison_of_executable_file_formats
https://www.agner.org/optimize/calling_conventions.pdf chapter 15
Special sections in Linux binaries
https://gankra.github.io/blah/rust-layouts-and-abis/#calling-conventions A discussion of calling conventions and ABIs, particularly Rust's ABI

a.out format

https://kestrelcomputer.github.io/kestrel/2018/02/01/on-elf-2 (Part 2 of a discussion about ELF; Part 2 compares it to a.out and Amiga Hunk; Part 1 is here: https://kestrelcomputer.github.io/kestrel/2018/01/29/on-elf )

ELF format

http://geezer.osdevbrasil.net/osd/exec/ esp. http://geezer.osdevbrasil.net/osd/exec/elf.txt and http://geezer.osdevbrasil.net/osd/exec/index.htm#elf
ELF Program Headers
https://github.com/corkami/pics/blob/master/binary/elf101/elf101.pdf
"The ELF format stands out as the most consistent, clear, robust and flexible of the object file formats. The other formats are full of patches and appear kludgy in comparison. I would recommend the ELF format for new applications." -- https://www.agner.org/optimize/calling_conventions.pdf
https://fasterthanli.me/series/making-our-own-executable-packer/part-1
https://www.bottomupcs.com/chapter07.xhtml
https://kestrelcomputer.github.io/kestrel/2018/01/29/on-elf (Part 1 of a discussion about ELF; Part 2 compares it to a.out and Amiga Hunk; Part 2 is here: https://kestrelcomputer.github.io/kestrel/2018/02/01/on-elf-2
https://cpu.land/
- https://github.com/hackclub/putting-the-you-in-cpu

Minimal executables with ELF:

https://nathanotterness.com/2021/10/tiny_elf_modernized.html
- contains an interesting figure showing which fields of the ELF header seem to be ignored by Linux if you just want to run the program (the article says that also, section headers can be removed):
http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html

PE format

PE links:

Amiga Hunk format

https://kestrelcomputer.github.io/kestrel/2018/02/01/on-elf-2 (Part 2 of a discussion about ELF; Part 2 compares it to a.out and Amiga Hunk; Part 1 is here: https://kestrelcomputer.github.io/kestrel/2018/01/29/on-elf )

Other binary formats

Slim binaries / Semantic dictionary encoding:

Links:

https://www.thefreelibrary.com/Intermediate+representations+of+mobile+code.-a0179977557

---

zzzcpan 1 day ago [-]

Frankly, I don't see anything interesting in that list, especially for amateurs.

As an amateur compiler writer you would probably want to make something useful in a few weeks, not waste a year playing around. And it's a very different story. It's essentially about making a meta DSL, that compiles into another language and plays well with existing libraries, tooling, the whole ecosystem, but also does something special and useful for you. So, you should learn parsing, possibly recursive descend for the code and something else for expressions, a bit about working with ASTs and that's pretty much it.

PaulHoule? 1 day ago [-]

Is amateur the right word?

I am in it for the money which I guess makes me a pro but I don't have a computer science background and frankly in 2016 I am afraid the average undergrad compiler course is part of the problem as much as the solution.

Another big issue is nontraditional compilers of many kinds such as js accelerators and things that compile to JavaScript?, domain specific languages, data flow systems, etc. Frankly I want to generate Java source or JVM byte code and could care less for real machine code.

reymus 1 day ago [-]

"frankly in 2016 I am afraid the average undergrad compiler course is part of the problem as much as the solution."

What do you mean by that?

tikhonj 1 day ago [-]

I'm not the OP, but I sympathize. The specific details covered in a "classical" compilers course are heavy weight and not super-relevant right now. These days you don't have to understand LR parsing or touch a parser-generator, you don't have to worry about register coloring... etc. Courses still use the Dragon Book which is older than I am and covers a bunch of stuff only relevant to writing compilers for C on resource-constrained systems.

Instead, I figure a course should cover basics of DSL design, types and type inference, working with ASTs, some static analysis and a few other things. That has some overlap with a traditional compilers course, but a pretty different focus.

---

David "Kranz' diss is a Yale Computer Science Dept. tech report. I would say it is required reading for anyone interested in serious compiler technology for functional programming languages. You could probably order or download one from a web page at a url that I'd bet begins with http://www.cs.yale.edu/." -- [41]

David Kranz. ORBIT: An Optimizing Compiler for Scheme. Ph.D. dissertation, Yale University, February 1988. Research Report 632, Department of Computer Science.

---

" ...techniques like cdr coding serve to bring the average case of list access and construction down from their worst case behavior of generating massive heap fragmentation (and thus defeating caching and prefetching) towards the behavior of nicely caching sequential data layouts (classical arrays) typically via tree like structures or SRFI-101. While traditional linked list structures due to potential heap fragmentation provide potentially atrocious cache locality and this cripple the performance of modern cache accelerated machines, more modern datastructures can simultaneously provide the traditional cons list pattern while improving cache locality and thus performance characteristics. Guy Steele himself has even come out arguing this point. " [42]

---

"JIT engines are viciously complicated beasts" -- Nathaniel Smith

---

http://home.pipeline.com/~hbaker1/CheneyMTA.html

---

see also section 'Hash tables' in proj-plbook-plChStdLibraries.

---

" A quick way to judge a language implementation is by inspecting its string concatenation function. If concat is implemented as a realloc and memcpy, well, the upstairs lights probably aren’t set to full brightness. Behold, in MoarVM? strings are broken into strands, and strands can repeat without taking up more memory. So this expression:

    "All work and no play makes Jack a dull boy" x 1000

Doesn’t make a thousand copies of the string, or even make a thousand pointers to the same string. " [43]

---

some papers on implementation of continuations:

[44]

(as of [45])

ASDL

A language for describing ASTs. Software is available to autogenerate code that implements ASTs described by ASDL.

Used in Python and in Oil shell.

See [[plChDataLangs.txt?]] for more details.

---

Transpilers

https://engineering.mongodb.com/post/transpiling-between-any-programming-languages-part-1

---

Closures

The Implementation of Lua5.0 section 5 "Functions and Closures" describes a simple implementation of closures using indirect pointers called 'upvalues'.

Chapter: Resources about implementing programming languages

Misc

[52] recommends this "Responsive compilers" talk at PLISS 2019, saying "In that talk he also provided some examples of how the Rust language was accidentally mis-designed for responsive compilation. It is an entirely watchable talk about compiler engineering and I recommend checking it out." slides: https://nikomatsakis.github.io/pliss-2019/responsive-compilers.html#1

in the talk, Matsakis suggests using the Salsa framework (spreadsheet-y updates; "a Rust framework for writing incremental, on-demand programs -- these are programs that want to adapt to changes in their inputs, continuously producing a new output that is up-to-date").

Lists of links

http://stackoverflow.com/questions/1669/learning-to-write-a-compiler/1672#1672
A Bestiary of Single-File Implementations of Programming Languages
- discussion: https://news.ycombinator.com/item?id=22313003
https://gist.github.com/cellularmitosis/1f55f9679f064bcff02905acb44ca510
Compilers: Principles, Techniques, and Tools by Aho, Sethi, and Ullman
- recommended most frequently
Let's Build a Compiler
- recommended frequently, e.g. https://news.ycombinator.com/item?id=6641117 , http://stackoverflow.com/a/1678/171761
Modern Compiler Construction in ML by Appel
- recommended by https://news.ycombinator.com/item?id=6641484 , http://stackoverflow.com/a/7085/171761
Lisp in Small Pieces
- recommended by https://news.ycombinator.com/item?id=6643609
Niklaus Wirth's Compiler Construction
- recommended by https://news.ycombinator.com/item?id=6644620 , http://stackoverflow.com/a/7085/171761
http://esumii.github.io/min-caml/paper.pdf
- http://stackoverflow.com/a/1931417/171761
Matt Might's class
- recommended by https://news.ycombinator.com/item?id=6643395
Terence Parr's course
- recommended by http://stackoverflow.com/a/1693/171761
A Retargetable C Compiler: Design and Implementation
- recommended by https://news.ycombinator.com/item?id=6642030
Coursera compilers course
- recommended by https://news.ycombinator.com/item?id=6642396
Hokstad's series
- recommended by https://news.ycombinator.com/item?id=6641435
An Incremental Approach to Compiler Construction (Ikarus Scheme)
- recommended by https://news.ycombinator.com/item?id=6641117
Frank Pfenning's course
- recommended by https://news.ycombinator.com/item?id=6644853
http://www.amazon.com/dp/0123745144/?tag=stackoverfl08-20
- recommended by http://stackoverflow.com/a/3810387/171761
http://stackoverflow.com/questions/3497168/lisp-compiler-design
Learn C and build your own programming language in 1000 lines of code!
- https://news.ycombinator.com/item?id=10474717
- https://github.com/orangeduck/BuildYourOwnLisp
Make a Lisp
- https://news.ycombinator.com/item?id=9121448
- https://news.ycombinator.com/item?id=7530427
- https://github.com/kanaka/mal/blob/master/process/guide.md
http://andrej.com/plzoo/ and https://github.com/andrejbauer/plzoo
http://www.amazon.com/Compiling-Continuations-Andrew-W-Appel/dp/052103311X
http://beautifulracket.com/first-lang.html
example compiler: https://github.com/thejameskyle/the-super-tiny-compiler/blob/master/super-tiny-compiler.js
- see also https://news.ycombinator.com/item?id=11396986
example compiler: https://github.com/jdh30/FSharpCompiler/blob/master/FSharpCompiler/FSharpCompiler/Tests.fs
example compiler: https://news.ycombinator.com/item?id=11396253
example compiler: https://github.com/bbu/simple-interpreter
http://www.cs.columbia.edu/~sedwards/classes/2016/4115-spring/microc.pdf (the microC pedagogic language)
http://www.cs.columbia.edu/~sedwards/classes/2016/4115-spring/
example compiler: https://github.com/thejameskyle/the-super-tiny-compiler/blob/master/the-super-tiny-compiler.js
http://exmortis.narod.ru/src_compilers_eng.html
The 90 Minute Scheme to C compiler
Camlp4:
- http://lambda-the-ultimate.org/classic/message4025.html
- http://venge.net/graydon/talks/mkc/html/index.html
Writing an interpreter, targeting a VM or writing from scratch?
http://web.archive.org/web/20081118065537/http://www.church-project.org/reports/electronic/Jim:MIT-LCS-1995-532.pdf
From Interpreter to Compiler and Virtual Machine
A Functional Correspondence between Evaluators and Abstract Machines
http://c9x.me/compile/bib/
on the pernicious use of optimization based on undefined behavior: What every compiler writer should know about programmers or “Optimization” based on undefined behaviour hurts performance by M. Anton Ertl
http://t3x.org/reload/index.html
https://cse.sc.edu/~mgv/csce531sp10/notes/531Ch7.ppt.
http://staff.polito.it/silvano.rivoira/HowToWriteYourOwnCompiler.htm
https://github.com/bollu/tiny-optimising-compiler
https://www.cs.swarthmore.edu/~jpolitz/cs75/s16/s_schedule.html
http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=Compilers
https://github.com/ucsd-cse131-sp17/lectures
https://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-4-in-python/
https://github.com/certik/bcompiler
https://github.com/danistefanovic/build-your-own-x#build-your-own-programming-language
https://github.com/AlgoryL/Projects-from-Scratch#topics
https://github.com/rby90/Project-Based-Tutorials-in-C#programming-languages
Implementing a Simple Compiler on 25 Lines of JavaScript
https://github.com/jaseemabid/inc and https://github.com/namin/inc (based on http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf )
https://bootstrapping.miraheze.org/wiki/Main_Page
https://github.com/kanaka/mal/blob/master/process/guide.md
https://mpov.timmorgan.org/i-built-a-lisp-compiler/
https://carld.github.io/2017/06/20/lisp-in-less-than-200-lines-of-c.html
- https://news.ycombinator.com/item?id=15781883 -- some of the lower ranked comments point out some bugs in this; no bounds checking in gettoken; macros are broken (and unneeded; ordinary functions could be used instead); cast to (long) when they should use intptr_t. And a suggestion from rurban: "In the eval the dynamic intern("if") ... for all interned symbols should be moved to compile-time global storage of those interns. Otherwise it looks good. In reality one would use more tag bits, no just one. Typically 3."
Rune elisp implementation in Rust
http://norvig.com/lispy.html
https://m.stopa.io/risp-lisp-in-rust-90a0dad5b116
- https://news.ycombinator.com/item?id=19810504
http://aosabook.org/en/500L/a-python-interpreter-written-in-python.html#fn1
- https://news.ycombinator.com/item?id=16795049
https://github.com/nedbat/byterun
https://paul.bone.id.au/2018/05/10/bytecode-interpreter/
https://igor.io/2013/08/28/stack-machines-fundamentals.html
https://generalproblem.net/lets_build_a_compiler/01-starting-out/
http://thume.ca/2019/04/18/writing-a-compiler-in-rust/
- related post: http://thume.ca/2019/04/18/writing-a-compiler-in-rust/
https://github.com/DoctorWkt/acwj
https://justinmeiners.github.io/lc3-vm/
http://www.wolczko.com/CS294/
http://www.bayfronttechnologies.com/mc_tutorial.html
https://upload.wikimedia.org/wikipedia/commons/a/aa/Write_Yourself_a_Scheme_in_48_Hours.pdf
https://thesephist.com/posts/pl/ has a bunch of commentary and also
- recommends the following resources:
- https://craftinginterpreters.com/contents.html
- https://ruslanspivak.com/lsbasi-part1/
- and also sorta-recommends reading the following implementations:
- "Ale, a simple Lisp written in Go"
- "Lua, written in C. I’ve written a full separate article on interesting design choices in the Lua interpreter, because I’m a big fan of its architecture."
- "Wasm3, an interpreter for WebAssembly?. I specifically enjoyed the thoughtful and unique bytecode interpreter design, outlined in a design document in the repo."
- "Boa, a JavaScript? and WebAssembly? runtime written in Rust"
https://speakerdeck.com/nineties/creating-a-language-using-only-assembly-language?slide=73
https://github.com/akeep/scheme-to-c
https://github.com/barak/scheme2c/
From System F to Typed Assembly Language
https://legacy.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf
background: https://blog.sigplan.org/2019/07/09/my-first-fifteen-compilers/
Essentials of Compilation: An Incremental Approach
https://github.com/IUCompilerCourse/IU-P423-P523-E313-E513-Fall-2020
https://news.ycombinator.com/item?id=20419390
the Wren language source code looks like a good example implementation to read. It is a simple high-level language with a small implementation meant to be embedded. The docs claim that "The VM implementation is under 4,000 semicolons. You can skim the whole thing in an afternoon. It's small, but not dense. It is readable and lovingly-commented."
https://bernsteinbear.com/blog/lisp/
https://bernsteinbear.com/blog/compiling-a-lisp-0/
http://c9x.me/compile/bib/
A New C Compiler. Ken Thompson Plan 9 C compilers
A Tour Through the Portable C Compiler. . S. C. Johnson
Compiler Construction. Niklaus Wirth
Write Your Own Compiler (T3X)
LISP SYSTEM IMPLEMENTATION
PRACTICAL COMPILER CONSTRUCTION A No-nonsense Tour through a C Compiler
Scheme 9 from Empty Space: A Guide to Implementing Scheme in C
COMPILING LAMBDA CALCULUS
Quasiquote Using Syntax-Rules
https://christine.website/blog/minicompiler-lexing-2020-10-29
https://c9x.me/git/qbe.git/tree/minic
https://github.com/jamiebuilds/the-super-tiny-compiler/blob/master/the-super-tiny-compiler.js
http://breuleux.net/blog/language-howto.html
http://standardpascal.org/pascals.html
http://blog.jeff.over.bz/assembly/compilers/jit/2017/01/15/x86-assembler.html
Let's write a compiler, part 1: Introduction, selecting a language, and doing some planning
- A lexer A parser Testing A code generator Input and output Arrays Strings, forward references, and conclusion Let's get hands-on with QBE
- Let's write a self-hosting compiler, part 1: Reading in source code
https://www.rodrigoaraujo.me/posts/lets-build-an-lc-3-virtual-machine/ (LC-3 VM in Rust)
https://github.com/eatonphil/lisp-rosetta-stone/
https://en.wikibooks.org/wiki/Creating_a_Virtual_Machine/Register_VM_in_C
- https://github.com/dmjio/vm
https://www.micahcantor.com/blog/js-to-asm-in-hs/
https://notes.eatonphil.com/lua-in-rust.html
- https://news.ycombinator.com/item?id=29952516
  - " Being that tokens are the leaves of the AST, there are a lot of them and they can take a lot of space. To save memory it is a good idea to store only a file location instead of a full token. Whenever token information is needed, just lex again to get the full token, starting at the file location. This works only for languages with a context-free lexical syntax, of course (and not entirely sure "context-free" is the right term here but you get what I mean)...Token data will not typically be needed a lot: The token kind (integer-literal, plus-operator, open-paren etc.) at exactly one point in the parsing phase and possibly in the type checking phase but you could have a separate "literal expression" kind for that. Binary payloads (string bytes, floating point value etc.) will be needed in the constant phase for literal tokens. I wouldn't expect that a little optimization here is a difficult tradeoff to make - neither with regards to speed nor code complexity." -- https://news.ycombinator.com/item?id=29954747
  - "One quick and cheap trick that can help your VM, especially when you have a lot of arithmetics: explicitly represent the top of stack. Eg. Instruction::Add => { let left = data.pop().unwrap(); top = left + top; pc += 1; } " -- [53]
https://zserge.com/posts/too-many-forths/
https://borretti.me/article/lessons-writing-compiler (good read)
- https://lobste.rs/s/4jecle/lessons_from_writing_compiler
https://github.com/codecrafters-io/build-your-own-x#build-your-own-programming-language
https://drew.ltd/blog/posts/2019-7-18.html https://drew.ltd/blog/posts/2019-7-24.html https://drew.ltd/blog/posts/2019-8-2.html
https://www.courier.com/blog/build-a-webassembly-language-lexing/ https://www.courier.com/blog/build-a-webassembly-language-parsing/ (by the same guy who write the previous bullet point)
https://ketansingh.me/posts/toy-compiler-with-llvm-and-go/
https://mort.coffee/home/fast-interpreters/
A Couple of Meta-interpreters in Prolog

guides on implementing type checkers and type inference:

introductory tutorial on how to implement Hindley-Milner type inference
tutorial on an optimization that OCaml's type checker uses
TAPL's type checkers: [54]
https://www.andreinc.net/2021/12/01/writing-a-simple-vm-in-less-than-125-lines-of-c
https://github.com/spencertipping/jit-tutorial
https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-simple-jits.html
https://peppe.rs/posts/lightweight_linting/
https://mukulrathi.com/create-your-own-programming-language/intro-to-compiler/
https://www.microsoft.com/en-us/research/publication/the-implementation-of-functional-programming-languages/
https://chidiwilliams.com/post/how-to-write-a-lisp-interpreter-in-javascript/
Complete and Easy Bidirectional Typechecking for Higher-Rank Polymorphism
" If you're interested in getting started with interpreters, which are easier, you might want to look into Daniel Holden's excellent Build Your Own Lisp (And Learn C). Although it has been criticized for many reasons, it's a great book, and if you find interpreters and compilers totally magic, it's a good place to start." [55]
Compiler Construction by Niklaus Wirth ( older version: https://web.archive.org/web/20191214175617/www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf )
Three Implementation Models for Scheme by Kybvig
http://createyourproglang.com/
http://matt.might.net/articles/books-papers-materials-for-graduate-students/#pl
https://github.com/mulderp/mulderp.github.com/issues/13
http://people.inf.ethz.ch/wirth/CompilerConstruction/index.html
https://intuitiveexplanations.com/tech/kalyn
https://plzoo.andrej.com/index.html
Languages Written in Rust
http://venge.net/graydon/talks/mkc/html/index.html
https://notes.eatonphil.com/compiler-basics-lisp-to-assembly.html
https://notes.eatonphil.com/interpreting-go.html
http://cowlark.com/mercat/index.html "Mercat is not really useful for anything much. It is, on the other hand, an excellent example of how to implement a self-hosted recursive-descent-parsed language and virtual machine."
Lisp in a weekend
https://news.ycombinator.com/item?id=15782849
https://yosefk.com/blog/c-as-an-intermediate-language.html
https://pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembly-code-breadboard/ (implements a VM in SBCL to test an idea for efficient execution of an ISA with a small, fixed-size stack)
https://thma.github.io/posts/2021-12-27-Implementing-a-functional-language-with-Graph-Reduction.html

links for optimization/static analysis etc:

mb:

http://t3x.org/reload/index.html
the other guy also wrote a book called 'Lightweight Compiler Techniques' and one called 'Zen-style programming' about various programming constructs, and a book about an interpreter for R4RS Scheme, at http://www.t3x.org/

learning projects that document their learning:

https://github.com/veera-sivarajan/boba

misc:

" Graydon Hoare: 21 compilers and 3 orders of magnitude in 60 minutes

In 2019, Graydon Hoare gave a talk to undergraduates (PDF of slides) trying to communicate a sense of what compilers looked like from the perspective of people who did it for a living.

I've been aware of this talk for over a year and meant to submit a story here, but was overcome by the sheer number of excellent observations. I'll just summarise the groups he uses:

    The giants: by which he means the big compilers that are built the old-fashioned way that throw massive resources at attaining efficiency
    The variants, which use tricks to avoid being so massive:
        Fewer optimisations: be traditional, but be selective and only the optimisations that really pay off
        Use compiler-friendly languages, by which he is really taking about languages that are good for implementing compilers, like Lisp and ML
        Theory-driven meta-languages, esp. how something like yacc allows a traditional Dragon-book style compiler to be written more easily
        Base compiler on a carefully designed IR that is either easy to compile or reasonable to bytecode-interpret
        Exercise discretion to have the object code be a mix of compiled and interpreted
        Use sophisticated partial evaluation
        Forget tradition and implement everything directly by hand

I really recommend spending time working through these slides. While much of the material I was familiar with, enough was new, and I really appreciated the well-made points, shout-outs to projects that deserve more visibility, such as Nanopass compilers and CakeML?, and the presentation of the Futamura projections, a famously tricky concept, at the undergraduate level. By Charles Stewart at 2022-02-27 14:47

Implementation

other blogs

51627 reads

" -- [56]

http://venge.net/graydon/talks/CompilerTalk-2019.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform

might need a large stack when calling into the kernel:

https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/

---

on implementing an expression evaluator with automatic memory-management with C++

" You don’t need to use the built in shared pointer type, which is generic and so not optimised for various cases, to write idiomatic and clean C++. You write a LispValue? class that is a tagged union of either a pointer to some rich structure or an embedded value. Most Lisp implementations use tagging on the low bit, with modern CPUs and addressing modes it’s useful to use 0 to indicate an integer and 1 to indicate a pointer because the subtract 1 can be folded into other immediate offsets for field or vtable addressing. If the value is a pointer, the pointee exposes non-virtual refcount manipulation operations in the base class. Your value type calls these in its copy constructor and copy assignment operator. For move assignment or move construction, your value type simply transfers ownership and zeroes out the original. In its destructor, you drop the ref count. When the refcount reaches zero, it calls a virtual function which can call the destructor, put the object back into a pool, or do anything else.

Implemented like this, you have no virtual calls for refcount manipulation and so all of your refcount operations are inlined. All refcount operations are in a single place, so you can switch between atomic and non atomic ops depending on whether you want to support multiple threads. More importantly, it is now impossible by construction to have mismatched refcount operations. Any copy of the value object implicitly increments the refcount of the object that it refers to, any time one is dropped, it does the converse. Now you can store LipsValues? in any C++ collection and have memory management just work. " -- david chisnall

---

Implementing JITs

https://carolchen.me/blog/jits-intro/ https://carolchen.me/blog/jits-impls/

Chapter ?: Parsing (and lexing) Chapter ?: targets, IRs, VMs and runtimes Chapter ?: Interop Chapter ?: Hard things to make it easy for the programmer (contains: garbage collection, greenthreads, TCO and strictness analysis) Chapter ?: Tradeoffs * contains: Constraints that performance considerations of the language and the toolchain impose on language design

proj-plbook-plPartImplementation

Part V: Implementation of a programming language

Chapter : the general pipeline

Chapter : modularity

ml functors

intermediate representations (IRs)

Introducing structured control flow

Chapter : possible compiler safety checks beyond typechecking

Chapter : linker

Chapter : concurrency implementation

Useful algorithms and data structures

Chapter: designing and implementing a virtual machine

Chapter: ?? where to put these

Chapter: implementing functional languages

todo: what else?

Chapter : Normal forms

Defn in lambda vs here

SSA

CPS

A

ANF

Guarded SSA

Egraphs

Chapter : Performance optimizations

Inlining

Stack machines and stack items in registers

Tagged data

Startup time

Effects analysis

Common subexpression elimination (CSE)

Flow analysis

Chapter :

canonical impl vs std

The dismaying history of languages without a canonical implementation

self-hosting

cross-compiling when self-hosting

standards bodies

Thompson 'trusting trust' attacks

Chapter: Case studies

Chapter: Stacks

Stacks usually grow down

Stacks: more topics

by construct

short-circuit boolean expressions

regexs

for loops

misc

eta expansion: turning methods into functions

[((v1, v2), out2) | (v1, out1)

((p ‘then’ many p) ‘using’ cons)

todo

compiler backends: code generation

liveness analysis

register allocation

Links for: compiler backends: code generation

providing sandbox functionality

linking

register allocation

stack maps

Generics

Object file formats and binary execution

a.out format

ELF format

PE format

Amiga Hunk format

Other binary formats

ASDL

Transpilers

Closures

Chapter: Resources about implementing programming languages

Misc

Lists of links

Implementing JITs