Bayle Shanks's website: proj-plbook-plPartImplementation

Table of Contents for Programming Languages: a survey

Part V: Implementation of a programming language

This is not a book about how to implement a programming language. However, even at the earliest stages of language design, it is valuable to have some idea of how the certain choices in the design of the language could make the implementation much less efficient or much more difficult to write. For this purpose, we'll provide a cursory overview of implementation topics, touching as we go on some of the language design choices that may have a large impact.

Chapter : the general pipeline

parsing (and lexing)
parse tree -> AST
variable binding?
what else here?
typechecking?
basic blocks, control flow graph (also explain call stack, and possibly scope/block stack, at some point)
- see also https://en.wikipedia.org/wiki/Extended_basic_block
what else here?
runtime

when do we convert to normal forms?

todo

" The first big phase of the compilation pipeline is parsing. You need to take your input and turn it into a tree. So you go through preprocessing, lexical analysis (aka tokenization), and then syntax analysis and IR generation. Lexical analysis is usually done with regexps. Syntax analysis is usually done with grammars. You can use recursive descent (most common), or a parser generator (common for smaller languages), or with fancier algorithms that are correspondingly slower to execute. But the output of this pipeline stage is usually a parse tree of some sort.

The next big phase is Type Checking. ...

The third camp, who tends to be the most isolated, is the code generation camp. Code generation is pretty straightforward, assuming you know enough recursion to realize your grandparents weren't Adam and Eve. So I'm really talking about Optimization...

-- http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html

modified from Python's docs:

" The usual steps for compilation are:

    Parse source code into a parse tree
    Transform parse tree into an Abstract Syntax Tree
    Transform AST into a Control Flow Graph
    Emit bytecode based on the Control Flow Graph" -- https://docs.python.org/devguide/compiler.html#abstract

discussions on the benefits of the introduction of a 'Core' or 'middle' language in between the HLL AST and a lower-level, LLVM-ish language:

https://news.ycombinator.com/item?id=11581863
- this thread is discussing the following Rust-specific link, which also makes some general points near the beginning: http://blog.rust-lang.org/2016/04/19/MIR.html

some advantages of a core language:

the core language can allow powerful but unsafe constructs that are not allowed in the HLL (eg for Rust, GOTO, downcasting, changing potentially shared variables conditionally upon some conditions where they have not actually been shared; [1])
the core language has fewer primitives than the HLL so it is easier to work with than the HLL AST; but compared to a LLL (low-level language) like LLVM, less high-level intent has been lost, so it is easier to write some optimizations, or other analysis like borrow checking, on the core language than on the LLL
more caching for incremental re-compilation within compilation units
focuses the designer's mind on what is 'core' for the language

Links:

todo: somewhere put my observations on LLLs and target languages:

'target' languages include both 'Core languages' and LLLs

goal for:

HLL: usability
core: easy compilation from HLL->core; easy manipulation by compiler transformations (implies simplicity). Unlike HLL, verbosity is fine. Unlike LLL, machine sympathy is not necessary.
LLL: efficiency (for actual assembly languages) OR ease of transformation to backend (and similar enough to actual assembly such that stuff which looks efficient in the LLL remains so after compilation to assembly) (for eg LLVM)

Core languages tend to be: similar to the HLL, except: de-sugared (eg apply various local 'desugar' tranformations) have fewer primitives than the HLL (eg de-sugaring various control flow constructs to GOTO) in safe or prescriptive languages, sometimes the Core primitives are more powerful than the HLL (eg in Rust the MIR core language has GOTO but Rust does not) still retain function calling have explicit types have a bunch of other things that are explicit, so that the LLL is way to verbose for human use, but with a similar control flow to the original HLL like the HLL but unlike LLL, still nonlinear (parethesized subexpressions are still present)

LLLs and 'assembly' languages are more similar to each other than you might expect.

LLL general properties:

linear, rather than AST (each instruction is at an address; you cannot give a parenthesized subexpression as the argument to a primitive instruction)
no type polymorphism (would lead to inefficient dispatch)
fixed sizes for things (eg almost always a max bit width of addresses, often a the max # of arguments that a primitive instruction can take)
designed along with, and constrained by, an instruction encoding (eg # of registers and addressing modes may be small so as to permit a dense instruction encoding)
designed along with, and constrained by, internal representations for primitive data types
often takes multiple instructions to do one 'primitive' thing such as a function call (eg first you must marshal the function arguments into their proper place, then you call the function)
often LLLs are in a 'canonical form' such as SSA, CPS (eg LLVM is SSA), or if not, often the compiler chooses to transform to a canonical form before the LLL

Main families of LLLs: imperative * stack vs. register machines stack machines (eg JVM, Python's, mb Forth) register machines * 3-operand vs 2- or 1-operand register machines * various calling conventions * addressing modes LOAD/STORE vs addressing modes vs LOAD/STORE with constant argument but self-modifying code explicit constant instructions (eg ADDC) vs constant addressing mode vs LOADK * are registers and memory locations actual memory locations upon which address arithmetic can be done (like in assembly), or just variable identifiers (like in Lua)? functional * eg https://en.wikipedia.org/wiki/SECD_machine combinator * eg Nock

Also, instruction encodings might be fixed-width or variable (most are variable).

Primitive operations in LLLs: not as much variance as you might expect. Generally:

most everyone has basic arithmetic; addition, subtraction, maybe multiplication, maybe division, various shifts
some sort of GOTO
some sort of conditional local branch or skip; possibly with a TEST or COND instruction separated from the branch/skip, possibly not
possibly bitwise test/set
possibly bitwise boolean ops
possibly some composite data structures; some variance here but not a ton
- cons cells
- fixed-size vectors
- variable-length arrays
- associative arrays
- strings

Links regarding stack machine vs register machines:

"Interpreters for virtual stack machines are easier to build than interpreters for register or memory-to-memory machines; the logic for handling memory address modes is in just one place rather than repeated in many instructions." -- [2]
[3]
Registers vs stacks for interpreter design says "it's easier to generate code for a stack machine" but goes on to say that register machines may be more efficient.
Register machines may be faster, according to Virtual Machine Showdown: Stack Versus Registers by Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl.
- (cites The case for virtual register machines)
https://en.wikipedia.org/wiki/Stack_machine ([4])
A Performance Survey on Stack-based and Register-based Virtual Machines
- discussion: https://news.ycombinator.com/item?id=13154111
Why have a stack? by Eric Lippert
https://web.archive.org/web/20200610223140/http://home.pipeline.com/~hbaker1/ForthStack.html

---

Someone's response to the question "Does anyone know if there is research into the effects of register counts on VMs?":

" Someone 1819 days ago [-]

If you can map all VM registers to CPU registers, the interpreter will be way simpler.

If you have more VM registers than CPU registers, you have to write code to load and store your VM registers.

If you have more CPU registers than VM registers, you will have to write code to undo some of the register spilling done by the compiler (if you want your VM to use the hardware to the fullest.)

So, the optimum (from a viewpoint of 'makes implementation easier') depends on the target architecture.

Of course, a generic VM cannot know how many registers (and what kinds; uniform register files are not universally present) the target CPU will have.

That is a reason that choosing either zero (i.e. a stack machine) or infinitely many (i.e. SSA) are popular choices fo VMs: for the first, you know for sure your target will not have fewer registers than your VM, for the second, that it will not have more.

If you choose any other number of VM registers, a fast generic VM will have to handle both cases.

Alternatively, of course, one can target a specific architecture, and have the VM be fairly close to the target; Google's NaCl? is (was? I think they changed things a bit) an extreme example. I have not checked the code, but this, I think, is similar. " -- https://news.ycombinator.com/item?id=2930109

---

Chapter : modularity

whole program analysis

if modules are separately compiled and allow types to remain completely abstract, e.g. when a module is compiled it is possible to not know what the in-memory representation of values in the module will be, then a host of optimizations cannot be performed. Alternatives:

Acceptance: produce fully generic code and accept the fact that the program will execute slowly. This is especially a problem if the language allows operation overloading on primitive operations and especially if it allows redefining standard operators on primitive types, e.g. if even adding two integers will be slowed down because the compiler did not know at compile time that integers were involved or it did not know if '+' had been overridden for integers.
produce bytecode, and either have a bytecode interpreter at runtime (slow execution), or compile the bytecode at runtime (slow startup) or have a just-in-time compiler (complex)
do the optimizations at link time instead of compile time; by this time the compiler (linker) has all the information and can tell what the representation will be for each value. But if the language supports dynamic linking, then 'link time' does not really occur until runtime, so this reduces to the former case.
give up on separate compilation; at compile time the compiler requires the source code of all libraries. This is impractical because some users may want to use closed-source libraries created by others, and it is also slow because libraries must be recompiled each time they are used.
create something like header files that can be distributed separately from library source code, such that by looking at all of the headers for all of the libraries, enough information is provided such that compiler can determine the in-memory representation of all values in the program at compile time. In some languages, such as C++, the header files require processing (macro preprocessing in the case of C++) and the same header file may be reloaded multiple times during one compilation, slowing down compilation significantly. The presence of macros in the header files also adds complexity and makes it more difficult to write third-party tools and to interoperate with other languages (cite https://news.ycombinator.com/item?id=6273739 ). Language designers using headers should consider (a) omitting macros, and (b) ensuring that each header file needs to be read at most once during any single compilation, and compiler writers should consider producing binary compiled header files as an intermediate output to save time during later recompilation.

see section "compilation model" in "Retrospective Thoughts on BitC?" http://www.coyotos.org/pipermail/bitc-dev/2012-March/003300.html

https://news.ycombinator.com/item?id=5422094

ml functors

intermediate representations (IRs)

" Some common intermediate representations:

General forms of intermediate representations (IR):
Graphical IR (parse tree, abstract syntax trees, DAG. . . )
Linear IR (ie., non graphical)
Three Address Code (TAC): instructions of the form “result=op1 operator op2”
Static single assignment (SSA) form: each variable is assigned once
Continuation-passing style (CPS): general form of IR for functional languages
Control flow graph

Examples:

Java bytecode (executed on the Java Virtual Machine)
LLVM (Low Level Virtual Machine): SSA and TAC based
C is used in several compilers as an intermediate representation (Lisp, Haskell, Cython. . . )
Microsoft’s Common Intermediate Language (CIL)
GNU Compiler Collection (GCC) uses several intermediate representations:
- Abstract syntax trees
- GENERIC (tree-based)
- GIMPLE (SSA form)
- Register Transfer Language (RTL, inspired by lisp lists) "

-- [5]

[6] also describes and motivates SSA and CPS in more detail. It defines SSA and explains why you need phi in SSA, and refers to an algorithm to determine where to place the phis.

Removing and Restoring Control Flow with the Value State Dependence Graph, thesis by James Stanier, Foundations of Software Systems School of Informatics University of Sussex . http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf , Chapter 2, Intermediate Representations in Compilers: A Survey

Abstract machines and the compilers that love/hate them

Introducing structured control flow

Some target languages, such as WASM and SPIR V, require some amount of structure in control flow. If the source language doesn't provide this structure, it must be introduced.

Links:

Chapter : possible compiler safety checks beyond typechecking

check to see if there are any reads to uninitialized variables could implement as types

Chapter : linker

issue in C:

in C, according to one of the comments on http://www.slideshare.net/olvemaudal/deep-c , it was claimed (i didn't check) that if you declare your own printf with the wrong signature, it will still be linked to the printf in the std library, but will crash at runtime, e.g. "void printf( int x, int y); main() {int a=42, b=99; printf( a, b);}" will apparently crash.

-- A new programming language might want to throw a compile-time error in such a case (as C++ apparently does, according to the slides).

Links:

http://www.lurklurk.org/linkers/linkers.html

Chapter : concurrency implementation

atomics

SIMD, MIMD

GPU

Useful algorithms and data structures

Efficient operations on long strings: http://en.wikipedia.org/wiki/Rope_%28data_structure%29

Chapter: designing and implementing a virtual machine

"Instructions should not bind together operations which an optimizing compiler might otherwise choose to separate in order to produce a more efficient program."

Brian Case

Chapter: ?? where to put these

Chapter: implementing functional languages

Links:

http://stackoverflow.com/questions/10992852/what-are-some-obvious-optimizations-for-a-virtual-machine-implementing-a-functio?rq=1
https://www.amazon.com/Virtual-Machines-Iain-D-Craig/dp/1852339691?ie=UTF8&tag=stackoverfl08-20
Implementing functional languages: a tutorial, Simon Peyton Jones and David Lester. Published by Prentice Hall, 1992.
Practical Foundations for Programming Languages, Robert Harper

todo: what else?

Chapter : Normal forms

"Basic block": "a portion of the code within a program with only one entry point and only one exit point." (wikipedia)

Defn in lambda vs here

SSA

"in CPS, as in SSA or ANF, expressions don't have subexpressions." -- http://wingolog.org/archives/2014/05/18/effects-analysis-in-guile

"If an intermediate language is in SSA form, then every variable has a single definition site. Single-assignment is a static property because in a general control flow graph the assignment might be in a loop. SSA form was developed by Wegman, Zadeck, Alpern, and Rosen [4, 37] for efficient computation of data flow problems, such as global value numbering and detecting equality of variables." -- https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf , Chapter 3

For example (examples taken from https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf ), if the source code is: x = a + b; y = x + 1; x = a + 1;

then this is equivalent to:

  x1 = a + b; 
  y1 = x1 + 1; 
  x2 = a + 1;

When two control flow edges join, we have to do something else. Consider

 if x < 5 then v = 1 else v = 2;
 w = v + 1;

In this v, v has two possible definition sites, violating SSA, but we don't know statically which one will be used. As a control flow graph, we have a diamond:

    (if x < 5)
   /          \
  /            \(v1 = 1) (v2 = 2) \ / \ / (w = v? + 1)

SSA deals with this by introducing a "magic" operator, Phi. Phi is only allowed to occur at the beginning of a basic block. Its semantics are that Phi "knows" when it executes from which basic block control was recently passed to it; it is given multiple arguments, and it simple returns one of them; it chooses which one of its various arguments to return based on from which basic block control was recently passed to it. For example, the above diamond can be replaced by:

    (if x < 5)
   /          \
  /            \(v1 = 1) (v2 = 2) \ / \ / (v3 = Phi(v1,v2)

   (w = v3 + 1)

here, Phi will return v1 if control came to it by the left path, or it will return v2 if control came to it by the right path. Meaning that v3 will end up being equal to v1 if the left path is taken, and it will end up being equal to v2 if the right path is taken. Note that the SSA property is preserved; each variable is assigned to exactly once.

"SSA form is typically used as intermediate representation for imperative languages. The functional programming community prefers the λ-calculus and continuations as intermediate language. Andrew Appel pointed out the close relationship between the two representations in his article ”SSA is Functional Programming”" -- https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf

"AliceML? uses a weakening of SSA called Executable SSA Form, which works only on acyclic (control?) graphs, and which does not require Phi nodes. See https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf Chapter 3, section 3.1.1. One disadvantage of that is that "the removal of Phi-functions in the abstract code might artificially extend the liveness of a variable across branches. Suppose a left branch sets a variable at the beginning and the right branch synchronizes this variable at the end. Then, the variable is declared to be live over the whole right branch. This increases register pressure and decreases code quality for register-poor architectures."

A good, simple description of phi nodes: https://capra.cs.cornell.edu/bril/lang/ssa.html

" A program is in SSA form if: 1. each definition has a distinct name 2. each use refers to a single definition ... Main interest: allows to implement several code optimizations " -- http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/05-intermediatecode-2014-2015.pdf

Transforming a 3-address representation into stack is easier than a stack one into 3-address.

Your sequence should be the following:

    Form basic blocks
    Perform an SSA-transform
    Build expression trees within the basic blocks
    Perform a register schedulling (and phi- removal simultaneously) to allocate local variables for the registers not eliminated by the previous step
    Emit a JVM code - registers goes into variables, expression trees are trivially expanded into stack operations

shareeditflag answered Dec 8 '11 at 8:18 SK-logic 8,60612032

Wow! Thanks this is just what I was looking for. Questions: do I need to do the SSA transform within each basic block or across the whole procedure? Do you have any pointers to tutorials, textbooks or other resources? – akbertram Dec 8 '11 at 9:09 SSA transform is aways procedure-wide: en.wikipedia.org/wiki/Single_static_assignment You'll just need to find a dominance frontier for each basic block where you're assigning a variable (with multiple assignment locations), insert the phi nodes there and then get rid of the redundant phis (n.b.: some may have circular dependencies). – SK-logic Dec 8 '11 at 9:13 @akbertram, LLVM can be a useful source of inspiration here, you can safely model your intermediate representation after it. Some important design decisions from there: do not allow to assign one register to another, and do not allow to assign a constant to a register, always substitute it in place instead. – SK-logic Dec 8 '11 at 9:16 The expression tree looks alot like the original AST -- is it worth making a round-trip through a TAC-like IR or does another sort of IR make sense when the only target is the JVM? – akbertram Dec 8 '11 at 9:17 @akbertram, if you can stuff your JVM compilation before making TAC - then yes, you do not need it, a direct stack code generation is easier. Otherwise, if you can only stack on top of the existing compiler infrastructure, you'll need to re-construct the expression trees. Funny bit is that the JIT will do it again too. – SK-logic Dec 8 '11 at 9:22 " -- [7]

Alternatives to Phi nodes in SSA:

"Instead of using phi nodes, MLIR uses a functional form of SSA where terminators pass values into block arguments defined by the successor block." [8]. See also [9]

Links:

CPS

See e.g.

CPS vs. SSA: http://wingolog.org/archives/2011/07/12/static-single-assignment-for-functional-programmers

"in CPS, as in SSA or ANF, expressions don't have subexpressions." -- http://wingolog.org/archives/2014/05/18/effects-analysis-in-guile

" There are a couple of things going on with CPS as implemented in this style of compilers that make it particularly nice for writing optimization passes:

1) There are only function calls ("throws") and no returns. Though there's no theoretical reduction in the stuff you have to track (since what is a direct-style return is now a throw to the rest of the program starting at the return point), there's a huge reduction in in the complexity of your optimizations because you're not tracking the extra thing. For some examples, you can look in the Manticore codebase where even simple optimizations like contraction and let-floating are implemented in both our early direct-style IR (BOM) and our CPS-style IR (CPS). The latter is gobs more readable, and there's no way at all I'd be willing to port most of the harder optimizations like reflow-informed higher-order inlining or useless variable elimination to BOM.

2) IRs are cheap in languages like lisp and ML. So you write optimizations as a tree-to-tree transformation (micro-optimization passes). This style makes it much easier to enforce invariants. If you look at the internals of most compilers written in C++, you'll see far fewer copies of the IR made and a whole bunch of staged state in the object that's only valid at certain points in the compilation process (e.g., symbols fully resolved only after phase 2b, but possibly invalid for a short time during optimization 3d unless you call function F3d_resolve_symbol...). Just CPS'ing the representation of a C++ compiler without also making it easy to write your optimization passes as efficient tree-to-tree transformations will not buy you much, IMO. " -- [10]

" So the lineage of the CPS-as-compiler-IR thesis goes from Steele's Rabbit compiler through T's Orbit to SML/NJ. At which point Sabry & Felleisen at Rice published a series of very heavy-duty papers dumping on CPS as a representation and proposing an alternate called A-Normal Form. ANF has been the fashionable representation for about ten years now; CPS is out of favor. This thread then sort of jumps tracks over to the CMU ML community, where it picks up the important typed-intermediate-language track and heads to Cornell, and Yale, but I'm not going to follow that now. " -- http://www.paulgraham.com/thist.html

A

ANF

"in CPS, as in SSA or ANF, expressions don't have subexpressions." -- http://wingolog.org/archives/2014/05/18/effects-analysis-in-guile

http://lambda-the-ultimate.org/node/1617#comment-19700

http://lambda-the-ultimate.org/node/69

Guarded SSA

see Optimizing compilers for structured programming languages by Marc Michael Brandis

Egraphs

https://github.com/cfallin/rfcs/blob/cranelift-egraphs/accepted/cranelift-egraph.md

Chapter : Performance optimizations

Many large books have been filled with the details of writing efficient compilers and interpreters, so this chapter will only provide an overview of selected techniques.

Inlining

Calling a function typically involves manipulating the stack (upon entry, you gotta push the return address and the frame pointer register onto the stack, push the formal parameters onto the stack, adjust the top-of-the-stack pointer register, and adjust the frame pointer register to point to the new current stack position).

You can save this overhead by inlining.

Stack machines and stack items in registers

Some languages present a computation model in which there is only a stack, no registers. In this case, assuming that the underlying hardware has registers, it may speed things up to use some of the registers to hold the top few stack positions.

Tagged data

Startup time

Dynamic loading: Languages with dynamic loading can experience slow startup times if they:

Load a large amount of data (cite http://web.archive.org/web/20071011183655/http://java.sun.com/developer/technicalArticles/javase/consumerjre/#Quickstarter , https://en.wikipedia.org/wiki/Java_performance#Startup_time )
Search through a large number of directory paths to look for files (cite https://news.ycombinator.com/item?id=5422094 , https://news.ycombinator.com/item?id=5422138 , https://news.ycombinator.com/item?id=5425365 )

Effects analysis

Common subexpression elimination (CSE)

Flow analysis

http://wingolog.org/archives/2014/07/01/flow-analysis-in-guile

Links:

Chapter :

canonical impl vs std

The dismaying history of languages without a canonical implementation

Languages without a canonical implementation tend to:

fragment into many incompatible variant languages, each of which claims heritage from the parent, but each of which is non-conformant to the standard, and/or
eventually find themself in a situation where one implementation 'wins' and becomes the canonical implementation, after which that implementation breaks conformance with the standard and (for the most part) drives the standard, rather than the other way around

Scheme started out as an implementation, was described in a series of published papers (the Lambda Papers), turned into a standard, however

Haskell started out as a standard (todo confirm), then many implementations sprung up, then one won out (GHC). As of this writing, GHC has stopped maintaining conformance with the language standard (except while running in a special mode that is not interoperable with most libraries in the wild; see [12] ), and discussions regarding changing the language appear to revolve around changing GHC first, with updates to the standard to cause them to conform to GHC an afterthought [13].

self-hosting

benefits, costs (lots of popular languages dont)

portability

kernel approach

cross-compiling when self-hosting

There is always the 'chain solution': Every self-hosting language had to be bootstrapped from another language at the beginning. And each subsequent self-hosting version of a self-hosting language had to be compilable by the previous version. So, to bootstrap the language onto a new platform, one could simply re-boostrap that same early version of the language, and then re-compile each version of the language's compiler in sequence (see eg [14]). Disadvantages to this include: (a) if you are worried about Thompson 'trusting trust' attacks (see section below), then you must audit the source code of each version of the language implementation in this chain; and (b) this seems like a lot more compilation steps than should be necessary.

Another option is for the language to deliberately restrict its own implementation to using code compatible with an earlier version of itself (eg Golang [15]). This removes some of the benefits of a self-hosting language in terms of using the compiler as a way for the language designers to eat their own dogfood.

standards bodies

various stories of standards processes and advice on what to do if you find yourself involved in a standards process:

http://www.nhplace.com/kent/Papers/cl-untold-story.html . some advice from that:
- write a charter defining scope and the broad criteria that will be used to make decisions about the language
- for each proposal, fill out forms. The forms should have the problem description and the proposed solution (and versioning stuff). The problem description should be agreed upon (rough consensus) by everyone dealing with the problem, but each faction may write their own proposed solution.
- editing a standards document is about trust. The editor needs to be trusted by all to keep their partisan views out of the editing (e.g. to edit ONLY as prompted by technical votes taken by the committee, never to make even small material changes to the standard which were not the result of a technical vote). even "if a certain aspect of the text of CLTL was meaningless or ambiguous or useless because of how it was written, my job as editor was to make the text clear enough that the meaningless, ambiguous, or useless nature was more transparent. I would sometimes say the job was to transform text from “implicitly vague” to “explicitly vague.”
(didn't i read another good blog post somewhere about this?)

Thompson 'trusting trust' attacks

Ken Thompson wrote a popular essay in which he pointed out that a compiler could be subverted to introduce attacker-desired behavior into the programs it compiles; if the subversion is clever, then such a compiler, when compiling itself from source, would continually re-introduce the subversion.

Links:

Chapter: Case studies

https://blog.mozilla.org/luke/2014/01/14/asm-js-aot-compilation-and-startup-performance/

Chapter: Stacks

" 1.4 WHY ARE STACKS USED IN COMPUTERS?

Both hardware and software stacks have been used to support four major computing areas in computing requirements: expression evaluation, subroutine return address storage, dynamically allocated local variable storage, and subroutine parameter passing.

1.4.1 Expression evaluation stack

Expression evaluation stacks were the first kind of stacks to be widely supported by special hardware. As a compiler interprets an arithmetic expression, it must keep track of intermediate stages and precedence of operations using an evaluation stack. In the case of an interpreted language, two stacks are kept. One stack contains the pending operations that await completion of higher precedence operations. The other stack contains the intermediate inputs that are associated with the pending operations. In a compiled language, the compiler keeps track of the pending operations during its instruction generation, and the hardware uses a single expression evaluation stack to hold intermediate results.

To see why stacks are well suited to expression evaluation, consider how the following arithmetic expression would be computed:

X = (A + B) * (C + D)

First, A and B would be added together. Then, this intermediate results must be saved somewhere. Let us say that it is pushed onto the expression evaluation stack. Next, C and D are added and the result is also pushed onto the expression evaluation stack. Finally, the top two stack elements (A+B and C+D) are multiplied and the result is stored in X. The expression evaluation stack provides automatic management of intermediate results of expressions, and allows as many levels of precedence in the expression as there are available stack elements. Those readers who have used Hewlett Packard calculators, which use Reverse Polish Notation, have direct experience with an expression evaluation stack.

The use of an expression evaluation stack is so basic to the evaluation of expressions that even register-based machine compilers often allocate registers as if they formed an expression evaluation stack.

1.4.2 The return address stack

With the introduction of recursion as a desirable language feature in the late 1950s, a means of storing the return address of a subroutine in dynamically allocated storage was required. The problem was that a common method for storing subroutine return addresses in non-recursive languages like FORTRAN was to allocate a space within the body of the subroutine for saving the return address. This, of course, prevented a subroutine from directly or indirectly calling itself, since the previously saved return address would be lost.

The solution to the recursion problem is to use a stack for storing the subroutine return address. As each subroutine is called the machine saves the return address of the calling program on a stack. This ensures that subroutine returns are processed in the reverse order of subroutine calls, which is the desired operation. Since new elements are allocated on the stack automatically at each subroutine call, recursive routines may call themselves without any problems.

Modern machines usually have some sort of hardware support for a return address stack. In conventional machines, this support is often a stack pointer register and instructions for performing subroutine calls and subroutine returns. This return address stack is usually kept in an otherwise unused portion of program memory.

1.4.3 The local variable stack

Another problem that arises when using recursion, and especially when also allowing reentrancy (the possibility of multiple uses of the same code by different threads of control) is the management of local variables. Once again, in older languages like FORTRAN, management of information for a subroutine was handled simply by allocating storage assigned permanently to the subroutine code. This kind of statically allocated storage is fine for programs which are neither reentrant nor recursive.

However, as soon as it is possible for a subroutine to be used by multiple threads of control simultaneously or to be recursively called, statically defined local variables within the procedure become almost impossible to maintain properly. The values of the variables for one thread of execution can be easily corrupted by another competing thread. The solution that is most frequently used is to allocate the space on a local variable stack. New blocks of memory are allocated on the local variable stack with each subroutine call, creating working storage for the subroutine. Even if only registers are used to hold temporary values within the subroutine, a local variable stack of some sort is required to save register values of the calling routine before they are destroyed.

The local variable stack not only allows reentrancy and recursion, but it can also save memory. In subroutines with statically allocated local variables, the variables take up space whether the subroutine is active or not. With a local variable stack, space on the stack is reused as subroutines are called and the stack depth increases and decreases.

1.4.4 The parameter stack

The final common use for a stack in computing is as a subroutine parameter stack. Whenever a subroutine is called it must usually be given a set of parameters upon which to act. Those parameters may be passed by placing values in registers, which has the disadvantage of limiting the possible number of parameters. The parameters may also be passed by copying them or pointers to them into a list in the calling routine's memory. In this case, reentrancy and recursion may not be possible. The most flexible method is to simply copy the elements onto a parameter stack before performing a procedure call. The parameter stack allows both recursion and reentrancy in programs.

1.4.5 Combination stacks

Real machines combine the various stack types. It is common in register-based machines to see the local variable stack, parameter stack, and return address stack combined into a single stack of activation records, or "frames." In these machines, expression evaluation stacks are eliminated by the compiler, and instead registers are allocated to perform expression evaluation.

The approach taken by the stack machines described later in this book is to have separate hardware expression evaluation and return stacks. The expression evaluation stacks are also used for parameter passing and local variable storage. Sometimes, especially when conventional languages such as C or Pascal are being executed, a frame pointer register is used to store local variables in an area of program memory. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec1_4.html

" ...nesting of tasks or threads. The task and its creator share the stack frames that existed at the time of task creation, but not the creator's subsequent frames nor the task's own frames. This was supported by a cactus stack, whose layout diagram resembled the trunk and arms of a Saguaro cactus. Each task had its own memory segment holding its stack and the frames that it owns. The base of this stack is linked to the middle of its creator's stack. " -- https://en.wikipedia.org/wiki/Stack_machine (see also https://en.wikipedia.org/wiki/Spaghetti_stack , another name for the same concept)

" Use in programming language runtimes

The term spaghetti stack is closely associated with implementations of programming languages that support continuations. Spaghetti stacks are used to implement the actual run-time stack containing variable bindings and other environmental features. When continuations must be supported, a function's local variables cannot be destroyed when that function returns: a saved continuation may later re-enter into that function, and will expect not only the variables there to be intact, but it will also expect the entire stack to be present so the function is able to return again. To resolve this problem, stack frames can be dynamically allocated in a spaghetti stack structure, and simply left behind to be garbage collected when no continuations refer to them any longer. This type of structure also solves both the upward and downward funarg problems, so first-class lexical closures are readily implemented in that substrate also. " -- https://en.wikipedia.org/wiki/Spaghetti_stack

Some VMs with stack-based (as opposed to register-based) instructions have a value stack (for the VM instructions), a call stack, and a block stack (to keep track of eg that you are in a FOR loop nested inside another FOR loop; a BREAK would pop this stack); for example, Python [16].

Stacks usually grow down

In many architectures the stack grows downwards [17] [18] [19], eg x86, Mostek6502, PDP11, most MIPS ABIs [20] [21] [22].

In early memory-constrained devices, often there is either just a memory section for program code, and a section for data, and a stack, and the code is placed near location 0, then the data section, whereas the stack is placed in high memory (near the largest addresses) and grows down [23].

Memory-constrained devices often don't have a heap that grows (only a fixed-size data section, which can be thought of as a fixed-size heap), which means that often the only memory section with a dynamically changing size is the stack. [24]

This arrangement would also work if the fixed-size data were placed in high memory and the stack were placed in low memory and grew up. This is seen sometimes (eg HP PA-RISC, Multics [25]) but less frequently.

One reason given for the stack being placed in high memory and growing downwards is that if the ISA makes it more efficient to add only UNSIGNED (always positives) offsets to memory addresses, then since a common calculation is to add an offset to the stack pointer to locate something in the stack, this will be most efficient if that offset is always positive, that is, if the stack pointer points to the lowest memory location in the stack, which occurs if the stack grows downwards [26] [27]. In addition, alignment calculations are simpler; eg "If you place a local variable on the stack which must be placed on a 4-byte boundary, you can simply subtract the size of the object from the stack pointer, and then zero out the two lower bits to get a properly aligned address. If the stack grows upwards, ensuring alignment becomes a bit trickier." [28].

Stacks: more topics

Elsewhere, we treat (todo):

segmented stacks, in which the stack may be 'segmented' into multiple discontiguous memory regions, allowing expansion of the stack via additional incremental memory allocations if it outgrows its initial allocation
spaghetti stacks or saguaro (cactus) stacks
- https://en.wikipedia.org/wiki/Parent_pointer_tree
- c2.com/cgi/wiki?SpaghettiStack?
separate stacks for calls, data, blocks, etc
- eg SECD machine has data stack (S) and call stack (D), i think, todo
- eg Python's call stack, value stack, block stack [29]
- eg Forth's data stack vs return stack (call stack) [30]

by construct

short-circuit boolean expressions

Links:

A No-Frills Introduction to Lua 5.1 VM Instructions section 'Relational and Logic Instructions', page 35

regexs

There are three basic ways to implement regular expressions:

simulate the NFA specified by the regular expression directly. Worst-case running time is O(m*n), where m is the length of the regular expression and n is the length of the string being matched.
compile the NFA to a DFA and run that. Worst-case running time is O(2^m) to compile and then O(n) to match.
match the pattern against the input string by backtracking. Worst-case running time in n is O(2^n) (todo: what about m?) but this can usually be avoided by taking care not to use certain kinds of regexes (eg [31]). This is the only choice that can implement certain common regular expression extensions such as backreferences.

It is also possible to try to use one of these techniques and then fall back to another one in certain cases, or to partially combine the first two techniques via simulating the NFA but with caching [32].

grep, awk, tcl use combinations of the first two techniques. Perl, PCRE, Python, Ruby, and Java use the third technique [33]