proj-plbook-plPartImplementation

Table of Contents for Programming Languages: a survey

Part V: Implementation of a programming language

This is not a book about how to implement a programming language. However, even at the earliest stages of language design, it is valuable to have some idea of how the certain choices in the design of the language could make the implementation much less efficient or much more difficult to write. For this purpose, we'll provide a cursory overview of implementation topics, touching as we go on some of the language design choices that may have a large impact.

Chapter : the general pipeline

when do we convert to normal forms?

todo

" The first big phase of the compilation pipeline is parsing. You need to take your input and turn it into a tree. So you go through preprocessing, lexical analysis (aka tokenization), and then syntax analysis and IR generation. Lexical analysis is usually done with regexps. Syntax analysis is usually done with grammars. You can use recursive descent (most common), or a parser generator (common for smaller languages), or with fancier algorithms that are correspondingly slower to execute. But the output of this pipeline stage is usually a parse tree of some sort.

The next big phase is Type Checking. ...

The third camp, who tends to be the most isolated, is the code generation camp. Code generation is pretty straightforward, assuming you know enough recursion to realize your grandparents weren't Adam and Eve. So I'm really talking about Optimization...

-- http://steve-yegge.blogspot.ca/2007/06/rich-programmer-food.html

modified from Python's docs:

" The usual steps for compilation are:

    Parse source code into a parse tree
    Transform parse tree into an Abstract Syntax Tree
    Transform AST into a Control Flow Graph
    Emit bytecode based on the Control Flow Graph" -- https://docs.python.org/devguide/compiler.html#abstract

discussions on the benefits of the introduction of a 'Core' or 'middle' language in between the HLL AST and a lower-level, LLVM-ish language:

some advantages of a core language:

Links:


todo: somewhere put my observations on LLLs and target languages:

'target' languages include both 'Core languages' and LLLs

goal for:

Core languages tend to be: similar to the HLL, except: de-sugared (eg apply various local 'desugar' tranformations) have fewer primitives than the HLL (eg de-sugaring various control flow constructs to GOTO) in safe or prescriptive languages, sometimes the Core primitives are more powerful than the HLL (eg in Rust the MIR core language has GOTO but Rust does not) still retain function calling have explicit types have a bunch of other things that are explicit, so that the LLL is way to verbose for human use, but with a similar control flow to the original HLL like the HLL but unlike LLL, still nonlinear (parethesized subexpressions are still present)

LLLs and 'assembly' languages are more similar to each other than you might expect.

LLL general properties:

Main families of LLLs: imperative * stack vs. register machines stack machines (eg JVM, Python's, mb Forth) register machines * 3-operand vs 2- or 1-operand register machines * various calling conventions * addressing modes LOAD/STORE vs addressing modes vs LOAD/STORE with constant argument but self-modifying code explicit constant instructions (eg ADDC) vs constant addressing mode vs LOADK * are registers and memory locations actual memory locations upon which address arithmetic can be done (like in assembly), or just variable identifiers (like in Lua)? functional * eg https://en.wikipedia.org/wiki/SECD_machine combinator * eg Nock

Also, instruction encodings might be fixed-width or variable (most are variable).

Primitive operations in LLLs: not as much variance as you might expect. Generally:


Links regarding stack machine vs register machines:

---

Someone's response to the question "Does anyone know if there is research into the effects of register counts on VMs?":

" Someone 1819 days ago [-]

If you can map all VM registers to CPU registers, the interpreter will be way simpler.

If you have more VM registers than CPU registers, you have to write code to load and store your VM registers.

If you have more CPU registers than VM registers, you will have to write code to undo some of the register spilling done by the compiler (if you want your VM to use the hardware to the fullest.)

So, the optimum (from a viewpoint of 'makes implementation easier') depends on the target architecture.

Of course, a generic VM cannot know how many registers (and what kinds; uniform register files are not universally present) the target CPU will have.

That is a reason that choosing either zero (i.e. a stack machine) or infinitely many (i.e. SSA) are popular choices fo VMs: for the first, you know for sure your target will not have fewer registers than your VM, for the second, that it will not have more.

If you choose any other number of VM registers, a fast generic VM will have to handle both cases.

Alternatively, of course, one can target a specific architecture, and have the VM be fairly close to the target; Google's NaCl? is (was? I think they changed things a bit) an extreme example. I have not checked the code, but this, I think, is similar. " -- https://news.ycombinator.com/item?id=2930109

---

Chapter : modularity

whole program analysis

if modules are separately compiled and allow types to remain completely abstract, e.g. when a module is compiled it is possible to not know what the in-memory representation of values in the module will be, then a host of optimizations cannot be performed. Alternatives:

see section "compilation model" in "Retrospective Thoughts on BitC?" http://www.coyotos.org/pipermail/bitc-dev/2012-March/003300.html

https://news.ycombinator.com/item?id=5422094

ml functors

intermediate representations (IRs)

" Some common intermediate representations:

Examples:

-- [5]

[6] also describes and motivates SSA and CPS in more detail. It defines SSA and explains why you need phi in SSA, and refers to an algorithm to determine where to place the phis.

Abstract machines and the compilers that love/hate them

Introducing structured control flow

Some target languages, such as WASM and SPIR V, require some amount of structure in control flow. If the source language doesn't provide this structure, it must be introduced.

Links:

Chapter : possible compiler safety checks beyond typechecking

check to see if there are any reads to uninitialized variables could implement as types

Chapter : linker

issue in C:

in C, according to one of the comments on http://www.slideshare.net/olvemaudal/deep-c , it was claimed (i didn't check) that if you declare your own printf with the wrong signature, it will still be linked to the printf in the std library, but will crash at runtime, e.g. "void printf( int x, int y); main() {int a=42, b=99; printf( a, b);}" will apparently crash.

-- A new programming language might want to throw a compile-time error in such a case (as C++ apparently does, according to the slides).

Links:

Chapter : concurrency implementation

atomics

SIMD, MIMD

GPU

Useful algorithms and data structures

Chapter: designing and implementing a virtual machine

"Instructions should not bind together operations which an optimizing compiler might otherwise choose to separate in order to produce a more efficient program."

Chapter: ?? where to put these

tags

descriptors

stacks

cactus/saguaro stacks

random book on stack-based hardware architectures: http://www.ece.cmu.edu/~koopman/stack_computers/contents.html

trampolines

vtables (C++)

interpreter vs spec:

" The technique that we had for Smalltalk was to write the VM in itself, so there’s a Smalltalk simulator of the VM that was essentially the only specification of the VM. You could debug and you could answer any question about what the VM would do by submitting stuff to it, and you made every change that you were going to make to the VM by changing the simulator. After you had gotten everything debugged the way you wanted, you pushed the button and it would generate, without human hands touching it, a mathematically correct version of C that would go on whatever platform you were trying to get onto." -- Alan Kay, http://queue.acm.org/detail.cfm?id=1039523

metacircular interpreters

Futamura projections https://news.ycombinator.com/item?id=7061913

"

nostrademons 1 day ago

link

IIUC PyPy? is all 3 Futamura projections.

I'm a little rusty on the details, but IIUC the core of PyPy?