Bayle Shanks's website: proj-plbook-plChTargetsImpl

Table of Contents for Programming Languages: a survey

Chapter : targets, IRs, VMs and runtimes

note: HLL is an abbreviation for High Level Language.

IR is for Intermediate Representation.

The core language of a VM is often called 'bytecode'. The instructions in this language are called 'instructions' or 'operations' or 'ops' or 'opcodes' or 'bytecodes'.

vms

maximizing code density for embedded systems

https://dercuano.github.io/notes/tiny-interpreters-for-microcontrollers.html

register-based vs stack-based VMs

"It’s a bit easier to build a relatively efficient interpreter with a register-based machine...Having now done the code generation work for register and stack machines...Mostly, I found the register-based code generation a bit easier to do...on the JVM, the stack must be empty either side of a try/catch construct. So you have to do some work to spill the stack into locals if there’s anything already on it....I haven’t really come out of all this thinking register machines are overwhelmingly better than stack machines, though – but they are a bit easier on the code generation, which is nice. If you plan to put your call frames on the system stack, the stack approach seems more attractive from that angle. Once you decide you’re not doing that, you could go either way." -- http://www.i-programmer.info/programming/perl/6138-where-is-perl-heading-an-interview-with-jonathan-worthington.html?start=1

Shi, Yunhe; Gregg, David; Beatty, Andrew; Ertl, M. Anton (2005-06-11). "Virtual Machine Showdown: Stack Versus Registers".

"And never forget that stack machines are superior to register machines when it comes to code density!" -- http://www.pouet.net/topic.php?post=398318

http://blogs.msdn.com/b/ericlippert/archive/2011/11/28/why-have-a-stack.aspx

examples of stack-based VMs:

JVM (see Chapter ?: The Java bytecode (JVM) language)
CPython bytecode interpreter (see the relevant section of Chapter ?: other intermediate target languages designed for implementing a single high-level language)

"if you look at the VMs that do successfully host multiple languages today, they’ve typically started by hosting one language really well." -- http://www.josetteorama.com/all-about-perl-6-interview-of-jonathan-worthington-part-2-of-3/

i guess the reason that writing instructions in stack-mode is so concise is that it's really just a useful shorthand for how the outputs of earlier instructions are plugged into the inputs of later ones, and that in the common case (executing a bunch of sequential instructions whose inputs depend on the output of the previous instruction) it's concise. when you get far away from this common case is when you get lots of ROTs and DUPs, in which case maybe registers are better.

another way of saying that: when you are computing expressions, you end up recursively computing subexpressions and then plugging the result into a larger expression. E.g. to compute 0.5*(f(x,y,u+v) + z), you compute z, then you compute u+v, then f(x,y,u+v), then f(x,y,u+v)+z, then 0.5*f(x,y,u+v)+z. You don't save usually save these intermediate results to use twice; e.g. if you compute 0.5*(2*f(x) + 2*f(x)) you might actually compute f(x) twice (or you may not, if you memoize). This is perfect for a stack machine. Whereas, if you did y = 2*f(x) and then 0.5*(y + y), that's more easily done on a register machine. The point is, for the best-case scenario on a stack machine, each intermediate result is used exactly once and is consumed immediately, except when other operands must first be computed as further inputs to the function that is consuming it, in which case it is consumed as soon as all of its coinputs are produced. If each function further produces exactly one output, then that output represents a subexpression and the stack machine represents the recursive computation of an expression.

Registers are for multi output, stack for single (== expressions)

Does a free swap turn a stack into a traditional register set? by D. J. Bernstein

A Performance Survey on Stack-based and Register-based VirtualMachines (finds that register machines are more efficient; the register machine they used appears to be this one, which appears to have 4 registers: https://github.com/Conceptual-Inertia/Inertia/blob/master/docs/instr_file.txt )

discussion:

https://news.ycombinator.com/item?id=13154111

http://wingolog.org/archives/2013/11/26/a-register-vm-for-guile

another argument for register VMs is the simpler de-optimization of JIT compilers: " Modern JIT compilers use a technique called type feedback compilation. Briefly, the idea is that a compiler that is integrated into the runtime of the system can exploit information on how the program is actually used to compile more efficient code than would be possible on the basis of the program source code alone. A simple example in javascript would be the following routine:

function foo(a) { var r = 0; for (var i = 1; i < a.length; i++) { r += (a[i] * a[i-1]); } return r; }

foo([1,2,3,4,5,6]);

If all calls to foo happen to have a single argument consisting of an array of integers, the semantics of this routine become much simpler than they are otherwise. (For example, in javascript, the addition of a number and a string produces a well-defined result, so it is totally valid to call foo with an array of strings). A type-feedback compiler might notice a large number of calls to foo, all with integer arrays as their sole argument, assume this will always be so, and compile a much faster routine. In order to correctly handle arrays of strings too, the compiler inserts a 'guard clause' that checks if a is really an array of integers. If not, the routine must be 'de-optimised'. Note that spesh, which is the optimisation framework for MoarVM?, also works this way

The goal of de-optimisation is to resume the execution of the interpreted (slow) routine where the assumptions of the compiled routine have failed. A typical place in our 'foo' function would be on entry or on the addition to r. The idea is that the values that are calculated in the optimised routine are copied to the locations of the values of the interpreted routine. In a register machine, this is conceptually simple because all variables already have a fixed location. However, the layout of the stack in a stack vm is dynamic and changes with the execution of the routine, and mapping between compiled and interpreted values may not be very simple at all. It is certainly doable - after all, the JVM famously has an efficient optimising JIT compiler - but not simple. " -- http://brrt-to-the-future.blogspot.com/2014/05/moarvm-as-machine.html

a discussion on this is

Links:

How many registers?

" If you only count real GPRs in MIPS like that then in ARM32 there's only 13 registers (R0-R12, minus SP, LR and PC) and x86 has 7 (minus SP, when omitting frame pointers). x86_64 has 15 and ARM64 has 31 – phuclv Jan 23 '15 at 6:22 "

https://stackoverflow.com/questions/8466981/why-does-arm-have-16-registers

"The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements"

also you need to store more state for context switches, which takes more time and more memory

" You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:

Extendable Instruction Set Computing

27 100.00 27.90% 16 101.62 30.22% 8 114.76 44.45% " -- Bradley McKinley

good read: https://electronics.stackexchange.com/questions/256854/why-dont-we-have-more-registers-in-microprocessors

Potential differences with register machines in VMs vs bare metal

Links:

https://news.ycombinator.com/item?id=9517508

VMs: pros and cons

Why do some languages target virtual machines? Why add another layer that isn't strictly necessary?

Pros:

portability: to make your language run on a new platform, you only have to port the virtual machine
can be faster than interpreting the source code directly: e.g. within an inner loop, you don't need to be doing things like re-parsing each line of code each time you see it, desugaring language constructs the same way over and over, etc
separation of concerns: to some extent you can keep 'implementation stuff' in the VM and 'high level syntax' in the translation from source to the VM; although semantics may involve both
specification: if you implement the language's compiler to the VM in the language itself, and then also implement it in the VM, you can check for consistency by checking that the result of compiling the language's own compiler produces the same bytecode as your implementation of it in the VM. This would also work if you replaced the VM with native code in the previous sentence, but VMs are usually simpler than native architectures and so make for better specifications and easier reading.

Additional pros if you target an existing virtual machine:

interoperability with libraries from existing languages: if you target the same VM as another language, it becomes easier (although sometimes still requiring some boilerplate) to call between programs running in these languages, which means that the languages can use each others's libraries.
the virtual machine has already been written: it is probably easier to write a compiler or interpreter from the new language to the virtual machine, than to native hardware. The virtual machine might have already been ported to various architectures.
implementation functionality: the virtual machine may contain functionality such as a garbage collection, a green thread scheduler, and various optimizations.
toolchain: there is probably an existing toolchain to support installation, packaging, debugging, and profiling. The new language may eventually reimplement some of these systems but in the meantime at least there's something.

Cons:

Links:

Information Losses in bytecode IRs

Comment by Raphael Amiard on the efficiency problems with taking a pre-existing bytecode IR and compiling it, as opposed to compiling from the HLL's AST directly to a target without going thru a pre-existing bytecode IR (after the experience of writing a compiler for OCaml ZAM bytecode (the Z3 project)): https://news.ycombinator.com/item?id=4798320

Why no polymorphism in bytecode?

In bytecode instruction sets, often you see a bunch of different instructions that do similar things on different types. For example, add signed integers, add unsigned integers, add 32-bit floating points, add 64-bit floating points. You might think, gee, couldn't the darn bytecodes give us a higher level of abstraction than that? Couldn't there just be one opcode for ADD, and the VM figures out which addition routine is required based on the type?

Here's why that is not often done. If at compile time the types of everything were known, and then at runtime we encounter an instruction to add (for example) register 0 to register 1, we don't want to have to lookup the types of register 0 and register 1 to see whether we should jump to the integer addition routine, the floating point addition routine, etc, because that takes significant extra time.

So, why not just have a single ADD opcode, but also encode the statically-known types in a separate 'type' field in the instruction? Well, for one, this field would have to have enough bits to encode all types that you would want to distinguish, so if the bytecode supports many types, this could substantially reduce code density.

Well, we have two ways to do that. First, at runtime we could look up the implementation to use in a table that is indexed by (opcode, type). But this table would be huge, and wasteful; not all opcodes can be used with all types (or are handled differently with all types), so this table would be much larger than it needs to be. We could store the table sparsely or as a hash, but then lookup takes a long time, and this is something that we have to do upon every instruction so that would be bad.

The second way would be to pass the type given by the instruction's bytecode to the implementation of the instruction, and let the implementation conditionally branch to the subimplementation for that type. The problem with this is that now we're doing at least two branches for polymorphic instructions at runtime, first a branch to the opcode's handler, and then a branch to the implementation for the type that we have. This runtime branching is not needed, since for values whose type was statically known, we could have figured out where we would have ended up at compile time.

So, the most efficient thing to do is to simply have a separate opcode for each implementation routine; eg one opcode for integer addition, another opcode for floating point addition, etc.

Common constraints on VMs for efficiency

A VM does not need to obey these constraints in order to fit the definition of a VM, but they are often seen for reasons of efficiency:

space-efficient bytecode encoding: As the name 'bytecode' may suggest, VMs commonly go to some effort to create a binary encoding in which each instruction is as small as possible, for example by specifying each instruction as a struct composed of sub-byte-size bitfields. Others things being equal, the code density with respect to memory can have an impact on execution efficiency, by decreasing the amount of memory reading needed to execute a given amount of code, especially when high code density allows inner loops to fit within cache.
bytecode instructions are not first-class functions: Even if the HLL (high-level language) executing on the VM has the property that every function is first-class, the individual bytecode instructions and chunks of code in the VM are often not first-class.
everything that can be done beforehand, short of compiling the bytecode to platform native instructions, has been: one of the points of having bytecode instead of just interpreting the HLL directly is speed; by the time we are executing bytecode, we have already parsed the source code, we have already done whatever static optimizations we want to do. Note that whether we want to do some optimizations may depend on which underlying platform the virtual machine is running on, in which case we want to let the virtual machine choose later whether to do them, though.
todo what else?

How intermediate?

Leans up vs. leans down

should they be designed to be similar to hardware CPU instruction set architecture (ISA)? e.g. most extant VMs are either stack machines or register machines; but why not just memory-to-memory?

simple vs. complex

data model, evaluation strategy of HLL, sometimes object model

two layers of VMs

interoperable vms expand into toolchains

Impedence mismatches when targetting an existing VM

examples:

" I learned my own lessons about the “repurposed JIT phenomenon” the hard way. Before any work started on HHVM, I built a hackathon prototype that did naive compilation from PHP to JavaScript?. Running the output on V8 produced some really promising results at first; programming-language shootout[2] style benchmarks looked in the neighborhood of 10x the standard PHP interpreter.

The problems came when I tried to run real PHP programs. The primitive operations that PHP exposes, and that large PHP codebases rely on, are simply different from those exposed by JavaScript?, Java, the CLR, etc. For example, PHP engines expose their use of reference counting to the PHP program, while JavaScript? and Java implementations typically use some sort of tracing garbage collection (and even if they use reference counting, don’t expose it to programs). So, in order to faithfully run PHP programs on a JVM or JavaScript? engine, you need to reimplement all of PHP’s memory management, either in the source language or in the runtime itself. So right out of the gate, you lose one of the big pieces of effort you were hoping to save by recycling someone else’s engine: memory management.

PHP also has a notion of “references” that doesn’t have a direct analog in JavaScript?; passing parameters by reference is unmarked in the call site, so you need to do method resolution (“what am I actually calling at runtime?”) at the call site before you can even pass parameters; PHP has value semantics for its array datatype, while JavaScript? and Java provide reference semantics for containers; PHP’s numerics are very different from JavaScript’s?; PHP has to do protected/private/public method privacy checks at runtime; no, JVM fans, InvokeDynamic? does not magically make all of these mismatches disappear; and on and on. Once you start diving into the nuances of the low-level primitives, almost every language is different enough that it is hard to efficiently implement one in terms of the other. " -- http://www.forbes.com/sites/quora/2013/06/20/why-did-the-hip-hop-vm-team-build-a-new-vm-instead-of-targeting-the-jvm-tracemonkey-or-v8/

" Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that. An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator. LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy? uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy?'s memory usage had been higher. " -- http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html

Author Profile Page chromatic replied to comment from Gerhard R.

January 4, 2013 10:23 AM

I think you're falling into the "JIT makes things fast" fallacy. You can see how well that works for languages which use LLVM--if they look like C or C++, they do pretty well. Otherwise they don't.

If you want your language to go really fast, your runtime needs to know a lot about the semantics of your language. How does memory allocation work? What's the layout of primitives? What precision and signedness do your primitives support? Do you support aliasing? What constness guarantees do you provide? What immutability guarantees do you provide? When can you resolve polymorphic dispatch? What boxing semantics do you provide? How does control flow work, and can you resolve questions about object escapes and lifetimes? What runtime support do you need to enable metaprogramming? When can you make these decisions?

I've often said that to support Perl 6 you need a virtual machine that looks a lot like Parrot (not that Parrot does everything right, but that it at least tries to make these things possible). I still suspect that mythical VM abstraction layer will go the way of the DLR. Effectively what NQP on the JVM or CLR will look like is a reimplementation of all of the C code in Rakudo right now using the underlying VM to provide very basic memory management and some platform access.

The result might run faster than current Rakudo, but it'll be so far away from the semantics of Java or C# that I doubt it'll run with the amazing speed people hope except on a few carefully chosen benchmarks that avoid the interesting language features that might make Perl 6 worth using.

(See also: JavaScript?'s terrible numeric support system and why JS JITs go to great length to perform calculations efficiently.)

" -- http://www.modernperlbooks.com/mt/2012/12/the-implementation-of-perl-5-versus-perl-6.html#comment-1957

" Besides the well-known issues of tail calls and continuations, VMs often include a runtime type system, garbage collection, calling functions with arguments, objects with fields, strings, dictionaries, typesafe arithmetic etc. It is unlikely that these operations match the semantics of an independently developed language. The type system will most likely not match at all. There are tons of problematic areas:

How are functions with a statically unknown arity handled?
Can objects other than functions be applied?
How are object fields identified? Are they accessed when knowing the object type or not, possibly including subtyping? Are fields identified by strings or symbols, or indices computed at compile time? Is the set of fields known at object creation time, or is the semantics of field access resolved dynamically?
What is the nature of characters, elements of strings? Unicode code points? UTF-16 code units? Bytes? Are they distinguished from 1-character strings? Are strings mutable, and can they change their length?
How is hashing and equality used by a hash table specified?
What is the semantics of weak references and finalizers? Is a finalizer specified when its associated object is created, or can it be added at any time to any object? May a finalizer refer to the object being finalized, or to other finalizable objects? Are finalizers run by other threads?
What numeric types are available? What is the effect of overflowing fixed-size integers: returning bignums? an exception? wrapping modulo integer size? What is the semantics of integer division and remainder of negative numbers? What happens when you try to add objects which are not of a builtin numeric type?
Which exceptions are thrown from particular erroneous situations? What happens when an array is accessed outside of bounds? What happens on integer and floating point division by zero? Are exceptions resumable? Is exception handling code in a tail position with respect to try?" -- http://lambda-the-ultimate.org/node/1617#comment-19661

" Language-independent VM

Here's the big problem with high-level language independent VMs: language designers invariably want to do something new. By definition, this is something you haven't thought of. Want lazyness? Out of luck. Continuations? Nope. First-class stack frames? Ditto. Fast multimethod dispatch? Well, you can hack it up, but most likely it'd be easier to just write a new VM.

In my case, I want to do a language with Erlang-style lightweight concurrency and message passing. This has to be built into the VM at a low level, because it affects everything from IO to heap allocation to stack representation to scheduling. I looked at Parrot, Neko, C--, and LLVM, but none of them offer much in concurrency support. The only halfway-suitable VM was, not surprisingly, Erlang itself, but I can't find any documentation on the bytecode format and the source is semi-closed for modifications (and makes what are IMHO poor choices in string represenation, making it unsuitable for my purposes). I'll probably end up starting with LLVM and then writing an instruction-compatible custom VM once I get to the concurrency parts. " -- http://lambda-the-ultimate.org/node/1617#comment-19670

supporting platform-dependent optimized primitives without bloating the VM language

Different platforms offer different optimized primitives. If the VM's language defined special facilities for the union of all of these, it would be very large indeed; this would also necessitate relatively large implementations of the VM because on each platform it would have to implement each of te defined facilities for . But if it does not define special facilities for some of these, then naive compilers targeting the VM language will transform operations that could be performed by optimized primitives into lengthly sequences of VM instructions, making it hard for a platform-specific runtime to figure out where the optimized primitives could be inserted.

What are some ideas for offering access to platform-dependent optimized primitives without bloating the VM language?

optimizer recognizes implementations of optimized primitives

A platform-specific optimizer, which could be part of a compiler or a runtime which is fed bytecode input, would try to recognize naive implementations of what on this platform could be rewritten in terms of optimized primitives, and rewrite them.

C. Guy Yarvin calls this "jets" at http://www.urbit.org/2013/08/22/Chapter-0-intro.html

annotations

Sections of the bytecode could support annotation, and then in the future annotations could be defined to say essentially, "this section is just doing X, if you have an optimized primitive for X please replace this section with that".

bytecode vs executing the AST

(todo: merge this with section "How intermediate?", above)

Interpreters are slower than compilers but there's a spectrum of types of interpreters.

An interpreter could actually re-parse the source code for each line of code each time it encounters it at runtime.

An interpreter pre-parse the source code once, and then interpret the AST, traveling from AST node to AST node instead of from source code line to source code line.

An interpreter pre-parse the source code once, and annotate the AST with "next" pointers that say, "once you're done with this node, here is the node to visit next". Perl5 takes this route [1]. This allows the interpreter to take a 'linear' path through the AST, even though the AST isn't being stored in linear form.

An interpreter could parse the source code, generate the AST, and compile the AST to a linear stream of bytecode. Java and Python take this route.

In 'bytecode' representations, you generally try not to defer something to runtime that you could just as easily do at compile time (examples: (a) resolve polymorphism which can be resolved statically; (b) for each instruction, figure out what the next instruction(s) are).

Orthogonal to this, an interpreter could JIT, that is, while interpreting the code also incrementally compile parts of it to native code (perhaps eventually compiling all of it, perhaps not).

misc

todo read this, mb put it here or mb in the computed goto section: http://wknight8111.blogspot.com/2009/06/understanding-opcode-dispatch.html

note: some platforms don't allow the execution of dynamically written code (e.g. Apple iOS) (umm, i think this is totally out of date now)

" Apple forbids the inclusion of mechanisms that enable execution of arbitrary third-party code. This rule effectively bars the use of nonstandard runtimes, JIT engines, and bytecode interpreters " -- http://arstechnica.com/information-technology/2009/09/monotouch-drops-net-into-apples-walled-app-garden/

" 3.3.2 — An Application may not itself install or launch other executable code by any means, including without limitation through the use of a plug-in architecture, calling other frameworks, other APIs or otherwise. No interpreted code may be downloaded or used in an Application except for code that is interpreted and run by Apple’s Documented APIs and built-in interpreter(s). "

http://stackoverflow.com/questions/12972802/appstore-ios-apps-and-interpreted-code-where-do-they-draw-the-line

" 2.7 Apps that download code in any way or form will be rejected

    2.8 Apps that install or launch other executable code will be rejected"

Links:

LLVM

JVM

CLR

SEAM

SEAM is a virtual machine framework that provides a component model for various services and leaves it up to the user to implement a 'execution unit', rather than defining an intermediate language, because "it is next to impossible to design a single intermediate language to accomodate all programming languages" [3]. Alice ML ('Mozart/Oz meets ML') and JVM have been implemented on SEAM [4].

There were reasons for developing Seam

The usual requirements of FPLs aside, the design of Alice relies on two central services that have to be provided by the VM:

    Pickling: a universal marshalling mechanism that includes code and supports configurable import/export transformations is absolutely crucial to the whole framework. Compiled components are actually pickles, and all distribution uses pickling. Basically, pickling has to be implemented on the same level as GC.

    Transients: are the generic, low-level representation of futures and support implicit data-flow synchronisation. Concurrency, laziness, dynamic linking, and just-in-time compilation all rely on this mechanism. For efficiency, VM and GC must be aware of transients and encapsulate them as far as possible. Transients also interfer with pickling.

Unfortunately, AFAIK, no existing VM or VM framework (besides Mozart) supports either of these features appropriately. LLVM wasn't even available when development on Seam started (in 2001). By Andreas Rossberg at Tue, 2004-12-28 18:12

" [5]

Some components of SEAM include:

graph store: used to provide garbage collection and pickling
scheduler
execution unit: SEAM does not define a specific code representation nor provide an execution engine to execute code in that representation, but rather it defines an API for an 'execution unit' service; languages can then provide execution units that know how to understand and execute code. "Seam defines an abstract task manager interface, which comprises the necessary functions for interpreting stack frames: This includes executing a task, handling exceptions (or raising them to the next stack frame), and clearing no-longer-needed references to store nodes from stack frames prior to garbage collection. We chose per-frame parameterization (instead of per-thread): Stack frames store a reference to the associated task manager in their first slot; operations on stack frames retrieve the actual task manager and delegate to it. Thisimplies that the computation within a single thread can be carried out by several task managers in conjunction. Accordingly, we need a generic way to create stack frames, asa function application does (or a procedure call, or a method invocation). Seam therefore defines an abstract task creator interface to perform stack frame creation. A first-class computation (a function closure, procedure, or method) is a pair of code and environment. In our representation, it is a block whose first subnode stores the code, while the re-maining subnodes store the environment. Code in turn is a block whose first subnode is the task creator implementa-tion, while the remaining subnodes are the code representation as defined by the task creator. A global register bankprovides for passing arguments. We obtain a generic way to execute a task, regardless of the language in which it isimplemented.The downside of parameterizing stack frames and firstclass computations is the extra indirection (virtual call) this im-plies. This overhead can be eliminated, however, when alanguage implementor knows that the same task manager/task creator is associated with the caller and the callee in an invocation (“intra-language invocation." [[6]
I/O subsystem: the scheduler needs to know about I/O so that it can wake up threads waiting on I/O when that I/O completes

Links:

GHC Cmm (similar to C--)

Cmm is an IL used in in Haskell's GHC. It was based on C-- (Cminusminus) which was more general, but the C-- project seems to have ended.

Example from [8]: Three functions that compute the cumulative sum from 1 to n, and the cumulative product from 1 to n:

/* Ordinary recursion */
export sp1;
sp1( bits32 n ) {
  bits32 s, p;
  if n == 1 {
    return( 1, 1 );
  } else {
    s, p = sp1( n-1 );
    return( s+n, p*n );
  }
}


/* Tail recursion */
export sp2;
sp2( bits32 n ) {
  jump sp2_help( n, 1, 1 );
}

sp2_help( bits32 n, bits32 s, bits32 p ) {
  if n==1 {
    return( s, p );
  } else {
    jump sp2_help( n-1, s+n, p*n )
  }
}


/* Loops */
export sp3;
sp3( bits32 n ) {
  bits32 s, p;
  s = 1; p = 1;
 loop:
  if n==1 {
    return( s, p );
  } else {
    s = s+n;
    p = p*n;
    n = n-1;
    goto loop;
  }
}

Motivation:

Why not target a high-level VM such as the JVM?

"One alternative is to make all these high-level services part of the abstraction offered by the portable assembler. For example, the Java Virtual Machine.... But a sophisticated platform like a virtual machine embodies too many design decisions. For a start, the semantics of the virtual machine may not match the semantics of the language being compiled (e.g., the exception semantics). Even if the semantics happen to match, the engineering tradeoffs may differ dramatically. For example, functional languages like Haskell or Scheme allocate like crazy (Diwan, Tarditi, and Moss 1993), and JVM implementations are typically not optimised for this case. Finally, a virtual machine typically comes complete with a very large infrastructure — class loaders, verifiers and the like — that may well be inappropriate. Our intended level of abstraction is much, much lower." -- [9]

Why not target C?

The C route rewards those who can map their high-level language rather directly onto C. A high-level language procedure becomes a C procedure, and so on. But this mapping is often awkward, and sometimes impossible. For example, some source languages fundamentally require tail-call optimisation; a procedure call whose result is returned to the caller of the current procedure must be executed in the stack frame of the current procedure. This optimisation allows iteration to be implemented efficiently using recursion. More generally, it allows one to think of a procedure as a labelled extended basic block that can be jumped to, rather than as sub-program that can only be called. Such procedures give a front end the freedom to design its own control flow. It is very difficult to implement the tail-call optimisation in C, and no C compiler known to us does so across separately compiled modules. Those using C have been very ingenious in finding ways around this deficiency (Steele 1978; Tarditi, Acharya, and Lee 1992;Peyton Jones 1992; Henderson, Conway, andSomogyi 1995), but the results are complex, fragile, and heavily tuned for one particular implementation of C (usually gcc).
A C compiler may lay out its stack frames as it pleases. This makes it difficult for a garbage collector to find the live pointers. Implementors either arrange not to keep pointers on the C stack, or they use a conservative garbage collector. These restrictions are Draconian.
The unknown stack-frame layout also complicates support for exception handling, debugging, profiling, and concurrency. For example, an exception-handling mechanism needs to walk the stack, perhaps removing stack frames as it goes. Again, C makes it essentially impossible to implement such mechanisms, unless they can be closely mapped onto what C provides (i.e., setjmp and longjmp).
A C compiler has to be very conservative about the possibility of memory aliasing. This seriously limits the ability of the instruction scheduler to permute memory operations or hoist them out of a loop. The front-end compiler often knows that aliasing cannot occur, but there is no way to convey this information to the C compiler.

So much for fundamental issues. C also lacks the ability to control a number of important low-level features, including returning multiple values in registers from a procedure, mis-aligned memory accesses, arithmetic, data layout, and omitting range checks on multi-way jumps. " -- [10]

Features:

C-- expressions and assignments are abstractions of computation. C-- provides a rich set of computational operators, but these operators work only on machine-level data types: bytes, words, etc. The expression abstraction hides the particular combination of machine instructions needed to compute values, and it hides the machine registers that may be needed to hold intermediate results.
C--’s goto and if statements are abstractions of control flow. (For convenience, C-- also providesstructured control-flow constructs.) The if abstraction hides the machine’s “condition codes;” branch conditions are arbitrary Boolean expressions.
C-- treats memory much as the machine does, except that addresses used in C-- programs may be arbitrary expressions. This abstraction hides the limitations of the machine’s addressing modes.
C-- variables are an abstraction of registers. A C-- back end puts as many variables as possible in registers; others go in memory. This abstraction hides the number and conventional uses of the machine’s registers.
In addition, C-- provides a procedure abstraction, the feature that looks least like an abstraction of a hardware primitive. However, many processor architectures provide direct support for procedures, although the nature of that support varies widely (procedure call or multiple register save instructions, register windows, link registers, branch prediction for return instructions, and so on). Because of this variety, calling conventions and activation-stack management are notoriously architecture dependent and hard to specify. C-- therefore offers prodedures as a primitive abstraction, albeit in a slightly unusual form (Section 3.4).

Our goal is to make it easy to retarget front ends, not to make every C-- program runnable everywhere." -- [11]

"C-- supports procedures that are both more and less general than C procedures - for example, C-- procedures offer multiple results and full tail calls, but they have a fixed number of arguments." -- [12]

"C-- supports a bare minimum of data types: a family of bits types (bits8, bits16, bits32, bits64), and a family of floating-pointtypes (float32, float64, float8)." -- [13]

"Memory accesses (loads and stores) are typed, and denoted with square brackets. Thus the statement: bits32[foo] = bits32[foo] + 1; loads a bits32 from the location whose address is in foo, adds one to it, and stores it at the same location." -- [14]

Links:

http://www.cs.tufts.edu/~nr/c--/index.html (or http://web.archive.org/web/20080822062234/http://www.cminusminus.org/ )
- http://www.cs.tufts.edu/~nr/c--/papers.html ("For an overview of the design")
- C--: a portable assembly language that supports garbage collection, by Simon Peyton Jones, Norman Ramsey, and Fermin Reig; 29 pages. Invited talk at PPDP, October 1999. "The paper to read if you're going to read only one. Reasonably thorough overview of the language and its run-time system, emphasizing garbage collection. Still, we have already changed a handful of details."
http://www.scs.stanford.edu/11au-cs240h/notes/ghc-slides.html
http://webcache.googleusercontent.com/search?q=cache:zDvHNEwVmusJ:www.scs.stanford.edu/11au-cs240h/notes/ghc-slides.html&client=ubuntu&hl=en&gl=us&strip=1
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Cmm
https://davidterei.com/downloads/papers/terei:2009:honours_thesis.pdf (section 2.4.3 page 13 PDF page 22, section 2.7 page 23 PDF page 32, and various places in section 3)
http://www.haskell.org/ghc/docs/latest/html/libraries/ghc/Cmm.html
C--: a portable assembly language that supports garbage collection
https://github.com/nrnrnr/qc--
https://en.wikipedia.org/wiki/C--
C--: A Portable Assembly Language (1997; Simon Peyton Jones, Thomas Nordin, Dino Oliva; 19 pages)

From https://ghc.haskell.org/trac/ghc/browser/ghc/compiler/cmm/CmmParse.y, grepping for string 'emit (mk' and by looking at the functions machOps and callishMachOps we can see some of the 'primitives' presumably emitted by Cmm:

branch return (note: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType says "returns are only Cmm macros") rawJump jumpExtra Cbranch
add sub neg mul quot rem devu modu fadd fsub fneg fmul fquot
eq ne ge le gt lt geu leu gtu ltu feq fne fge fle fgt flt
and or xor com (not)
shl shrl shra
(conversions) lobits8 (UU W8) lobits16 (UU W16) lobits32 lobits64 zx16 (UU W16) zx32 zx64 sx16 (SS W16) sx32 sx64 f2f32 (FF W32) f2f64 f2i8 (FS W8) f2i16 (FS W16) f2i32 f2i64 i2f32 (SF W32) i2f64
write_barrier
memcpy memset memmove
prefetch0 prefetch1 prefetch2 prefetch3
popcnt8 popcnt16 popcnt32 popcnt64
cmpxchg8 cmpxchg16 cmpxchg32 cmpxchg64

https://ghc.haskell.org/trac/ghc/browser/ghc/compiler/cmm/CmmMachOp.hs also mentions:

clz ctz bswap atomicRMW(add sub and nand or xor
pwr sin cos tan sinh cosh tanh asin acos atan log exp sqrt pwr sin cos tan sinh cosh tanh asin acos atan log exp sqrt
insert extract (floating point vector element insertion and extraction)

" Cmm Statements implement:

no-op; the empty statement: ;
C-- (C99/C++ style) comments: ... \n and /* ... */
the assignment operator: =
store operation (assignment to a memory location): type[expr] =
control flow within procedures (goto) and between procedures (jump, returns) (note: returns are only Cmm macros)
foreign calls (foreign "C" ...) and calls to Cmm Primitive Operations (%)
procedure calls and tail calls
conditional statement (if ... { ... } else { ... })
tabled conditional (switch) " -- [15]

Cmm has labels and basic blocks.

https://mail.haskell.org/pipermail/ghc-devs/2014-September/006301.html says "C-- was originally envisaged as a target language for a variety of compilers. But in fact LLVM, which was developed at a similar time, "won" that race and has built a far larger ecosystem. That's fine with us -- it's great how successful LLVM has been -- but it means that C-- is now used essentially only in GHC."

Differences between Cmm and C-- include:

Cmm doesn't support C--'s runtime API
Cmm uses the C preprocessor [16]
In Cmm, foreign calls specify which registers need to be saved [17]
Cmm doesn't understand parenthetical list of arguments after procedure names, although its pretty-printer emits these [18]
Cmm doesn't support "procedures as section contents" [19]
Cmm doesn't support "const" [20]
In Cmm, "all-uppercase procedure invocations are special reserved tokens in compiler/cmm/CmmLex?.x and compiler/cmm/CmmParse?.y" [21]
Cmm doesn't support nested code blocks [22] (does C-- support this?)
"Although the C-- specification does not allow the C-- type system to statically distinguish between floats, signed ints, unsigned ints or pointers, Cmm does" [23]
Cmm uses "ptr" instead of "address" [24]
"Cmm does not recognise C-- "{ literal, ... }" initialisation syntax, such as bits32{10} or bits32[3] {1, 2, 3}. Cmm does recognise initialisation with a literal: string_name: bits8[] "twenty character string\n\0"; variable_num: bits32 10::bits32;" [25]
"C-- also allows the * (multiplication) operator in addressing expressions, for an approximation of scaled addressing ([base * (2^n)]); for example, n (the "scale") must be 0, 1, 2 or 4. C-- itself would not enforce alignment or limits on the scale." [26]
"While the C-- specification allows a bare data keyword directive, Cmm does not: this is valid C--, not Cmm! data { } all Cmm directives use this syntax: section [Cmm section directive] { }" [27]
"Cmm does not recognise the C-- "stack" declaration for allocating memory on the system stack." [28]
"Cmm does not support the target directive. This is partly due GHC generally lacking cross-compiler capabilities." [29]
quasi-operators. "The syntax for these quasi-operators is in some cases similar to syntax for Cmm statements and generally conform to the C-- specification, sections 3.3.2 (expr) and 7.4.1 (syntax of primitive operators), except that 3. and, by the equivalence of the two, 1. may return multiple arguments. In Cmm, quasi-operators may have side effects. The syntax for quasi-operators may be:
- expr0 `name` expr0 (just like infix-functions in Haskell)
- type[ expression ] (the memory access quasi-expression described in Memory Access; the Haskell representation of this syntax is CmmLoad? CmmExpr? MachRep?)
- %name( exprs0 ) (standard prefix form, similar to C-- statement syntax for procedures but with the distinguishing prefix %; in Cmm this is also used as statement syntax for calls, which are really built-in procedures, see Cmm Calls)" [30]
"Cmm does not implement the C-- specification for Spans (sec. 6.1) or Continuations (sec. 6.7)." [31]
"Although Cmm supports primitive operations that may have side effects (see Primitive Operations, below), it does not parse the syntax %% form mentioned in section 6.3 of the C-- specification. Use the %name(arg1,arg2) expression-syntax instead. " [32]
"Cmm does not implement the return statement (C-- spec, sec. 6.8.2) but provides a set of macros that return a list of tuples of a CgRep? and a CmmExpr?: [(CgRep?,CmmExpr?)]. " [33]
"...Cmm does not implement any of the control flow statements of C-- specification (section 6.8.1)..." [34]
but Cmm does include "both calls to foreign functions and calls to Cmm quasi-operators using expression syntax (see Quasi-operator Syntax). " [35]
Cmm may not implement all of the 75 primitive operators of C--? [36]

Pillar

In some ways similar to C--. Used by the Functional Language Research Compiler (FLRC) / Haskell Research Compiler (HRC).

" Pillar’s concurrency features include constructs for threading, synchronization, and explicit data-parallel operations. The threading constructs focus on creating new threads only when hardware resources are idle, and otherwise executing parallel work within existing threads, thus minimizing thread creation overhead. In addition to the usual synchronization constructs, Pillar includes transactional memory. Its sequential features include stack walking, second-class continuations, support for precise garbage collection, tail calls, and seamless integration of Pillar and legacy code.

...

successful languages...will share key features: language constructs that allow easy extraction of high levels of concurrency, a highly-scalable runtime that efficiently maps concurrenc y onto available hardware resources, a rich set of synchronization constructs like futures and transactions, and managed features from modern languages such as garbage collection and exceptions

...

We call this runtime the language-specific runtime (LSR) to distinguish it from the Pillar runtime. The LSR could be written in Pillar and make use of Pillar constructs, or could be written in a language such as C traditionally used for runtime implementation.

...

The LSR can also make use of Pillar’s runtime that, in addition to supporting the Pillar implementation, provides a set of services for high-level languages such as stack walking and garbage collection (GC) support. The Pillar runtime is layered on top of McRT?, the Multi-Core RunTime? [3], which provides scheduling, synchronization, and software transactional memory services.

...

The Pillar compiler produces both object code and associated metadata. This metadata is used by the Pil- lar runtime to provide services such as stack walking and root-set enumeration, and because of it, the code is said to be managed . (Pillar also supports integration with non-Pillar code, such as legacy code, which is said to be unmanaged .) The Pillar compiler controls the metadata format, and provides its own metadata decoder library to interpret it to the Pillar runtime. The metadata and decoder are also linked into the executable.

...

The Pillar language has several key design principles. First, it is a compiler target language, with the goal of mapping any parallel language onto Pillar while maintaining that language’s semantics. As such, Pillar cannot include features that vary across high- level languages, like object models and type-safety rules. C++, C#, and Java, for ex- ample, are too high-level to be effective target languages, as their object models and type-safety rules are not appropriate for many languages. Therefore, of necessity, Pillar is a fairly low-level language. Although most Pillar programs will be automatically generated, expert programmers must be able to directly create Pillar programs. As a result, assembly and bytecode languages are too low-level since they are difficult even for experts to use. Although inspired by C-- [4, 5], we decided to define Pillar as a set of extensions to C because then we could utilize existing optimizing C compilers to get quality implementations of Pillar quickly.

Since the Pillar language is based on C, type safety properties of the source paral- lel language must be enforced by the high-level converter. For example, array bounds checks might be implemented in Pillar using a combination of explicit tests and conditional branches. Similarly, null-dereference checks, divide-by-zero checks, enforcing data privacy, and restricting undesired data accesses must be done at a level above the Pillar language by the high-level converter. One notable exception is that we are work- ing on annotations to express immutability and disambiguation of memory references.

Second, Pillar must provide support for key sequential features of modern programming languages. Examples include garbage collection (specifically, the ability to identify live roots on stack frames), stack walking (e.g., for exception propagation), proper tail calls (important when compiling functional languages ), second-class continuations (e.g., for exception propagation and backtracking), and the ability to make calls between managed Pillar code and unmanaged legacy code

Third, Pillar must also support key concurrency features of parallel languages, such as parallel thread creation, transactions, data-parallel operations, and futures. Fig. 2 summarizes the syntax of the Pillar features added to the C language. These features are described in the following sections.

Fig. 2. Pillar syntactic elements:

Sequential constructs:

Second-class continuations
- continuation k(a, b, c):
- cut to k(x, y, z);
Alternate control flow
- foo() also cuts to k1, k2;
- foo() also unwinds to k3, k4;
- foo() never returns;
Tail call
- tailcall foo();
Spans
- span TAG value {...}
Virtual stack and destructors
- VSE(k) { ... continuation k(target): ... cut to target;}
GC references
- ref obj;
Managed/unmanaged calls
- #pragma managed(off); #include <stdio.h>; #pragma managed(on); ...; printf(...)

Concurrency constructs:

Pcall
- pcall(aff) foo(a, b, c);
Prscall
- prscall(aff) foo(a, b, c);
Futures
- fcall(aff, &st) foo(a, b, c);
- ftouch(&st);
- fwait(&st);
Transactions
- TRANSACTION(k) {... continuation k(reason): if (reason==RETRY) ... else if (reason==ABORT) ...}

2.1 Sequential features

Second-class continuations: This mechanism is used to jump back to a point in an older stack frame and discard intervening stack frames, similar to C’s setjmp/longjmp mechanism. The point in the older stack frame is called a continuation, and is declared by the continuation keyword; the jumping operation is called a cut and allows multiple arguments to be passed to the target continuation. For any function call in Pillar, if the target function might ultimately cut to some continuation defined in the calling function rather than returning normally, then the function call must be annotated with all such continuations (these can be thought of as all alternate return points) so that the compiler can insert additional control flow edges to keep optimizations safe.

Virtual stack elements: A VSE (virtual stack element) declaration associates a cleanup task with a block of code. The “virtual stack” terminology is explained in Section 5.2. This cleanup task is executed whenever a cut attempts to jump out of the region of code associated with the VSE. This mechanism solves a problem with traditional stack cutting (such as in C-- ) where cuts do not compose well with many other operations. For example, suppose that code executing within a transaction cuts to some stack frame outside the transaction. The underlying transactional memory system would not get notified and this is sure to cause problems during subsequent execution. By using a VSE per transaction, the transactional memory system in Pillar is notified when a cut attempts to bypass it and can run code to abort or restart the transaction. Since cuts in Pillar compose well with all the features of Pillar, we call them composable cuts .

Stack walking: The Pillar language itself has no keywords for stack walking , but the Pillar runtime provides an interface for iterating over the stack frames of a particular thread. Pillar has the also unwinds to annotation on a function call for providing a list of continuations that can be accessed during a stack walk. This is useful for implementing exception propagation using stack walking, as is typical in C++, Java, and C# implementations.

Spans: Spans are a mechanism for associating specific metadata with call sites within a syntactic region of code, which can be looked up during stack walking.

Root-set enumeration: Pillar adds a primitive type called ref that is used for declaring local variables that should be reported as roots to the garbage collector. During stack walking these roots can be enumerated. The ref type may also contain additional parameters that describe how the garbage collector should treat the reference: e.g., as a direct object pointer versus an interior pointer, as a weak root, or as a tagged union that conditionally contains a root. These parameters have meaning only to the garbage collector, and are not interpreted by Pillar or its runtime. If refs escape to unmanaged code, they must be wrapped and enumerated specially, similar to what is done in Java for JNI object handles.

Tail calls: The tailcall keyword before a function call specifies a proper tail call: the current stack frame is destroyed and replaced with the callee’s new frame.

Calls between managed and unmanaged code: All Pillar function declarations are implicitly tagged with the pillar attribute. The Pillar compiler also understands a special pragma that suppresses the pillar attribute on function declarations; this pragma is used when including standard C header files or defining non-Pillar functions. (footnote: One particularly pleasing outcome of this syntax is that managed Pillar code and unmanaged C code can coexist within the same source files.). Calling conventions and other interfacing depend on the presence or abs ence of the pillar attribute in both the caller and callee, and the Pillar compiler generates calls accordingly.

Note that spans, second-class continuations, and stack walking are C-- constructs and are described in more detail in the C-- specification (Ramsey, N., Peyton Jones, S., Lindig, C.: The C-- language specification, version 2.0. Available at http://cminusminus.org/papers.html (February 2005)).

2.2 Concurrency features

Pillar currently provides three mechanisms for creating new logical threads: pcall , prscall , and fcall .

((pcall (spawn):)) Adding the pcall keyword in front of a call to a function with a void return type creates a new child thread, whose entry point is the target function. Execution in the original parent thread continues immediately with the statement following the pcall . Any synchronization or transfer of results between the two threads should use global variables or parameters passed to the pcall target function.

((prscall (mb spawn):)) The prscall keyword is semantically identical to pcall , but implements a parallel-ready sequential call [7]. Prscalls allow programs to specify potential parallelism without incurring the overhead of spawning parallel threads if all processors are already busy. A prscall initially starts running the child thread as a sequential call (the parent is suspended). However, if a processor becomes free, it can start executing the parent in parallel with the child. Thus, prscalls are nearly as cheap as normal procedure calls, but take advantage of free processors when they become available.

((fcall (future):)) The fcall construct can be used to parallelize programs that have certain serial- izable semantics. The fcall annotation indicates that the call may be executed concurrently with its continuation, while allowing the call to be eagerly or lazily serialized if the compiler or runtime deems it unprofitable to parallelize it. The st parameter to the fcall is a synchronization variable, called a future, that indicates the status of the call: empty indicates that the call has not yet been started, busy indicates that the call is currently being computed, and full indicates that the call has completed. Two forcing operations are provided for futures: ftouch and fwait . If the future is full, both return immediately; if the future is empty, both cause the call to be run sequentially in the forcing thread; if the future is busy, fwait blocks until the call completes while ftouch returns immediately. The serializability requirement holds if, for each future, its first ftouch or fwait can be safely replaced by a call to the future’s target function. Both prscall and fcall are geared toward an execution environment where there is a great deal of available fine-grain concurrency, with the expectation that the vast majority of calls can be executed sequentially within their parents’ context instead of creating and destroying a separate thread.

These three keywords take an additional affinity parameter [8] that helps the scheduler place related threads close to each other to, e.g., improve memory locality.

Pillar provides support for transactions. A syntactic block of code is marked as a transaction, and transaction blocks may be nested. Within the transaction block, transactional memory accesses are specially annotated, and a continuation is specified as the “handler” for those situations where the underlying transactional memory system needs the program to respond to situations like a data conflict or a user retry.

The concurrency constructs described so far relate to light weight thread-level parallelism. To support data parallelism, we intend to add Ct primitives (Ghuloum, A., Sprangle, E., Fang, J.: Flexible Parallel Programming for Tera-scale Architectures with Ct (2007) Available at http://www.intel.com/research/platform/terascale/TeraScale_whitepaper.pdf ((note: now at https://software.intel.com/sites/default/files/2d/cf/25739 )) ) to Pillar. These primitives express a variety of nested data-parallel operations, and their semantics allow the compiler to combine and optimize multiple such operations.

...

Our experience shows that cooperative preemption offers several performance advantages over traditional preemptive scheduling in multi-core platforms (Saha, B., Adl-Tabatabai, A., Ghuloum, A., Rajagopalan, M ., Hudson, R., Petersen, L., Menon, V., Murphy, B., Shpeisman, T., Sprangle, E., Rohilla h, A., Carmean, D., Fang, J.: Enabling Scalability and Performance in a Large Scale CMP En vironment. EuroSys? (March 2007)). The Pillar compiler is expected to generate a yield check at each method entry and backward branch, a well-known technique that ensures yielding within a bounded time.

" -- [37]

Links:

Ocaml's VM

" > What virtual machine would be good to target if I need to implement typical things such tail call optimizations and closures? If you want TCO and closures, I'd seriously consider targeting ocaml's VM, or even just embedding your language via camlp4 LtU? link

" -- http://lambda-the-ultimate.org/node/1617#comment-19637

Camlp4 Tutorial