Table of Contents for Programming Languages: a survey

Chapter : targets, IRs, VMs and runtimes

note: HLL is an abbreviation for High Level Language.

IR is for Intermediate Representation.

The core language of a VM is usually called 'bytecode'. The instructions in this language are called 'instructions' or 'operations' or 'ops' or 'opcodes' or 'bytecodes'.


maximizing code density for embedded systems


register-based vs stack-based VMs

"It’s a bit easier to build a relatively efficient interpreter with a register-based machine...Having now done the code generation work for register and stack machines...Mostly, I found the register-based code generation a bit easier to do...on the JVM, the stack must be empty either side of a try/catch construct. So you have to do some work to spill the stack into locals if there’s anything already on it....I haven’t really come out of all this thinking register machines are overwhelmingly better than stack machines, though – but they are a bit easier on the code generation, which is nice. If you plan to put your call frames on the system stack, the stack approach seems more attractive from that angle. Once you decide you’re not doing that, you could go either way." --

Shi, Yunhe; Gregg, David; Beatty, Andrew; Ertl, M. Anton (2005-06-11). "Virtual Machine Showdown: Stack Versus Registers".

"And never forget that stack machines are superior to register machines when it comes to code density!" --

examples of stack-based VMs:

"if you look at the VMs that do successfully host multiple languages today, they’ve typically started by hosting one language really well." --

i guess the reason that writing instructions in stack-mode is so concise is that it's really just a useful shorthand for how the outputs of earlier instructions are plugged into the inputs of later ones, and that in the common case (executing a bunch of sequential instructions whose inputs depend on the output of the previous instruction) it's concise. when you get far away from this common case is when you get lots of ROTs and DUPs, in which case maybe registers are better.

another way of saying that: when you are computing expressions, you end up recursively computing subexpressions and then plugging the result into a larger expression. E.g. to compute 0.5*(f(x,y,u+v) + z), you compute z, then you compute u+v, then f(x,y,u+v), then f(x,y,u+v)+z, then 0.5*f(x,y,u+v)+z. You don't save usually save these intermediate results to use twice; e.g. if you compute 0.5*(2*f(x) + 2*f(x)) you might actually compute f(x) twice (or you may not, if you memoize). This is perfect for a stack machine. Whereas, if you did y = 2*f(x) and then 0.5*(y + y), that's more easily done on a register machine. The point is, for the best-case scenario on a stack machine, each intermediate result is used exactly once and is consumed immediately, except when other operands must first be computed as further inputs to the function that is consuming it, in which case it is consumed as soon as all of its coinputs are produced. If each function further produces exactly one output, then that output represents a subexpression and the stack machine represents the recursive computation of an expression.

Registers are for multi output, stack for single (== expressions)

A Performance Survey on Stack-based and Register-based VirtualMachines (finds that register machines are more efficient; the register machine they used appears to be this one, which appears to have 4 registers: )

another argument for register VMs is the simpler de-optimization of JIT compilers: " Modern JIT compilers use a technique called type feedback compilation. Briefly, the idea is that a compiler that is integrated into the runtime of the system can exploit information on how the program is actually used to compile more efficient code than would be possible on the basis of the program source code alone. A simple example in javascript would be the following routine:

function foo(a) { var r = 0; for (var i = 1; i < a.length; i++) { r += (a[i] * a[i-1]); } return r; }


If all calls to foo happen to have a single argument consisting of an array of integers, the semantics of this routine become much simpler than they are otherwise. (For example, in javascript, the addition of a number and a string produces a well-defined result, so it is totally valid to call foo with an array of strings). A type-feedback compiler might notice a large number of calls to foo, all with integer arrays as their sole argument, assume this will always be so, and compile a much faster routine. In order to correctly handle arrays of strings too, the compiler inserts a 'guard clause' that checks if a is really an array of integers. If not, the routine must be 'de-optimised'. Note that spesh, which is the optimisation framework for MoarVM?, also works this way

The goal of de-optimisation is to resume the execution of the interpreted (slow) routine where the assumptions of the compiled routine have failed. A typical place in our 'foo' function would be on entry or on the addition to r. The idea is that the values that are calculated in the optimised routine are copied to the locations of the values of the interpreted routine. In a register machine, this is conceptually simple because all variables already have a fixed location. However, the layout of the stack in a stack vm is dynamic and changes with the execution of the routine, and mapping between compiled and interpreted values may not be very simple at all. It is certainly doable - after all, the JVM famously has an efficient optimising JIT compiler - but not simple. " --

a discussion on this is


How many registers?

" If you only count real GPRs in MIPS like that then in ARM32 there's only 13 registers (R0-R12, minus SP, LR and PC) and x86 has 7 (minus SP, when omitting frame pointers). x86_64 has 15 and ARM64 has 31 – phuclv Jan 23 '15 at 6:22 "

"The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements"

also you need to store more state for context switches, which takes more time and more memory

" You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:

Extendable Instruction Set Computing

Register Program Load/Store Count Size Frequency

27 100.00 27.90% 16 101.62 30.22% 8 114.76 44.45% " -- Bradley McKinley

good read:

Potential differences with register machines in VMs vs bare metal


VMs: pros and cons

Why do some languages target virtual machines? Why add another layer that isn't strictly necessary?


Additional pros if you target an existing virtual machine:



Information Losses in bytecode IRs

Comment by Raphael Amiard on the efficiency problems with taking a pre-existing bytecode IR and compiling it, as opposed to compiling from the HLL's AST directly to a target without going thru a pre-existing bytecode IR (after the experience of writing a compiler for OCaml ZAM bytecode (the Z3 project)):

Why no polymorphism in bytecode?

In bytecode instruction sets, often you see a bunch of different instructions that do similar things on different types. For example, add signed integers, add unsigned integers, add 32-bit floating points, add 64-bit floating points. You might think, gee, couldn't the darn bytecodes give us a higher level of abstraction than that? Couldn't there just be one opcode for ADD, and the VM figures out which addition routine is required based on the type?

Here's why that is not often done. If at compile time the types of everything were known, and then at runtime we encounter an instruction to add (for example) register 0 to register 1, we don't want to have to lookup the types of register 0 and register 1 to see whether we should jump to the integer addition routine, the floating point addition routine, etc, because that takes significant extra time.

So, why not just have a single ADD opcode, but also encode the statically-known types in a separate 'type' field in the instruction? Well, for one, this field would have to have enough bits to encode all types that you would want to distinguish, so if the bytecode supports many types, this could substantially reduce code density.

Well, we have two ways to do that. First, at runtime we could look up the implementation to use in a table that is indexed by (opcode, type). But this table would be huge, and wasteful; not all opcodes can be used with all types (or are handled differently with all types), so this table would be much larger than it needs to be. We could store the table sparsely or as a hash, but then lookup takes a long time, and this is something that we have to do upon every instruction so that would be bad.

The second way would be to pass the type given by the instruction's bytecode to the implementation of the instruction, and let the implementation conditionally branch to the subimplementation for that type. The problem with this is that now we're doing at least two branches for polymorphic instructions at runtime, first a branch to the opcode's handler, and then a branch to the implementation for the type that we have. This runtime branching is not needed, since for values whose type was statically known, we could have figured out where we would have ended up at compile time.

So, the most efficient thing to do is to simply have a separate opcode for each implementation routine; eg one opcode for integer addition, another opcode for floating point addition, etc.

Common constraints on VMs for efficiency

A VM does not need to obey these constraints in order to fit the definition of a VM, but they are often seen for reasons of efficiency:

How intermediate?

Leans up vs. leans down

should they be designed to be similar to hardware CPU instruction set architecture (ISA)? e.g. most extant VMs are either stack machines or register machines; but why not just memory-to-memory?

simple vs. complex

data model, evaluation strategy of HLL, sometimes object model

two layers of VMs

interoperable vms expand into toolchains

Impedence mismatches when targetting an existing VM


" I learned my own lessons about the “repurposed JIT phenomenon” the hard way. Before any work started on HHVM, I built a hackathon prototype that did naive compilation from PHP to JavaScript?. Running the output on V8 produced some really promising results at first; programming-language shootout[2] style benchmarks looked in the neighborhood of 10x the standard PHP interpreter.

The problems came when I tried to run real PHP programs. The primitive operations that PHP exposes, and that large PHP codebases rely on, are simply different from those exposed by JavaScript?, Java, the CLR, etc. For example, PHP engines expose their use of reference counting to the PHP program, while JavaScript? and Java implementations typically use some sort of tracing garbage collection (and even if they use reference counting, don’t expose it to programs). So, in order to faithfully run PHP programs on a JVM or JavaScript? engine, you need to reimplement all of PHP’s memory management, either in the source language or in the runtime itself. So right out of the gate, you lose one of the big pieces of effort you were hoping to save by recycling someone else’s engine: memory management.

PHP also has a notion of “references” that doesn’t have a direct analog in JavaScript?; passing parameters by reference is unmarked in the call site, so you need to do method resolution (“what am I actually calling at runtime?”) at the call site before you can even pass parameters; PHP has value semantics for its array datatype, while JavaScript? and Java provide reference semantics for containers; PHP’s numerics are very different from JavaScript’s?; PHP has to do protected/private/public method privacy checks at runtime; no, JVM fans, InvokeDynamic? does not magically make all of these mismatches disappear; and on and on. Once you start diving into the nuances of the low-level primitives, almost every language is different enough that it is hard to efficiently implement one in terms of the other. " --

" Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that. An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator. LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy? uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy?'s memory usage had been higher. " --


Author Profile Page chromatic replied to comment from Gerhard R.

January 4, 2013 10:23 AM

I think you're falling into the "JIT makes things fast" fallacy. You can see how well that works for languages which use LLVM--if they look like C or C++, they do pretty well. Otherwise they don't.

If you want your language to go really fast, your runtime needs to know a lot about the semantics of your language. How does memory allocation work? What's the layout of primitives? What precision and signedness do your primitives support? Do you support aliasing? What constness guarantees do you provide? What immutability guarantees do you provide? When can you resolve polymorphic dispatch? What boxing semantics do you provide? How does control flow work, and can you resolve questions about object escapes and lifetimes? What runtime support do you need to enable metaprogramming? When can you make these decisions?

I've often said that to support Perl 6 you need a virtual machine that looks a lot like Parrot (not that Parrot does everything right, but that it at least tries to make these things possible). I still suspect that mythical VM abstraction layer will go the way of the DLR. Effectively what NQP on the JVM or CLR will look like is a reimplementation of all of the C code in Rakudo right now using the underlying VM to provide very basic memory management and some platform access.

The result might run faster than current Rakudo, but it'll be so far away from the semantics of Java or C# that I doubt it'll run with the amazing speed people hope except on a few carefully chosen benchmarks that avoid the interesting language features that might make Perl 6 worth using.

(See also: JavaScript?'s terrible numeric support system and why JS JITs go to great length to perform calculations efficiently.)

" --

" Besides the well-known issues of tail calls and continuations, VMs often include a runtime type system, garbage collection, calling functions with arguments, objects with fields, strings, dictionaries, typesafe arithmetic etc. It is unlikely that these operations match the semantics of an independently developed language. The type system will most likely not match at all. There are tons of problematic areas:

    How are functions with a statically unknown arity handled?
    Can objects other than functions be applied?
    How are object fields identified? Are they accessed when knowing the object type or not, possibly including subtyping? Are fields identified by strings or symbols, or indices computed at compile time? Is the set of fields known at object creation time, or is the semantics of field access resolved dynamically?
    What is the nature of characters, elements of strings? Unicode code points? UTF-16 code units? Bytes? Are they distinguished from 1-character strings? Are strings mutable, and can they change their length?
    How is hashing and equality used by a hash table specified?
    What is the semantics of weak references and finalizers? Is a finalizer specified when its associated object is created, or can it be added at any time to any object? May a finalizer refer to the object being finalized, or to other finalizable objects? Are finalizers run by other threads?
    What numeric types are available? What is the effect of overflowing fixed-size integers: returning bignums? an exception? wrapping modulo integer size? What is the semantics of integer division and remainder of negative numbers? What happens when you try to add objects which are not of a builtin numeric type?
    Which exceptions are thrown from particular erroneous situations? What happens when an array is accessed outside of bounds? What happens on integer and floating point division by zero? Are exceptions resumable? Is exception handling code in a tail position with respect to try?" --

" Language-independent VM

Here's the big problem with high-level language independent VMs: language designers invariably want to do something new. By definition, this is something you haven't thought of. Want lazyness? Out of luck. Continuations? Nope. First-class stack frames? Ditto. Fast multimethod dispatch? Well, you can hack it up, but most likely it'd be easier to just write a new VM.

In my case, I want to do a language with Erlang-style lightweight concurrency and message passing. This has to be built into the VM at a low level, because it affects everything from IO to heap allocation to stack representation to scheduling. I looked at Parrot, Neko, C--, and LLVM, but none of them offer much in concurrency support. The only halfway-suitable VM was, not surprisingly, Erlang itself, but I can't find any documentation on the bytecode format and the source is semi-closed for modifications (and makes what are IMHO poor choices in string represenation, making it unsuitable for my purposes). I'll probably end up starting with LLVM and then writing an instruction-compatible custom VM once I get to the concurrency parts. " --

supporting platform-dependent optimized primitives without bloating the VM language

Different platforms offer different optimized primitives. If the VM's language defined special facilities for the union of all of these, it would be very large indeed; this would also necessitate relatively large implementations of the VM because on each platform it would have to implement each of te defined facilities for . But if it does not define special facilities for some of these, then naive compilers targeting the VM language will transform operations that could be performed by optimized primitives into lengthly sequences of VM instructions, making it hard for a platform-specific runtime to figure out where the optimized primitives could be inserted.

What are some ideas for offering access to platform-dependent optimized primitives without bloating the VM language?

optimizer recognizes implementations of optimized primitives

A platform-specific optimizer, which could be part of a compiler or a runtime which is fed bytecode input, would try to recognize naive implementations of what on this platform could be rewritten in terms of optimized primitives, and rewrite them.

C. Guy Yarvin calls this "jets" at


Sections of the bytecode could support annotation, and then in the future annotations could be defined to say essentially, "this section is just doing X, if you have an optimized primitive for X please replace this section with that".

bytecode vs executing the AST

(todo: merge this with section "How intermediate?", above)

Interpreters are slower than compilers but there's a spectrum of types of interpreters.

An interpreter could actually re-parse the source code for each line of code each time it encounters it at runtime.

An interpreter pre-parse the source code once, and then interpret the AST, traveling from AST node to AST node instead of from source code line to source code line.

An interpreter pre-parse the source code once, and annotate the AST with "next" pointers that say, "once you're done with this node, here is the node to visit next". Perl5 takes this route [1]. This allows the interpreter to take a 'linear' path through the AST, even though the AST isn't being stored in linear form.

An interpreter could parse the source code, generate the AST, and compile the AST to a linear stream of bytecode. Java and Python take this route.

In 'bytecode' representations, you generally try not to defer something to runtime that you could just as easily do at compile time (examples: (a) resolve polymorphism which can be resolved statically; (b) for each instruction, figure out what the next instruction(s) are).

Orthogonal to this, an interpreter could JIT, that is, while interpreting the code also incrementally compile parts of it to native code (perhaps eventually compiling all of it, perhaps not).


todo read this, mb put it here or mb in the computed goto section:

note: some platforms don't allow the execution of dynamically written code (e.g. Apple iOS) (umm, i think this is totally out of date now)

" Apple forbids the inclusion of mechanisms that enable execution of arbitrary third-party code. This rule effectively bars the use of nonstandard runtimes, JIT engines, and bytecode interpreters " --

" 3.3.2 — An Application may not itself install or launch other executable code by any means, including without limitation through the use of a plug-in architecture, calling other frameworks, other APIs or otherwise. No interpreted code may be downloaded or used in an Application except for code that is interpreted and run by Apple’s Documented APIs and built-in interpreter(s). "

" 2.7 Apps that download code in any way or form will be rejected

    2.8 Apps that install or launch other executable code will be rejected"


See also [2].





SEAM is a virtual machine framework that provides a component model for various services and leaves it up to the user to implement a 'execution unit', rather than defining an intermediate language, because "it is next to impossible to design a single intermediate language to accomodate all programming languages" [3]. Alice ML ('Mozart/Oz meets ML') and JVM have been implemented on SEAM [4].


There were reasons for developing Seam

The usual requirements of FPLs aside, the design of Alice relies on two central services that have to be provided by the VM:

    Pickling: a universal marshalling mechanism that includes code and supports configurable import/export transformations is absolutely crucial to the whole framework. Compiled components are actually pickles, and all distribution uses pickling. Basically, pickling has to be implemented on the same level as GC.
    Transients: are the generic, low-level representation of futures and support implicit data-flow synchronisation. Concurrency, laziness, dynamic linking, and just-in-time compilation all rely on this mechanism. For efficiency, VM and GC must be aware of transients and encapsulate them as far as possible. Transients also interfer with pickling.

Unfortunately, AFAIK, no existing VM or VM framework (besides Mozart) supports either of these features appropriately. LLVM wasn't even available when development on Seam started (in 2001). By Andreas Rossberg at Tue, 2004-12-28 18:12

login or register to post comments

" [5]

Some components of SEAM include:


GHC Cmm (similar to C--)

Cmm is an IL used in in Haskell's GHC. It was based on C-- (Cminusminus) which was more general, but the C-- project seems to have ended.

Example from [8]: Three functions that compute the cumulative sum from 1 to n, and the cumulative product from 1 to n:

/* Ordinary recursion */
export sp1;
sp1( bits32 n ) {
  bits32 s, p;
  if n == 1 {
    return( 1, 1 );
  } else {
    s, p = sp1( n-1 );
    return( s+n, p*n );

/* Tail recursion */
export sp2;
sp2( bits32 n ) {
  jump sp2_help( n, 1, 1 );

sp2_help( bits32 n, bits32 s, bits32 p ) {
  if n==1 {
    return( s, p );
  } else {
    jump sp2_help( n-1, s+n, p*n )

/* Loops */
export sp3;
sp3( bits32 n ) {
  bits32 s, p;
  s = 1; p = 1;
  if n==1 {
    return( s, p );
  } else {
    s = s+n;
    p = p*n;
    n = n-1;
    goto loop;


Why not target a high-level VM such as the JVM?

"One alternative is to make all these high-level services part of the abstraction offered by the portable assembler. For example, the Java Virtual Machine.... But a sophisticated platform like a virtual machine embodies too many design decisions. For a start, the semantics of the virtual machine may not match the semantics of the language being compiled (e.g., the exception semantics). Even if the semantics happen to match, the engineering tradeoffs may differ dramatically. For example, functional languages like Haskell or Scheme allocate like crazy (Diwan, Tarditi, and Moss 1993), and JVM implementations are typically not optimised for this case. Finally, a virtual machine typically comes complete with a very large infrastructure — class loaders, verifiers and the like — that may well be inappropriate. Our intended level of abstraction is much, much lower." -- [9]

Why not target C?


So much for fundamental issues. C also lacks the ability to control a number of important low-level features, including returning multiple values in registers from a procedure, mis-aligned memory accesses, arithmetic, data layout, and omitting range checks on multi-way jumps. " -- [10]



Our goal is to make it easy to retarget front ends, not to make every C-- program runnable everywhere." -- [11]

"C-- supports procedures that are both more and less general than C procedures - for example, C-- procedures offer multiple results and full tail calls, but they have a fixed number of arguments." -- [12]

"C-- supports a bare minimum of data types: a family of bits types (bits8, bits16, bits32, bits64), and a family of floating-pointtypes (float32, float64, float8)." -- [13]

"Memory accesses (loads and stores) are typed, and denoted with square brackets. Thus the statement: bits32[foo] = bits32[foo] + 1; loads a bits32 from the location whose address is in foo, adds one to it, and stores it at the same location." -- [14]


From, grepping for string 'emit (mk' and by looking at the functions machOps and callishMachOps we can see some of the 'primitives' presumably emitted by Cmm: also mentions:

" Cmm Statements implement:

Cmm has labels and basic blocks. says "C-- was originally envisaged as a target language for a variety of compilers. But in fact LLVM, which was developed at a similar time, "won" that race and has built a far larger ecosystem. That's fine with us -- it's great how successful LLVM has been -- but it means that C-- is now used essentially only in GHC."

Differences between Cmm and C-- include:


In some ways similar to C--. Used by the Functional Language Research Compiler (FLRC) / Haskell Research Compiler (HRC).

" Pillar’s concurrency features include constructs for threading, synchronization, and explicit data-parallel operations. The threading constructs focus on creating new threads only when hardware resources are idle, and otherwise executing parallel work within existing threads, thus minimizing thread creation overhead. In addition to the usual synchronization constructs, Pillar includes transactional memory. Its sequential features include stack walking, second-class continuations, support for precise garbage collection, tail calls, and seamless integration of Pillar and legacy code.


successful languages...will share key features: language constructs that allow easy extraction of high levels of concurrency, a highly-scalable runtime that efficiently maps concurrenc y onto available hardware resources, a rich set of synchronization constructs like futures and transactions, and managed features from modern languages such as garbage collection and exceptions


We call this runtime the language-specific runtime (LSR) to distinguish it from the Pillar runtime. The LSR could be written in Pillar and make use of Pillar constructs, or could be written in a language such as C traditionally used for runtime implementation.


The LSR can also make use of Pillar’s runtime that, in addition to supporting the Pillar implementation, provides a set of services for high-level languages such as stack walking and garbage collection (GC) support. The Pillar runtime is layered on top of McRT?, the Multi-Core RunTime? [3], which provides scheduling, synchronization, and software transactional memory services.


The Pillar compiler produces both object code and associated metadata. This metadata is used by the Pil- lar runtime to provide services such as stack walking and root-set enumeration, and because of it, the code is said to be managed . (Pillar also supports integration with non-Pillar code, such as legacy code, which is said to be unmanaged .) The Pillar compiler controls the metadata format, and provides its own metadata decoder library to interpret it to the Pillar runtime. The metadata and decoder are also linked into the executable.


The Pillar language has several key design principles. First, it is a compiler target language, with the goal of mapping any parallel language onto Pillar while maintaining that language’s semantics. As such, Pillar cannot include features that vary across high- level languages, like object models and type-safety rules. C++, C#, and Java, for ex- ample, are too high-level to be effective target languages, as their object models and type-safety rules are not appropriate for many languages. Therefore, of necessity, Pillar is a fairly low-level language. Although most Pillar programs will be automatically generated, expert programmers must be able to directly create Pillar programs. As a result, assembly and bytecode languages are too low-level since they are difficult even for experts to use. Although inspired by C-- [4, 5], we decided to define Pillar as a set of extensions to C because then we could utilize existing optimizing C compilers to get quality implementations of Pillar quickly.

Since the Pillar language is based on C, type safety properties of the source paral- lel language must be enforced by the high-level converter. For example, array bounds checks might be implemented in Pillar using a combination of explicit tests and conditional branches. Similarly, null-dereference checks, divide-by-zero checks, enforcing data privacy, and restricting undesired data accesses must be done at a level above the Pillar language by the high-level converter. One notable exception is that we are work- ing on annotations to express immutability and disambiguation of memory references.

Second, Pillar must provide support for key sequential features of modern programming languages. Examples include garbage collection (specifically, the ability to identify live roots on stack frames), stack walking (e.g., for exception propagation), proper tail calls (important when compiling functional languages ), second-class continuations (e.g., for exception propagation and backtracking), and the ability to make calls between managed Pillar code and unmanaged legacy code

Third, Pillar must also support key concurrency features of parallel languages, such as parallel thread creation, transactions, data-parallel operations, and futures. Fig. 2 summarizes the syntax of the Pillar features added to the C language. These features are described in the following sections.

Fig. 2. Pillar syntactic elements:

Sequential constructs:

Concurrency constructs:

2.1 Sequential features

Second-class continuations: This mechanism is used to jump back to a point in an older stack frame and discard intervening stack frames, similar to C’s setjmp/longjmp mechanism. The point in the older stack frame is called a continuation, and is declared by the continuation keyword; the jumping operation is called a cut and allows multiple arguments to be passed to the target continuation. For any function call in Pillar, if the target function might ultimately cut to some continuation defined in the calling function rather than returning normally, then the function call must be annotated with all such continuations (these can be thought of as all alternate return points) so that the compiler can insert additional control flow edges to keep optimizations safe.

Virtual stack elements: A VSE (virtual stack element) declaration associates a cleanup task with a block of code. The “virtual stack” terminology is explained in Section 5.2. This cleanup task is executed whenever a cut attempts to jump out of the region of code associated with the VSE. This mechanism solves a problem with traditional stack cutting (such as in C-- ) where cuts do not compose well with many other operations. For example, suppose that code executing within a transaction cuts to some stack frame outside the transaction. The underlying transactional memory system would not get notified and this is sure to cause problems during subsequent execution. By using a VSE per transaction, the transactional memory system in Pillar is notified when a cut attempts to bypass it and can run code to abort or restart the transaction. Since cuts in Pillar compose well with all the features of Pillar, we call them composable cuts .

Stack walking: The Pillar language itself has no keywords for stack walking , but the Pillar runtime provides an interface for iterating over the stack frames of a particular thread. Pillar has the also unwinds to annotation on a function call for providing a list of continuations that can be accessed during a stack walk. This is useful for implementing exception propagation using stack walking, as is typical in C++, Java, and C# implementations.

Spans: Spans are a mechanism for associating specific metadata with call sites within a syntactic region of code, which can be looked up during stack walking.

Root-set enumeration: Pillar adds a primitive type called ref that is used for declaring local variables that should be reported as roots to the garbage collector. During stack walking these roots can be enumerated. The ref type may also contain additional parameters that describe how the garbage collector should treat the reference: e.g., as a direct object pointer versus an interior pointer, as a weak root, or as a tagged union that conditionally contains a root. These parameters have meaning only to the garbage collector, and are not interpreted by Pillar or its runtime. If refs escape to unmanaged code, they must be wrapped and enumerated specially, similar to what is done in Java for JNI object handles.

Tail calls: The tailcall keyword before a function call specifies a proper tail call: the current stack frame is destroyed and replaced with the callee’s new frame.

Calls between managed and unmanaged code: All Pillar function declarations are implicitly tagged with the pillar attribute. The Pillar compiler also understands a special pragma that suppresses the pillar attribute on function declarations; this pragma is used when including standard C header files or defining non-Pillar functions. (footnote: One particularly pleasing outcome of this syntax is that managed Pillar code and unmanaged C code can coexist within the same source files.). Calling conventions and other interfacing depend on the presence or abs ence of the pillar attribute in both the caller and callee, and the Pillar compiler generates calls accordingly.

Note that spans, second-class continuations, and stack walking are C-- constructs and are described in more detail in the C-- specification (Ramsey, N., Peyton Jones, S., Lindig, C.: The C-- language specification, version 2.0. Available at (February 2005)).

2.2 Concurrency features

Pillar currently provides three mechanisms for creating new logical threads: pcall , prscall , and fcall .

((pcall (spawn):)) Adding the pcall keyword in front of a call to a function with a void return type creates a new child thread, whose entry point is the target function. Execution in the original parent thread continues immediately with the statement following the pcall . Any synchronization or transfer of results between the two threads should use global variables or parameters passed to the pcall target function.

((prscall (mb spawn):)) The prscall keyword is semantically identical to pcall , but implements a parallel-ready sequential call [7]. Prscalls allow programs to specify potential parallelism without incurring the overhead of spawning parallel threads if all processors are already busy. A prscall initially starts running the child thread as a sequential call (the parent is suspended). However, if a processor becomes free, it can start executing the parent in parallel with the child. Thus, prscalls are nearly as cheap as normal procedure calls, but take advantage of free processors when they become available.

((fcall (future):)) The fcall construct can be used to parallelize programs that have certain serial- izable semantics. The fcall annotation indicates that the call may be executed concurrently with its continuation, while allowing the call to be eagerly or lazily serialized if the compiler or runtime deems it unprofitable to parallelize it. The st parameter to the fcall is a synchronization variable, called a future, that indicates the status of the call: empty indicates that the call has not yet been started, busy indicates that the call is currently being computed, and full indicates that the call has completed. Two forcing operations are provided for futures: ftouch and fwait . If the future is full, both return immediately; if the future is empty, both cause the call to be run sequentially in the forcing thread; if the future is busy, fwait blocks until the call completes while ftouch returns immediately. The serializability requirement holds if, for each future, its first ftouch or fwait can be safely replaced by a call to the future’s target function. Both prscall and fcall are geared toward an execution environment where there is a great deal of available fine-grain concurrency, with the expectation that the vast majority of calls can be executed sequentially within their parents’ context instead of creating and destroying a separate thread.

These three keywords take an additional affinity parameter [8] that helps the scheduler place related threads close to each other to, e.g., improve memory locality.

Pillar provides support for transactions. A syntactic block of code is marked as a transaction, and transaction blocks may be nested. Within the transaction block, transactional memory accesses are specially annotated, and a continuation is specified as the “handler” for those situations where the underlying transactional memory system needs the program to respond to situations like a data conflict or a user retry.

The concurrency constructs described so far relate to light weight thread-level parallelism. To support data parallelism, we intend to add Ct primitives (Ghuloum, A., Sprangle, E., Fang, J.: Flexible Parallel Programming for Tera-scale Architectures with Ct (2007) Available at ((note: now at )) ) to Pillar. These primitives express a variety of nested data-parallel operations, and their semantics allow the compiler to combine and optimize multiple such operations.


Our experience shows that cooperative preemption offers several performance advantages over traditional preemptive scheduling in multi-core platforms (Saha, B., Adl-Tabatabai, A., Ghuloum, A., Rajagopalan, M ., Hudson, R., Petersen, L., Menon, V., Murphy, B., Shpeisman, T., Sprangle, E., Rohilla h, A., Carmean, D., Fang, J.: Enabling Scalability and Performance in a Large Scale CMP En vironment. EuroSys? (March 2007)). The Pillar compiler is expected to generate a yield check at each method entry and backward branch, a well-known technique that ensures yielding within a bounded time.

" -- [37]


Ocaml's VM

" > What virtual machine would be good to target if I need to implement typical things such tail call optimizations and closures? If you want TCO and closures, I'd seriously consider targeting ocaml's VM, or even just embedding your language via camlp4 LtU? link

" --
Camlp4 Tutorial