proj-plbook-plChTargetsImpl

Table of Contents for Programming Languages: a survey

Chapter : targets, IRs, VMs and runtimes

note: HLL is an abbreviation for High Level Language.

IR is for Intermediate Representation.

The core language of a VM is often called 'bytecode'. The instructions in this language are called 'instructions' or 'operations' or 'ops' or 'opcodes' or 'bytecodes'.

vms

maximizing code density for embedded systems

https://dercuano.github.io/notes/tiny-interpreters-for-microcontrollers.html

register-based vs stack-based VMs

"It’s a bit easier to build a relatively efficient interpreter with a register-based machine...Having now done the code generation work for register and stack machines...Mostly, I found the register-based code generation a bit easier to do...on the JVM, the stack must be empty either side of a try/catch construct. So you have to do some work to spill the stack into locals if there’s anything already on it....I haven’t really come out of all this thinking register machines are overwhelmingly better than stack machines, though – but they are a bit easier on the code generation, which is nice. If you plan to put your call frames on the system stack, the stack approach seems more attractive from that angle. Once you decide you’re not doing that, you could go either way." -- http://www.i-programmer.info/programming/perl/6138-where-is-perl-heading-an-interview-with-jonathan-worthington.html?start=1

Shi, Yunhe; Gregg, David; Beatty, Andrew; Ertl, M. Anton (2005-06-11). "Virtual Machine Showdown: Stack Versus Registers".

"And never forget that stack machines are superior to register machines when it comes to code density!" -- http://www.pouet.net/topic.php?post=398318

http://blogs.msdn.com/b/ericlippert/archive/2011/11/28/why-have-a-stack.aspx

examples of stack-based VMs:

"if you look at the VMs that do successfully host multiple languages today, they’ve typically started by hosting one language really well." -- http://www.josetteorama.com/all-about-perl-6-interview-of-jonathan-worthington-part-2-of-3/

i guess the reason that writing instructions in stack-mode is so concise is that it's really just a useful shorthand for how the outputs of earlier instructions are plugged into the inputs of later ones, and that in the common case (executing a bunch of sequential instructions whose inputs depend on the output of the previous instruction) it's concise. when you get far away from this common case is when you get lots of ROTs and DUPs, in which case maybe registers are better.

another way of saying that: when you are computing expressions, you end up recursively computing subexpressions and then plugging the result into a larger expression. E.g. to compute 0.5*(f(x,y,u+v) + z), you compute z, then you compute u+v, then f(x,y,u+v), then f(x,y,u+v)+z, then 0.5*f(x,y,u+v)+z. You don't save usually save these intermediate results to use twice; e.g. if you compute 0.5*(2*f(x) + 2*f(x)) you might actually compute f(x) twice (or you may not, if you memoize). This is perfect for a stack machine. Whereas, if you did y = 2*f(x) and then 0.5*(y + y), that's more easily done on a register machine. The point is, for the best-case scenario on a stack machine, each intermediate result is used exactly once and is consumed immediately, except when other operands must first be computed as further inputs to the function that is consuming it, in which case it is consumed as soon as all of its coinputs are produced. If each function further produces exactly one output, then that output represents a subexpression and the stack machine represents the recursive computation of an expression.

Registers are for multi output, stack for single (== expressions)

Does a free swap turn a stack into a traditional register set? by D. J. Bernstein

A Performance Survey on Stack-based and Register-based VirtualMachines (finds that register machines are more efficient; the register machine they used appears to be this one, which appears to have 4 registers: https://github.com/Conceptual-Inertia/Inertia/blob/master/docs/instr_file.txt )

discussion:

http://wingolog.org/archives/2013/11/26/a-register-vm-for-guile

another argument for register VMs is the simpler de-optimization of JIT compilers: " Modern JIT compilers use a technique called type feedback compilation. Briefly, the idea is that a compiler that is integrated into the runtime of the system can exploit information on how the program is actually used to compile more efficient code than would be possible on the basis of the program source code alone. A simple example in javascript would be the following routine:

function foo(a) { var r = 0; for (var i = 1; i < a.length; i++) { r += (a[i] * a[i-1]); } return r; }

foo([1,2,3,4,5,6]);

If all calls to foo happen to have a single argument consisting of an array of integers, the semantics of this routine become much simpler than they are otherwise. (For example, in javascript, the addition of a number and a string produces a well-defined result, so it is totally valid to call foo with an array of strings). A type-feedback compiler might notice a large number of calls to foo, all with integer arrays as their sole argument, assume this will always be so, and compile a much faster routine. In order to correctly handle arrays of strings too, the compiler inserts a 'guard clause' that checks if a is really an array of integers. If not, the routine must be 'de-optimised'. Note that spesh, which is the optimisation framework for MoarVM?, also works this way

The goal of de-optimisation is to resume the execution of the interpreted (slow) routine where the assumptions of the compiled routine have failed. A typical place in our 'foo' function would be on entry or on the addition to r. The idea is that the values that are calculated in the optimised routine are copied to the locations of the values of the interpreted routine. In a register machine, this is conceptually simple because all variables already have a fixed location. However, the layout of the stack in a stack vm is dynamic and changes with the execution of the routine, and mapping between compiled and interpreted values may not be very simple at all. It is certainly doable - after all, the JVM famously has an efficient optimising JIT compiler - but not simple. " -- http://brrt-to-the-future.blogspot.com/2014/05/moarvm-as-machine.html

a discussion on this is

Links:

How many registers?

" If you only count real GPRs in MIPS like that then in ARM32 there's only 13 registers (R0-R12, minus SP, LR and PC) and x86 has 7 (minus SP, when omitting frame pointers). x86_64 has 15 and ARM64 has 31 – phuclv Jan 23 '15 at 6:22 "

https://stackoverflow.com/questions/8466981/why-does-arm-have-16-registers

"The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements"

also you need to store more state for context switches, which takes more time and more memory

" You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:

Extendable Instruction Set Computing

Register Program Load/Store Count Size Frequency

27 100.00 27.90% 16 101.62 30.22% 8 114.76 44.45% " -- Bradley McKinley

good read: https://electronics.stackexchange.com/questions/256854/why-dont-we-have-more-registers-in-microprocessors

Potential differences with register machines in VMs vs bare metal

Links:

VMs: pros and cons

Why do some languages target virtual machines? Why add another layer that isn't strictly necessary?

Pros:

Additional pros if you target an existing virtual machine:

Cons:

Links:

Information Losses in bytecode IRs

Comment by Raphael Amiard on the efficiency problems with taking a pre-existing bytecode IR and compiling it, as opposed to compiling from the HLL's AST directly to a target without going thru a pre-existing bytecode IR (after the experience of writing a compiler for OCaml ZAM bytecode (the Z3 project)): https://news.ycombinator.com/item?id=4798320

Why no polymorphism in bytecode?

In bytecode instruction sets, often you see a bunch of different instructions that do similar things on different types. For example, add signed integers, add unsigned integers, add 32-bit floating points, add 64-bit floating points. You might think, gee, couldn't the darn bytecodes give us a higher level of abstraction than that? Couldn't there just be one opcode for ADD, and the VM figures out which addition routine is required based on the type?

Here's why that is not often done. If at compile time the types of everything were known, and then at runtime we encounter an instruction to add (for example) register 0 to register 1, we don't want to have to lookup the types of register 0 and register 1 to see whether we should jump to the integer addition routine, the floating point addition routine, etc, because that takes significant extra time.

So, why not just have a single ADD opcode, but also encode the statically-known types in a separate 'type' field in the instruction? Well, for one, this field would have to have enough bits to encode all types that you would want to distinguish, so if the bytecode supports many types, this could substantially reduce code density.

Well, we have two ways to do that. First, at runtime we could look up the implementation to use in a table that is indexed by (opcode, type). But this table would be huge, and wasteful; not all opcodes can be used with all types (or are handled differently with all types), so this table would be much larger than it needs to be. We could store the table sparsely or as a hash, but then lookup takes a long time, and this is something that we have to do upon every instruction so that would be bad.

The second way would be to pass the type given by the instruction's bytecode to the implementation of the instruction, and let the implementation conditionally branch to the subimplementation for that type. The problem with this is that now we're doing at least two branches for polymorphic instructions at runtime, first a branch to the opcode's handler, and then a branch to the implementation for the type that we have. This runtime branching is not needed, since for values whose type was statically known, we could have figured out where we would have ended up at compile time.

So, the most efficient thing to do is to simply have a separate opcode for each implementation routine; eg one opcode for integer addition, another opcode for floating point addition, etc.

Common constraints on VMs for efficiency

A VM does not need to obey these constraints in order to fit the definition of a VM, but they are often seen for reasons of efficiency:

How intermediate?

Leans up vs. leans down

should they be designed to be similar to hardware CPU instruction set architecture (ISA)? e.g. most extant VMs are either stack machines or register machines; but why not just memory-to-memory?

simple vs. complex

data model, evaluation strategy of HLL, sometimes object model

two layers of VMs

interoperable vms expand into toolchains

Impedence mismatches when targetting an existing VM

examples:

" I learned my own lessons about the “repurposed JIT phenomenon” the hard way. Before any work started on HHVM, I built a hackathon prototype that did naive compilation from PHP to JavaScript?. Running the output on V8 produced some really promising results at first; programming-language shootout[2] style benchmarks looked in the neighborhood of 10x the standard PHP interpreter.

The problems came when I tried to run real PHP programs. The primitive operations that PHP exposes, and that large PHP codebases rely on, are simply different from those exposed by JavaScript?, Java, the CLR, etc. For example, PHP engines expose their use of reference counting to the PHP program, while JavaScript? and Java implementations typically use some sort of tracing garbage collection (and even if they use reference counting, don’t expose it to programs). So, in order to faithfully run PHP programs on a JVM or JavaScript? engine, you need to reimplement all of PHP’s memory management, either in the source language or in the runtime itself. So right out of the gate, you lose one of the big pieces of effort you were hoping to save by recycling someone else’s engine: memory management.

PHP also has a notion of “references” that doesn’t have a direct analog in JavaScript?; passing parameters by reference is unmarked in the call site, so you need to do method resolution (“what am I actually calling at runtime?”) at the call site before you can even pass parameters; PHP has value semantics for its array datatype, while JavaScript? and Java provide reference semantics for containers; PHP’s numerics are very different from JavaScript’s?; PHP has to do protected/private/public method privacy checks at runtime; no, JVM fans, InvokeDynamic? does not magically make all of these mismatches disappear; and on and on. Once you start diving into the nuances of the low-level primitives, almost every language is different enough that it is hard to efficiently implement one in terms of the other. " -- http://www.forbes.com/sites/quora/2013/06/20/why-did-the-hip-hop-vm-team-build-a-new-vm-instead-of-targeting-the-jvm-tracemonkey-or-v8/

" Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that. An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator. LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy? uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy?'s memory usage had been higher. " -- http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospective.html

"

Author Profile Page chromatic replied to comment from Gerhard R.

January 4, 2013 10:23 AM

I think you're falling into the "JIT makes things fast" fallacy. You can see how well that works for languages which use LLVM--if they look like C or C++, they do pretty well. Otherwise they don't.

If you want your language to go really fast, your runtime needs to know a lot about the semantics of your language. How does memory allocation work? What's the layout of primitives? What precision and signedness do your primitives support? Do you support aliasing? What constness guarantees do you provide? What immutability guarantees do you provide? When can you resolve polymorphic dispatch? What boxing semantics do you provide? How does control flow work, and can you resolve questions about object escapes and lifetimes? What runtime support do you need to enable metaprogramming? When can you make these decisions?

I've often said that to support Perl 6 you need a virtual machine that looks a lot like Parrot (not that Parrot does everything right, but that it at least tries to make these things possible). I still suspect that mythical VM abstraction layer will go the way of the DLR. Effectively what NQP on the JVM or CLR will look like is a reimplementation of all of the C code in Rakudo right now using the underlying VM to provide very basic memory management and some platform access.

The result might run faster than current Rakudo, but it'll be so far away from the semantics of Java or C# that I doubt it'll run with the amazing speed people hope except on a few carefully chosen benchmarks that avoid the interesting language features that might make Perl 6 worth using.

(See also: JavaScript?'s terrible numeric support system and why JS JITs go to great length to perform calculations efficiently.)

" -- http://www.modernperlbooks.com/mt/2012/12/the-implementation-of-perl-5-versus-perl-6.html#comment-1957

" Besides the well-known issues of tail calls and continuations, VMs often include a runtime type system, garbage collection, calling functions with arguments, objects with fields, strings, dictionaries, typesafe arithmetic etc. It is unlikely that these operations match the semantics of an independently developed language. The type system will most likely not match at all. There are tons of problematic areas:

    How are functions with a statically unknown arity handled?
    Can objects other than functions be applied?
    How are object fields identified? Are they accessed when knowing the object type or not, possibly including subtyping? Are fields identified by strings or symbols, or indices computed at compile time? Is the set of fields known at object creation time, or is the semantics of field access resolved dynamically?
    What is the nature of characters, elements of strings? Unicode code points? UTF-16 code units? Bytes? Are they distinguished from 1-character strings? Are strings mutable, and can they change their length?
    How is hashing and equality used by a hash table specified?
    What is the semantics of weak references and finalizers? Is a finalizer specified when its associated object is created, or can it be added at any time to any object? May a finalizer refer to the object being finalized, or to other finalizable objects? Are finalizers run by other threads?
    What numeric types are available? What is the effect of overflowing fixed-size integers: returning bignums? an exception? wrapping modulo integer size? What is the semantics of integer division and remainder of negative numbers? What happens when you try to add objects which are not of a builtin numeric type?
    Which exceptions are thrown from particular erroneous situations? What happens when an array is accessed outside of bounds? What happens on integer and floating point division by zero? Are exceptions resumable? Is exception handling code in a tail position with respect to try?" -- http://lambda-the-ultimate.org/node/1617#comment-19661

" Language-independent VM

Here's the big problem with high-level language independent VMs: language designers invariably want to do something new. By definition, this is something you haven't thought of. Want lazyness? Out of luck. Continuations? Nope. First-class stack frames? Ditto. Fast multimethod dispatch? Well, you can hack it up, but most likely it'd be easier to just write a new VM.

In my case, I want to do a language with Erlang-style lightweight concurrency and message passing. This has to be built into the VM at a low level, because it affects everything from IO to heap allocation to stack representation to scheduling. I looked at Parrot, Neko, C--, and LLVM, but none of them offer much in concurrency support. The only halfway-suitable VM was, not surprisingly, Erlang itself, but I can't find any documentation on the bytecode format and the source is semi-closed for modifications (and makes what are IMHO poor choices in string represenation, making it unsuitable for my purposes). I'll probably end up starting with LLVM and then writing an instruction-compatible custom VM once I get to the concurrency parts. " -- http://lambda-the-ultimate.org/node/1617#comment-19670

supporting platform-dependent optimized primitives without bloating the VM language

Different platforms offer different optimized primitives. If the VM's language defined special facilities for the union of all of these, it would be very large indeed; this would also necessitate relatively large implementations of the VM because on each platform it would have to implement each of te defined facilities for . But if it does not define special facilities for some of these, then naive compilers targeting the VM language will transform operations that could be performed by optimized primitives into lengthly sequences of VM instructions, making it hard for a platform-specific runtime to figure out where the optimized primitives could be inserted.

What are some ideas for offering access to platform-dependent optimized primitives without bloating the VM language?

optimizer recognizes implementations of optimized primitives

A platform-specific optimizer, which could be part of a compiler or a runtime which is fed bytecode input, would try to recognize naive implementations of what on this platform could be rewritten in terms of optimized primitives, and rewrite them.

C. Guy Yarvin calls this "jets" at http://www.urbit.org/2013/08/22/Chapter-0-intro.html

annotations

Sections of the bytecode could support annotation, and then in the future annotations could be defined to say essentially, "this section is just doing X, if you have an optimized primitive for X please replace this section with that".

bytecode vs executing the AST

(todo: merge this with section "How intermediate?", above)

Interpreters are slower than compilers but there's a spectrum of types of interpreters.

An interpreter could actually re-parse the source code for each line of code each time it encounters it at runtime.

An interpreter pre-parse the source code once, and then interpret the AST, traveling from AST node to AST node instead of from source code line to source code line.

An interpreter pre-parse the source code once, and annotate the AST with "next" pointers that say, "once you're done with this node, here is the node to visit next". Perl5 takes this route [1]