Table of Contents for Programming Languages: a survey
note: HLL is an abbreviation for High Level Language.
IR is for Intermediate Representation.
The core language of a VM is often called 'bytecode'. The instructions in this language are called 'instructions' or 'operations' or 'ops' or 'opcodes' or 'bytecodes'.
https://dercuano.github.io/notes/tiny-interpreters-for-microcontrollers.html
"It’s a bit easier to build a relatively efficient interpreter with a register-based machine...Having now done the code generation work for register and stack machines...Mostly, I found the register-based code generation a bit easier to do...on the JVM, the stack must be empty either side of a try/catch construct. So you have to do some work to spill the stack into locals if there’s anything already on it....I haven’t really come out of all this thinking register machines are overwhelmingly better than stack machines, though – but they are a bit easier on the code generation, which is nice. If you plan to put your call frames on the system stack, the stack approach seems more attractive from that angle. Once you decide you’re not doing that, you could go either way." -- http://www.i-programmer.info/programming/perl/6138-where-is-perl-heading-an-interview-with-jonathan-worthington.html?start=1
"And never forget that stack machines are superior to register machines when it comes to code density!" -- http://www.pouet.net/topic.php?post=398318
http://blogs.msdn.com/b/ericlippert/archive/2011/11/28/why-have-a-stack.aspx
examples of stack-based VMs:
"if you look at the VMs that do successfully host multiple languages today, they’ve typically started by hosting one language really well." -- http://www.josetteorama.com/all-about-perl-6-interview-of-jonathan-worthington-part-2-of-3/
i guess the reason that writing instructions in stack-mode is so concise is that it's really just a useful shorthand for how the outputs of earlier instructions are plugged into the inputs of later ones, and that in the common case (executing a bunch of sequential instructions whose inputs depend on the output of the previous instruction) it's concise. when you get far away from this common case is when you get lots of ROTs and DUPs, in which case maybe registers are better.
another way of saying that: when you are computing expressions, you end up recursively computing subexpressions and then plugging the result into a larger expression. E.g. to compute 0.5*(f(x,y,u+v) + z), you compute z, then you compute u+v, then f(x,y,u+v), then f(x,y,u+v)+z, then 0.5*f(x,y,u+v)+z. You don't save usually save these intermediate results to use twice; e.g. if you compute 0.5*(2*f(x) + 2*f(x)) you might actually compute f(x) twice (or you may not, if you memoize). This is perfect for a stack machine. Whereas, if you did y = 2*f(x) and then 0.5*(y + y), that's more easily done on a register machine. The point is, for the best-case scenario on a stack machine, each intermediate result is used exactly once and is consumed immediately, except when other operands must first be computed as further inputs to the function that is consuming it, in which case it is consumed as soon as all of its coinputs are produced. If each function further produces exactly one output, then that output represents a subexpression and the stack machine represents the recursive computation of an expression.
Registers are for multi output, stack for single (== expressions)
Does a free swap turn a stack into a traditional register set? by D. J. Bernstein
A Performance Survey on Stack-based and Register-based VirtualMachines (finds that register machines are more efficient; the register machine they used appears to be this one, which appears to have 4 registers: https://github.com/Conceptual-Inertia/Inertia/blob/master/docs/instr_file.txt )
discussion:
http://wingolog.org/archives/2013/11/26/a-register-vm-for-guile
another argument for register VMs is the simpler de-optimization of JIT compilers: " Modern JIT compilers use a technique called type feedback compilation. Briefly, the idea is that a compiler that is integrated into the runtime of the system can exploit information on how the program is actually used to compile more efficient code than would be possible on the basis of the program source code alone. A simple example in javascript would be the following routine:
function foo(a) { var r = 0; for (var i = 1; i < a.length; i++) { r += (a[i] * a[i-1]); } return r; }
foo([1,2,3,4,5,6]);
If all calls to foo happen to have a single argument consisting of an array of integers, the semantics of this routine become much simpler than they are otherwise. (For example, in javascript, the addition of a number and a string produces a well-defined result, so it is totally valid to call foo with an array of strings). A type-feedback compiler might notice a large number of calls to foo, all with integer arrays as their sole argument, assume this will always be so, and compile a much faster routine. In order to correctly handle arrays of strings too, the compiler inserts a 'guard clause' that checks if a is really an array of integers. If not, the routine must be 'de-optimised'. Note that spesh, which is the optimisation framework for MoarVM?, also works this way
The goal of de-optimisation is to resume the execution of the interpreted (slow) routine where the assumptions of the compiled routine have failed. A typical place in our 'foo' function would be on entry or on the addition to r. The idea is that the values that are calculated in the optimised routine are copied to the locations of the values of the interpreted routine. In a register machine, this is conceptually simple because all variables already have a fixed location. However, the layout of the stack in a stack vm is dynamic and changes with the execution of the routine, and mapping between compiled and interpreted values may not be very simple at all. It is certainly doable - after all, the JVM famously has an efficient optimising JIT compiler - but not simple. " -- http://brrt-to-the-future.blogspot.com/2014/05/moarvm-as-machine.html
a discussion on this is
Links:
" If you only count real GPRs in MIPS like that then in ARM32 there's only 13 registers (R0-R12, minus SP, LR and PC) and x86 has 7 (minus SP, when omitting frame pointers). x86_64 has 15 and ARM64 has 31 – phuclv Jan 23 '15 at 6:22 "
https://stackoverflow.com/questions/8466981/why-does-arm-have-16-registers
"The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements"
also you need to store more state for context switches, which takes more time and more memory
" You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:
Extendable Instruction Set Computing
Register Program Load/Store Count Size Frequency
27 100.00 27.90% 16 101.62 30.22% 8 114.76 44.45% " -- Bradley McKinley
Links:
Why do some languages target virtual machines? Why add another layer that isn't strictly necessary?
Pros:
Additional pros if you target an existing virtual machine:
Cons:
Links:
Comment by Raphael Amiard on the efficiency problems with taking a pre-existing bytecode IR and compiling it, as opposed to compiling from the HLL's AST directly to a target without going thru a pre-existing bytecode IR (after the experience of writing a compiler for OCaml ZAM bytecode (the Z3 project)): https://news.ycombinator.com/item?id=4798320
In bytecode instruction sets, often you see a bunch of different instructions that do similar things on different types. For example, add signed integers, add unsigned integers, add 32-bit floating points, add 64-bit floating points. You might think, gee, couldn't the darn bytecodes give us a higher level of abstraction than that? Couldn't there just be one opcode for ADD, and the VM figures out which addition routine is required based on the type?
Here's why that is not often done. If at compile time the types of everything were known, and then at runtime we encounter an instruction to add (for example) register 0 to register 1, we don't want to have to lookup the types of register 0 and register 1 to see whether we should jump to the integer addition routine, the floating point addition routine, etc, because that takes significant extra time.
So, why not just have a single ADD opcode, but also encode the statically-known types in a separate 'type' field in the instruction? Well, for one, this field would have to have enough bits to encode all types that you would want to distinguish, so if the bytecode supports many types, this could substantially reduce code density.
Well, we have two ways to do that. First, at runtime we could look up the implementation to use in a table that is indexed by (opcode, type). But this table would be huge, and wasteful; not all opcodes can be used with all types (or are handled differently with all types), so this table would be much larger than it needs to be. We could store the table sparsely or as a hash, but then lookup takes a long time, and this is something that we have to do upon every instruction so that would be bad.
The second way would be to pass the type given by the instruction's bytecode to the implementation of the instruction, and let the implementation conditionally branch to the subimplementation for that type. The problem with this is that now we're doing at least two branches for polymorphic instructions at runtime, first a branch to the opcode's handler, and then a branch to the implementation for the type that we have. This runtime branching is not needed, since for values whose type was statically known, we could have figured out where we would have ended up at compile time.
So, the most efficient thing to do is to simply have a separate opcode for each implementation routine; eg one opcode for integer addition, another opcode for floating point addition, etc.
A VM does not need to obey these constraints in order to fit the definition of a VM, but they are often seen for reasons of efficiency:
Leans up vs. leans down
should they be designed to be similar to hardware CPU instruction set architecture (ISA)? e.g. most extant VMs are either stack machines or register machines; but why not just memory-to-memory?
simple vs. complex
data model, evaluation strategy of HLL, sometimes object model
two layers of VMs
interoperable vms expand into toolchains
examples:
" I learned my own lessons about the “repurposed JIT phenomenon” the hard way. Before any work started on HHVM, I built a hackathon prototype that did naive compilation from PHP to JavaScript?. Running the output on V8 produced some really promising results at first; programming-language shootout[2] style benchmarks looked in the neighborhood of 10x the standard PHP interpreter.
The problems came when I tried to run real PHP programs. The primitive operations that PHP exposes, and that large PHP codebases rely on, are simply different from those exposed by JavaScript?, Java, the CLR, etc. For example, PHP engines expose their use of reference counting to the PHP program, while JavaScript? and Java implementations typically use some sort of tracing garbage collection (and even if they use reference counting, don’t expose it to programs). So, in order to faithfully run PHP programs on a JVM or JavaScript? engine, you need to reimplement all of PHP’s memory management, either in the source language or in the runtime itself. So right out of the gate, you lose one of the big pieces of effort you were hoping to save by recycling someone else’s engine: memory management.
PHP also has a notion of “references” that doesn’t have a direct analog in JavaScript?; passing parameters by reference is unmarked in the call site, so you need to do method resolution (“what am I actually calling at runtime?”) at the call site before you can even pass parameters; PHP has value semantics for its array datatype, while JavaScript? and Java provide reference semantics for containers; PHP’s numerics are very different from JavaScript’s?; PHP has to do protected/private/public method privacy checks at runtime; no, JVM fans, InvokeDynamic?