Revision 28 not available (showing current revision instead)

books-programmingLanguages-programmingLanguagesPartTargetLanguages

Table of Contents for Programming Languages: a survey

Intermediate languages

There are many languages whose goal is not to be a good language for humans to write or read code in, but rather to be a good language for a compiler or interpreter to target.

TODO: separate this section into 'implementation tours' in the implementation section, and purely target language tours, here.

Chapter : a tour of some language implementations

Go

Haskell: GHC

Python: CPython

http://docs.python.org/devguide/compiler.html

Python: PyPy

PyPy? is a reimplementation of Python in RPython.

RPython is a restricted subset of Python, with restrictions on dynamic typing, reflection, and metaprogramming to enable type inference at compile time.

PyPy? is also a compiler for RPython, which adds in JIT analysis. The RPython compiler is written in Python.

Then the PyPy? Python interpreter is written in a mixture of Python (for slow initialization) and RPython (for the fast part) (i think?). This way the JIT analysis is applied to the result. I think? Not sure I understand.

I think it provides an extension API called CPyExt?, not sure though.

RPython:

Perl6: Rakudo

Perl6 source code is parsed and the parse tree is annotated by firing "action methods" during parsing. The annotated AST is called a QAST. The QAST is then compiled to a virtual machine bytecode. Various virtual machines are planned to be supported, include Parrot, JVM, and MoarVM?. The compilation steps from source code to bytecode are implemented in a subset of Perl6 called 'NQP' (Not Quite Perl).

The Perl6 object model is being reimplemented in the 6model project.

Links:

Smalltalk: Squeak

Squeak runs on a VM.

The VM is implemented in Slang, a subset of Smalltalk that can be efficiently optimized.

Erlang

http://prog21.dadgum.com/127.html

Erlang -> Core Erlang -> BEAM -> optimized BEAM

Comparisons and observations

Generally the implementation of a high-level language in a restricted 'core' version of that same language define the core by producing a statically typed variant and disallowing various metaprogramming constructs.

Chapter : a tour of some targets, IRs, VMs and runtimes

(todo move most of the above here)

We'll refer to the top of the stack as TOP0, and to the second position on the stack as TOP1, and to the third position as TOP2, etc.

stacks:

cpython: block stack

stack ops: cpython:

arithmetic:

JVM

stack-oriented

9 primitive types: int, long, short, byte, char, float, double, bool, reference

invokedynamic Dynalink (Dynamic Linker Framework)

"

JVM bytecode verification • JVM bytecode is statically verified before execution - An instruction must work on stack operands and variables of the right type - A method must use no more local variables and no more local stack positions than it claims to - For every point in the bytecode, the local stack has the same depth every time it is reached - A method must not throw more exceptions than it admits - A method must end with a return value or throw instruction - Method must not use one half of a two word value

Additional JVM runtime checks • Array bounds check • Array assignment type checks • Null reference checks • Checked casts • Bottom line: - A JVM program cannot read or overwrite arbitrary memory - Better debugging, better security - No buffer overflow attacks, worms, etc as in C/C++

" -- Rasmus Ejlers Møgelberg, http://itu.dk/people/mogel/SPLC2012/lectures/SPLC.2012.12.pdf

Links:

on Android

Android has a non-canonical JVM compiler that called 'Dalvik' that compiles to another Dalvik-specific VM.

It doesn't support invokedynamic.

CLR

DLR for dynamic languages (built on top of CLR)

alternate, open implementation: Mono

15 primitive types: bool, byte, sbyte, char, decimal, double, float, int, uint, long, ulong, object, short, ushort, string

-- http://msdn.microsoft.com/en-us/library/ya5y69ds.aspx

The CIL has 67 base instructions and 33 object model instructions, according to " Common Language Infrastructure (CLI) Partition III CIL Instruction Set".

" Rasmus Ejlers Møgelberg Common Language Infrastructure • Much the same philosophy as JVM, but - Many source languages: C#, VB.NET, F#, SML, JScript, Eiffel, Ruby - Tail calls support functional languages - True generics in byte-code: safer and faster - User-defined structs " -- Rasmus Ejlers Møgelberg, http://itu.dk/people/mogel/SPLC2012/lectures/SPLC.2012.12.pdf

Instruction encoding

Instructions are one or more byte opcodes (currently, one or two bytes) followed by zero or more operands. They can also have prefixes.

Links

LLVM

Some issues with LLVM:

LLVM ISA

Instructions

Terminator Instructions: ret, br, switch, indirectbr, invoke, resume, unreachable

Binary Operations: add, fadd, sub, fsub, mul, fmul, udiv, sdiv, fdiv, urem, srem, frem

Bitwise Binary Operations: shl, lshr, ashr, and, or, xor

Vector Operations: extractelement, insertelement, shufflevector,

Aggregate Operations: extractvalue, insertvalue,

Memory Access and Addressing Operations: alloca, load, store, fence, cmpxchg, atomicrmw, getelementptr,

Conversion Operations: trunc , zext , sext , fptrunc , fpext , fptoui , fptosi , uitofp , sitofp , ptrtoint , inttoptr , bitcast ,

Other Operations: icmp, fcmp, phi, select, call, va_arg, landingpad,

Intrinsics

Variable Argument Handling Intrinsics: llvm.va_start, llvm.va_end, llvm.va_copy,

Accurate Garbage Collection Intrinsics: llvm.gcroot, llvm.gcread, llvm.gcwrite,

Code Generator Intrinsics: llvm.returnaddress, llvm.frameaddress, llvm.stacksave, llvm.stackrestore, llvm.prefetch, llvm.pcmarker, llvm.readcyclecounter,

Standard C Library Intrinsics: llvm.memcpy, llvm.memmove, llvm.memset, llvm.sqrt, llvm.powi, llvm.sin, llvm.cos, llvm.pow, llvm.exp, llvm.exp2, llvm.log, llvm.log10, llvm.log2, llvm.fma, llvm.fabs, llvm.copysign, llvm.floor, llvm.ceil, llvm.trunc, llvm.rint, llvm.nearbyint, llvm.round,

Bit Manipulation Intrinsics: llvm.bswap, llvm.ctpop, llvm.ctlz, llvm.cttz,

Arithmetic with Overflow Intrinsics: llvm.sadd.with.overflow, llvm.uadd.with.overflow, llvm.ssub.with.overflow, llvm.usub.with.overflow, llvm.smul.with.overflow, llvm.umul.with.overflow,

Specialised Arithmetic Intrinsics: llvm.fmuladd,

Half Precision Floating Point Intrinsics: llvm.convert.to.fp16, llvm.convert.from.fp16,

Debugger Intrinsics: llvm.dbg.declare, llvm.dbg.value¶

Exception Handling Intrinsics: llvm.eh.typeid.for, llvm.eh.sjlj.setjmp, llvm.eh.sjlj.longjmp, llvm.eh.sjlj.lsda, llvm.eh.sjlj.callsite,

Trampoline Intrinsics: llvm.init.trampoline, llvm.adjust.trampoline

Memory Use Markers: llvm.lifetime.start, llvm.lifetime.end, llvm.invariant.start, llvm.invariant.end

General Intrinsics:llvm.var.annotation, llvm.ptr.annotation.*, llvm.annotation.*, llvm.trap, llvm.debugtrap, llvm.stackprotector, llvm.stackprotectorcheck, llvm.objectsize, llvm.expect, llvm.donothing,

" The LLVM instruction set defines a register based virtual machine with an interesting twist: it has an infinite number of registers. In keeping with its design point as a compiler intermediate representation, LLVM registers enable static single assignment form. A register is used for exactly one value and never reassigned, making it easy for subsequent processing to determine whether values are live or can be eliminated. " -- http://codingrelic.geekhold.com/2010/07/virtual-instruction-sets-opcode.html

LLVM Links

CPython bytecode

Two stacks:

Stack ops:

Arithmetic:

Links:

Parrot

Parrot started out as a runtime for Perl6. Then it refocused on being an interoperable VM target for a variety of languages. However, it hasn't been very successful (due to not enough volunteers being motivated to spend enough hours hacking on it), and even Perl6 is moving away from it.

Even if unsuccessful, it is still of interest because it is one of the few VMs designed with interoperation between multiple HLLs in mind.

Note that multiple core Parrot devs claim that all core Parrot devs hate Parrot's object model: http://whiteknight.github.io/2011/09/10/dust_settles.html http://www.modernperlbooks.com/mt/2012/12/the-implementation-of-perl-5-versus-perl-6.html

It's register-based. It provides garbage collection.

It has a syntactic-sugar IR language called PIR (which handles register allocation and supports named registers), an assembly-language called PASM, an AST serialization format called PAST, and a bytecode called PBT.

Its objects are called PMCs (Polymorphic Containers).

Its set of opcodes are extensible (a program written in Parrot can define custom opcodes). Parrot itself contains a lot of opcodes: http://docs.parrot.org/parrot/devel/html/ops.html

At one point there was an effort called M0 to redefine things from a small, core set of opcodes but i don't know what happened to it; this appears to be the list of M0 opcodes: https://github.com/parrot/parrot/blob/m0/src/m0/m0.ops . I dunno if the M0 project is still ongoing, see http://leto.net/dukeleto.pl/2011/05/what-is-m0.html https://github.com/parrot/mole http://reparrot.blogspot.com/2011/07/m0-roadmap-goals-for-q4-2011.html http://gerdr.github.io/on-parrot/rethinking-m0.html . The repo seems to be at https://github.com/parrot/parrot/tree/m0 . There is also an IL that compiles to M0: https://github.com/parrot/m1/blob/master/docs/pddxx_m1.pod .

There was an earlier effort for some sort of core language called L1 http://wknight8111.blogspot.com/2009/06/l1-language-of-parrot-internals.html . Not sure what happened with that either.

The M0 opcodes:

https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

control flow: noop goto (go to a fixed offset in the current bytecode segment) goto_if (conditionally go to a fixed offset in the current bytecode segment) goto_chunk (go to an offset in another chunk)

arithmetic: add_i add_n sub_i sub_n mult_i mult_n div_i div_n mod_i (remainder) mod_n isgt_i isgt_n isge_i isge_n convert_i_n (convert from integer to numeric) convert_n_i

bitwise arithmetic: ashr (right bitshift with sign extension) lshr (right bitshift without sign extension) shl (left bitshift) and or xor

Memory/GC ops: gc_alloc sys_alloc sys_free copy_mem

todo: set set_imm deref set_ref set_byte get_byte set_word get_word csym ccall_arg ccall_ret ccall print_s print_i print_n exit

_i means arith on 'integers', _n means arith on 'two numeric registers', "Treat *$2 and *$3 as integer or floating-point values, (operate on) them and store the result in *$1."

todo; explain the crytic ones; descriptions here https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

Links:

"

Parrot

Parrot is also a register based virtual machine. It defines four types of registers:

    Integers
    Numbers (i.e. floating point)
    Strings
    Polymorphic Containers (PMCs), which reference complex types and structures

Like LLVM, Parrot does not define a maximum number of registers: each function uses as many registers as it needs. Functions do not re-use registers for different purposes by storing their values to memory, they specify a new register number instead. The Parrot runtime will handle assignment of virtual machine registers to CPU registers.

So far as I can tell, integer registers are the width of the host CPU on which the VM is running. A Parrot bytecode might find itself using either 32 or 64 bit integer registers, determined at runtime and not compile time. This is fascinating if correct, though it seems like BigNum? handling would be somewhat complicated by this. " -- http://codingrelic.geekhold.com/2010/07/virtual-instruction-sets-opcode.html

MoarVM

MoarVM? is a VM built for Perl6's Rakudo implementation (the most canonical Perl6 implementation as of this writing).

Links:

GHC Core

Possible extensions:

GHC STG

GHC Cmm (C--)

Smalltalk

From http://wiki.squeak.org/squeak/2267 , the operations available in Slang are:

"&" "

"+" "-" "" "
" min: max: bitAnd: bitOr: bitXor: bitShift: "<" "<=" "=" ">" ">=" "~=" "==" isNil notNil whileTrue: whileFalse: to:do: to:by:do: ifTrue: ifFalse: ifTrue:ifFalse: ifFalse:ifTrue: at: at:put: 1
" and: or: not