Bayle Shanks's website: books-programmingLanguages-programmingLanguagesPartCoreLanguagesCaseStudies

Chapter : a tour of some language implementations

Go

Haskell: GHC

Python: CPython

Python: PyPy

PyPy? is a reimplementation of Python in Python.

RPython is a restricted subset of Python, with restrictions on dynamic typing, reflection, and metaprogramming to enable type inference at compile time. The RPython compiler is written in Python. The PyPy? Python compiler/interpreter is written in a mixture of Python (for slow initialization) and RPython (for the fast part) (i think?).

I think it provides an extension API called CPyExt?, not sure though.

RPython:

http://doc.pypy.org/en/latest/faq.html#the-rpython-translation-tool-chain
http://doc.pypy.org/en/latest/getting-started-dev.html
http://doc.pypy.org/en/latest/coding-guide.html#id1
http://code.google.com/p/rpythonic/ (for using RPython as a standalone programming language)

Perl6: Rakudo

Perl6 source code is parsed and the parse tree is annotated by firing "action methods" during parsing. The annotated AST is called a QAST. The QAST is then compiled to a virtual machine bytecode. Various virtual machines are planned to be supported, include Parrot, JVM, and MoarVM?. The compilation steps from source code to bytecode are implemented in a subset of Perl6 called 'NQP' (Not Quite Perl).

The Perl6 object model is being reimplemented in the 6model project.

Links:

Smalltalk: Squeak

Squeak runs on a VM.

The VM is implemented in Slang, a subset of Smalltalk that can be efficiently optimized.

Comparisons and observations

Generally the implementation of a high-level language in a restricted 'core' version of that same language define the core by producing a statically typed variant and disallowing various metaprogramming constructs.

Chapter : a tour of some targets, IRs, VMs and runtimes

(todo move most of the above here)

We'll refer to the top of the stack as TOP0, and to the second position on the stack as TOP1, and to the third position as TOP2, etc.

stacks:

cpython: block stack

stack ops: cpython:

arithmetic:

JVM

stack-oriented

primitive types: int, long, short, byte, char, float, double, bool, reference

invokedynamic Dynalink (Dynamic Linker Framework)

Links:

http://en.wikipedia.org/wiki/Java_bytecode

on Android

Android has a non-canonical JVM interpreter called 'Dalvik'.

It doesn't support invokedynamic.

CLR

DLR for dynamic languages (built on top of CLR)

alternate, open implementation: Mono

https://en.wikipedia.org/wiki/List_of_CIL_instructions

LLVM

Some issues with LLVM:

http://doc.pypy.org/en/latest/faq.html#could-we-use-llvm

CPython bytecode

Two stacks:

main stack
block stack: "Per frame, there is a stack of blocks, denoting nested loops, try statements, and such."

Stack ops:

various rots, dups
in-place binary arithmetic on top0, top1 (doesn't consume top1)

Arithmetic:

arithmetical negation (-6 -> 6), boolean negation (-6 -> False), inversion (-6 -> 5), unary positive (does nothing by default but can be overridden by __pos__ in custom classes)

Links:

http://docs.python.org/release/3.3.2/library/dis.html#bytecodes
www.troeger.eu/files/teaching/pythonvm08.pdf

Parrot

Parrot started out as a runtime for Perl6. Then it refocused on being an interoperable VM target for a variety of languages. However, it hasn't been very successful (due to not enough volunteers being motivated to spend enough hours hacking on it), and even Perl6 is moving away from it.

Even if unsuccessful, it is still of interest because it is one of the few VMs designed with interoperation between multiple HLLs in mind.

Note that multiple core Parrot devs claim that all core Parrot devs hate Parrot's object model: http://whiteknight.github.io/2011/09/10/dust_settles.html http://www.modernperlbooks.com/mt/2012/12/the-implementation-of-perl-5-versus-perl-6.html

It's register-based. It provides garbage collection.

It has a syntactic-sugar IR language called PIR (which handles register allocation and supports named registers), an assembly-language called PASM, an AST serialization format called PAST, and a bytecode called PBT.

Its objects are called PMCs (Polymorphic Containers).

Its set of opcodes are extensible (a program written in Parrot can define custom opcodes). Parrot itself contains a lot of opcodes: http://docs.parrot.org/parrot/devel/html/ops.html

At one point there was an effort called M0 to redefine things from a small, core set of opcodes but i don't know what happened to it; this appears to be the list of M0 opcodes: https://github.com/parrot/parrot/blob/m0/src/m0/m0.ops . I dunno if the M0 project is still ongoing, see http://leto.net/dukeleto.pl/2011/05/what-is-m0.html https://github.com/parrot/mole http://reparrot.blogspot.com/2011/07/m0-roadmap-goals-for-q4-2011.html http://gerdr.github.io/on-parrot/rethinking-m0.html . The repo seems to be at https://github.com/parrot/parrot/tree/m0 . There is also an IL that compiles to M0: https://github.com/parrot/m1/blob/master/docs/pddxx_m1.pod .

There was an earlier effort for some sort of core language called L1 http://wknight8111.blogspot.com/2009/06/l1-language-of-parrot-internals.html . Not sure what happened with that either.

Links:

http://docs.parrot.org/parrot/latest/html/

Neko

MoarVM

MoarVM? is a VM built for Perl6's Rakudo implementation (the most canonical Perl6 implementation as of this writing).

Register-based
primitive types: int, num, str, object
uses libuv for async I/O, libtommath for multiple precision arithmetic, libatomic_ops for atomic operations, uthash for hashes

Links:

http://jnthn.net/papers/2013-yapceu-moarvm.pdf
http://perl6.com/MoarVM_talk1.pdf‎
opcodes (that's pretty unenlightening; here's the implementation, i think: https://github.com/MoarVM/MoarVM/blob/master/src/core/interp.c )
http://6guts.wordpress.com/2013/05/31/moarvm-a-virtual-machine-for-nqp-and-rakudo/

Smalltalk

From http://wiki.squeak.org/squeak/2267 , the operations available in Slang are:

"&" "

"+" "-" "" "
" min: max: bitAnd: bitOr: bitXor: bitShift: "<" "<=" "=" ">" ">=" "~=" "==" isNil notNil whileTrue: whileFalse: to:do: to:by:do: ifTrue: ifFalse: ifTrue:ifFalse: ifFalse:ifTrue: at: at:put: 1 bitInvert32 preIncrement integerValueOf: integerObjectOf: isIntegerObject:

" and: or: not

Links:

http://www.cosc.canterbury.ac.nz/wolfgang.kreutzer/cosc205/squeak.html
http://www.marcusdenker.de/talks/07SCGSmalltalk/11Bytecode.pdf
http://www.mirandabanda.org/bluebook/bluebook_chapter26.html#TheBytecodes26
http://www.mirandabanda.org/bluebook/bluebook_chapter27.html
http://www.mirandabanda.org/bluebook/bluebook_chapter28.html (the bytecode instruction set is here)
http://www.mirandabanda.org/bluebook/bluebook_chapter29.html (specification of the primitive methods)

The Smalltalk-80 Bytecodes Range Bits Function 0-15 0000iiii Push Receiver Variable #iiii 16-31 0001iiii Push Temporary Location #iiii 32-63 001iiiii Push Literal Constant #iiiii 64-95 010iiiii Push Literal Variable #iiiii 96-103 01100iii Pop and Store Receiver Variable #iii 104-111 01101iii Pop and Store Temporary Location #iii 112-119 01110iii Push (receiver, true, false, nil, -1, 0, 1, 2) [iii] 120-123 011110ii Return (receiver, true, false, nil) [ii] From Message 124-125 0111110i Return Stack Top From (Message, Block) [i] 126-127 0111111i unused 128 10000000 jjkkkkkk Push (Receiver Variable, Temporary Location, Literal Constant, Literal Variable) [jj] #kkkkkk 129 10000001 jjkkkkkk Store (Receiver Variable, Temporary Location, Illegal, Literal Variable) [jj] #kkkkkk 130 10000010 jjkkkkkk Pop and Store (Receiver Variable, Temporary Location, Illegal, Literal Variable) [jj] #kkkkkk 131 10000011 jjjkkkkk Send Literal Selector #kkkkk With jjj Arguments 132 10000100 jjjjjjjj kkkkkkkk Send Literal Selector #kkkkkkkk With jjjjjjjj Arguments 133 10000101 jjjkkkkk Send Literal Selector #kkkkk To Superclass With jjj Arguments 134 10000110 jjjjjjjj kkkkkkkk Send Literal Selector #kkkkkkkk To Superclass With jjjjjjjj Arguments 135 10000111 Pop Stack Top 136 10001000 Duplicate Stack Top 137 10001001 Push Active Context 138-143 unused 144-151 10010iii Jump iii + 1 (i.e., 1 through 8) 152-159 10011iii Pop and Jump 0n False iii +1 (i.e., 1 through 8) 160-167 10100iii jjjjjjjj Jump(iii - 4) *256+jjjjjjjj 168-171 101010ii jjjjjjjj Pop and Jump On True ii *256+jjjjjjjj 172-175 101011ii jjjjjjjj Pop and Jump On False ii *256+jjjjjjjj 176-191 1011iiii Send Arithmetic Message #iiii 192-207 1100iiii Send Special Message #iiii 208-223 1101iiii Send Literal Selector #iiii With No Arguments 224-239 1110iiii Send Literal Selector #iiii With 1 Argument 240-255 1111iiii Send Literal Selector #iiii With 2 Arguments

Dis

Links:

LuaVM

Links:

asm.js

Links:

http://asmjs.org/spec/latest/

Nock

Links:

http://www.urbit.org/2013/08/22/Chapter-2-nock.html

Runtimes

Apache Portable Runtime

"OS threads, blocking IO, and other system things"

libuv

asynchronous I/O and some other stuff.

Used by nodejs, Rust language, Luvit, Julia, pyuv, MoarVM?.

Other useful runtime libraries

libatomic_ops: "Provides implementations for atomic memory update operations on a number of architectures"
libuv for async I/O
libtommath for multiple precision arithmetic
uthash for hashes

RISC CPU architecture instruction sets

For more inspiration about the sorts of instructions that might go into a VM, one might look at popular RISC CPU instruction sets.

My purpose in including this section is NOT to teach the reader the basics of assembly language and computer architecture; i assume that the reader already knows that. I just want to give you more food for thought about 'minimal' programming languages.

What is RISC?

RISC is "Reduced Instruction Set Computer", in constrast to CISC, "Complex Instruction Set Computer". The difference between RISC and CISC is not that RISC necessarily has fewer instructions (although it often does), but rather that RISC instructions are less complex and typically can be executed within a single data memory cycle ( http://en.wikipedia.org/wiki/Reduced_instruction_set_computing#Instruction_set ). Note that this means that RISC instruction sets sometimes eschew operations which access main memory and also do something else, preferring to provide only load/store operations and not other ways of accessing main memory. In more detail:

" what exactly is a RISC processor? This turns out to be quite hard to answer. Here is a list of possible criteria that have been used in the past.

    Instructions are conceptually simple — that is, no baroque things like `evaluate polynomial', or `edit string', both of which were found in the VAX.
    Instructions are uniform length — as opposed, to say, the VAX or M68000 which have a wide range of instruction lengths.
    Instructions use one, or very few, formats — again, unlike the VAX or M68000.
    The instruction set is orthogonal — that is, there are no special rules about what operations are permitted with particular addressing modes (which would complicate the life of a compiler writer).
    There is one, or very few, addressing modes.
    The architecture is load-and-store — that is, only load and store operations access memory — all operate instructions (e.g. arithmetic) only operate on registers.
    The architecture supports two (or perhaps a few more) datatypes — integer and floating point usually." -- http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323004.html

What are popular RISC architectures that might be worth looking at?

As of this writing, ARM is the most commerical successful RISC architecture. Other often-noted ones are SPARC, PowerPC?, and MIPS. Of these, some say that MIPS is the prototypical, most elegant example of RISC:

"MIPS is the cleanest successful RISC. PowerPC? and (32-bit) ARM have so many extra instructions (even a few operating modes, 32-bit ARM especially) that you could almost call them CISC. SPARC has a few odd features and Itanium is composed entirely of odd features. The latter two are more dead than MIPS." -- http://stackoverflow.com/a/2653951/171761

"Answering now your first question: the reason that MIPS features so prominently in books is that it is almost a perfect exemplar of a RISC system. It is a small, relatively pure RISC implementation that is easily understood and that illustrates RISC concepts well. For pedagogical purposes it is probably the best real-world architecture to show the nature of RISC, along with its warts. Other processors thought of as RISC (ARM, SPARC, Alpha, etc.) are more pragmatic and complicated, obfuscating RISC concepts with some more CISC-like enhancements for better performance or other benefits." -- http://stackoverflow.com/a/2796869/171761

"Almost every instruction found in the MIPS core is found in the other architectures" -- http://www.cis.upenn.edu/~milom/cis501-Fall05/papers/RISC-appendix-C.pdf

"MIPS is the most elegant among the effective RISC architectures; even the competition thought so, as evidenced by the strong MIPS influence to be seen in later architectures like DEC’s Alpha and HP’s Precision. Elegance by itself doesn’t get you far in a competitive marketplace, but MIPS microproces- sors have generally managed to be among the most efficient of each generation by remaining among the simplest" --- http://v5.books.elsevier.com/bookscat/samples/9780120884216/9780120884216.PDF

In addition, there are 8-bit microcontrollers ("MCUs"), which are not considered in the same class as CPUs but which also have interesting, small intruction sets. The PIC and the AVR architectures are popular ones (the 8051 is also popular but is older, is CISC, and does not seem to be recommended as often; however PIC and AVR are only manufactored by their respective developers, whereas 8051-compatibles are manufactored by a bunch of different companies). Note that Arduino, which you may have heard of, uses AVR or ARM. Many people comment that the AVR is easier to program than the (8-bit) PIC ( http://stackoverflow.com/questions/140049/avr-or-pic-to-start-programming-microcontroller , http://www.ladyada.net/library/picvsavr.html ), but others say that PIC is simpler (e.g. http://www.8051projects.net/lofiversion/t17539/what039s-diff039-between-8051pic-avr.html ); i suspect that they mean that the PIC has fewer instructions and a simpler architecture outside of the ISA, but the AVR has a more uniform architecture and more accessible C compilers, but i'm not too sure what they mean since i've never used either. The PIC and the AVR are both called RISC but the AVR has a more RISC-y design (the PIC has indirect addressing), even though it also has a larger instruction set.

MIPS

todo

http://en.wikipedia.org/wiki/MIPS_architecture#MIPS_assembly_language

Atmel AVR

"The AVR processors were designed with the efficient execution of compiled C code in mind and have several built-in pointers for the task.... The mostly regular instruction set makes programming it using C (or even Ada) compilers fairly straightforward. GCC has included AVR support for quite some time, and that support is widely used. In fact, Atmel solicited input from major developers of compilers for small microcontrollers, to determine the instruction set features that were most useful in a compiler for high-level languages." -- http://en.wikipedia.org/wiki/Atmel_AVR#Device_overview

Most AVRs are modified Harvard architecture designs. Harvard architecture means that program code and data are stored in separate memory banks. Modified Harvard architecture means that program code can still be accessed. Most, but not all AVRs can access program code in a read-only fashion; some AVRs can also write to program code memory. The smallest AVRs have 512 bytes of program memory and 32 bytes of data memory.

Here we look at the AVR Minimal Core ISA, found in: ATtiny11, ATtiny12, ATtiny15, ATtiny28. Note that i think the "Reduced Core" may be more current, found in: ATtiny10, ATtiny9, ATtiny5, ATtiny4. I'll omit explanations for instruction that can be guessed from the mnemonic when the mnemonic follows the same pattern as previously listed instructions.

The AVRs have 16 or more general purpose registers. The registers are mapped to RAM. This is a load-store architecture; only the load and store operations access RAM, everything else works with registers. The AVR has a status register composed of 8 flags: Carry, Zero, Negative, Overflow (V), Sign, Half carry (for Binary Coded Decimal arithmetic), Bit copy, Interrupts enabled. Most recent AVRs have an on-chip oscillator. Most AVRs have a 2-stage pipelined architecture ("the next instruction is fetched as the current one is executing") and most instructions are single cycle, allowing almost 1 MIPS per MHZ (e.g. an 8 MHz processor can achieve 8 MIPS). AVRs have a 'watchdog timer', which can be used to generate an interrupt or to reboot (reset) the MCU after some amount of time; if enabled, the watchdog timer is continually counting up and the software must periodically reset it with the WDR instruction to prevent it from activating (e.g. it can be used as a timeout failsafe so you don't have to manually reboot hung devices in the field). Most AVRs support JTAG, a debugging and program-code loading mechanism. The stack is allocated out of ordinary RAM and can grow to the entire RAM size. Four addressing modes are supported (at least in Reduced Core): direct, indirect, indirect with pre-decrement, and indirect with post-increment.

Arithmetic instructions: ADD, ADC (add with carry), SUB (subtract), SUBI (subtract immediate), SBC (subtract with carry), SBCI, NEG, INC (increment), DEC, TST (test for zero or minus), CLR (clear register), SER (set register). The AVR Minimal Core ISA does not contain a MUL (multiplication) instruction but the AVR Enhanced Core ISA does.

Branches: RJMP (relative jump), RCALL (relative subroutine call), RET (subroutine return), RETI (interrupt return), CPSE (compare, skip if equal), CP (compare), CPC (compare with carry), CPI (compare immediate), SBRC (skip if bit in register cleared), SBRS, SBIC (skip if bit in I/O register cleared), SBIS, BRBS (branch if status flag set), BRBC (branch if status flag cleared), BREQ (branch if equal), BRNE (branch if not equal), BRCS (branch if carry set), BRCC, BRSH (branch if same or higher), BRLO (branch if lower), BRMI (branch if minus), BRPL (branch if plus), BRGE (branch if greater-than-or-equal, signed), BRLT, BRHS (branch if half-carry set), BRHC, BRTS (branch if T set), BRTC, BRVS (branch if overflow set), BRVC

Transfers: LD (load from memory), ST (store to memory), MOV (move), LDI (load immediate), IN (load from I/O memory), OUT, LPM (load from program memory). The AVR Reduced Core uses LD for program and data memory. The AVR Reduced Core has PUSH and POP instructions. The AVR Minimal Core ISA does not contain a SPM (store to program memory) instruction but the AVR Enhanced Core ISA does.

Bitwise: SBI (set bit in I/O register), CBI, LSL (logical shift left), LSR (logical shift right), ROL (rotate left thru carry), ROR, ASR (arithmetic shift right), SWAP (swap nibbles), BSET (flag set), BCLR, BST, BLD (bit load from T to register), SEC (set carry), CLC (clear carry), SEN, CLN, SEZ, CLZ, SES, CLS, SEV, CLV, SET, CLT, SEH, CLH, AND, ANDI (AND immediate), OR, ORI, EOR (xor), COM (bitwise negation (one's complement)), SBR (set register bit), CBR (clear register bit),

Control: SEI (set interrupt), CLI, BRIE (branch if interrupt enabled), BRID (branch if interrupt disabled), NOP, SLEEP (sleep until interrupt), WDR (reset watchdog timer)

Summary description: We have: load/store, mov, relative jump, relative call/return, <=/</=/>,>= comparison and branching, arithmetic (addition, subtraction, negation, comparisons and set/clears for carry/negative/overflow/zero flags and various registers; many arithmetic operations have multiple forms for carry/no-carry) bitwise arithmetic (negation, and, or, xor, set/clear/skip-if bit, logical shifts, arithmetic shifts, rotate, swap nibbles), and interrupts, NOP, sleep (until interrupt) and watchdog timer reset.

Links:

PIC

The PIC design is a Harvard architecture. It has:

Separate code and data spaces (Harvard architecture).
A small number of fixed length instructions
Most instructions are single cycle execution (2 clock cycles, or 4 clock cycles in 8-bit models), with one delay cycle on branches and skips
One accumulator (W0), the use of which (as source operand) is implied (i.e. is not encoded in the opcode)
All RAM locations function as registers as both source and/or destination of math and other functions.[6]
A hardware stack for storing return addresses
A small amount of addressable data space (32, 128, or 256 bytes, depending on the family), extended through banking
Data space mapped CPU, port, and peripheral registers
ALU status flags are mapped into the data space
The program counter is also mapped into the data space and writable (this is used to implement indirect jumps).

There is no distinction between memory space and register space because the RAM serves the job of both memory and registers, and the RAM is usually just referred to as the register file or simply as the registers....The addressability of memory varies depending on device series, and all PIC devices have some banking mechanism to extend addressing to additional memory...To implement indirect addressing, a "file select register" (FSR) and "indirect register" (INDF) are used. A register number is written to the FSR, after which reads from or writes to INDF will actually be to or from the register pointed to by FSR....PICs have a hardware call stack, which is used to save return addresses....Some operations, such as bit setting and testing, can be performed on any numbered register, but bi-operand arithmetic operations always involve W (the accumulator), writing the result back to either W or the other operand register. To load a constant, it is necessary to load it into W before it can be moved into another register. On the older cores, all register moves needed to pass through W, but this changed on the "high end" cores....PIC cores have skip instructions which are used for conditional execution and branching. The skip instructions are 'skip if bit set' and 'skip if bit not set'. Because cores before PIC18 had only unconditional branch instructions, conditional jumps are implemented by a conditional skip (with the opposite condition) followed by an unconditional branch.

In general, PIC instructions fall into 5 classes:

Operation on working register (WREG) with 8-bit immediate ("literal") operand... One instruction peculiar to the PIC is retlw, load immediate into WREG and return, which is used with computed branches to produce lookup tables.
Operation with WREG and indexed register. The result can be written to either the Working register (e.g. addwf reg,w). or the selected register (e.g. addwf reg,f).
Bit operations. These take a register number and a bit number, and perform one of 4 actions: set or clear a bit, and test and skip on set/clear. The latter are used to perform conditional branches. The usual ALU status flags are available in a numbered register so operations such as "branch on carry clear" are possible.
Control transfers. Other than the skip instructions previously mentioned, there are only two: goto and call.
A few miscellaneous zero-operand instructions, such as return from subroutine, and sleep to enter low-power mode.

... The architectural decisions are directed at the maximization of speed-to-cost ratio. The PIC architecture was among the first scalar CPU designs,[citation needed] and is still among the simplest and cheapest. The Harvard architecture—in which instructions and data come from separate sources—simplifies timing and microcircuit design greatly, and this benefits clock speed, price, and power consumption.

The PIC instruction set is suited to implementation of fast lookup tables in the program space. Such lookups take one instruction and two instruction cycles. Many functions can be modeled in this way. Optimization is facilitated by the relatively large program space of the PIC (e.g. 4096 × 14-bit words on the 16F690) and by the design of the instruction set, which allows for embedded constants. For example, a branch instruction's target may be indexed by W, and execute a "RETLW" which does as it is named - return with literal in W. ...

Limitations

    One accumulator
    Register-bank switching is required to access the entire RAM of many devices
    Operations and registers are not orthogonal; some instructions can address RAM and/or immediate constants, while others can only use the accumulator

The following stack limitations have been addressed in the PIC18 series, but still apply to earlier cores:

    The hardware call stack is not addressable, so preemptive task switching cannot be implemented
    Software-implemented stacks are not efficient, so it is difficult to generate reentrant code and support local variables

With paged program memory, there are two page sizes to worry about

...

The easy to learn RISC instruction set of the PIC assembly language code can make the overall flow difficult to comprehend. Judicious use of simple macros can increase the readability of PIC assembly language.

...

Baseline core devices (12 bit)

These devices feature a 12-bit wide code memory, a 32-byte register file, and a tiny two level deep call stack. They are represented by the PIC10 series, as well as by some PIC12 and PIC16 devices.

...

Generally the first 7 to 9 bytes of the register file are special-purpose registers, and the remaining bytes are general purpose RAM. Pointers are implemented using a register pair: after writing an address to the FSR (file select register), the INDF (indirect f) register becomes an alias for the addressed register. If banked RAM is implemented, the bank number is selected by the high 3 bits of the FSR. This affects register numbers 16–31; registers 0–15 are global and not affected by the bank select bits.

Because of the very limited register space (5 bits), 4 rarely read registers were not assigned addresses, but written by special instructions (OPTION and TRIS).

The ROM address space is 512 words (12 bits each), which may be extended to 2048 words by banking. CALL and GOTO instructions specify the low 9 bits of the new code location; additional high-order bits are taken from the status register. Note that a CALL instruction only includes 8 bits of address, and may only specify addresses in the first half of each 512-word page.

Lookup tables are implemented using a computed GOTO (assignment to PCL register) into a table of RETLW instructions. ...

" -- https://en.wikipedia.org/wiki/PIC_microcontroller

Instructions (remember that W is the accumulator; f is a register number):

misc: NOP, OPTION (copy W to option register), SLEEP, CLRWDT (reset watchdog timer), TRIS k (k = 1,2, or 3) (copy W to one of the tristate registers; tristate registers control port I/O direction; "in 12bit cores, the TRISn registers are not mapped in the file registers space, so the TRIS instruction is the only way of setting port direction for those processors." -- http://www.microchip.com/forums/m157552.aspx ),

moves: MOVF f r (r = f), MOVWF r (f = W)

set/clears:

CLRW (W = 0)
CLRF f (f = 0)
BCF f b (clear bit b of f)
BSF f b (set bit b of f),

bitwise arithmetic:

IORWF f r (r = f bitwise-OR W)
ANDWF f r (r = f bitwise-AND W)
XORWF f r (r = f bitwise-XOR W)
COMF f r (r = bitwise complement of f)
SWAPF f r (r = swap-nibbles of f),
RRF f r (r = rotate-right-thru-carry of f)
RRL f r (rotate-left-thru-carry)

arithmetic:

ADDWF f r (r = f + W)
SUBWF f r (r = f - W)
INCF f r (r = f - 1)
DECF f r (r = f - 1)

skips:

INCFSZ f r (r = f + 1, then skip if zero),
DECFSZ f r (r = f - 1, then skip if zero),
BTFSC f b (skip if bit b of f is clear)
BTFSS (skip if bit b of f is set)

control flow:

CALL k
RETLW k (W = k, then return from subroutine)
GOTO k

immediate addressing mode operations (e.g. operations that take a constant parameter):

MOVLW k (W = k)
IORLW k (W = W bitwise-OR k)
ANDLW k (W = W bitwise-AND k)
XORLW k (W = W bitwise-XOR k)

Summary: PIC's ISA is a non-regular design in that the accumulator register has a special role (every binary operation takes the accumulator as one operand; many operations cannot take constant operands; and constants can only be directly placed into W (and then can be moved into another register with a MOVWF instruction). There are moves, clears, bit sets and clears, bitwise arithmetic (and, or, xor, not, swap nibbles, rotate-right/left-thru-carry; only and, or, xor can have a constant operand), arithmetic (add, subtract, increment, decrement), skips (inc/dec and skip if zero; skip if bit is set/clear), control flow (call/return, goto). As noted above, there is an idiomatic lookup-table-in-program-memory implementation using RETLW k instructions to encode each table entry.

Links:

http://en.wikipedia.org/wiki/PIC_microcontroller

Berkeley RISC II

todo

Berkeley RISC II

ARM: Intro

https://en.wikipedia.org/wiki/ARM_architecture#32-bit_architecture

ARM Thumb: "The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions." -- (ARM7TDMI Technical Reference Manual Revision: r4p1) "The Thumb instruction set provides better code density, at the expense of inferior performance....Thumb-2, a major enhancement of the Thumb instruction set. Thumb-2 provides almost exactly the same functionality as the ARM instruction set. It has both 16-bit and 32-bit instructions, and achieves ARM-like performance with Thumb-like code density." -- (RealView? Compilation Tools Assembler Guide Version 4.0) https://en.wikipedia.org/wiki/ARM_Cortex-M

Some instructions have immediate addressing modes and others do not. i won't bother to include that information because my interest here is mainly in the instruction set. I leave out some instructions that are, to me, uninteresting variants of existing ones. Note that the purpose of these listings is not accuracy, but rather to get a sense of what sorts of instructions are in RISC-ish CPU instruction sets.

Note that in Thumb2, instructions cannot reference the PC (program counter) or SP (stack pointer) as operands, including destination operand, unless noted. Note that every instruction that returns a result takes an operand specifying the destination register; operations are NOT done in place on the input registers (except when the destination register given is the same as an input register).

ARM: 16-bit Thumb2 instructions

MOV LSL r1 r2 r3 (logical shift left; r1 := r2 << r3) LSR ASR (arithmetic shift left) ADD (note; the source and/or destination operands for ADD can include SP, the stack pointer; in this way you can get the SP into a register) SUB (note; the source and destination operands for SUB can include SP, the stack pointer)

ADR (Add immediate to program counter; in this way you can get the PC into a register; useful for getting the address of a 'label' if your assembler translates labels to relative offsets )

CMP

AND EOR (xor)

ADC (Add with Carry; a + b + carry bit) SBC (Subtract with Carry; a - b - carry bit) ROR (Rotate Right) TST (Test bits: TST x y: update condition code flags on Rn AND Rm) RSB (Reverse subtract (from zero; e.g. negate)) CMP (update condition code flags on Rn - Rm) CMN (Compare Negative; update condition code flags on Rn + Rm) ORR (or) MUL BIC (Bit Clear: x AND (NOT y)) MVN (Move Negative/NOT: binary negation)

BL (branch with link; BL <label>: LR register = address of next instruction, PC = label)

BX (Branch and Exchange; this is used to enter/exit "thumb state") BLX (Branch with Link and Exchange; this is used to enter/exit "thumb state")

Load and store:

STR (Store word. Addressing modes include immediate, register offset, PC offset, SP offset. Can store list of multiple registers (STMIA).) also STRH for store halfword, STRB for byte

LDR (Load word. Addressing modes include immediate, register offset, SP offset. Can load list of multiple registers (LDMIA).) also LDRH for Load unsigned halfword, LDRSH for signed halfword, LDRB for unsigned byte, LDRSB for signed byte

LDR (load from literal pool instrs) B (unconditional, conditional branch instructions: takes as an operand a 'condition field' (this is different from a condition code), which is one of equal, not equal, Carry Set / Unsigned higher or same, Carry Clear / Unsigned lower, Negative, Positive or zero, Overflow, No overflow, Unsigned higher, Unsigned lower or same, Signed greater than or equal, Signed less than or equal, Signed greater than, Signed less than, always

SVC (service (system) call instructions; formerly SWI) SETEND (set endianness) CPS (change processor state; enables and disables specified interrupts) BKPT (software breakpoint) IT (If-Then; "Makes up to four following instructions conditional, according to pattern. pattern is a string of up to three letters. Each letter can be T (Then) or E (Else)."

Adjust stack pointer instructions Increment stack pointer ADD (SP plus immediate) Decrement stack pointer SUB (SP minus immediate)

Sign or zero extend instructions (these are used to convert a signed or unsigned value of a certain byte width into a value of a larger byte width, e.g. to convert a signed byte representing "-10" to a signed word representing "-10"; see http://odellconnie.blogspot.com/2012/03/sign-extension-zero-extension.html ) SXTH (Signed Extend Halfword to Word: SXTH Rd Rm: Rd[31:0] := SignExtend?(Rm[15:0])) SXTB (Signed Extend Byte to Word: Rd[31:0] := SignExtend?(Rm[7:0]) UXTH (Unsigned Extend Halfword to word: Rd[31:0] := ZeroExtend?(Rm[15:0])) UXTB (Unsigned Extend Byte to word: Rd[31:0] := ZeroExtend?(Rm[7:0]))

Compare and branch on (non-)zero instructions CBZ (Compare and branch on zero; CBZ r <label>: if r == 0, goto <label>) CBNZ (Compare and branch on non-zero)

PUSH (push selected registers onto stack) POP (push selected registers from stack)

Reverse byte instructions REV (Byte-Reverse Word, e.g. reverse the ordering of the four bytes in the word (and put the result in the destination register)) REV16 (Byte-Reverse Packed Halfword, e.g. reverse the ordering of the two bytes in both halfwords) REVSH (Byte-Reverse Signed Halfword, e.g. reverse the bytes in the low halfword, and sign extend the result to will the whole word)

NOP-compatible hint instructions: NOP YIELD (Yield control to alternative thread) WFE (Wait For Event) WFI (Wait For Interrupt) SEV (Send event; signal event in multiprocessor system)

ARM: 32-bit Thumb2 instructions

ORN (OR (not)) TEQ (update condition code flags on a XOR b) MOVT (move the source halfword into the top halfword of the destination register) BFC (Bit Field Clear; set specified bits to zero; takes a starting bit and a bitwidth) BFI (Bit Field Insert; set specified bits to specified values; takes a starting bit and a bitwidth and a source value)

SBFX (Signed Bit Field extract) SSAT (Signed saturate, LSL, ASR) SSAT16 (Signed saturate 16-bit) UBFX (Unsigned Bit Field extract) USAT (Unsigned saturate, LSL, ASR) USAT16 (Unsigned saturate 16-bit)

PKH (Pack halfword, BT, TB) RRX (Rotate Right with Extend)

Signed and unsigned extend instructions with optional addition: SXTAB (Signed extend byte and add) SXTAB16 (Signed extend two bytes to halfwords, and add) SXTAH (Signed extend halfword and add) SXTB16 (Signed extend two bytes to halfwords) UXTAB (Unsigned extend byte and add) UXTAB16 (Unsigned extend two bytes to halfwords, and add) UXTAH (Unsigned extend halfword and add) UXTB16 (Unsigned extend two bytes to halfwords)

SIMD add and subtract: QADD16, UADD16, QADD8, UADD8, QASX, UASX, QSUB16, UHADD16, QSUB8, UHADD8, QSAX, UHASX, SADD16, UHSUB16, SADD8, UHSUB8, SASX, UHSAX, SHADD16, UQADD16, SHADD8, UQADD8, SHASX, UQASX, SHSUB16, UQSUB16, SHSUB8, UQSUB8, SHSAX, UQSAX, SSUB16, USUB16, SSUB8, USUB8, SSAX

Mnemonic element Meaning: Q prefix Signed saturating arithmetic. S prefix Signed arithmetic, modulo 28 or 216. SH prefix Signed halving arithmetic. The result of the calculation is halved. U prefix Unsigned arithmetic, modulo 28 or 216. UH prefix Unsigned halving arithmetic. The result of the calculation is halved. UQ prefix Unsigned saturating arithmetic. 16 suffix The instruction performs two 16-bit calculations. 8 suffix The instruction performs four 8-bit calculations. ASX mnemonic The instruction performs one 16-bit addition and one 16-bit subtraction. The X indicates that the halfwords of the second operand are exchanged before the operation. SAX mnemonic The instruction performs one 16-bit subtraction and one 16-bit addition. The X indicates that the halfwords of the second operand are exchanged before the operation.

CLZ (Count Leading Zeros (just what is sounds like)) QADD (Saturating Add) QDADD (Saturating Double and Add) QDSUB (Saturating Double and Subtract) QSUB (Saturating Subtract) RBIT (Reverse Bits) SEL (Select bytes; passed 4 bits in GE register, which control, in each of the four word positions of the output, which word out of the two input bytes will contribute that byte)

multiply/divide and accumulate (add/subtract the result of multiplying to the destination, in-place), with various different byte widths of the operands and destination register(s): MLA (multiply and accumulate; x + (y*z)) MLS (multiply and subtract) SMLAxy (Signed Multiply-Accumulate Add, with double-length result) SMLAD (Signed Dual Multiply-Accumulate Add) SMLAWx (Signed Multiply-Accumulate Add) SMLSD (Signed Dual Multiply Subtract and Accumulate) SMMLA (Signed 32 + 32 x 32-bit, most significant word) SMMLS (Signed 32 – 32 x 32-bit, most significant word) SMMUL (Signed 32 x 32-bit, most significant 32-bit word) SMUAD (Signed Dual Multiply Add) SMULxy SMULWx SMUSD (Signed Dual Multiply Subtract) USAD8 (Unsigned Sum of Absolute Differences) USADA8 (Unsigned Accumulate Absolute Differences)

with 64-bit results (two registers to hold result): SMULL (Signed multiply with double-length result) UMULL (Unsigned multiply with double-length result) SDIV (Signed divide) UDIV (Unsigned divide) SMLALxy (Signed multiply with double-length result and accumulate) SMLALD (Signed Multiply Accumulate Long Dual) SMLSLD (Signed Multiply Subtract accumulate Long Dual) UMLAL (Unsigned 64 + 32 x 32) UMAAL (Unsigned multiply and accumulate with double-length result)

loads and stores:

add versions for postindexing, and for double words
PLD, PLI (preload)

LDRD (load double) STRD (store double) LDREX (load exclusive word; something to do with semaphores) STREX (store exclusive word; something to do with semaphores) CLREX (clear local processor exclusive tag; something to do with semaphores)

TBB (Table Branch Byte) TBH (Table Branch Halfword)

LDMDB / LDMEA (Load Multiple Decrement Before / Empty Ascending) RFE (Return From Exception) SRS (Store Return State) STMDB / STMFD on page 4-333 (Store Multiple Decrement Before / Full Descending)

MRS (Move from Status register to ARM Register, e.g. put the condition codes into a register) MSR (Move from ARM register to Status register, e.g. copy a register over the condition codes) SUBS (Return From Exception without stack)

DBG (Debug hint)

Special control operations: CLREX (Clear Exclusive) DSB (Data Synchronization Barrier) DMB (Data Memory Barrier) ISB (Instruction Synchronization Barrier)

Coprocessor instructions: not listed

Links:

ARM: Cortex M profile

Cortex M0, M0+, and M1 only have these instructions:

16-bit: ADC, ADD, ADR, AND, ASR, B, BIC, BKPT, BLX, BX, CMN, CMP, CPS, EOR, LDM, LDR, LDRB, LDRH, LDRSB, LDRSH, LSL, LSR, MOV, MUL, MVN, NOP, ORR, POP, PUSH, REV, REV16, REVSH, ROR, RSB, SBC, SEV, STM, STMIA, STR, STRB, STRH, SUB, SVC, SXTB, SXTH, TST, UXTB, UXTH, WFE, WFI, YIELD

32-bit: BL (branch with link), DMB (Data Memory Barrier; Ensure the order of observation of memory accesses), DSB (Data Synchronization Barrier; Ensure the completion of memory accesses), ISB (Instruction Synchronization Barrier; flush processor pipeline and branch prediction logic), MRS (Move from Status register), MSR (move to status register)

Note that the 16-bit instruction set is identical to the 16-bit thumb-2 instruction set above, except for SETEND (set endianness), IT (if-then), CBZ (Compare and branch on zero), CBNZ. (also, BL here appears only as 32-bit, whereas it was in the 16-bit instruction set, but I think that BL is actually 32-bits in the 16-bit instruction set in some way, not sure i understand that though). IT, CBZ, CBNZ are added in the Cortex M3, as well as a bunch of 32-bit instructions:

new 32-bit instructions in the Cortex M3: BFC (Bit Field Clear), BFI (Bit Field Insert), CDP (?), CLREX (clear local processor exclusive tag), CLZ (count leading zeros), DBG (debug hint), various loads (LDC, LDMA, LDMDB, LDRBT, LDRD, LDREX, LDREXB, LDREXH, LDRHT, LDRSB, LDRSBT, LDRSHT, LDRT), MCR (?), MLS (multiply and subtract), MCRR (?), MLA (multiply and accumulate; x + (y*z)), MOVT (move the source halfword into the top halfword of the destination register), MRC (?), MRRC (?), ORN (x or (not(y)), PLD (preload data), PLDW, PLI (preload instructions), RRX (Rotate Right with Extend), SBFX (Signed Bit Field extract), SDIV (Signed divide), SMLAL (an SMULL-like thingee), SMULL, SSAT (signed saturate), STC (?), various stores (STMDB, STRBT, STRD, STREX, STREXB, STREXH, STRHT, STRT), TBB (Table Branch Byte), TBH (Table Branch Halfword), TEQ (update condition code flags on a XOR b), UBFX (Unsigned Bit Field extract), UDIV (Unsigned divide), other multiply, multiply-accumulate, and saturate instructions (UMLAL, UMULL, USAT)

Links:

https://en.wikipedia.org/wiki/ARM_Cortex-M

ARM: summary

It seems like the 'core' instruction set is indeed the set found in Cortex M0, M0+, and M1. This is a subset of the 16-bit thumb2 set, but with a few 32-bit instructions too.

Those instructions are: MOV, arithmetic (ADD, ADC, SUB, SBC, RSB, MUL), bitwise arithmetic (LSL, LSR, ASR, AND, ORR, EOR, ROR, BIC, MVN), byte reversals (REV, REV16, REVSH), get/set special registers (ADR, MRS, MSR), comparisons (CMP, CMN, TST), branching (B, BL), load/stores with immediate, register offset, PC, SP offset, and multiple registers, push/pop, extension (SXTH, SXTB, UXTH, UXTB), misc control (SVC, NOP), multiprocessing and (YIELD, WFE, WFI, SEV, DMB, DSB), and a few other misc instructions (ISB and some others).

When we get to the Cortex M3 we add 32-bit instructions for bit fields (BFC/BFI, SBFX, UBFX), multiprocessing (LDREX, STREX, CLREX), bitwise arithmetic (CLZ, MOVT, ORN, RRX, saturating versions of things), comparisons (TEQ), various loads and stores (with postindexing and various widths), arithmetic (division, multiply-accumulate (add/subtract) operations with various widths), branch tables (TBB, TBH), and some other misc instructions (DBG, PLD, PLI).

RISC summary

The AVR, the PIC, and the ARM all have:

All but the PIC have load/store, relative jumps (higher end PICs have this), <= etc comparisons/branching (higher-end PICs have this), bitwise arithmetic (LSL, LSR, ASR; higher-end PICs have this), negation, carry/no-carry forms of addition and subtraction (higher-end PICs have this), access to the stack pointer (higher-end PICs have this), register indirect addressing for loads (higher-end PICs have this too). AVR Reduced Core and ARM and higher-end PICs have PUSH and POP.

The PIC doesn't have load/store because it memory maps into registers and uses banked memory to deal with the fact that it only has so many registers. The PIC is the only one with banked memory. The ARM doesn't have single instruction bit set/clear until you get to the M3, but the PIC and the AVR do, and they also have a skip/branch-if-bit Only the ARM has MUL (but the AVR Enhanced Core does, as do higher-end PICs), width extension instructions, multiprocessing instructions, multiple registers for load/store/push/pop , byte reversals; it is lacking increment/decrement, and swap nibbles, which the other two do have. Higher-end AVRs and ARMs have post-increment addressing for load/stores. Higher-end ARMs and PICs have multiply-accumulate and division.

Irregularities sometimes seen include not letting anything use immediate (constant) addressing, having an accumulator register with a special role; having to move some things into a certain register first and move it again from there to where you want it, and not having full access to the PC and SP.

In summary, it seems like a reasonable 'common core' would consist of:

mov, jump, call, addition, subtraction, bitwise arithmetic (and/or/not/xor, rotate right, bit clears, NOP, a way to make some hardware-specific calls, branch on zero, branch on condition flag, ways to get and set the condition flags, an operand to specify a destination register for each instruction, load/store, relative jumps, <= etc comparisons/branching, LSL, LSR, ASR, carry/no-carry forms of addition and subtraction, access to the stack pointer, register indirect addressing for loads, PUSH, POP. single instruction bit set/clear, skip/branch-if-bit.

A slightly extended core would also have MUL, post-increment addressing, multiply-accumulate, division, increment, decrement, swap nibbles.

RISC links

Only tangentially of interest:

3 ISA paradigms

Accumulator (certain registers for certain operations)
stack machine (operations work on the top few elements of the stack)
the winner was: (general purpose) registers

Special-purpose registers

PC (program counter)
zero register (always 0; a useful constant)
Condition code/flag registers: Carry, Zero, Negative, Overflow
Link registers: stores return address of caller; assigned during subroutine call
base register: added to memory addresses

Register Addressing modes

From http://www.cl.cam.ac.uk/teaching/0405/CompArch/mynotes.pdf :

Classic RISC Addressing Modes:

Register: Mov r0 <- r1
Immediate: Mov r0 <- 42
Register Indirect: Ldl r0 <- Mem[r1] (follow a pointer held in r1)
Register Indirect with Displacement: Ldl r0 <- Mem[128 + r1]

Less RISCy addr modes (ARM and PowerPC?):

Register plus Register (Indexed): Ldl r0 <- Mem[r1 + r1]
- use: index into array
Register plus Scaled Register: Ldl r0 <- Mem[r1 + r2*k]
- multiplier k is specified by the programmer; must be power of 2
- use: index into array whose elements are of size k
Register Indirect with Displacement and Update
- like Register Indirect with Displacement, but composed with C's ++
- two forms, pre- and post-; in C, *(++p) and *(p++)
- use: creating stack (local) variables

CISC Addressing Modes:

Direct (Absolute): Mov r0 (1000)
- Offset often large
- x86 Implicit base address
Memory Indirect: Mov r0 <- Mem[Mem[r1]]
- use: C ptr, linked lists
PC Indirect with Displacement: Mov r0 <- Mem[ProgramCounter? + 128]
- use: Accessing constants