Bayle Shanks's website: proj-plbook-plChIsaMisc

Table of Contents for Programming Languages: a survey

V86

A teaching language used on the web page http://www.plantation-productions.com/Webster/www.artofasm.com/Linux/HTML/ISA.html

" For example, most processors you find will have instructions like the following:

Data movement instructions (e.g., MOV)

Arithmetic and logical instructions (e.g., ADD, SUB, AND, OR, NOT)

Comparison instructions

A set of conditional jump instructions (generally used after the compare instructions)

Input/Output instructions

Other miscellaneous instructions "

" The Y86 CPU provides 20 instructions. Seven of these instructions have two operands, eight of these instructions have a single operand, and five instructions have no operands at all. The instructions are MOV (two forms), ADD, SUB, CMP, AND, OR, NOT, JE, JNE, JB, JBE, JA, JAE, JMP, BRK, IRET, HALT, GET, and PUT. "

HALT is program termination. BRK is a temporary halt that can be resumed from. JB and JB are JLT and JGT. IRET is return from interrupt. GET and PUT are input and output.

"The Y86 processor supports the register addressing mode7, the immediate addressing mode, the indirect addressing mode, the indexed addressing mode, and the direct addressing mode."

Later, they mention expansion to the NEG (arithmetic negation) instruction, and the SHL, SHR, ROL, ROR, and XOR instructions.

HLA

http://www.plantation-productions.com/Webster/HighLevelAsm/HLADoc/HLARef/HLARef_html/HLAReference.htm

Links:

8051

todo

according to Wikipedia ( http://en.wikipedia.org/wiki/8051#Important_features_and_applications ), two distinctive and important features of the 8051 are bit-level boolean logic operations, which "helped cement the 8051's popularity in industrial control applications because it reduced code size by as much as 30%.", and "four bank selectable working register sets which greatly reduce the amount of time required to complete an interrupt service routine. With a single instruction the 8051 can switch register banks as opposed to the time consuming task of transferring the critical registers to the stack or designated RAM locations. These registers also allowed the 8051 to quickly perform a context switch."

links:

Berkeley RISC II

todo

Berkeley RISC II

OpenRISC

This has an unofficial Debian port (or1k). In addition, it is of interest because it is an open project attempting to provide a generally useful design, one might hope that their core ISA is close to a common core with few idiosyncracies.

A list of all mandatory instructions in the OpenRISC? 1200 core (as of this time the only extant implementation, i think): (omitting all instructions whose mnemonic is the same as another, but with 'i' appended, which i took to be immediate addressing mode variants) (from http://openrisc.net/or1200-spec.html#_instructions ):

add add signed and bf Branch if Flag bnf Branch if no Flag j Jump (immediate) jal Jump and Link (immediate) jalr Jump and Link Register jr jump (register) lbs Load Byte and Extend with Sign lbz Load Byte and Extend with Zero lhs Load Half Word and Extend with Sign lhz Load Half Word and Extend with Zero lws Load Single Word and Extend with Sign lwz Load Single Word and Extend with Zero mfspr Move From Special-Purpose Register movhi Move Immediate High mtspr Move To Special-Purpose Register nop or rfe Return From Exception rori Rotate Right with Immediate (The 6-bit immediate value specifies the number of bit positions) sb Store Byte (with immediate offset) sfeq Set Flag if Equal (cmp) sfges Set Flag if Greater or Equal Than Signed sfgeu Set Flag if Greater or Equal Than Unsigned sfgts Set Flag if Greater Than Signed sfgtu Set Flag if Greater Than Unsigned sfleu Set Flag if Less or Equal Than Unsigned sflts Set Flag if Less Than Signed sfltu Set Flag if Less Than Unsigned sfne Set Flag if Not Equal sh Store Half Word ("The offset is sign-extended and added to the contents of general-purpose register rA. The sum represents an effective address. The low-order 16 bits of general-purpose register rB are stored to memory location addressed by EA") sll Shift Left Logical (number of bit positions specified in register) sra Shift Right Arithmetic (number of bit positions specified in register) srl Shift Right Logical (number of bit positions specified in register) sub Subtract Signed sw Store Single Word sys System Call trap Trap "Execution of trap instruction results in the trap exception if specified bit in SR is set. Trap exception is a request to the operating system or to the debug facility to execute certain debug services. Immediate value is used to select which SR bit is tested by trap instruction" xor

Links:

8051

    ACALL - Absolute Call
    ADD, ADDC - Add Accumulator (With Carry)
    AJMP - Absolute Jump
    ANL - Bitwise AND
    CJNE - Compare and Jump if Not Equal
    CLR - Clear Register
    CPL - Complement Register
    DA - Decimal Adjust
    DEC - Decrement Register
    DIV - Divide Accumulator by B
    DJNZ - Decrement Register and Jump if Not Zero
    INC - Increment Register
    JB - Jump if Bit Set
    JBC - Jump if Bit Set and Clear Bit
    JC - Jump if Carry Set
    JMP - Jump to Address
    JNB - Jump if Bit Not Set
    JNC - Jump if Carry Not Set
    JNZ - Jump if Accumulator Not Zero
    JZ - Jump if Accumulator Zero
    LCALL - Long Call
    LJMP - Long Jump
    MOV - Move Memory
    MOVC - Move Code Memory
    MOVX - Move Extended Memory
    MUL - Multiply Accumulator by B
    NOP - No Operation
    ORL - Bitwise OR
    POP - Pop Value From Stack
    PUSH - Push Value Onto Stack
    RET - Return From Subroutine
    RETI - Return From Interrupt
    RL - Rotate Accumulator Left
    RLC - Rotate Accumulator Left Through Carry
    RR - Rotate Accumulator Right
    RRC - Rotate Accumulator Right Through Carry
    SETB - Set Bit
    SJMP - Short Jump
    SUBB - Subtract From Accumulator With Borrow
    SWAP - Swap Accumulator Nibbles
    XCH - Exchange Bytes
    XCHD - Exchange Digits
    XRL - Bitwise Exclusive OR
    Undefined - Undefined Instruction" -- http://www.win.tue.nl/~aeb/comp/8051/set8051.html

HC08

Cell Processor SPU

https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44/$file/SPU_ISA_v1.2_27Jan2007_pub.pdf

3. Memory—Load?/Store Instructions Load Quadword (d-form) Load Quadword (x-form) Load Quadword (a-form) Load Quadword Instruction Relative (a-form) Store Quadword (d-form) Store Quadword (x-form) Store Quadword (a-form) Store Quadword Instruction Relative (a-form) Generate Controls for Byte Insertion (d-form) Generate Controls for Byte Insertion (x-form) Generate Controls for Halfword Insertion (d-form) Generate Controls for Halfword Insertion (x-form) Generate Controls for Word Insertion (d-form) Generate Controls for Word Insertion (x-form) Generate Controls for Doubleword Insertion (d-form) Generate Controls for Doubleword Insertion (x-form)

4. Constant-Formation Instructions Immediate Load Halfword Immediate Load Halfword Upper Immediate Load Word Immediate Load Address Immediate Or Halfword Lower Form Select Mask for Bytes Immediate

5. Integer and Logical Instructions Add Halfword Add Halfword Immediate Add Word Add Word Immediate Subtract from Halfword Subtract from Halfword Immediate Subtract from Word Subtract from Word Immediate Add Extended Carry Generate Carry Generate Extended Subtract from Extended Borrow Generate Borrow Generate Extended Multiply Multiply Unsigned Multiply Immediate Multiply Unsigned Immediate Multiply and Add Multiply High Multiply and Shift Right Multiply High High Multiply High High and Add Multiply High High Unsigned Multiply High High Unsigned and Add Count Leading Zeros Count Ones in Bytes Form Select Mask for Bytes Form Select Mask for Halfwords Form Select Mask for Words Gather Bits from Bytes Gather Bits from Halfwords Gather Bits from Words Average Bytes Absolute Differences of Bytes Sum Bytes into Halfwords Extend Sign Byte to Halfword Extend Sign Halfword to Word Extend Sign Word to Doubleword And And with Complement And Byte Immediate And Halfword Immediate And Word Immediate Or Or with Complement Or Byte Immediate Or Halfword Immediate Or Word Immediate Or Across Exclusive Or Exclusive Or Byte Immediate Exclusive Or Halfword Immediate Exclusive Or Word Immediate Nand Nor Equivalent Select Bits Shuffle Bytes

6. Shift and Rotate Instructions Shift Left Halfword Shift Left Halfword Immediate Shift Left Word Shift Left Word Immediate Shift Left Quadword by Bits Shift Left Quadword by Bits Immediate Shift Left Quadword by Bytes Shift Left Quadword by Bytes Immediate Shift Left Quadword by Bytes from Bit Shift Count Rotate Halfword Rotate Halfword Immediate Rotate Word Rotate Word Immediate Rotate Quadword by Bytes Rotate Quadword by Bytes Immediate Rotate Quadword by Bytes from Bit Shift Count Rotate Quadword by Bits

Rotate Quadword by Bits Immediate Rotate and Mask Halfword Rotate and Mask Halfword Immediate Rotate and Mask Word Rotate and Mask Word Immediate Rotate and Mask Quadword by Bytes Rotate and Mask Quadword by Bytes Immediate Rotate and Mask Quadword Bytes from Bit Shift Count Rotate and Mask Quadword by Bits Rotate and Mask Quadword by Bits Immediate Rotate and Mask Algebraic Halfword Rotate and Mask Algebraic Halfword Immediate Rotate and Mask Algebraic Word Rotate and Mask Algebraic Word Immediate

7. Compare, Branch, and Halt Instructions Halt If Equal Halt If Equal Immediate Halt If Greater Than Halt If Greater Than Immediate Halt If Logically Greater Than Halt If Logically Greater Than Immediate Compare Equal Byte Compare Equal Byte Immediate Compare Equal Halfword Compare Equal Halfword Immediate Compare Equal Word Compare Equal Word Immediate Compare Greater Than Byte Compare Greater Than Byte Immediate Compare Greater Than Halfword Compare Greater Than Halfword Immediate Compare Greater Than Word Compare Greater Than Word Immediate Compare Logical Greater Than Byte Compare Logical Greater Than Byte Immediate Compare Logical Greater Than Halfword Compare Logical Greater Than Halfword Immediate Compare Logical Greater Than Word Compare Logical Greater Than Word Immediate Branch Relative Branch Absolute Branch Relative and Set Link Branch Absolute and Set Link Branch Indirect Interrupt Return Branch Indirect and Set Link if External Data Branch Indirect and Set Link Branch If Not Zero Word Branch If Zero Word Branch If Not Zero Halfword Branch If Zero Halfword Branch Indirect If Zero Branch Indirect If Not Zero Branch Indirect If Zero Halfword Branch Indirect If Not Zero Halfword

8. Hint-for-Branch Instructions Hint for Branch (r-form) Hint for Branch (a-form) Hint for Branch Relative

9. Floating-Point Instructions 9.1 Single Precision (Extended-Range Mode) 9.2 Double Precision 9.2.1 Conversions Between Single-Precision and Double-Precision Format 9.2.2 Exception Conditions 9.3 Floating-Point Status and Control Register

Floating Add Double Floating Add Floating Subtract Double Floating Subtract Floating Multiply Double Floating Multiply Floating Multiply and Add Double Floating Multiply and Add Floating Negative Multiply and Subtract Double Floating Negative Multiply and Subtract Floating Multiply and Subtract Double Floating Multiply and Subtract Double Floating Negative Multiply and Add Floating Reciprocal Estimate Floating Reciprocal Absolute Square Root Estimate Floating Interpolate Convert Signed Integer to Floating Convert Floating to Signed Integer Convert Unsigned Integer to Floating Convert Floating to Unsigned Integer Floating Round Double to Single Floating Extend Single to Double Double Floating Compare Equal Double Floating Compare Magnitude Equal Double Floating Compare Greater Than Double Floating Compare Magnitude Greater Than Double Floating Test Special Value Floating Compare Equal Floating Compare Magnitude Equal Floating Compare Greater Than Floating Compare Magnitude Greater Than Floating-Point Status and Control Register Write Floating-Point Status and Control Register Read

10. Control Instructions Stop and Signal Stop and Signal with Dependencies No Operation (Load) No Operation (Execute) Synchronize Synchronize Data Move from Special-Purpose Register Move to Special-Purpose Register

11. Channel Instructions Read Channel Read Channel Count Write Channel

Links:

Programming the Cell Broadband Engine™ Architecture: Examples and Best Practices (666 pages) (alternate link with alternate version: [1])
Cell Broadband EngineProgramming Handbook Version 1.1 (877 pages) (alternate link: [2])
2006 slide presentation
https://www.google.com/search?q=%22mfceieio%22+%22mfcsync%22+barrier+%22putlluc%22&oq=%22mfceieio%22+%22mfcsync%22+barrier+%22putlluc%22

Cypress PSoC MCU

Different versions with different MCUs. PSoC? 3 has 8051, and PSoC? 4 has ARM Cortex M0, and PSoC? 4 has ARM Cortex M3.

"The main problem for me is trying to find microcontrollers which have the peripheral set I want. This is very difficult as our requirements don't seem to be mainstream. We want things like 5 PWM channels, 5 Quadrature decoders, 2 non-standard SPI ports and a UART with negated IO....Also included on the chip are re-configurable digital and analogue blocks which can be made into a wide range of peripherals: ADCs, filters, op-amps, DACs, SPI, UART, quadrature decoder, CRC generator, etc...The real benefit is that you can stick with one chip, knowing that it can tackle a great many of the projects you'll want to do in the future." -- http://electronics.stackexchange.com/a/37438

Links:

XMOS Xcore

Multicore MCU; up to 8 MCUs.

Links:

XMOS Xcore Instructions

The following is mostly quoted or paraphrased from https://www.xmos.com/download/private/xCORE-200%3A-The-XMOS-XS2-Architecture-%28ISA%29%281.1%29.pdf .

XMOS Xcore Instructions: Data Access

access to words via stack pointer: LDWSP (load word from stack), STWSP (store word to stack), LDAWSP (load address of word in stack)
access to words via data pointer: LDWDP (load word from data), STWDP (store word to data), LDAWDP (load address of word in data)
load constants and program addresses: LDC (load constant), LDWCP (load word from constant pool), LDAWCP (load word address in constant pool), LDWCPL (load word from constant pool long), LDAPF (load address in program forward), LDAPB (load address in program backward)
load/store from address in register (plus scaled offset, which may be immediate or another register): LDW (load word), STW (store word), LDAWF (load address of word forward), LDAWB (load address of word backward), LDWI (load word), STWI (store word), LDAWFI (load address of word forward), LDAWBI (load address of word backward)
load/store subwords: LD16S (load 16-bit signed item), ST16 (store 16-bit item), LDA16F (load address of 16-bit item forward), LDA16B (load address of 16-bit item backward),LD8U (load byte unsigned), ST8 (store byte)
load/store double words: LDDSP (load two words from stack), STDSP (store two words to stack), LDDI (load two words), STDI (store two words), LDD (load two words), STD (store two words)
mask, extend, clear: MKMSK (make mask), MKMSKI (make mask immediate), SEXT (sign extend), SEXTI (sign extend immediate), ZEXT (zero extend), ZEXTI (zero extend immediate), ANDNOT (and not (clear field))

XMOS Xcore Instructions: Expression Evaluation

ADDI (add immediate), ADD, SUBI (subtract immediate), SUB, NEG (negate), EQI (equal immediate), EQ (equal), LSU (less than unsigned), LSS (less than signed)
AND, OR, XOR, XOR4 (xor, four inputs), NOT, SHLI (logical shift left immediate), SHL (), SHRI (), SHR (), ASHRI (arithmetic shift right immediate), ASHR (arithmetic shift right)
MUL (), DIVU (divide unsigned), DIVS (divide signed), REMU (remainder unsigned), REMS (remainder signed)
NOP, BITREV (bit reverse), BYTEREV (byte reverse), CLZ (count leading zeros), ZIP (zip double word), UNZIP (unzip double word)

XMOS Xcore Instructions: Branching, Jumping and Calling

branching:

BRFT (branch relative forward true)
BRFF (branch relative forward false)
BRBT (branch relative backward true)
BRBF (branch relative backward false)

jumping:

BRFU (branch relative forward unconditional)
BRBU (branch relative backward unconditional)
BRU (branch relative unconditional via reg)
BAU (branch absolute unconditional via reg)

jumping with link register: BLRF (branch and link relative forward), BLRB (branch and link relative backward), BLACP (branch and link absolute via CP), BLAT (branch and link absolute via table), BLA (branch and link absolute via register)

stack manipulation intended for calling, and returning:

EXTSP (extend stack)
EXTDP (extend data)
ENTSP (entry and extend stack, single issue)
DUALENTSP (entry and extend stack, dual issue)
RETSP (contract stack and return)
SETSP (set stack pointer)
SETDP (set data pointer)
SETCP (set pool pointer)

XMOS Xcore Instructions: Resources and the Thread Scheduler

Each xCORE Tile manages a number of different types of resource. These include threads, synchronisers, channel ends, timers and locks. For each type of resource a set of available items is maintained. The names of these sets are used to identify the type of resource to be allocated by the GETR (get resource) instruction. When the resource is no longer needed, it can be released for subsequent use by a FREER (free resource) instruction. Some resources have associated control modes which are set using the SETC instruction.

GETR (get resource), FREER (free resource), SETC (set resource control mode)

Many of the mode settings are defined only for a specific kind of resource and are described in the appropriate section; the ones which are used for several different kinds of resource are:

OFF (resource off)
ON (resource on)
START (resource active)
STOP (resource inactive)
EVENT (resource will cause events)
INTERRUPT (resource will raise interrupts)

Execution of instructions from each thread is managed by the thread scheduler. This maintains a set of runnable threads... from which it takes instructions in turn. When a thread is unable to continue, it is paused by removing it from the run set. The reason for this may be any of the following:

Its registers are being initialised prior to it being able to run.
It is waiting to synchronise with another thread before continuing.
It is waiting to synchronise with another thread and terminate (a join).
It has attempted an input from a channel which has no data available, or a port which is not ready, or a timer which has not reached a specified time.
It has attempted an output to a channel or a port which has no room for the data.
It has executed an instruction causing it to wait for one of a number of events or interrupts which may be generated when channels, ports or timers become ready for input.

The thread scheduler manages the threads, thread synchronisation and timing (using the synchronisers and timers). It is directly coupled to resources such as the ports and channels so as to minimise the delay when a thread becomes runnable as a result of a communication or input-output

XMOS Xcore Instructions: Concurrency and Thread Synchronisation

A thread can initiate execution on one or more newly allocated threads, and can subsequently synchronise with them to exchange data or to ensure that all threads have completed before continuing. Thread synchronisation is performed using hardware synchronisers , and threads using a synchroniser will move between running states and paused states. When a thread is first created, its status register is initialised as follows:

sr[bit eeble] <- 0
sr[bit ieble] <- 0
sr[bit inenb] <- 0
sr[bit inint] <- 0
sr[bit hipri] <- 0
sr[bit fast] <- 0
sr[bit kedi] <- 0
sr[bit waiting] <- 1 (the thread is paused)
sr[bit di] <- 0

The access registers of the newly created thread can be initialised using the following instructions.

TINITPC (set thread pc)
TINITSP (set thread stack)
TINITDP (set thread data)
TINITCP (set thread pool)
TINITLR (set thread link)

These instructions can only be used when the thread is paused. The TINITLR instruction is intended primarily to support debugging. On thread initialisation, the PC must be initialised. DP, SP, and CP will retain their value on freeing and allocating threads, so they may not have to be reinitialised. Data can be transferred between the operand registers of two threads using TSETR and TSETMR instructions, which can be used even when the destination thread is running.

TSETR (set thread operand register), TSETMR (set master thread operand register)

To start a synchronised slave thread a master must first acquire a synchroniser. This is done using a GETR SYNC instruction. If there is a synchroniser available its resource ID is returned, otherwise the invalid resource ID is returned. The GETST instruction is then used to get a synchronised thread. It is passed the synchroniser ID and if there is a free thread it will be allocated, attached to the synchroniser and its ID returned, otherwise the invalid resource ID is returned. The master thread can repeat this process to create a group of threads which will all synchronise together. To start the slave threads the master executes an MSYNC instruction using the synchroniser ID.

GETST (get synchronised thread), MSYNC (master synchronise)

The group of threads can synchronise at any point by the slaves executing the SSYNC and the master the MSYNC. Once all the threads have synchronised they are unpaused and continue executing from the next instruction. The processor maintains a set of paused master threads 'mpaused' and a set of paused slave threads 'spaused'...

Each synchroniser also maintains a record...of whether its master has reached a synchronisation point.

To terminate all of the slaves and allow the master to continue the master executes an MJOIN instruction instead of an MSYNC. When this happens, the slave threads are all freed and the master continues.

A master thread can also create threads which can terminate themselves. This is done by the master executing a GETR THREAD instruction. This instruction returns either a thread ID if there is a free thread or the invalid resource ID. The unsynchronised thread can be initialised in the same way as a synchronised thread using the TINITPC, TINITSP, TINITDP, TINITCP, TINITLR and TSETR instructions. The unsynchronised thread is then started by the master executing a TSTART instruction specifying the thread ID. Once the thread has completed its task it can terminate itself with the FREET instruction.

TSTART (start thread), FREET (free thread)
GETID: The identifier of an executing thread can be accessed by the GETID instruction

XMOS Xcore instructions: Communication

Communication between threads is performed using channels , which provide full- duplex data transfer between channel ends , whether the ends are both in the same xCORE Tile, in different xCORE Tiles on the same chip or in xCORE Tiles on different chips. Channels carry messages constructed from data and control tokens between the two channel ends. The control tokens are used to encode communication protocols. Although most control tokens are available for software use, a number are reserved for encoding the protocol used by the interconnect hardware, and can not be sent and received using instructions. A channel end can be used to generate events and interrupts when data becomes available as described below. This allows a thread to monitor several channels, ports or timers, only servicing those that are ready. To communicate between two threads, two channel ends need to be allocated, one for each thread. This is done using the GETR c , CHANEND instruction. Each channel end has a destination register which holds the identifier of the destination channel end; this is initialised with the SETD instruction. It is also possible to use the identifier of a channel end to determine its destination channel end.

SETD, GETD (set/get destination of channel)

The identifier of the channel end c1 is used to initialise the channel end for thread c2 , and vice versa. Each thread can then use the identifier of its own channel end to transfer data and messages using output and input instructions. The interconnect can be partitioned into several independent networks. This makes it possible, for example, to allocate channels carrying short control messages to one network whilst allocating channels carrying long data messages to another. There are instructions to allocate a channel to a network and to determine which network a channel is using.

SETN, GETN (set/get network of channel)

(my note: Writing to a channel is called 'outputting to the channel', and reading from it is called inputting from it.)

OUT (output data word), IN (input token)
OUTT (output token), OUTCT (output control token), OUTCTI (output control token immediate)
INT (input token), INCT (input control token)
CHKCT (check control token), CHKCTI (check control token immediate), TESTCT (test for control token), TESTWCT (test word for control token)
TESTLCL (test destination local): determine whether a destination channel end is on the same processor; this makes it possible to optimise communication of large data structures where the two communicating threads are executed by the same processor

The channel connection is established when the first output is executed. If the destination channel end is on another xCORE Tile, this will cause the destination identifier to be sent through the interconnect, establishing a route for the subse- quent data and control tokens. The connection is terminated when an END control token is sent. If a subsequent output is executed using the same channel end, the destination identifier will be used again to establish a new route which will again persist until another END control token is sent. A destination channel end can be shared by any number of outputting threads; they are served in a round-robin manner. Once a connection has been established it will persist until an END is received; any other thread attempting to establish a connection will be queued. In the case of a shared channel end, the outputting thread will usually transmit the identifier of its channel end so that the inputting thread can use it to reply. The OUT and IN instructions are used to transmit words of data through the channel; to transmit bytes of data the OUTT and INT instructions are used. Control tokens are sent using OUTCT or OUTCTI and received using INCT. To support efficient runtime checks that the type, length or structure of output data matches that expected by the inputer, CHKCT and CHKCTI instructions are provided. The CHKCT instruction inputs and discards a token provided that the input token matches its operand; otherwise it traps. The normal IN and INT instructions trap if they encounter a control token. To input a control token INCT is used; this traps if it encounters a data token. The END control token is one of the 12 tokens which can be sent using OUTCTI and checked using CHKCTI. By following each message output with an OUTCTI c , END and each input with a CHKCTI c , END it is possible to check that the size of the message is the same as the size of the message expected by the inputting thread. To perform synchronised communication, the output message should be followed with (OUTCTI c , END; CHKCTI c , END) and the input with (CHKCTI c , END; OUTCTI c , END). Another control token is PAUSE. Like END, this causes the route through the interconnect to be disconnected. However the PAUSE token is not delivered to the receiving thread. It is used by the outputting thread to break up long messages or streams, allowing the interconnect to be shared efficiently. The remaining control tokens are used for runtime checking and for signalling the type of message being received; they have no effect on the interconnect. Note that in addition to END and PAUSE, ten of these can be efficiently handled using OUTCTI and CHKCTI. A control token takes up a single byte of storage in the channel. On the receiving end the software can test whether the next token is a control token using the TESTCT instruction, which waits until at least one token is available. It is also possible to test whether the next word contains a control token using the TESTWCT instruction. This waits until a whole word of data tokens has been received (in which case it returns 0) or until a control token has been received (in which case it returns the byte position after the position of the byte containing the control token). Channel ends have a buffer able to hold sufficient tokens to allow at least one word to be buffered. If an output instruction is executed when the channel is too full to take the data then the thread which executed the instruction is paused. It is restarted when there is enough room in the channel for the instruction to successfully complete. Likewise, when an input instruction is executed and there is not enough data available then the thread is paused and will be restarted when enough data becomes available. Note that when sending long messages to a shared channel, the sender should send a short request and then wait for a reply before proceeding as this will minimise interconnect congestion caused by delays in accepting the message. When a channel end c is no longer required, it can be freed using a FREER c instruction. Otherwise it can be used for another message. It is sometimes necessary to determine the identifier of the destination channel end c 2 stored in channel end c 1 . For example, this enables a thread to transmit the identifier of a destination channel end it has been using to a thread on another processor. This can be done using the GETD instruction.

XMOS Xcore instructions: Locks

Mutual exclusion between a number of threads can be performed using locks . A lock is allocated using a GETR l , LOCK instruction. The lock is initially free . It can be claimed using an IN instruction and freed using an OUT instruction. When a thread executes an IN on a lock which is already claimed, it is paused and placed in a queue waiting for the lock. Whenever a lock is freed by an OUT instruction and the lock’s queue is not empty, the next thread in the queue is unpaused; it will then succeed in claiming the lock. When inputting from a lock, the IN instruction always returns the lock identifier, so the same register can be used as both source and destination operand. When outputting to a lock, the data operand of the OUT instruction is ignored. When the lock is no longer needed, it can be freed using a FREER l instruction.

XMOS Xcore instructions: Timers and clocks

Each xCORE Tile executes instructions at a speed determined by its own clock input. In addition, it provides a reference clock output which ticks at a standard frequency of 100MHz. A set of programmable timers is provided and all of these can be used by threads to provide timed program execution relative to the reference clock.

The processor has a set of timers that can be used to wait for a time. The current time can be input from any timer, or it can be obtained by using GETTIME:

GETTIME (get current time)

Each timer can be used by a thread to read its current time or to wait until a specified time. A timer is allocated using the GETR t , TIMER instruction. It can be configured using the SETC instruction; the only two modes which can be set are UNCOND (timer always ready; inputs complete immediately) and AFTER (timer ready when its current time is after its DATA value). In unconditional mode, an IN instruction reads the current value of the timer. In AFTER mode, the IN instruction waits until the value of its current time is after (later than) the value in its DATA register. The value can be set using a SETD instruction.

Timers can also be used to generate events as described below.

A set of programmable clocks is also provided and each can be used to produce a clock output to control the action of one or more ports and their associated port timers.

SETCLK (set the clock source of a port)

(my note: Each clock can use a one bit port as its clock source (again, using SETCLK)

Alternatively, a clock may use the reference clock as its clock source (by 'SETCLK p, REF'). In either case the clock can be configured to divide the frequency using an 8-bit divider. When this is set to 0, the clock passes directly to the output. The falling edge of the clock is used to perform the division. Hence a setting of 1 will result in an output from the clock which changes each falling edge of the input, halving the input frequency f ; and a setting of n will produce an output frequency of f/2n. The division factor is set using the SETD instruction. The lowest eight bits of the operand are used and the rest ignored.

To ensure that the timers in the ports which are attached to the same clock all record the same time, the clock should be started using a 'SETC c, START' instruction after the ports have all been attached to the clock. All of the clocks are initially stopped and a clock can be stopped by a 'SETC c, STOP' instruction. The data output on the pins of an output port changes state synchronously with the port clock. If several output ports are driven from the same clock, they will appear to operate as a single output port, provided that the processor is able to supply new data to all of them during each clock cycle. Similarly, the data input by an input port from the port pins is sampled synchronously with the port clock. If several input ports are driven from the same clock they will appear to operate as a single input port provided that the processor is able to take the data from all of them during each clock cycle. The use of clocked ports therefore decouples the internal timing of input and output program execution from the operation of synchronous input and output interfaces.

XMOS Xcore instructions: Ports, Input and Output

Ports are interfaces to physical pins. A port can be used for input or output . It can use the reference clock as its port clock or it can use one of the programmable clocks. Transfers to and from the pins can be synchronised with the execution of input and output instructions, or the port can be configured to buffer the transfers and to convert automatically between serial and parallel form. Ports can also be timed to provide precise timing of values appearing on output pins or taken from input pins. When inputting, a condition can be used to delay the input until the data in the port meets the condition. When the condition is met the captured data is time stamped with the time at which it was captured. The port clock input is initially the reference clock. It can be changed using the SETCLK instruction with a clock ID as the clock operand. This port clock drives the port timer and can also be used to determine when data is taken from or presented to the pins. A port can be used to generate events and interrupts when input data becomes available as described below. This allows a thread to monitor several ports, channels or timers, only servicing those that are ready. ... Each port has a transfer register . The input and output instructions used for channels, IN and OUT, can also be used to transfer data to and from a port transfer register. The IN instruction zero-extends the contents of a port transfer register and transfers the result to an operand register. The OUT instruction transfers the least significant bits from an operand register to a port transfer register ...

The port configuration is done using the SETC instruction which is used to define several independent settings of the port.

There are further instructions for shifting bits and partial words to and from the port and precisely controlling timing:

OUTSHR (output to port and shift), INSHR (shift and input from port), SETRDY (set source of port ready input), SETPT (set port time), CLRPT (clear port time), GETTS (get port timestamp), SETTW (set port transfer width), SETPSC (set port shift register count), ENDIN (end input), OUTPW (output part word), OUTPWI (output part word), INPW (input part word),

XMOS Xcore instructions: Events

Events and interrupts allow timers, ports and channel ends to automatically transfer control to a pre-defined event handler. The resources generate events by default and must be reconfigured using a SETC instruction in order to generate interrupts. The ability of a thread to accept events or interrupts is controlled by information held in the thread status register ( sr ), and may be explicitly controlled using SETSR and CLRSR:

SETSR, GETSR (set/get thread state), CLRSR (clear thread state)

The operand of these instructions should be one (or more) of

EEBLE (enable events), IEBLE (enable interrupts), INENB (determine if thread is enabling events), ININT (determine if thread is in interrupt mode), HIPRI (set thread to high priority mode), FAST (set thread to fast mode), KEDI (set thread to switch to dual issue on kernel entry)

A thread normally enables one or more events and then waits for one of them to occur. Hence, on an event all the thread’s state is valid, allowing the thread to respond rapidly to the event. The thread can perform input and output operations using the port, channel or timer which gave rise to an event whilst leaving some or all of the event information unchanged. This allows the thread to complete handling an event and immediately wait for another similar event. Timers, ports and channel ends all support events, the only difference being the ready conditions used to trigger the event. The program location of the event handler must be set prior to enabling the event using the SETV instruction. The SETEV instruction can be used to set an environment for the event handler; this will often be a stack address containing data used by the handler. Timers and ports have conditions which determine when they will generate an event; these are set using the SETC and SETD instructions. Channel ends are considered ready as soon as they contain enough data. Event generation by a specific port, timer or channel can be enabled using an event enable unconditional (EEU) instruction and disabled using an event disable unconditional (EDU) instruction. The event enable true (EET) instruction enables the event if its condition operand is true and disables it otherwise; conversely the event enable false (EEF) instruction enables the event if its condition operand is false, and disables it otherwise. These instructions are used to optimise the implementation of guarded inputs.

SETV (set event vector), SETEV (set event environment vector), SETD (set resource data), GETD (get resource data), SETC (set event condition), EET (event enable true), EEF (event enable false), EDU (event disable), EEU (event enable)

Having enabled events on one or more resources, a thread can use a WAITEU, WAITET or WAITEF instruction to wait for at least one event. The WAITEU instruction waits unconditionally; the WAITET instruction waits only if its condition operand is true, and the WAITEF waits only if its condition operand is false.

WAITET (event wait if true), WAITEF (event wait if false), WAITEU (event wait),

This may result in an event taking place immediately with control being transferred to the event handler specified by the corresponding event vector with events disabled by clearing the thread’s eeble flag. Alternatively the thread may be paused until an event takes place with the eeble flag enabled; in this case the eeble flag will be cleared when the event takes place, and the thread resumes execution.

Note that the environment vector is transferred to the event data register, from where it can be accessed by the GETED instruction. This allows it to be used to access data associated with the event, or simply to enable several events to share the same event vector.

To optimise the responsiveness of a thread to high priority resources the SETSR EEBLE instruction can be used to enable events before starting to enable the ports, channels and timers. This may cause an event to be handled immediately, or as soon as it is enabled. An enabling sequence of this kind can be followed either by a WAITEU instruction to wait for one of the events, or it can simply be followed by a CLRSR EEBLE to continue execution when no event takes place. The WAITET and WAITEF instructions can also be used in conjunction with a CLRSR EEBLE to conditionally wait or continue depending on a guarding condition. The WAITET and WAITEF instructions can also be used to optimise the common case of repeatedly handling events from multiple sources until a terminating condition occurs.

All of the events which have been enabled by a thread can be disabled using a single CLRE instruction. This disables event generation in all of the ports, channels or timers which have had events enabled by the thread. The CLRE instruction also clears the thread’s eeble flag.

CLRE (disable all events for thread)

Interrupts: In contrast to events, interrupts can occur at any point during program execution, and so the current pc and sr (and potentially also some or all of the other registers) must be saved prior to execution of the interrupt handler. Interrupts are taken between instructions, which means that in an interrupt handler the previous instruction will have been completed, and the next instruction is yet to be executed on return from the interrupt. This is done using the spc and ssr registers. Any interrupt and exception causes the pc and sr registers to be saved into spc and ssr , and the status register to be modified to indicate that the processor is running in kernel mode. When the handler has completed, execution of the interrupted thread can be performed by a KRET instruction...

KRET (return from interrupt)

XMOS Xcore instructions: Exceptions

Exceptions which occur when an error is detected during instruction execution are treated in the same way as interrupts except that they transfer control to a location defined relative to the thread’s kernel entry point kep register.

Exception types:

ET_LINK_ERROR: Incorrect use of channel
ET_ILLEGAL_PC: Unaligned program counter
ET_ILLEGAL_INSTRUCTION: Illegal opcode
ET_ILLEGAL_RESOURCE: Illegal use of resource
ET_LOAD_STORE: Unaligned memory access
ET_ILLEGAL_PS: Undefined PS register
ET_ARITHMETIC: Arithmetic error
ET_ECALL: Assertion failed
ET_RESOURCE_DEP: Illegal resource use
ET_KCALL: KCALL executed

A program can force an exception as a result of a software detected error condition using ECALLT, ECALLF, or ELATE. These have the same effect as hardware detected exceptions, transferring control to the same location and indicating that an error has occurred in the exception type (et) register:

ECALLT (error on true), ECALLF (error on false), ELATE (error if late, 1r),

A program can explicitly cause entry to a handler using one of the kernel call instructions. These have a similar effect to exceptions, except that they transfer control to a location defined relative to the thread’s kep register (((my note: i don't understand the difference the ECALLs and KALL))):

KCALL, KCALLI (call kernel to enter exception handler)

The spc , ssr , et and sed registers can be saved and restored directly to the stack:

LDSPC (load exception pc), STSPC (store exception pc), LDSSR (load exception sr), STSSR (store exception sr), LDSED (load exception data), STSED (store exception data), STET (store exception type)

In addition, the et and ed registers can be transferred directly to a register:

GETET (get exception type), GETED (get exception data)

A handler can use the KENTSP instruction to save the current stack pointer into word 0 of the thread’s kernel stack (using the kernel stack pointer ksp) and change stack pointer to point at the base of the thread’s kernel stack. KRESTSP can then be used to restore the stack pointer on exit from the handler.

KENTSP n (switch to kernel stack), KRESTSP n (switch from kernel stack)

The kep can be initialised using the SETKEP instruction; the ksp can be read using the GETKSP instructions:

SETKEP (set kernel entry point), GETKSP (get kernel stack pointer)

The kernel stack pointer is initialised by the boot-ROM to point to a safe location near the last location of RAM - the last few locations are used by the JTAG debugging interface. ksp can be modified by using a sequence of SETSP followed by KRESTSP.

XMOS Xcore instructions: Initialisation and Debugging

The state of the processor includes additional registers to those used for the threads:

dspc: debug save pc
dssr: debug save sr
dssp: debug save sp
dtype: debug cause
dtid: thread identifier used to access thread state
dtreg: register identifier used to access thread state
DEBUG: flag that indicates that processor is in debug mode

All of the processor state can be accessed using the GETPS and SETPS instructions:

GETPS, SETPS (get/set processor state)

To access the state of a thread, first SETPS is used to set dtid and dtreg to the thread identifier and register number within the thread state. The contents of the register can then be accessed by:

DGETREG (get thread register)

The debugging state is entered by executing a DCALL instruction, by an instruction that triggers a watchpoint or a breakpoint, or by an external asynchronous DEBUG event (for example caused by asserting a DEBUG pin). During debug, thread 0 executes the debug handler, all other threads are frozen. The debugging state is exited on DRET, which causes thread 0 to resume at its saved PC, and all other threads to start where they were stopped. Entry to a debug handler operates in a manner similar manner to an interrupt.

DCALL (debug call (breakpoint)), DRET (return from debug), DENTSP (debug save stack pointer), DRESTSP (debug restore stack pointer)

On entering debug mode the DI bit is saved in the dspc register, and it is cleared.

Watchpoints and instruction breakpoints are supported by means of SETPS and GETPS instructions. An instruction breakpoint is an address that triggers a DCALL on a PC being equal to the value in the instruction break point. A data watchpoint is a pair of addresses l and h , and a condition that triggers a DCALL on stores and or loads to specific memory addresses.

XMOS Xcore instructions: Specialised Instructions

Long arithmetic:
- LADD (add with carry)
- LSUB (subtract with borrow)
- LMUL (long multiply; The long multiply instruction multiplies two of its source operands, and adds two more source operands to the result, leaving the unsigned double length result in its two destination operands. The result can always be represented within two words...The two carry-in operands allow the component results of multi-length multiplications to be formed directly without the need for extra addition steps)
- LDIV (The long division instruction is very similar to the short unsigned division instruction, except that it returns the remainder as well as the result; it also allows the remainder from a previous step of a multi-length division to be loaded as the high part of the dividend; An ET_ARITHMETIC exception is raised if the result cannot be represented as a single word value)
- LEXTRACT (extracts a selection of bits from two words at a given offset; a sequence of LEXTRACT instructions can be used to implement a rotate, long shift, and misaligned loads), LINSERT (inserts a bit pattern into a double word; inverse operation of LEXTRACT)
Multiply accumulate: MACCU (long multiply, acc unsigned; multiplies two source operands to produce a double length result which it adds to its double length accumulator operand held in two other operands), MACCS (long multiply, acc signed), LSATS (saturate signed; saturates a number that is above an indicated threshold, or below the negation of that threshold), LSATSI (Saturate signed immediate)
Cyclic redundancy check: CRC32, CRC8, CRCN, CRC32_INC (CRC32 and a simultaneous increment on the second parameter)

Other CPU and MCU/MPU links

Note: i think an MCU has onboard memory whereas an MPU uses external memory. See http://www.atmel.com/Images/MCU_vs_MPU_Article.pdf . In this document i've just called everything MCU, todo find out which ones are really MPUs and update this.

http://en.wikibooks.org/wiki/Embedded_Systems/Particular_Microprocessors
http://www.embedded.com/electronics-blogs/other/4420311/MCU-popularity--Engineer-vs--provider-perceptions
https://community.freescale.com/thread/60391
http://stackoverflow.com/questions/2137153/does-it-matter-which-microcontroller-to-use-for-1st-time-embed-system-programmer
http://electronics.stackexchange.com/questions/37423/how-to-choose-a-mcu-platform
http://electronics.stackexchange.com/questions/37423/how-to-choose-a-mcu-platform
http://www.labsud.org/en/arduino-est-il-la-reponse/
http://www.reddit.com/r/ECE/comments/1e0apw/what_microcontroller/
"If you want to learn the basics of microprocessor technology, Freescale HC11 or HC12 is one of the cleanest designs. Atmel AVR comes a close second." -- http://forum.allaboutcircuits.com/archive/index.php/t-89174.html
http://ca.answers.yahoo.com/question/index?qid=20120120152818AAe94TP
"Everyone I talked to that is using microcontrollers in OEM applications is either using PIC, some sort of ARM variant or an MSP430 (only for low power applications). I have yet to come across anyone using an AVR." -- http://electronics.stackexchange.com/questions/2324/why-are-atmel-avrs-so-popular
http://www.analognotes.com/digitalnotes/
http://www.avrfreaks.net/index.php?name=PNphpBB2&file=printview&t=51991&start=0
https://svn.kbs.tu-berlin.de/svn/EOS2012/web/09-CPU.pdf
"Invention-wise, the forefather of the microcontroller world is 8051. Then came in PIC and Renesas and the most recent is AVR . 8-bit is dominated by 8051 because of its long existence, improvements with time, established tools and familiarity in student community . Though IP provider Intel has discontinued MC S 51, other vendors still offer 8051 flavours of MCUs. “ThelaunchofMicrochip’sPICwasagamechangerin8-bitasitofferedaverysmall-sizeandlow-costMCU.FlashmemorybasedPICsrangefromtiny?6-pinSOT23packageto 100-pin TQFP package targeted at small-size embedded products,” shares Upendra P atel, CTO , eInfochips. Although PIC and A VR have captured the market, 8051 is still in use because of its simplicity , familiarity and low cost. " "Unlike 8-bit and 32-bit, there is no standard core in 16-bit space. Proprietary cores offered by different vendors are targeted at different application areas. Renesas’ M16C multi-function MCU family meets the increasing functionality demands corresponding to different market segments. It is compatible with both the low-end as well as high-end MCUs in the family. Electronic household appliances, white goods, audio equipment, TV sets and cameras are some of the application areas. Mixed-signal processors family MSP430 from Texas Instruments is popular in India....In spite of the recent launches to keep 16-bit alive, experts feel that 8-bit and 32-bit are the answer for the microcontroller market. " "In 32-bit space, so far the outright leader is ARM Cortex based microcontroller." "“If you look at 8-bit space, 8051 has achieved its posi - tion in standard cores. There is possibly no room for new standard cores in 8-bit space. Slowly the 16-bit slot is migrating to 32-bit space. And in 32-bit, ARM has gained popularity with its core Cortex- M,” shares Upendra Patel, chief technol - ogy officer at eInfochips" "AccordingtoWeeSeng?,“Mostdevelopers write theirprograms in ‘C’language,andnotAssemblylanguage. "-- https://www.einfochips.com/articles/2010-efy-Microcontrollers-Need-for-Standard-Core.pdf
http://www.microchip.com/investor/Pressrelease/MCHP%20Investor%20Presentation.060313.pdf
"Earlier we talked about how the CPU and MCU choices are boiling down to just two: ARM and x86. Lately it seems that most programmers and engineers are choosing one or the other for their next project, while perfectly good alternatives such as PowerPC?, MIPS, AVR, and others fall by the wayside." "There aren’t quite the hundreds of variations that we see with, say, 8051-based chips or Microchip PIC variants, but ARM is a relative newcomer to this market, so give ’em time." "Where ARM has the lead over PIC or AVR or 8051 is growth headroom. It’s the only family that spans the spectrum from four bits (i.e., $0.50) to 64-bits devices, with plenty of stops along the way. ColdFire?/68K, MIPS, and x86 all have something like that, but not with the same conviction. Freescale’s beloved old CISC family is on its last legs, and MIPS has far fewer suppliers and seems to be circling the drain. Only x86 has the same product breadth" "ARM’s waves of success threaten to carry away the flotsam, in the form of lesser MCU families. It also erodes and undermines seemingly permanent fixtures on the landscape, such as 8051, PIC, and 6805 families." "ARM’s roadmap, from lowly MCU to 64-bit multicore server cluster, makes it an architecture that engineers can stay with for their entire careers. Of course, there was a time when we thought that was true of 68K, and MIPS, and SPARC, and other processor families, too" -- http://www.eejournal.com/archives/articles/20120822-armchoice/
http://www.instructables.com/id/How-to-choose-a-MicroController/?ALLSTEPS mentions PIC, AVR, 8051, Freescale 68HC908 and HCS08, MSP430, ARM, Cypress PSOC, Renesas (Hitachi) H8 and M6, Zilog Z8 and Z80
http://www.microchip.com/investor/Pressrelease/MCHP%20Investor%20Presentation.060313.pdf suggests that the top 8-bit ISAs are PIC, Renesas (H8, R8), AVR, ST8, Freescale HCxx, the top 16-bit are PIC, MSP430, Freescale HC16, Renesas (H8, H8S, M16C), the top 32-bit is ARM and Renesas (SuperH?, H8SX, M32C, M32R). I'm having trouble understanding which Renesas products are the most popular, since Renesas always ranks at the top of sales, but much lower in the 'which chip would you consider for your next product'. I hear about the H8 and the SuperH? the most, perhaps.
http://www.labsud.org/en/arduino-est-il-la-reponse/ mentions AVR, PIC, MSP430, ARM, 8051, Cypress PSOC, Propeller, XMOS
many mention Arduino, which is AVR under the covers.
the ROS Robot Operating System by Willow Garage seems to run on Linux cores which seem to run on x86 and ARM.
http://popcon.debian.org/ shows the most popular architectures are x86, ARM, PowerPC?, and SPARC
http://www.cpushack.com/IntelMicrocontrollers.html
http://techon.nikkeibp.co.jp/english/handbook/MCU/Freescale.html
http://www.cpushack.com/2013/05/21/mcu-of-the-day-st7-family-68hc05-reborn/
http://rab.ict.pwr.wroc.pl/~mw/mcu/tutorial/notes.pdf‎

SuperH (SH)

unofficial Debian port

The decline of the SuperH architecture

http://shared-ptr.com/sh_insns.html

Opinions:

" Few years ago, I designed my own ISA. In that time I investigated design decisions in lots of ISAs and compared them. There was nothing in the RISC-V instruction set that stood out to me, like for example, the SuperH? instruction set, which is remarkably well designed. Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything... " [3]
- "My memories of SuperH? are a bit different. Yeah, it's cleaner than ARM, but the delay slots, hardware division, and the tiny register file among others made life unnecessarily difficult. A lot of those design decisions didn't hold up well over time. " -- [4]
- " What are the particularly good design features of SuperH?? (As compared to, say, MIPS?) What sticks in my mind from my limited exposure to SuperH? is that there's no load immediate instruction, so you have to do a PC-relative load instead. It was clearly optimized for compiled rather than handwritten code! " [5]
  - "SuperH? has a mov #imm, Rx that can take an 8-bit #imm. But you're right, literal pools were used just like on ARM. Things I liked about SuperH?: 16 bit fixed-width insn format (except for some SH2A and DSP ops), T flag for bit manipulation ops, GBR to enable scaled loads with offset, xtrct instruction, single-cycle division insns (div0, div1), MAC insns. In terms of code density SH was quite effective, see here http://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_density.pdf or here http://www.deater.net/weave/vmwprod/asm/ll/ll.html" [6]

DEC Alpha

The DEC Alpha has an unofficial Debian port (it used to have an official Debian port, though). The DEC Alpha is notable for having one of the most relaxed memory models out of relatively popular Linux-capable architectures.

"The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, some versions of the Alpha CPU have a split data cache, permitting them to have two semantically-related cache lines updated at separate times. This is where the data dependency barrier really becomes necessary as this synchronises both caches with the memory coherence system, thus making it seem like pointer changes vs new data occur in the right order.

The Alpha defines the Linux kernel's memory barrier model." -- https://www.kernel.org/doc/Documentation/memory-barriers.txt

"It may seem strange to say much of anything about a CPU whose end of life has been announced, but Al- pha is interesting because, with the weakest memory ordering model, it reorders memory operations the most aggressively. It therefore has defined the Linux- kernel memory-ordering primitives, which must work on all CPUs, including Alpha. Understanding Alpha is therefore surprisingly important to the Linux ker- nel hacker." -- http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf

m32r

This has an unofficial Debian port.

A 32-bit RISC microprocessor of Renesas Technology.

http://www.linux-m32r.org/

https://en.wikipedia.org/wiki/M32R

http://www.renesas.com/products/mpumcu/m32r/index.jsp

s390

This has an unofficial Debian port.

IBM S/390 and System z machines.

See z/Architecture).

HP PA-RISC

https://en.wikipedia.org/wiki/PA-RISC

This has an unofficial Debian port (it used to have an official Debian port, though).

IA-64 / Itanium

https://en.wikipedia.org/wiki/Itanium

This has an official Debian port.

"Once touted by Intel as a replacement for the x86 product line," -- http://features.techworld.com/operating-systems/2690/will-intel-abandon-the-itanium/

"As of 2008, Itanium was the fourth-most deployed microprocessor architecture for enterprise-class systems, behind x86-64, Power Architecture, and SPARC." -- https://en.wikipedia.org/wiki/Itanium#cite_ref-ITJungle_1-0

"MIPS is the cleanest successful RISC. PowerPC? and (32-bit) ARM have so many extra instructions (even a few operating modes, 32-bit ARM especially) that you could almost call them CISC. SPARC has a few odd features and Itanium is composed entirely of odd features. The latter two are more dead than MIPS." -- http://stackoverflow.com/a/2653951/171761

Links:

Blackfin

https://en.wikipedia.org/wiki/Blackfin

Lisp Machines

https://en.wikipedia.org/wiki/Lisp_machine#Technical_overview

Rekursiv

Links:

http://en.wikipedia.org/wiki/Rekursiv

todo

https://en.wikipedia.org/wiki/Instruction_set

https://www.google.com/search?client=ubuntu&channel=fs&q=minimalist+vm&ie=utf-8&oe=utf-8

http://stackoverflow.com/questions/9439001/what-is-the-minimum-instruction-set-required-for-any-assembly-language-to-be-con

https://en.wikipedia.org/wiki/Orthogonal_instruction_set

https://en.wikipedia.org/wiki/PDP-11_architecture#Instruction_set

https://en.wikipedia.org/wiki/PDP-8#Instruction_set

https://www.dartlang.org/articles/why-not-bytecode/

https://en.wikipedia.org/wiki/One_instruction_set_computer

https://en.wikipedia.org/wiki/Minimal_instruction_set_computer

http://www.yumpu.com/en/document/view/19487455/composable-processor-virtualization-for-embedded-systems

http://semipublic.comp-arch.net/wiki/Atomic_list_and_queue_operations

http://semipublic.comp-arch.net/wiki/Big_List_of_Instructions

http://semipublic.comp-arch.net/wiki/Synchronization_Instructions

http://www.es.ele.tue.nl/~kgoossens/2010-caos.pdf composable processor virtualization for embedded systems

http://www.es.ele.tue.nl/~kgoossens/2012-dsd-virtual-memory.pdf Composable Virtual Memory for an Embedded SoC?

https://www.google.com/search?client=ubuntu&channel=fs&q=uclinux+mmu&ie=utf-8&oe=utf-8

https://en.wikipedia.org/wiki/Memory_management_unit

http://www.makelinux.net/ldd3/chp-15-sect-1

https://www.kernel.org/doc/gorman/pdf/understand.pdf

http://stackoverflow.com/questions/10000298/arm-mmu-operation-in-various-operating-modes

http://www-sop.inria.fr/everest/personnel/Andres.Krapf/docs/mm.pdf

https://en.wikipedia.org/wiki/Virtual_machine

http://138.4.11.199:8080/multipartes/public/MPT-D6%202-SolutionsForDetectedLimitations_v1.0.pdf Mechanisms for hardware virtualization in multicore architectures

http://polaris.cs.uiuc.edu/lcpc07/accepted/41_Final_Paper.pdf Capsules: Expressing Composable Computations in a Parallel Programming Model

types of memory barriers

load/load, load/store, store/load, store/store

"Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The four categories of barrier can be specified individually" -- http://developer.android.com/training/articles/smp.html#barrier_inst

M1

"M1 is a ``toy machine used to teach undergraduates about the ACL2 formalization of the Java Virtual Machine. M1 is a von Neumann style stack machine. The state consists of four components, a program counter, an array of local variable values (akin to registers), a stack, and an execute-only program. The machine provides eight instructions for doing addition and multiplication on the stack, moving items from the locals to the stack and back, an unconditional jump and a conditional jump that tests the top of the stack against 0. M1 provides unbounded integers. Because of this, M1 is Turing equivalent." -- http://www.cs.utexas.edu/users/moore/acl2/seminar/2012.03-19-moore-abstract.txt

Links:

Tadasv VMS

A toy virtual machine

https://github.com/tadasv/vms/

registers = ["eax", "ebx", "ecx", "edx", "esp", "ebp", "esi", "edi"]

https://github.com/tadasv/vms/blob/master/compiler/myis.py

instructions = { "push" : [{"opcode" : 0x00, "format" : "<cc", "params" : ["reg"]}], "pop" : [{"opcode" : 0x01, "format" : "<cc", "params" : ["reg"]}], "mov" : [{"opcode" : 0x02, "format" : "<ccI", "params" : ["reg", "imm"]}, {"opcode" : 0x03, "format" : "<ccc", "params" : ["reg", "reg"]}, {"opcode" : 0x04, "format" : "<ccc", "params" : ["reg", "@reg"]}, {"opcode" : 0x05, "format" : "<ccc", "params" : ["@reg", "reg"]}, {"opcode" : 0x06, "format" : "<ccI", "params" : ["reg", "ref"]} ], "inc" : [{"opcode" : 0x07, "format" : "<cc", "params" : ["reg"]}], "dec" : [{"opcode" : 0x08, "format" : "<cc", "params" : ["reg"]}], "add" : [{"opcode" : 0x09, "format" : "<ccc", "params" : ["reg", "reg"]}], "jmp" : [{"opcode" : 0x0A, "format" : "<cI", "params" : ["ref"]}], "jz" : [{"opcode" : 0x0B, "format" : "<ccI", "params" : ["reg", "ref"]}], "jnz" : [{"opcode" : 0x0C, "format" : "<ccI", "params" : ["reg", "ref"]}], "mul" : [{"opcode" : 0x0D, "format" : "<ccc", "params" : ["reg", "reg"]}], "halt" : [{"opcode" : 0xFF, "format" : "<c", "params" : []}], "emit" : [{"opcode" : None, "format" : "<s", "params" : ["str"]}]

Robot Odyssey chip file format

http://scanlime.org/2009/04/robot-odyssey-chip-disassembler/

Steamer16

" Instruction descriptions and opcode assignments:

  NOP,  {0}     ( x y z -- x y z)               no operation

  lit,  {1}     ( x y z -- y z data) PC++       push data at PC, increment PC

  @,    {2}     ( x y addr -- x y data)         fetch data from addr

  !,    {3}     ( x data addr -- x x x)         store data to addr

  +,    {4}     ( x n1 n2 -- x x n1+n2)         add 2ND to TOP

  AND,  {5}     ( x n1 n2 -- x x n1&n2)         and 2ND to TOP

  XOR,  {6}     ( x n1 n2 -- x x n1^n2)         exclusive-or 2ND to TOP

  zgo,  {7}     ( x flg addr -- x x x)          if flg equals 0
                                                then jump to addr
                                                else continue"

-- http://web.archive.org/web/20050909044602/http://www.stringtuner.com/myron.plichota/steamer.txt (see also http://web.archive.org/web/20051216155306/http://www.stringtuner.com/myron.plichota/steamer1.htm )

Xtensa

table 3-11 from PDF page 57 of [7] lists the 'core' ISA of the Xtensa Tensilica 32-bit RISC ISA (harvard architecture, 24-bit instruction width, most instructions have a 16-bit form also; the J (jump) instruction has an 18-bit PC-relative immediate offset):

Load/store: L8UI, L16SI, L16UI, L32I, S8I, S16I, S32I
Memory ordering: MEMW (barrier for all memory and cache accesses (loads, stores, acquires, releases, prefetches, and cache operations, but not instruction fetches), EXTW (MEMW, and in addition a barrier for all external effects)
Jump, Call: CALL0, CALLX0 (CALL0 is PC-relative immediate, CALLX0 is register-addressed), RET, J, JX (J is PC-relative immediate, JX is register-addressed)
Conditional branch: BALL, BNALL, BANY, BNONE (ALL, ANY, NONE refer to if all/any/none of the masked bits are set), BBC, BBCI, BBS, BBSI (BC and BS are if-bit-clear, if-bit-set; the I variants are immediate), BEQ, BEQI, BEQZ, BNE, BNEI, BNEZ, BGE, BGEI, BGEU, BGEUI, BGEZ, BLT, BLTI, BLTU, BLTUI, BLTZ
Move: MOVI (load register with 12-bit signed constant), conditional moves: MOVEQZ, MOVGEZ, MOVLTZ, MOVNEZ
Arithmetic: ADDI, ADDMI (add signed constant shifted by 8), ADD, ADDX2 (add register to register shifted by 2), ADDX4, ADDX8, SUB, SUBX2, SUBX4, SUBX8, NEG, ABS
Bitwise logical: AND, OR, XOR
Shift: EXTUI ("Extract unsigned field immediate" "Shifts right by 0..31 and ANDs with a mask of 1..16 ones"), SRLI (Shift right logical immediate by 0..15 bit positions; use EXTUI if you need >=16), SRAI, SLLI, SRC ("Shift right combined (a funnel shift with shift amount from SAR) The two source registers are catenated, shifted, and the least significant 32 bits returned", SLL, SRL, SRA, SSL, SSR ("Set shift amount register (SAR) for shift right logicalThis instruction differs from WSR to SAR in that only the five least significant bits of the register are used."), SSAI ("Set shift amount register (SAR) immediate"), SSA8B ("Set shift amount register (SAR) for big-endian byte align"), SSA8L ("Set shift amount register (SAR) for big-endian byte align")
Instruction fetch synchronizes ("Processor control"): ISYNC (RSYNC, also do Instruction fetch synchronize), RSYNC (ESYNC, also do "Instruction register synchronize: Waits for all previously fetched WSR and XSR instructions to be performed before interpreting the register fields of the next instruction"), ESYNC (DSYNC, also do "Register value synchronize: Waits for all previously fetched WSR and XSR instructions to be performed before the next instruction uses any register values"), DSYNC ("Load/store synchronize: Waits for all previously fetched WSR and XSR instructions to be performed before interpreting the virtual address of the next load or store instruction")
Misc ("Processor control"): RSR, WSR (read/write special register), XSR (Read Special Register; combined RSR, WSR), RUR, WUR (?), NOP

Links:

Xtensa: A new ISA and Approach (2000)
TIE Language—The Fast Path to High-Performance Embedded SoC Processing (the language used to extend the architecture)

ACPI ASL

http://www.acpi.info/DOWNLOADS/ACPI_5_Errata%20A.pdf section 19.4, page 714, PDF page 752

ZPU

http://htmlpreview.github.io/?https://github.com/zylin/zpu/master/zpu/docs/zpu_arch.html

BREAKPOINT
IM x (push immediate x)
LOADSP n, STORESP n (LOAD and STORE TOS from/to stack location n)
ADDSP n (ADD stack location n to TOS)
EMULATE "Push PC to stack and set PC to 0x0+xxxxx*32. This is used to emulate opcodes. See zpupgk.vhd for list of emulate opcode values used. zpu_core.vhd contains reference implementations of these instructions rather than letting the ZPU execute the EMULATE instruction"
PUSHPC (push PC) (emulated)
POPPC (set PC to popped addr)
LOAD, STORE (and B and H variants)
PUSHSP (push stack pointer), POPSP
ADD, SUB, MULT, DIV, MOD, NEG
bitwise: AND, OR, NOT, XOR, FLIP (reverse bit order)
NOP
PUSHSPADD "a=sp; b=popIntStack()*4; pushIntStack(a+b);"
POPPCREL "setPc(popIntStack()+getPc());"
LESSTHAN, LESSTHANOREQUAL, and Unsigned variants, EQ, NEQ
EQBRANCH (BE), NEQBRANCH (BNE)
LSHIFTRIGHT, ASHIFTLEFT, ASHIFTRIGHT
CALL, CALLPCREL

NS 32k

https://en.wikipedia.org/wiki/NS320xx#Architecture http://cpu-ns32k.net/files/ismanual.pdf

" The processors had 8 general-purpose 32-bit registers, plus a series of special-purpose registers:

    Frame pointer
    Stack pointer (one each for user and supervisor modes)
    Static base register, for referencing global variables
    Link base register for dynamically linked modules (object orientation)
    Program counter
    A typical processor status register, with a low-order user byte and a high-order system byte.

(Additional system registers not listed).

The instruction set was very much in the CISC model, with 2-operand instructions, memory-to-memory operations, flexible addressing modes, and variable-length byte-aligned instruction encoding. Addressing modes could involve up to two displacements and two memory indirections per operand as well as scaled indexing, making the longest conceivable instruction 23 bytes. The actual number of instructions was much lower than that of contemporary RISC processors.

Unlike some other processors, autoincrement of the base register was not provided; the only exception was a "top of stack" addressing mode that would pop sources and push destinations. " -- https://en.wikipedia.org/wiki/NS320xx#Architecture

Add, Add Quick, Add with Carry, Subtract, Subtract with Carry [Borrow], Negate, abs, mult, Multiply Extended Integer, div, mod, quotient, rem, Divide Extended Integer
Move, Move Quick, Move with Sign-Extension, Move with Zero-Extension
Compare, Compare Quick
Packed Decimal Add and Subtract
AND, OR, XOR, NOT, bit clear
Arithmetic Shift, Logical Shift, Rotate
Jump, Conditional Branch, Unconditional Branch, Case Branch (Multiway), Add Compare and Branch
Jump to Subroutine, Branch to Subroutine, Return from Subroutine
FLOATING POINT
- float + - * / neg
- float abs cmp mv
- float dot polynomial
- float log binary, scale binary
- Move Long Floating to Floating, Move Floating to Long Floating, Move Integer to Floating, Round Floating to Integer, Truncate Floating to Integer, Floor Floating to Integer
- Load/Store FSR
Boolean
- Complement Boolean
- Save Condition as Boolean
BIT
- Bit Test, set, clear, invert
- Find First Set Bit
- Convert to Bit Pointer
BIT FIELD: Extract Field, Extract Field Short, Insert Field, Insert Field Short
STRING
- Move String, Move String Translating
- Compare Strings, Compare Strings Translating
- Skip String, Skip String Translating
BLOCK: Move Multiple, Compare Multiple
ARRAY: Bounds Check, Calculate Index
Call External Procedure, Call External Procedure with Descriptor, Return from External Procedure
Breakpoint Trap, Trap on Flag (conditional), Supervisor Call Trap
Return from Trap (Privileged instruction), Return from Interrupt (Privileged instruction)

---

J1

J1 is a small (200 lines of Verilog) stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA. Some highlights:

        Extremely high code density. A complete system including the TCP/IP stack fits in under 8K bytes.
        Single cycle call, zero cycle return
        Instruction set maps trivially to Forth
        Cross compiler runs on Windows, Mac and Unix
        Basic software includes a sizeable subset of ANS Forth and a portable TCP/IP networking stack.

... The J1 is a simple 16-bit CPU. It has some RAM, a program counter (PC), a data stack and a call/return stack. It has a small set of built-in arithmetic instructions. Fields in the J1 instructions control the arithmetic function, and write the results back to the data stacks. There are more details on instruction coding in the paper. ... The CPU was designed to run Forth programs very efficiently: the machine’s instructions are so close to Forth that there is little benefit to writing code in assembler. Effectively Forth is the assembly language. J1 runs at about 100 Forth MIPS on a typical FPGA. This compares with about 0.1 Forth MIPS for a traditional threaded Forth running on an embedded 8-bit CPU. ... The code that defines the basic Forth operations as J1 instructions is in basewords.fs

The next layer up defines basic operations in terms of these simple words. These include many of the CORE words from the DPANS94 Forth standard. Some of the general facilities provided by nuc.fs

        byte memory access
        string handling
        double precision (i.e. 32 bit) math
        one’s complement addition
        memory copy and fill
        multiplication and division, fractional arithmetic
        pictured numeric output
        debug words: memory and stack dump, assert

The above files - about 2K of code - bring the J1 to the point where it can start to define application-specific code. " [8]

"operates reliably at 80 MHz in a Xilinx Spartan-3E FPGA" [9]

" The J1 is a small CPU core for use in FPGAs. It is a 16- bit von Neumann architecture with three basic instruction formats. The instruction set of the J1 maps very closely to ANS Forth. The J1 does not have:

condition registers or a carry flag
pipelined instruction execution
8-bit memory operations
interrupts or exceptions
relative branches
multiply or divide support ... This description follows the convention that the top of stack is T, the second item on the stack is N, and the top of the return stack is R. J1’s internal state consists of:
a 33 deep x 16-bit data stack
a 32 deep x 16-bit return stack
a 13-bit program counter

There is no other internal state: the CPU has no condition flags, modes or extra registers. Memory is 16-bits wide ... there are five categories of instructions: literal, jump, conditional jump, call, and ALU.

... Instruction encoding ((paraphrased from figure)):

literal: 1 + 15-bit literal
jump: 000 + 13-bit target
conditional jump: 001 + 13-bit target
call: 010 + 13-bit target
ALU: 011 + 12 bit ALU instruction (see below for fields) + 1 unused bit ... Literals are 15-bit, zero-extended to 16-bit, and hence use a single instruction when the number is in the range 0-32767. To handle numbers in the range 32768-65535, the compiler follows the immediate instruction with invert. Hence the majority of immediate loads take one instruction.

All target addresses - for call, jump and conditional branch - are 13-bit. This limits code size to 8K words, or 16K bytes. The advantages are twofold. Firstly, instruction decode is simpler because all three kinds of instructions have the same format. Secondly, because there are no relative branches, the cross compiler avoids the problem of range overflow in resolve.

Conditional branches are often a source of complexity in CPUs and their associated compiler. J1 has a single instruction that tests and pops T, and if T = 0 replaces the current PC with the 13-bit target value. This instruction is the same as 0branch word found in many Forth implementations, and is of course sufficient to implement the full set of control structures.

ALU instruction have multiple fields:

field   width  action
T'         4	 ALU op, replaces T, see table II
T -> N     1	 copy T to N
R -> PC    1	 copy R to the PC
T -> R     1 	 copy T to R
dstack +-  2	 signed increment data stack
rstack +-  2	 signed increment return stack
N -> [T]   1 	 RAM write

ALU operation codes:

0: T
1: N
2: T+N
3: T and N
4: T or N
5: T xor N
6: ∼T
7: N=T
8: N < T
9: N rshift T
10: T − 1
11: R
12: [T]
13: N lshift T
14: depth
15: N u< T

Table III shows how these fields may be used together to implement several Forth primitive words. Hence each of these words map to a single cycle instruction. In fact J1 executes all of the frequent Forth words - as measured by (Gregg, M. A. Ertl, and J. Waldron, “The Common Case in Forth Programs,” in EuroForth?, 2001) and (P. J. Koopman, Jr., Stack computers: the new wave. New York, NY, USA: Halsted Press, 1989) in a single clock cycle:

word      T'   T->N   R->PC   T->R   dstack+-   rstack+-   N->[T]
dup	  T    1      0	      0	     +1		0	   0
over	  N    1      0	      0	     +1		0	   0
invert	  ~T   0      0	      0	     0		0	   0
+	  T+N  0      0	      0	     -1		0	   0
swap	  N    1      0	      0	     0		0	   0
nip	  T    0      0	      0	     -1		0	   0
dropN	  N    0      0       0	     -1     	0	   0
;	  T    0      1	      0	     0		-1	   0
>r	  N    0      0	      1	     -1		+1	   0
r>	  R    1      0	      1	     +1		-1	   0
r@	  R    1      0	      1	     +1		0	   0
@	  [T]  0      0	      0	     0		0	   0
!	  N    0      0	      0	     -1		0	   1

" -- [10]

My notes:

it looks like the way to interpret these is that first dstack+- or rstack+- is applied, then the others; in the others, T, N, and R have the values of what they were just before beginning of the instruction
'dup' is duplicate top-of-stack; 'over' is push a copy of the second item on the stack; 'swap' swaps the top two items on the stack; 'nip' removes the second item on the stack (or, you could say it pops, then destructively writes the value popped over the new top-of-stack); 'drop' removes the top item on the stack; ';' returns from a subroutine; >r pops the data stack and pushes the popped value onto the return stack; r> pops the return stack and pushes the popped value onto the data stack; r@ copies the top of the return stack and pushes it onto the data stack; @ pops a memory location, fetches (loads) the value in that memory location, and pushes that value onto the stack; ! pops a value and a memory location and stores that value at that memory location.
i don't see why 'dup' has T->N, isn't that already the case after dstack+1?
i don't see why x> and r@ have a '1' in T->R?
i don't see why '!' has dstack -1 instead of dstack -2

" The CPU’s architecture encourages highly-factored code:

the call instruction is always single-cycle
; and 'exit' are usually free (due to an optimization)
the return stack is 32 elements deep

...

Almost all of the core words are written in pure Forth, the exceptions are 'pick' and 'roll', which must use assembly code because the stack is not accessible in regular memory. Much of the core is based on eforth [10].

" -- [11]

There is also the J1a, "a simplified variant of the original J1. The modifications from the original J1 are:

multi-bit shifts are gone, instead the J1a has single-bit shifts
the stacks are implemented as push/pop shift registers
support for single-ported RAMs "

"The board has 8K of RAM, and runs SwapForth?, a small but complete interactive Forth development environment. SwapForth? takes about 5K of the available RAM, and includes a full native compiler, the ANS standard CORE words, and several more modern extensions."

http://excamera.com/sphinx/article-j1a-swapforth.html

Links:

MicroCore

embedded CPU for Forth.

" MicroCore? [1] is a popular configurable processor core targeted at FPGAs. It is a dual-stack Harvard architecture, encodes instructions in 8 bits, and executes one instruction in two system clock cycles. A call requires two of these instructions: a push literal followed by a branch to Top-of-Stack (TOS). A 32-bit implementation with all options enabled runs at 25 MHz - 12.5 MIPS - in a Xilinx Spartan-2S FPGA. " -- [12]

Links:

Schleisiek, “MicroCore,” in EuroForth?, 2001

b16-small

embedded CPU for Forth.

" b16-small [2], [3] is a 16-bit RISC processor. In addition to dual stacks, it has an address register A, and a carry flag C. Instructions are 5 bits each, and are packed 1-3 in each word. Byte memory access is supported. Instructions execute at a rate of one per cycle, except memory accesses and literals which take one extra cycle. The b16 assembly language resembles Chuck Moore’s ColorForth?. FPGA implementations of b16 run at 30 MHz. " -- [13]

Links:

B. Paysan. http://www.jwdt.com/~paysan/b16.html
B. Paysan, “b16-small – Less is More,” in EuroForth?, 2004

eP32

embedded CPU for Forth.

" eP32 [4] is a 32-bit RISC processor with deep return and data stacks. It has an address register (X) and status register (T). Instructions are encoded in six bits, hence each 32-bit word contains five instructions. Implemented in TSMC’s 0.18um CMOS standard library the CPU runs at 100 MHz, providing 100 MIPS if all instructions are short. However a jump or call instruction causes a stall as the target instruction is fetched, so these instructions operate at 20 MIPS. " -- [14]

Links:

E. Hjrtland and L. Chen, "EP32 - a 32-bit Forth Microprocessor" in Canadian Conference on Electrical and Computer Engineering, pp. 518–521, 2007

F18 (GA144 core)

Two stacks (data and return). "Each stack is eight elements indexed circularly.". There are also registers to access some of the stack elements; T (top of data stack), S (second item on data stack), R (top of return stack). These registers are actually in addition to/spilled from the stack, so there is effectively a 9-element return stack and a 10-element data stack.

Registers:

P (program input)
I (current instruction word)
A (GPR)
B (write-only)
T (top of data stack)
S (second item on data stack)
R (top of return stack)

"Having separate address registers may be a very strange thing to someone familiar with Forth. Normally, addresses go on the stack and the fetch (@) and store (!) operations use them from there. On the F18, fetch and store are always through P, A or B." [15]

"Having such a lightweight calling convention is what allows for extremely aggressive factoring in Forth...In Forth a call costs only a push/pop of a return address and having lots of small routines is encouraged without worrying about inlining." [16]

18 bit words. 64-word RAM. In addition, a 64-word ROM.

Instruction encoding: 5 bit instructions (no operands except for (sometimes) addresses). Note that 4 instructions 18 bit fit in each 8-bit word, by restricting what the last instruction in a word may be.

"The P register is the program counter; pointing at the next instruction cell to be fetched. I’ll call these “cells” so as not to confuse with Forth “words”. Initially P points at a port reading instructions from a block on disk (block 0, to which the assembler targets). When pointing at a port like this, it continues to do so until a (call), (jump), etc. directs it to fetch instructions from other ports or from RAM. However, when fetching from RAM, P will auto-increment. The most recently fetched instruction cell is placed in register I where it is executed slot by slot – four slots containing op codes (see below). It’s important to realize that we don’t execute through P. Instead we first fetch through P into I and then execute instruction slots from I. Instructions may in the mean time further advance P (e.g. @p to fetch a literal value inline) without immediately affecting execution. There may even be a micronext (unext) loop within I without further fetching at all. " [17]

"Note that upon boot, the B register conveniently points at the console I/O port." [18]

"These are the 32 instructions of this simple machine. I will briefly describe them here and will get into more detail on some (e.g. the multiply-step instruction) in future posts." [19]

control:

;: return (pop return stack and goto that address)
ex: "execute" (swap P and R)
(jump) addr
(call) addr
unext: loop within I (eg go back to the instruction in the first slot of I), decrement R (micronext)
next addr: loop to address, decrement R
if addr: jump if T=0 (leaving condition value on the stack)
-if addr: jump if T≥0 (leaving condition value on the stack)

'addr' operands take up the remaining slots within the current contents of I.

loads and stores:

@p: fetch inline literal via P, autoincrement ("fetch-p")
@+: fetch via A, autoincrement
@b: fetch via B
@: fetch via A
!p: store via P, autoincrement ("store-p")
!+: store via A, autoincrement
!b: store via B
!: store via A

"Fetching and storing through P auto-increments (except when pointing at a port)" [20]

arithmetic and misc ALU stuff:

+ (plus), +* (multiply-step), 2* (left shift), 2/ (right shift), – (invert all bits ("not"))
and, or (exclusive or)
drop, dup, pop (from top of RETURN stack to data stack), over
a: (A to T)
push: (from T to R)
b! (store into B ("b-store")), a! (store into A)
.: nop

Simulator-only instructions: break (break into debug view), mark (reset performance statistics)

Example programs from [21]:

– push . .      
@b !b unext

" a two-cell program that echos console input. There are no addresses shown because these are not packed in memory, but streamed over a port at which the P register points. The first cell negates the top of the stack (–), leaving ffffffff in T, and then does a push to R. The remaining two slots are nops (.). The following cell sets up a micronext (unext) loop; first reading from the console with fetch-B (@b) and echoing a key back with store-B (!b). Note that upon boot, the B register conveniently points at the console I/O port. Finally the unext causes execution to loop back to the first slot; decrement R as the induction variable counting down to zero. This means the machine will sit and spin for 2^32 iterations echoing keypresses. It’s executing the single-cell program with nothing in RAM! This is an interesting aspect of the F18; that you can execute code streaming over a port without first loading into memory and do micronext looping without instruction fetches. " -- [22])

" Port execution is a very interesting aspect of the F18. Executing code streaming in over a port rather than from memory is extremely useful as a “protocol” of sorts between nodes. No parsing. Just ask your neighbors to do things. You may use your neighbor’s RAM and stacks. You can treat the nodes as “agents” and messages become just code passed between them. " -- [23]) (see also http://excamera.com/sphinx/article-ga144-ram.html )

Another example program:

         08 0f 1c 1c    @p ! . .        
 0000    0a 0e 02 00    @b !b jump:0000 
         02 00 00 00    jump:0000

"The above does nearly the same thing as Program 1; echoing keypresses forever. The previous version didn’t work forever. It used a micronext loop to over ffffffff iterations. Here we use a (jump) instruction to loop literally forever. This requires an address to which to jump and thus requires the program to be in memory. These three cells are streamed in over the port to which P points upon boot (block 0) as usual. The first cell simply reads the second cell from the port (@p) and stores it in memory (!). Note that A points at 0000 upon boot. Then the third cell is executed (the second having just been fetched inline). This jumps to address 0000. Notice that the assembler shows that the second cell is indeed packed into RAM at address 0000. So three cells have been streamed in, but just one remains in memory. This program is just like the earlier one; reading from console-in (@b) and writing to console-out (!b), but instead of a unext for R iterations it does an unconditional jump to itself." -- [24]) (see also http://excamera.com/sphinx/article-ga144-ram.html )

Another example program:

         08 0d 1c 1c    @p push . .     
         00 00 00 05    5               
         08 0d 04 1c    @p !+ unext .   
 0000    08 08 1d 1c    @p @p push .    
 0001    00 00 00 60    60              
 0002    00 00 00 19    19              
 0003    08 14 18 0e    @p + dup !b     
 0004    00 00 00 01    1               
 0005    05 00 03 00    next:0003 ;     
         03 00 00 00    call:0000

" This is the last example we’ll walk through like this before moving on to see how much easier this is in colorForth. Try reading the code and working through what it does before continuing 🙂

Again this sequence of cells is streamed through a port at which P points. The first three cells load the next five and the last cell calls the freshly loaded code. In fact, you can think of the first three cells as a generic “loader” for the computer. It fetches a value inline (the green 5) with @p and does a push to R to set up for a micronext loop for six iterations (one more than R). Notice that literals like this are fetched inline and execution continues with the next cell.The loop then is in the third cell. The @p !+ unext will execute six times; reading in the following cells and appending to memory. Notice that the !+ stores to and then increments A. Finally, the last cell will call into this memory-packed code.

Notice that since this is a call and not a jump and because the packed code ends with a return (;) then the computer will go back to awaiting further instructions after the program executes. That is, the P register is initially pointing to a port. Upon executing the call, this port address is pushed to the return stack as usual and P now points at RAM. Upon returning from the call the port address is popped and it goes back to reading from the port. In general, this is an interesting idea. You can think of it as memory mapped ports or you can think of I/O vs. memory as being different “modes” triggered by bits in P. Being able to nest “mode switches” by pushing flags along with return addresses to the return stack is powerful. In the actual F18 computer for example, extended arithmetic mode is triggered by a bit in P and mode switching may be conveniently nested this way.

Back to the program above. Did you figure out what it does? This will emit the alphabet to the console. It fetched a couple of literals to the data stack (@p @p and the following green 60 and 19) and does a push of the 19 to R for use as a next loop counter. Realize that these are hex values (19 hex is 25 decimal for iterating the letters of the alphabet, 60 is just below the ASCII character ‘a’).

The loop starts at address 0003. We increment the top of the stack (the letter to emit); fetching a literal 1 (@p paired with the green 1) and adding (+) it. We then dup this so that we can use it and have a copy still available for the next iteration. The !b emits the letter to the console. This loop doesn’t fit within a micronext. Instead, the following cell is executed (address 0005) with a regular next instruction. Just like unext, this decrements R and loops while non-zero. Once the loop falls through, the return instruction causes the computer to return to awaiting instructions from the port as we discussed.

This, in general, is how you use the computers. Sometimes you can simply stream code to them to be executed. Other times the stream of code is effectively a “loader” to pack RAM with useful routines. These routines may then be called by further streamed code. These calls may return to awaiting further instructions from the stream (or possibly never return). " -- https://blogs.msdn.microsoft.com/ashleyf/2013/10/13/programming-the-f18/

Links:

https://blogs.msdn.microsoft.com/ashleyf/2013/10/13/programming-the-f18/
http://excamera.com/sphinx/article-ga144-ram.html
http://wayback.archive.org/web/20160331204751/http://www.colorforth.com/stack.htm
http://www.greenarraychips.com/home/documents/greg/DB001-110412-F18A.pdf
http://school.arrayforth.com/index.php?category=5#categoryContent
https://github.com/AshleyF/Color
https://blogs.msdn.microsoft.com/ashleyf/2013/11/02/the-beautiful-simplicity-of-colorforth/
https://blogs.msdn.microsoft.com/ashleyf/2013/11/10/multiply-step-instruction/
https://blogs.msdn.microsoft.com/ashleyf/2014/07/04/f18-variables/
https://blogs.msdn.microsoft.com/ashleyf/2013/09/21/chuck-moores-creations/
http://www.rs-online.com/designspark/electronics/blog/hands-on-with-a-144-core-processor
https://mschuldt.github.io/www.colorforth.com/ef.htm
http://wildirisdiscovery.blogspot.com/2015/02/the-ga144.html "Every commercial application you might think of ends up requiring more I/O pins then the designers of this part gave it...Another frustrating aspect of the GA144 is that...you can’t tile single GA144s together to form a much larger array because the top/bottom and right/left edges of the chip don’t match up pin to pin (and also)...the GA144 only has two SERDES links, and both are located on the same side of the part"
http://www.forth.org/svfig/kk/10-2013-Ruffer.pdf (about the difficulty of soldering the GA144?)

Comment:

" kragen on May 18, 2016

parent

favorite

on: The MOnSter? 6502

I think the GreenArrays? F18A cores are similar in transistor count to the 6502, but the instruction set is arguably better, and the logic is asynchronous, leading to lower power consumption and no need for low-skew clock distribution. In 180nm fabrication technology, supposedly, it needs an eighth of a square millimeter (http://www.greenarraychips.com/home/documents/greg/PB003-110...), which makes it almost 4 million square lambdas. If we figure that a transistor is about 30 square lambdas and that wires occupy, say, 75% of the chip, that's about 32000 transistors per core, the majority of which is the RAM and ROM, not the CPU itself; the CPU is probably between 5000 and 10 000 transistors. The 6502 was 4528 transistors: http://www.righto.com/2013/09/intel-x86-documentation-has-mo...

The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.

Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)

Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.

The GreenArrays? team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.

I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)

So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.

There are all kinds of chips that want to embed some kind of small microprocessor using a minimal amount of silicon area, but aren't too demanding of its power. A lot of them embed a Z80 or an 8051, which have lots of existing toolchains targeting them. A 6502 might be a reasonable choice, too. Both 6502 and Z80 have self-hosting toolchains available, too, but they kind of suck compared to modern stuff.

If you wanted to build your own CPU out of discrete components (like this delightful MOnSter?!) and wanted to minimize the number of transistors without regard to the number of other components involved, you could go a long way with either diode logic or diode-array ROM state machines.

Diode logic allows you to compute arbitrary non-inverting combinational functions; if all your inputs are from flip-flops that have complementary outputs, that's as universal as NAND. This means that only the amount of state in your state machine costs you transistors. Stan Frankel's Librascope General Precision LGP-21 "had 460 transistors and about 300 diodes", but you could probably do better than that.

Diode-array ROM state machines are a similar, but simpler, approach: you simply explicitly encode the transition function of your state machine into a ROM, decode the output of your state flip-flops into a selection of one word of that ROM, and then the output data gives you the new state of your state machine. This costs you some more transistors in the address-decoding logic, and probably costs you more diodes, too, but it's dead simple. The reason people do this in real life is that they're using an EPROM or similar chip instead of wiring up a diode array out of discrete components. (The Apple ][ disk controller was a famous example of this kind of thing.) " -- [25]

---

a library of hardware-level primitives:

http://www.utdallas.edu/~mxl095420/EE6306/Final%20project/tsmc18_component.pdf

---

Nand2Tetis (The Elements of Computing Systems)'s 'Hack' assembly language and ISA

assembly and ISA: http://www.nand2tetris.org/chapters/chapter%2004.pdf

And their HLL, Jack:

http://www.nand2tetris.org/lectures/PDF/lecture%2009%20high%20level%20language.pdf http://www.nand2tetris.org/projects/09/Jack%20OS%20API.pdf

---

ForwardCom

An open ISA designed by Agner Fog starting in 2016. E seems to have thought particularly hard about vector operations.

Links:

---

IBCM (Itty Bitty Computing Machine)

A teaching language. Has a single 16-bit accumulator. 16-bit words, 12-bit addressing (so 4k words of memory, or 8k bytes). Instructions are 16-bit words, with a 4-bit opcode and 12-bits for operands (instructions 'io' and 'shift' have a special format, halt doesn't use the operands at all, and the other instructions all have a 12-bit memory address here).

Instructions (mostly quoted and paraphrased from [26]; i've changed some of the mnemonics):

dw (for “declare word”), for declaring variables
halt
io (can either read or print a value, either as hex or as an ASCII character)
shift (either shift or rotate the accumulator, either left or right, some number of bits between 0 and 15, inclusive)
load, store
add, sub (accumulator = accumulator +/- contents_of_memory_address)
and, or, xor, not (bitwise operations between memory and accumulator, mutating accumulator)
jmp, jmpe (jump if accumulator equals zero), jmpl (jump if accumulator less than zero)
brl (branch and link; set accumulator to value of the next instruction, then jump to address)
nop

Link:

https://aaronbloomfield.github.io/pdr/book/ibcm-chapter.pdf

---

Steffens' Google Sheets VM

Implemented (for fun) on top of a spreadsheet application (Google Sheets). Registers and memory are cells in the spreadsheet.

6 registers: 4 GPRs, the IP, and a stack pointer.

Instructions (there is no instruction encoding, the textual assembly is directly executed):

mov dst src copies a value from src to dst.
add dst src, sub dst src, mul dst src
push src, pop dst
jmp target, jl cmp1 cmp2 target (jump if less than)
call target (push the IP onto the stack, then jump to target), ret
output src
end (halt)

4 addressing modes:

immediate
register direct
memory
memory indirect

Link:

https://briansteffens.github.io/2017/07/03/google-sheets-virtual-machine.html

---

M.core

" Then there was M.Core, Motorola’s direct assault on ARM. An entirely new 32-bit MCU family designed from the ground up to be small, efficient, and inexpensive, M.Core found few fans outside of its creator’s walls. " -- [27]

https://en.wikipedia.org/wiki/M%C2%B7CORE

http://www.ece.ualberta.ca/~cmpe490/documents/motorola/MCORERM.pdf

---

Epiphany

For massive parallelism

http://adapteva.com/docs/epiphany_arch_ref.pdf

Instructions

" Instruction - set highlights include:

Orthogonal instruction set , with no restrictions on register usage .
Instruction set optimized for floating point computation and efficient data movement .
Post - modify load/store instructions for efficient handling of large array structures .
R ich set of branch conditions , with 3 - cycle branch penalty on all taken branches and zero penalty on untaken branches.
Conditional move instructions to reduce branch penalty for simple control - code structures .
Instructions with immediate modifies for high c ode density and low power consumption.
Compact and efficient floating - point instruction set

Branch Instructions:

Unrestricted branching is supported through out the 32-bit memory map using branch instructions and register jump instructions. Branching can occur conditionally , based on the arithmetic flags set by the integer or floating-point execution unit. The following table illustrates the condition codes supported by the ISA . The architecture supports two sets of flags to allow independent condi tional execution and branching of instructions based on results from two separate arithmetic units. The full set of branching conditions allows the synthesis of any high - level control comparison, including: <, <=, = , == , !=, >=, and >. Both signed and unsigned arithmetic is supported

B (conditional branch or absolute jump)
BL (Jump and Link)
JR (register jump)
JALR (Register Jump and Link)

Load/Store Instructions:

Load and st ore instructions move data between the general - purpose register file and any legal memory location within the architecture, including external memory and any other eCore CPU in the system. All other instructions, such as floating - point and integer arithmet ic instructions , are restricted to using registers as source and destination operands.

The ISA supports the following addressing modes:

Displacement Addressing: The memory address is calculated by adding an immediate offset to a base register value. The i mmediate offset is limited to 3 bits for 16 - bit load / store instructions or 11 bits for 32 - bit load / store instructions. The base register value is not modified by the load/store operation. This mode is useful for accessing local variables.
Indexed Addressing: The memory address is calculated by adding a register value offset to a base register value. The base register value is not modified by the load/store operation. This mode is u seful in array addressing.
Post-Modify Auto-increment Addressing: The memory address is taken directly from the base register value. After the memory operation has completed, a register offset is added to the base register value and written back to the base register. This mode is useful for processing large data arrays and for impl ementing an ef ficient stack - pop operation.

Byte, short, word, and double data types are supported by all load / store instruction formats. All loads and stores must be aligned with respect to the data size being used. Short ( 16 - bit ) data types must be aligne d on 16 - bit boundaries in memory, word ( 32 - bit ) data types must be aligned on 32 - bit boundaries, and double ( 64 - bit ) data types must be aligned on 64 - bit boundaries. Unaligned memory accesses returns unexpected data and generates a software exception. Doub le data - type load/store instructions must specify an even register in the general - purpose register file. The corresponding odd register is written implicitly. Attempts to use odd registers with double data format is flagged as an error by the assembler.

LDR (load register) STR (store register) TESTSET (test and set) addressing modes on LDR and STR: Immediate Offset (effective address is contents of registers, plus or minus constant), Postmodify-immediate, Register Offset (effective address is sum or difference of two registers), Postmodify-Register

Integer Instructions: General - purpose integer instructions , such as ADD, SUB, ORR, AND, are useful for control code and integer math. These instructions work with immediate as well as register-based operands. The instructions update the integer status bits of the STATUS register.

ADD SUB ASR LSR LSL ORR (or) AND EOR (xor) BITR (bitreverse, which i assume is bitwise NOT?)

Floating-Point Instructions: An orthogonal set of IEEE754 - compliant floating - point instructions for signal processing applications. These instructions update the floating-point status bits of the STATUS register .

FADD FSUB FMUL FMADD (multiply-add) FMSUB (multiply-subtract) FABS FIX (float to fixed point conversion) FLOAT (fixed to float conversion)

Secondary Signed Integer Instructions: The basic floating point instruction set can be substituted with a set of signed integer instruction s by setting the appropriate mode bits in the CONFIG register [19:16]. These instructions use the same opcodes as the flo a ting - point instr uctions and include: IADD, ISUB, IMUL, IMADD, IMSUB.

IADD ISUB IMUL IMADD (multiply-add) IMSUB (multiply-sub)

Register Move Instructions: All register moves are done as complete word ( 32 - bit ) entities. Conditional move instructions support the same set of condition codes as the branch instructions specified in Table 12.

MOV (can be immediate or register addr mode; register addr mode MOV can be conditional)
MOVT (move immediate high; RD := RD

(<imm16><< 16)))

MOVTS (move to special register), MOVFS (move from special register)

Program Flow Instructions A number of special instructions used by the run time environment to enable efficient interrupt handling, multicore programming, and program debug:

NOP
IDLE (wait for interrupt)
RTS (return from subroutine; PC = LR (link register))
RTI (return from interrupts; PC = IRET)
GID (interrupts disable; disable all interrupts)
GIE (interrupts enable)
BKPT (breakpoint)
MBKPT (multicore breakpoint)
TRAP (halts program)
SYNC (forces an ILAT[0] on all cores in group)
WAND (multicore barrier)

" Figure 9 shows how the shared - memory architecture and the eMesh network work productively together . In the example, a dot - product routine writes its result to a memory location in another mesh node . The only thing required to pass data from one node t o another is the setting of a pointer. The hardware decod es the transaction and determin es whether it belongs to the local node ’s memory or to another node ’s memory.

Figure 9 : Pointer Manipulation Example

C - CODE VecA? array at 0x82002000 VecB? array at 0x82004000 remote_res at 0x92004000 for (i=0;i<100;i++){ loc_sum+=vecA[i]*vecB[i]; } remote_res=loc_sum;

ASSEMBLY R0=pointer to VecA? R2=pointer to VecB? R6=pointer to remote_res R4=loc_sum;

MOV R5 , #100 ; _L: LDR R1,[R0], #1 ; LDR R3,[R2], #1 ; FMADD R4 ,R1, R 3 ; SUB R 5 ,R5, #1 ; BNE _L; STR R4, [R 6 ]; "

---

Transputer

" The 16 'primary' one-operand instructions were: Mnemonic Description J Jump — add immediate operand to instruction pointer. LDLP Load Local Pointer — load a Workspace-relative pointer onto the top of the register stack PFIX Prefix — general way to increase lower nibble of following primary instruction LDNL Load non-local — load a value offset from address at top of stack LDC Load constant — load constant operand onto the top of the register stack LDNLP Load Non-local pointer — Load address, offset from top of stack NFIX Negative prefix — general way to negate (and possibly increase) lower nibble LDL Load Local — load value offset from Workspace ADC Add Constant — add constant operand to top of register stack CALL Subroutine call — push instruction pointer and jump CJ Conditional jump — depending on value at top of register stack AJW Adjust workspace — add operand to workspace pointer EQC Equals constant — test if top of register stack equals constant operand STL Store local — store at constant offset from workspace STNL Store non-local — store at address offset from top of stack OPR Operate — general way to extend instruction set

All these instructions take a constant, representing an offset or an arithmetic constant. If this constant was less than 16, all these instructions coded to a single byte.

The first 16 'secondary' zero-operand instructions (using the OPR primary instruction) were: Mnemonic Description REV Reverse — swap two top items of register stack LB Load byte BSUB Byte subscript ENDP End process DIFF Difference ADD Add GCALL General Call — swap top of stack and instruction pointer IN Input — receive message PROD Product GT Greater Than — the only comparison instruction WSUB Word subscript OUT Output — send message SUB Subtract STARTP Start Process OUTBYTE Output Byte — send single-byte message OUTWORD Output word — send single-word message " -- [28]

---

Sweet32

https://github.com/Basman74/Sweet32-CPU/blob/master/Sweet32_CPU_overview_0v95.pdf

16 32-bit GPRs

Registers:

R0-R15: GPRs
PC
IRQ0VEC: user IRQ0 interrupt vector address (the user sets this to give the entry point to their interrupt handler) (write-only)
CW: CPU control word (write-only): bit 31: trace/debug enable; bit0: interrupt IRQ0 enable; other bits: reserved for future use
TRACE_RTN: trace interrupt return address (read-only)
XR: (OPTIONAL): 32x32 multiply upper 32-bit result (read-only; only accessible via the GETMX opcode)

16-bit 3-operand instruction encoding; 4-bit opcode, then 3x4-bit operand fields selecting three registers. Some opcodes (like LDD, LDW, MJMP) require an additional word or two to encode data.

Sweet32 instructions

AND, XOR, NOT
TSTSZ: skip if (Rx (logical)AND Ry) == 0. TSTSNZ: same but skip if not zero.
BITSZ Rx Ry: skip if bit Ry of Rx is zero. BITSNZ: same but skip if not zero.
SUBSLT: Skip if Ry < Rx
SWAPB (swap two bytes in the lower 16-bits of the specified register), SWAPW (swap two 16-bit words of the specified register)
INCS (add signed immediate 4-bit value to Rx and store result in Rz)
ADD, MUL, GETMX (get upper 32-bit result; must occur immediately after a MUL; optional/not on all Sweet32 processors)
SJMP (12-bit relative jump), MJMP (28-bit relative jump), LJMP (absolute jump to address in Rx)
GETPC (get PC and add Rx), SETCW (set CPU control word, GETTR (get trace return address), SETIV (set IRQ0 interrupt vector address)
LDB (load immediate 8-bit), LDW (load immediate 16-bit), LDD (load immediate 32-bit)
RETI (return from IRQ0 interrupt handler), RETT (return from trace/debug interrupt handler)
LSR (logical shift right), ASR (arithmetic shift right)
MOVW (move 16-bit data from Rx to the address in Ry), MOVD (32-bit MOV), MOVSW (like MOVW but also sign-extends to 32-bits (that is, copies the most significant bit 16 times))

---

FML

https://thegrandmotherblogg.wordpress.com/2014/08/27/fml-a-silly-virtual-machine/

---

1802

https://www.atarimagazines.com/computeii/issue3/page52.php

" rrinker - Wednesday, October 30, 2019 - link Indeed, back in the early 8 bit days, the RCA CDP1802 was absolutely the most elegant architecture, but most people probably day "the what?". Outside of some niche applications, where the completely static nature allowed it to clock down to 0 and draw picoamps, and it being all CMOS so low power even when running, and being available in rad hardened form, it was all but invisible in the pre-IBM PC 8 bit days. Elegance doesn't equal design wins.

...

rrinker - Wednesday, October 30, 2019 - link The registers and such aren't static though, I don't think, on most modern processors, so they have to run at some minimum clock rate to be refreshed. Unused elements can be turned off today, and with turbo you have varying frequency, but the old 1802 could run from 0-4MHz (or 6, or 8, or as high as 12MHz in the last variants. It was almost RISC-like, as well, just 91 instructions, and that very well organized (for Hex, Intel just loved Octal so the 8080 instruction set it highly aligned for octal - it makes no sense in hex). Most 1802 contemporaries had a very small register count - the 1802 has 16x16 general purpose registers plus accumulator and a coupe others. Of the 16, any one could be program counter, which could be switched on the fly (that's how they did subroutine calls with no CALL instruction), and any could be the index pointer for memory directed operations, also switchable on the fly, which means the 1802 has the best operand ever - SEX, for SEt indeX. It was my first computer, and later when I had to add an assembly routing to an Apple 2 BASIC program to get proper performance, I was hugely frustrated by the lack of registers int he 6502. SO many people worship the 6502, I frankly hated the thing. 1802 remains my favorite 8-bitter, followed by the Z80. Best part is, that computer I built from a kit more than 40 years ago still works perfectly. " -- comments on https://www.anandtech.com/comments/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip/669994

MRISC32

https://mrisc32.bitsnbites.eu/

https://www.bitsnbites.eu/

https://github.com/mrisc32/mrisc32/blob/master/doc/Instructions.md

MyNOR

http://www.mynor.org/boards-mynor.htm

Instruction Set

MyNOR? is a CISC (complex instruction set) CPU with von-Neumann architecture. Programcode and data are stored together in the same RAM. Furthermore the RAM is used to store the stack memory and also the CPU registers. Because CPU registers are stored in RAM, MyNOR? is capable of dealing with up to 256 8-bit registers.

Instruction Function Instruction Function LD reg,# Load register with immediate value SUB reg Subtract register from ACCU (with carry) LD reg,reg Load register with other register XOR reg Perform XOR operation on ACCU and register LDA # Load ACCU with immediate value CMP reg Compare ACCU with register and set FLAG LDA reg Load ACCU from register CMP # Compare ACCU with immediate value and set FLG STA reg Store ACCU to register TST reg Test register for zero and set FLAG LAP Load ACCU through pointer JMP abs Unconditional jump to absolut memory address SAP Store ACCU through pointer JNF abs Jump to absolut memory address if FLAG = 0 ADD reg Add register to ACCU (with carry) JPF abs Jump to absolut memory address if FLAG = 1 AND reg Perform AND operation on ACCU and register JSR abs Call subroutine DEC reg Decrement register RET Return from subroutine INC reg Increment register RST Reset the CPU OR reg Perform OR operation on ACCU and register IO port Input or Output ACCU on port ROL reg Rotate register left (with carry) PSH reg Push register to stack ROR reg Rotate register right (with carry) POP reg Pull register from stack

I have optimized the instruction set a lot, so programming becomes convenient and efficient. The Cross Assembler "myca" provides some special macro instructions to make programming even more convenient: Instruction Function ADD # Add immediate value to ACCU AND # Perform AND operation on ACCU and immediate value OR # Perform OR operation on ACCU and immediate value SUB # Subtract immediate value from ACCU (with carry) XOR # Perform XOR operation on ACCU and immediate value CLC Clear (carry) FLAG SEC Set (carry) FLAG

If you are interested in a full description of the registers and the instruction set, please read the MyNOR-Instruction-Set documentation. "

---

WRAMP

https://wramp.wand.nz/insn.html

" 16 registers, each being 32 bits wide ...

$0 Hardwired zero
$1-$13 General purpose registers
$sp Stack pointer
$ra Return address register ... Each CPU instruction is a word (32 bits) in length. An instruction is encoded in one of the three formats..." I-Type, R-Type, J-Type.

All three instruction types start with a 4-bit opcode, a 4-bit destination register specifier, and a 4-bit source register specifier. J-Type then uses the last 20-bits for an address/offset. I-type and R-type both use the next 4 bits for a function specifier (sub-opcode). I-type ends with a 16-bit immediate, whereas R-type has 12 bits of constant zeros and then 4-bits of Rt, a second source register specifier:

I-Type: opcode (4 bits), Rd (4 bits), Rs (4 bits), Func (4 bits), Immediate (16 bits)
R-Type: opcode (4 bits), Rd (4 bits), Rs (4 bits), Func (4 bits), 0s (12 bits), Rt (4 bits)
J-Type: opcode (4 bits), Rd (4 bits), Rs (4 bits), Address / Offset (20 bits)

Arithmetic Instructions:

add, addi (add immediate), addu (add unsigned), addui, (sub

mult

div

rem)(

ui) (all are 3-operand instructions)

lhi Rd, immediate (Load High Immediate; Rd <- immed << 16), la Rd, address (Load Address)

Bitwise Instructions:

(and

xor

sll

srl

sra)(

i) (all are 3-operand instructions)

Test Instructions:

(slt (set on less than)

sgt

sle

sge

seq

sne)(

ui) (all are 3-operand instructions)

Jump/Branch Instructions:

j address (jump)
jr Rs (jump to register)
jal address (jump and link)
jalr Rs (Jump and Link Register)
beqz Rs, offset (Branch on equal to 0; 20-bit signed offset)
bnez Rs, offset (Branch on not equal to 0; 20-bit signed offset)

Memory Instructions:

lw Rd, offset(Rs) (Load word; Rd <- MEM[Rs+offset])
sw Rd, offset(Rs)

Special Instructions:

movgs Rd, Rs (Move General to Special Register)
movsg Rd, R (Move Special to General Register)
break (Generate Break Point Exception)
syscall (Generate Syscall Exception)
rfe (Return from Exception)

Links:

---

Plzoo comm's underlying machine

https://github.com/andrejbauer/plzoo/blob/master/src/comm/machine.ml

instructions:

NOOP (* no operation *)

SET of int (* pop from stack and store in the given location *)

GET of int (* push from given location onto stack *)

PUSH of int (* push integer constant onto stack *)

ADD (* addition *)

SUB (* subtraction *)

MUL (* multiplication *)

DIV (* division *)

MOD (* remainder *)

EQ (* equal *)

LT (* less than *)

AND (* conjunction *)

OR (* disjunction *)

NOT (* negation *)

JMP of int (* relative jump *)

JMPZ of int (* relative jump if zero on the stack *)

PRINT (* pop from stack and print *)

---

Y86

https://www.cs.utexas.edu/~byoung/cs429/y86primer.pdf

---

4stack

https://bernd-paysan.de/4stack.html

"The 4stack Processor uses stack based instructions for a four way VLIW processor...Multiple stacks overcome the limited parallelism of single-stack machines. Using several stack-like organized register files instead of one huge multiported register file minimizes conflicts and greatly reduces gate count for these units."

"The 4stack processor is a research project to create a high performance VLIW (very long instruction word) microprocessor architecture without the typical disadvantages as un- compact code by using the stack paradigm for the individual operating units. " -- https://bernd-paysan.de/userman.pdf

"The proposed machine uses four stacks cached in four LIFO latch files. Each stack has its own ALU. The four stack locations from the top of stack may be accessed by any other ALU (four read ports). The next four stack locations may only be accessed by the stack’s ALU itself (one read port). The stack items below are (usually) unaddressable. " -- https://bernd-paysan.de/4stack.pdf

---

Rex

VLIW

"256 cores per chip, scratchpad memory, a 2D-mesh interconnect, and a revolutionary high bandwidth chip-to-chip interconnect achieve:"

Lisp Machine Inc. K-machine

http://fare.tunes.org/tmp/emergent/kmachine.htm

https://lobste.rs/s/lxkq2r/lisp_machine_inc_k_machine