Difference between revision 22 and current revision

No diff available.

Table of Contents for Programming Languages: a survey


A teaching language used on the web page

" For example, most processors you find will have instructions like the following:

Data movement instructions (e.g., MOV)

Arithmetic and logical instructions (e.g., ADD, SUB, AND, OR, NOT)

Comparison instructions

A set of conditional jump instructions (generally used after the compare instructions)

Input/Output instructions

Other miscellaneous instructions "

" The Y86 CPU provides 20 instructions. Seven of these instructions have two operands, eight of these instructions have a single operand, and five instructions have no operands at all. The instructions are MOV (two forms), ADD, SUB, CMP, AND, OR, NOT, JE, JNE, JB, JBE, JA, JAE, JMP, BRK, IRET, HALT, GET, and PUT. "

HALT is program termination. BRK is a temporary halt that can be resumed from. JB and JB are JLT and JGT. IRET is return from interrupt. GET and PUT are input and output.

"The Y86 processor supports the register addressing mode7, the immediate addressing mode, the indirect addressing mode, the indexed addressing mode, and the direct addressing mode."

Later, they mention expansion to the NEG (arithmetic negation) instruction, and the SHL, SHR, ROL, ROR, and XOR instructions.





according to Wikipedia ( ), two distinctive and important features of the 8051 are bit-level boolean logic operations, which "helped cement the 8051's popularity in industrial control applications because it reduced code size by as much as 30%.", and "four bank selectable working register sets which greatly reduce the amount of time required to complete an interrupt service routine. With a single instruction the 8051 can switch register banks as opposed to the time consuming task of transferring the critical registers to the stack or designated RAM locations. These registers also allowed the 8051 to quickly perform a context switch."


Berkeley RISC II



This has an unofficial Debian port (or1k). In addition, it is of interest because it is an open project attempting to provide a generally useful design, one might hope that their core ISA is close to a common core with few idiosyncracies.

A list of all mandatory instructions in the OpenRISC? 1200 core (as of this time the only extant implementation, i think): (omitting all instructions whose mnemonic is the same as another, but with 'i' appended, which i took to be immediate addressing mode variants) (from ):

add add signed and bf Branch if Flag bnf Branch if no Flag j Jump (immediate) jal Jump and Link (immediate) jalr Jump and Link Register jr jump (register) lbs Load Byte and Extend with Sign lbz Load Byte and Extend with Zero lhs Load Half Word and Extend with Sign lhz Load Half Word and Extend with Zero lws Load Single Word and Extend with Sign lwz Load Single Word and Extend with Zero mfspr Move From Special-Purpose Register movhi Move Immediate High mtspr Move To Special-Purpose Register nop or rfe Return From Exception rori Rotate Right with Immediate (The 6-bit immediate value specifies the number of bit positions) sb Store Byte (with immediate offset) sfeq Set Flag if Equal (cmp) sfges Set Flag if Greater or Equal Than Signed sfgeu Set Flag if Greater or Equal Than Unsigned sfgts Set Flag if Greater Than Signed sfgtu Set Flag if Greater Than Unsigned sfleu Set Flag if Less or Equal Than Unsigned sflts Set Flag if Less Than Signed sfltu Set Flag if Less Than Unsigned sfne Set Flag if Not Equal sh Store Half Word ("The offset is sign-extended and added to the contents of general-purpose register rA. The sum represents an effective address. The low-order 16 bits of general-purpose register rB are stored to memory location addressed by EA") sll Shift Left Logical (number of bit positions specified in register) sra Shift Right Arithmetic (number of bit positions specified in register) srl Shift Right Logical (number of bit positions specified in register) sub Subtract Signed sw Store Single Word sys System Call trap Trap "Execution of trap instruction results in the trap exception if specified bit in SR is set. Trap exception is a request to the operating system or to the debug facility to execute certain debug services. Immediate value is used to select which SR bit is tested by trap instruction" xor




    ACALL - Absolute Call
    ADD, ADDC - Add Accumulator (With Carry)
    AJMP - Absolute Jump
    ANL - Bitwise AND
    CJNE - Compare and Jump if Not Equal
    CLR - Clear Register
    CPL - Complement Register
    DA - Decimal Adjust
    DEC - Decrement Register
    DIV - Divide Accumulator by B
    DJNZ - Decrement Register and Jump if Not Zero
    INC - Increment Register
    JB - Jump if Bit Set
    JBC - Jump if Bit Set and Clear Bit
    JC - Jump if Carry Set
    JMP - Jump to Address
    JNB - Jump if Bit Not Set
    JNC - Jump if Carry Not Set
    JNZ - Jump if Accumulator Not Zero
    JZ - Jump if Accumulator Zero
    LCALL - Long Call
    LJMP - Long Jump
    MOV - Move Memory
    MOVC - Move Code Memory
    MOVX - Move Extended Memory
    MUL - Multiply Accumulator by B
    NOP - No Operation
    ORL - Bitwise OR
    POP - Pop Value From Stack
    PUSH - Push Value Onto Stack
    RET - Return From Subroutine
    RETI - Return From Interrupt
    RL - Rotate Accumulator Left
    RLC - Rotate Accumulator Left Through Carry
    RR - Rotate Accumulator Right
    RRC - Rotate Accumulator Right Through Carry
    SETB - Set Bit
    SJMP - Short Jump
    SUBB - Subtract From Accumulator With Borrow
    SWAP - Swap Accumulator Nibbles
    XCH - Exchange Bytes
    XCHD - Exchange Digits
    XRL - Bitwise Exclusive OR
    Undefined - Undefined Instruction" --


Cell Processor SPU$file/SPU_ISA_v1.2_27Jan2007_pub.pdf

3. Memory—Load?/Store Instructions Load Quadword (d-form) Load Quadword (x-form) Load Quadword (a-form) Load Quadword Instruction Relative (a-form) Store Quadword (d-form) Store Quadword (x-form) Store Quadword (a-form) Store Quadword Instruction Relative (a-form) Generate Controls for Byte Insertion (d-form) Generate Controls for Byte Insertion (x-form) Generate Controls for Halfword Insertion (d-form) Generate Controls for Halfword Insertion (x-form) Generate Controls for Word Insertion (d-form) Generate Controls for Word Insertion (x-form) Generate Controls for Doubleword Insertion (d-form) Generate Controls for Doubleword Insertion (x-form)

4. Constant-Formation Instructions Immediate Load Halfword Immediate Load Halfword Upper Immediate Load Word Immediate Load Address Immediate Or Halfword Lower Form Select Mask for Bytes Immediate

5. Integer and Logical Instructions Add Halfword Add Halfword Immediate Add Word Add Word Immediate Subtract from Halfword Subtract from Halfword Immediate Subtract from Word Subtract from Word Immediate Add Extended Carry Generate Carry Generate Extended Subtract from Extended Borrow Generate Borrow Generate Extended Multiply Multiply Unsigned Multiply Immediate Multiply Unsigned Immediate Multiply and Add Multiply High Multiply and Shift Right Multiply High High Multiply High High and Add Multiply High High Unsigned Multiply High High Unsigned and Add Count Leading Zeros Count Ones in Bytes Form Select Mask for Bytes Form Select Mask for Halfwords Form Select Mask for Words Gather Bits from Bytes Gather Bits from Halfwords Gather Bits from Words Average Bytes Absolute Differences of Bytes Sum Bytes into Halfwords Extend Sign Byte to Halfword Extend Sign Halfword to Word Extend Sign Word to Doubleword And And with Complement And Byte Immediate And Halfword Immediate And Word Immediate Or Or with Complement Or Byte Immediate Or Halfword Immediate Or Word Immediate Or Across Exclusive Or Exclusive Or Byte Immediate Exclusive Or Halfword Immediate Exclusive Or Word Immediate Nand Nor Equivalent Select Bits Shuffle Bytes

6. Shift and Rotate Instructions Shift Left Halfword Shift Left Halfword Immediate Shift Left Word Shift Left Word Immediate Shift Left Quadword by Bits Shift Left Quadword by Bits Immediate Shift Left Quadword by Bytes Shift Left Quadword by Bytes Immediate Shift Left Quadword by Bytes from Bit Shift Count Rotate Halfword Rotate Halfword Immediate Rotate Word Rotate Word Immediate Rotate Quadword by Bytes Rotate Quadword by Bytes Immediate Rotate Quadword by Bytes from Bit Shift Count Rotate Quadword by Bits

Rotate Quadword by Bits Immediate Rotate and Mask Halfword Rotate and Mask Halfword Immediate Rotate and Mask Word Rotate and Mask Word Immediate Rotate and Mask Quadword by Bytes Rotate and Mask Quadword by Bytes Immediate Rotate and Mask Quadword Bytes from Bit Shift Count Rotate and Mask Quadword by Bits Rotate and Mask Quadword by Bits Immediate Rotate and Mask Algebraic Halfword Rotate and Mask Algebraic Halfword Immediate Rotate and Mask Algebraic Word Rotate and Mask Algebraic Word Immediate

7. Compare, Branch, and Halt Instructions Halt If Equal Halt If Equal Immediate Halt If Greater Than Halt If Greater Than Immediate Halt If Logically Greater Than Halt If Logically Greater Than Immediate Compare Equal Byte Compare Equal Byte Immediate Compare Equal Halfword Compare Equal Halfword Immediate Compare Equal Word Compare Equal Word Immediate Compare Greater Than Byte Compare Greater Than Byte Immediate Compare Greater Than Halfword Compare Greater Than Halfword Immediate Compare Greater Than Word Compare Greater Than Word Immediate Compare Logical Greater Than Byte Compare Logical Greater Than Byte Immediate Compare Logical Greater Than Halfword Compare Logical Greater Than Halfword Immediate Compare Logical Greater Than Word Compare Logical Greater Than Word Immediate Branch Relative Branch Absolute Branch Relative and Set Link Branch Absolute and Set Link Branch Indirect Interrupt Return Branch Indirect and Set Link if External Data Branch Indirect and Set Link Branch If Not Zero Word Branch If Zero Word Branch If Not Zero Halfword Branch If Zero Halfword Branch Indirect If Zero Branch Indirect If Not Zero Branch Indirect If Zero Halfword Branch Indirect If Not Zero Halfword

8. Hint-for-Branch Instructions Hint for Branch (r-form) Hint for Branch (a-form) Hint for Branch Relative

9. Floating-Point Instructions 9.1 Single Precision (Extended-Range Mode) 9.2 Double Precision 9.2.1 Conversions Between Single-Precision and Double-Precision Format 9.2.2 Exception Conditions 9.3 Floating-Point Status and Control Register

Floating Add Double Floating Add Floating Subtract Double Floating Subtract Floating Multiply Double Floating Multiply Floating Multiply and Add Double Floating Multiply and Add Floating Negative Multiply and Subtract Double Floating Negative Multiply and Subtract Floating Multiply and Subtract Double Floating Multiply and Subtract Double Floating Negative Multiply and Add Floating Reciprocal Estimate Floating Reciprocal Absolute Square Root Estimate Floating Interpolate Convert Signed Integer to Floating Convert Floating to Signed Integer Convert Unsigned Integer to Floating Convert Floating to Unsigned Integer Floating Round Double to Single Floating Extend Single to Double Double Floating Compare Equal Double Floating Compare Magnitude Equal Double Floating Compare Greater Than Double Floating Compare Magnitude Greater Than Double Floating Test Special Value Floating Compare Equal Floating Compare Magnitude Equal Floating Compare Greater Than Floating Compare Magnitude Greater Than Floating-Point Status and Control Register Write Floating-Point Status and Control Register Read

10. Control Instructions Stop and Signal Stop and Signal with Dependencies No Operation (Load) No Operation (Execute) Synchronize Synchronize Data Move from Special-Purpose Register Move to Special-Purpose Register

11. Channel Instructions Read Channel Read Channel Count Write Channel


Cypress PSoC MCU

Different versions with different MCUs. PSoC? 3 has 8051, and PSoC? 4 has ARM Cortex M0, and PSoC? 4 has ARM Cortex M3.

"The main problem for me is trying to find microcontrollers which have the peripheral set I want. This is very difficult as our requirements don't seem to be mainstream. We want things like 5 PWM channels, 5 Quadrature decoders, 2 non-standard SPI ports and a UART with negated IO....Also included on the chip are re-configurable digital and analogue blocks which can be made into a wide range of peripherals: ADCs, filters, op-amps, DACs, SPI, UART, quadrature decoder, CRC generator, etc...The real benefit is that you can stick with one chip, knowing that it can tackle a great many of the projects you'll want to do in the future." --


XMOS Xcore

Multicore MCU; up to 8 MCUs.


XMOS Xcore Instructions

The following is mostly quoted or paraphrased from .

XMOS Xcore Instructions: Data Access

XMOS Xcore Instructions: Expression Evaluation

XMOS Xcore Instructions: Branching, Jumping and Calling



jumping with link register: BLRF (branch and link relative forward), BLRB (branch and link relative backward), BLACP (branch and link absolute via CP), BLAT (branch and link absolute via table), BLA (branch and link absolute via register)

stack manipulation intended for calling, and returning:

XMOS Xcore Instructions: Resources and the Thread Scheduler

Each xCORE Tile manages a number of different types of resource. These include threads, synchronisers, channel ends, timers and locks. For each type of resource a set of available items is maintained. The names of these sets are used to identify the type of resource to be allocated by the GETR (get resource) instruction. When the resource is no longer needed, it can be released for subsequent use by a FREER (free resource) instruction. Some resources have associated control modes which are set using the SETC instruction.

Many of the mode settings are defined only for a specific kind of resource and are described in the appropriate section; the ones which are used for several different kinds of resource are:

Execution of instructions from each thread is managed by the thread scheduler. This maintains a set of runnable threads... from which it takes instructions in turn. When a thread is unable to continue, it is paused by removing it from the run set. The reason for this may be any of the following:

The thread scheduler manages the threads, thread synchronisation and timing (using the synchronisers and timers). It is directly coupled to resources such as the ports and channels so as to minimise the delay when a thread becomes runnable as a result of a communication or input-output

XMOS Xcore Instructions: Concurrency and Thread Synchronisation

A thread can initiate execution on one or more newly allocated threads, and can subsequently synchronise with them to exchange data or to ensure that all threads have completed before continuing. Thread synchronisation is performed using hardware synchronisers , and threads using a synchroniser will move between running states and paused states. When a thread is first created, its status register is initialised as follows:

The access registers of the newly created thread can be initialised using the following instructions.

These instructions can only be used when the thread is paused. The TINITLR instruction is intended primarily to support debugging. On thread initialisation, the PC must be initialised. DP, SP, and CP will retain their value on freeing and allocating threads, so they may not have to be reinitialised. Data can be transferred between the operand registers of two threads using TSETR and TSETMR instructions, which can be used even when the destination thread is running.

To start a synchronised slave thread a master must first acquire a synchroniser. This is done using a GETR SYNC instruction. If there is a synchroniser available its resource ID is returned, otherwise the invalid resource ID is returned. The GETST instruction is then used to get a synchronised thread. It is passed the synchroniser ID and if there is a free thread it will be allocated, attached to the synchroniser and its ID returned, otherwise the invalid resource ID is returned. The master thread can repeat this process to create a group of threads which will all synchronise together. To start the slave threads the master executes an MSYNC instruction using the synchroniser ID.

The group of threads can synchronise at any point by the slaves executing the SSYNC and the master the MSYNC. Once all the threads have synchronised they are unpaused and continue executing from the next instruction. The processor maintains a set of paused master threads 'mpaused' and a set of paused slave threads 'spaused'...

Each synchroniser also maintains a record...of whether its master has reached a synchronisation point.

To terminate all of the slaves and allow the master to continue the master executes an MJOIN instruction instead of an MSYNC. When this happens, the slave threads are all freed and the master continues.

A master thread can also create threads which can terminate themselves. This is done by the master executing a GETR THREAD instruction. This instruction returns either a thread ID if there is a free thread or the invalid resource ID. The unsynchronised thread can be initialised in the same way as a synchronised thread using the TINITPC, TINITSP, TINITDP, TINITCP, TINITLR and TSETR instructions. The unsynchronised thread is then started by the master executing a TSTART instruction specifying the thread ID. Once the thread has completed its task it can terminate itself with the FREET instruction.

XMOS Xcore instructions: Communication

Communication between threads is performed using channels , which provide full- duplex data transfer between channel ends , whether the ends are both in the same xCORE Tile, in different xCORE Tiles on the same chip or in xCORE Tiles on different chips. Channels carry messages constructed from data and control tokens between the two channel ends. The control tokens are used to encode communication protocols. Although most control tokens are available for software use, a number are reserved for encoding the protocol used by the interconnect hardware, and can not be sent and received using instructions. A channel end can be used to generate events and interrupts when data becomes available as described below. This allows a thread to monitor several channels, ports or timers, only servicing those that are ready. To communicate between two threads, two channel ends need to be allocated, one for each thread. This is done using the GETR c , CHANEND instruction. Each channel end has a destination register which holds the identifier of the destination channel end; this is initialised with the SETD instruction. It is also possible to use the identifier of a channel end to determine its destination channel end.

The identifier of the channel end c1 is used to initialise the channel end for thread c2 , and vice versa. Each thread can then use the identifier of its own channel end to transfer data and messages using output and input instructions. The interconnect can be partitioned into several independent networks. This makes it possible, for example, to allocate channels carrying short control messages to one network whilst allocating channels carrying long data messages to another. There are instructions to allocate a channel to a network and to determine which network a channel is using.

(my note: Writing to a channel is called 'outputting to the channel', and reading from it is called inputting from it.)

The channel connection is established when the first output is executed. If the destination channel end is on another xCORE Tile, this will cause the destination identifier to be sent through the interconnect, establishing a route for the subse- quent data and control tokens. The connection is terminated when an END control token is sent. If a subsequent output is executed using the same channel end, the destination identifier will be used again to establish a new route which will again persist until another END control token is sent. A destination channel end can be shared by any number of outputting threads; they are served in a round-robin manner. Once a connection has been established it will persist until an END is received; any other thread attempting to establish a connection will be queued. In the case of a shared channel end, the outputting thread will usually transmit the identifier of its channel end so that the inputting thread can use it to reply. The OUT and IN instructions are used to transmit words of data through the channel; to transmit bytes of data the OUTT and INT instructions are used. Control tokens are sent using OUTCT or OUTCTI and received using INCT. To support efficient runtime checks that the type, length or structure of output data matches that expected by the inputer, CHKCT and CHKCTI instructions are provided. The CHKCT instruction inputs and discards a token provided that the input token matches its operand; otherwise it traps. The normal IN and INT instructions trap if they encounter a control token. To input a control token INCT is used; this traps if it encounters a data token. The END control token is one of the 12 tokens which can be sent using OUTCTI and checked using CHKCTI. By following each message output with an OUTCTI c , END and each input with a CHKCTI c , END it is possible to check that the size of the message is the same as the size of the message expected by the inputting thread. To perform synchronised communication, the output message should be followed with (OUTCTI c , END; CHKCTI c , END) and the input with (CHKCTI c , END; OUTCTI c , END). Another control token is PAUSE. Like END, this causes the route through the interconnect to be disconnected. However the PAUSE token is not delivered to the receiving thread. It is used by the outputting thread to break up long messages or streams, allowing the interconnect to be shared efficiently. The remaining control tokens are used for runtime checking and for signalling the type of message being received; they have no effect on the interconnect. Note that in addition to END and PAUSE, ten of these can be efficiently handled using OUTCTI and CHKCTI. A control token takes up a single byte of storage in the channel. On the receiving end the software can test whether the next token is a control token using the TESTCT instruction, which waits until at least one token is available. It is also possible to test whether the next word contains a control token using the TESTWCT instruction. This waits until a whole word of data tokens has been received (in which case it returns 0) or until a control token has been received (in which case it returns the byte position after the position of the byte containing the control token). Channel ends have a buffer able to hold sufficient tokens to allow at least one word to be buffered. If an output instruction is executed when the channel is too full to take the data then the thread which executed the instruction is paused. It is restarted when there is enough room in the channel for the instruction to successfully complete. Likewise, when an input instruction is executed and there is not enough data available then the thread is paused and will be restarted when enough data becomes available. Note that when sending long messages to a shared channel, the sender should send a short request and then wait for a reply before proceeding as this will minimise interconnect congestion caused by delays in accepting the message. When a channel end c is no longer required, it can be freed using a FREER c instruction. Otherwise it can be used for another message. It is sometimes necessary to determine the identifier of the destination channel end c 2 stored in channel end c 1 . For example, this enables a thread to transmit the identifier of a destination channel end it has been using to a thread on another processor. This can be done using the GETD instruction.

XMOS Xcore instructions: Locks

Mutual exclusion between a number of threads can be performed using locks . A lock is allocated using a GETR l , LOCK instruction. The lock is initially free . It can be claimed using an IN instruction and freed using an OUT instruction. When a thread executes an IN on a lock which is already claimed, it is paused and placed in a queue waiting for the lock. Whenever a lock is freed by an OUT instruction and the lock’s queue is not empty, the next thread in the queue is unpaused; it will then succeed in claiming the lock. When inputting from a lock, the IN instruction always returns the lock identifier, so the same register can be used as both source and destination operand. When outputting to a lock, the data operand of the OUT instruction is ignored. When the lock is no longer needed, it can be freed using a FREER l instruction.

XMOS Xcore instructions: Timers and clocks

Each xCORE Tile executes instructions at a speed determined by its own clock input. In addition, it provides a reference clock output which ticks at a standard frequency of 100MHz. A set of programmable timers is provided and all of these can be used by threads to provide timed program execution relative to the reference clock.

The processor has a set of timers that can be used to wait for a time. The current time can be input from any timer, or it can be obtained by using GETTIME:

Each timer can be used by a thread to read its current time or to wait until a specified time. A timer is allocated using the GETR t , TIMER instruction. It can be configured using the SETC instruction; the only two modes which can be set are UNCOND (timer always ready; inputs complete immediately) and AFTER (timer ready when its current time is after its DATA value). In unconditional mode, an IN instruction reads the current value of the timer. In AFTER mode, the IN instruction waits until the value of its current time is after (later than) the value in its DATA register. The value can be set using a SETD instruction.

Timers can also be used to generate events as described below.

A set of programmable clocks is also provided and each can be used to produce a clock output to control the action of one or more ports and their associated port timers.

(my note: Each clock can use a one bit port as its clock source (again, using SETCLK)

Alternatively, a clock may use the reference clock as its clock source (by 'SETCLK p, REF'). In either case the clock can be configured to divide the frequency using an 8-bit divider. When this is set to 0, the clock passes directly to the output. The falling edge of the clock is used to perform the division. Hence a setting of 1 will result in an output from the clock which changes each falling edge of the input, halving the input frequency f ; and a setting of n will produce an output frequency of f/2n. The division factor is set using the SETD instruction. The lowest eight bits of the operand are used and the rest ignored.

To ensure that the timers in the ports which are attached to the same clock all record the same time, the clock should be started using a 'SETC c, START' instruction after the ports have all been attached to the clock. All of the clocks are initially stopped and a clock can be stopped by a 'SETC c, STOP' instruction. The data output on the pins of an output port changes state synchronously with the port clock. If several output ports are driven from the same clock, they will appear to operate as a single output port, provided that the processor is able to supply new data to all of them during each clock cycle. Similarly, the data input by an input port from the port pins is sampled synchronously with the port clock. If several input ports are driven from the same clock they will appear to operate as a single input port provided that the processor is able to take the data from all of them during each clock cycle. The use of clocked ports therefore decouples the internal timing of input and output program execution from the operation of synchronous input and output interfaces.

XMOS Xcore instructions: Ports, Input and Output

Ports are interfaces to physical pins. A port can be used for input or output . It can use the reference clock as its port clock or it can use one of the programmable clocks. Transfers to and from the pins can be synchronised with the execution of input and output instructions, or the port can be configured to buffer the transfers and to convert automatically between serial and parallel form. Ports can also be timed to provide precise timing of values appearing on output pins or taken from input pins. When inputting, a condition can be used to delay the input until the data in the port meets the condition. When the condition is met the captured data is time stamped with the time at which it was captured. The port clock input is initially the reference clock. It can be changed using the SETCLK instruction with a clock ID as the clock operand. This port clock drives the port timer and can also be used to determine when data is taken from or presented to the pins. A port can be used to generate events and interrupts when input data becomes available as described below. This allows a thread to monitor several ports, channels or timers, only servicing those that are ready. ... Each port has a transfer register . The input and output instructions used for channels, IN and OUT, can also be used to transfer data to and from a port transfer register. The IN instruction zero-extends the contents of a port transfer register and transfers the result to an operand register. The OUT instruction transfers the least significant bits from an operand register to a port transfer register ...

The port configuration is done using the SETC instruction which is used to define several independent settings of the port.

There are further instructions for shifting bits and partial words to and from the port and precisely controlling timing:

XMOS Xcore instructions: Events

Events and interrupts allow timers, ports and channel ends to automatically transfer control to a pre-defined event handler. The resources generate events by default and must be reconfigured using a SETC instruction in order to generate interrupts. The ability of a thread to accept events or interrupts is controlled by information held in the thread status register ( sr ), and may be explicitly controlled using SETSR and CLRSR:

The operand of these instructions should be one (or more) of

A thread normally enables one or more events and then waits for one of them to occur. Hence, on an event all the thread’s state is valid, allowing the thread to respond rapidly to the event. The thread can perform input and output operations using the port, channel or timer which gave rise to an event whilst leaving some or all of the event information unchanged. This allows the thread to complete handling an event and immediately wait for another similar event. Timers, ports and channel ends all support events, the only difference being the ready conditions used to trigger the event. The program location of the event handler must be set prior to enabling the event using the SETV instruction. The SETEV instruction can be used to set an environment for the event handler; this will often be a stack address containing data used by the handler. Timers and ports have conditions which determine when they will generate an event; these are set using the SETC and SETD instructions. Channel ends are considered ready as soon as they contain enough data. Event generation by a specific port, timer or channel can be enabled using an event enable unconditional (EEU) instruction and disabled using an event disable unconditional (EDU) instruction. The event enable true (EET) instruction enables the event if its condition operand is true and disables it otherwise; conversely the event enable false (EEF) instruction enables the event if its condition operand is false, and disables it otherwise. These instructions are used to optimise the implementation of guarded inputs.

Having enabled events on one or more resources, a thread can use a WAITEU, WAITET or WAITEF instruction to wait for at least one event. The WAITEU instruction waits unconditionally; the WAITET instruction waits only if its condition operand is true, and the WAITEF waits only if its condition operand is false.

This may result in an event taking place immediately with control being transferred to the event handler specified by the corresponding event vector with events disabled by clearing the thread’s eeble flag. Alternatively the thread may be paused until an event takes place with the eeble flag enabled; in this case the eeble flag will be cleared when the event takes place, and the thread resumes execution.

Note that the environment vector is transferred to the event data register, from where it can be accessed by the GETED instruction. This allows it to be used to access data associated with the event, or simply to enable several events to share the same event vector.

To optimise the responsiveness of a thread to high priority resources the SETSR EEBLE instruction can be used to enable events before starting to enable the ports, channels and timers. This may cause an event to be handled immediately, or as soon as it is enabled. An enabling sequence of this kind can be followed either by a WAITEU instruction to wait for one of the events, or it can simply be followed by a CLRSR EEBLE to continue execution when no event takes place. The WAITET and WAITEF instructions can also be used in conjunction with a CLRSR EEBLE to conditionally wait or continue depending on a guarding condition. The WAITET and WAITEF instructions can also be used to optimise the common case of repeatedly handling events from multiple sources until a terminating condition occurs.

All of the events which have been enabled by a thread can be disabled using a single CLRE instruction. This disables event generation in all of the ports, channels or timers which have had events enabled by the thread. The CLRE instruction also clears the thread’s eeble flag.

Interrupts: In contrast to events, interrupts can occur at any point during program execution, and so the current pc and sr (and potentially also some or all of the other registers) must be saved prior to execution of the interrupt handler. Interrupts are taken between instructions, which means that in an interrupt handler the previous instruction will have been completed, and the next instruction is yet to be executed on return from the interrupt. This is done using the spc and ssr registers. Any interrupt and exception causes the pc and sr registers to be saved into spc and ssr , and the status register to be modified to indicate that the processor is running in kernel mode. When the handler has completed, execution of the interrupted thread can be performed by a KRET instruction...

XMOS Xcore instructions: Exceptions

Exceptions which occur when an error is detected during instruction execution are treated in the same way as interrupts except that they transfer control to a location defined relative to the thread’s kernel entry point kep register.

Exception types:

A program can force an exception as a result of a software detected error condition using ECALLT, ECALLF, or ELATE. These have the same effect as hardware detected exceptions, transferring control to the same location and indicating that an error has occurred in the exception type (et) register:

A program can explicitly cause entry to a handler using one of the kernel call instructions. These have a similar effect to exceptions, except that they transfer control to a location defined relative to the thread’s kep register (((my note: i don't understand the difference the ECALLs and KALL))):

The spc , ssr , et and sed registers can be saved and restored directly to the stack:

In addition, the et and ed registers can be transferred directly to a register:

A handler can use the KENTSP instruction to save the current stack pointer into word 0 of the thread’s kernel stack (using the kernel stack pointer ksp) and change stack pointer to point at the base of the thread’s kernel stack. KRESTSP can then be used to restore the stack pointer on exit from the handler.

The kep can be initialised using the SETKEP instruction; the ksp can be read using the GETKSP instructions:

The kernel stack pointer is initialised by the boot-ROM to point to a safe location near the last location of RAM - the last few locations are used by the JTAG debugging interface. ksp can be modified by using a sequence of SETSP followed by KRESTSP.

XMOS Xcore instructions: Initialisation and Debugging

The state of the processor includes additional registers to those used for the threads:

All of the processor state can be accessed using the GETPS and SETPS instructions:

To access the state of a thread, first SETPS is used to set dtid and dtreg to the thread identifier and register number within the thread state. The contents of the register can then be accessed by:

The debugging state is entered by executing a DCALL instruction, by an instruction that triggers a watchpoint or a breakpoint, or by an external asynchronous DEBUG event (for example caused by asserting a DEBUG pin). During debug, thread 0 executes the debug handler, all other threads are frozen. The debugging state is exited on DRET, which causes thread 0 to resume at its saved PC, and all other threads to start where they were stopped. Entry to a debug handler operates in a manner similar manner to an interrupt.

On entering debug mode the DI bit is saved in the dspc register, and it is cleared.

Watchpoints and instruction breakpoints are supported by means of SETPS and GETPS instructions. An instruction breakpoint is an address that triggers a DCALL on a PC being equal to the value in the instruction break point. A data watchpoint is a pair of addresses l and h , and a condition that triggers a DCALL on stores and or loads to specific memory addresses.

XMOS Xcore instructions: Specialised Instructions

Other CPU and MCU/MPU links

Note: i think an MCU has onboard memory whereas an MPU uses external memory. See . In this document i've just called everything MCU, todo find out which ones are really MPUs and update this.


unofficial Debian port

DEC Alpha

The DEC Alpha has an unofficial Debian port (it used to have an official Debian port, though). The DEC Alpha is notable for having one of the most relaxed memory models out of relatively popular Linux-capable architectures.

"The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, some versions of the Alpha CPU have a split data cache, permitting them to have two semantically-related cache lines updated at separate times. This is where the data dependency barrier really becomes necessary as this synchronises both caches with the memory coherence system, thus making it seem like pointer changes vs new data occur in the right order.

The Alpha defines the Linux kernel's memory barrier model." --

"It may seem strange to say much of anything about a CPU whose end of life has been announced, but Al- pha is interesting because, with the weakest memory ordering model, it reorders memory operations the most aggressively. It therefore has defined the Linux- kernel memory-ordering primitives, which must work on all CPUs, including Alpha. Understanding Alpha is therefore surprisingly important to the Linux ker- nel hacker." --


This has an unofficial Debian port.

A 32-bit RISC microprocessor of Renesas Technology.


This has an unofficial Debian port.

IBM S/390 and System z machines.

See z/Architecture).


This has an unofficial Debian port (it used to have an official Debian port, though).


This has an official Debian port.

"Once touted by Intel as a replacement for the x86 product line," --

"As of 2008, Itanium was the fourth-most deployed microprocessor architecture for enterprise-class systems, behind x86-64, Power Architecture, and SPARC." --


Lisp Machines



todo composable processor virtualization for embedded systems Composable Virtual Memory for an Embedded SoC? Mechanisms for hardware virtualization in multicore architectures Capsules: Expressing Composable Computations in a Parallel Programming Model

types of memory barriers

load/load, load/store, store/load, store/store

"Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The four categories of barrier can be specified individually" --


"M1 is a ``toy machine used to teach undergraduates about the ACL2 formalization of the Java Virtual Machine. M1 is a von Neumann style stack machine. The state consists of four components, a program counter, an array of local variable values (akin to registers), a stack, and an execute-only program. The machine provides eight instructions for doing addition and multiplication on the stack, moving items from the locals to the stack and back, an unconditional jump and a conditional jump that tests the top of the stack against 0. M1 provides unbounded integers. Because of this, M1 is Turing equivalent." --


Tadasv VMS

A toy virtual machine

registers = ["eax", "ebx", "ecx", "edx", "esp", "ebp", "esi", "edi"]

instructions = { "push" : [{"opcode" : 0x00, "format" : "<cc", "params" : ["reg"]}], "pop" : [{"opcode" : 0x01, "format" : "<cc", "params" : ["reg"]}], "mov" : [{"opcode" : 0x02, "format" : "<ccI", "params" : ["reg", "imm"]}, {"opcode" : 0x03, "format" : "<ccc", "params" : ["reg", "reg"]}, {"opcode" : 0x04, "format" : "<ccc", "params" : ["reg", "@reg"]}, {"opcode" : 0x05, "format" : "<ccc", "params" : ["@reg", "reg"]}, {"opcode" : 0x06, "format" : "<ccI", "params" : ["reg", "ref"]} ], "inc" : [{"opcode" : 0x07, "format" : "<cc", "params" : ["reg"]}], "dec" : [{"opcode" : 0x08, "format" : "<cc", "params" : ["reg"]}], "add" : [{"opcode" : 0x09, "format" : "<ccc", "params" : ["reg", "reg"]}], "jmp" : [{"opcode" : 0x0A, "format" : "<cI", "params" : ["ref"]}], "jz" : [{"opcode" : 0x0B, "format" : "<ccI", "params" : ["reg", "ref"]}], "jnz" : [{"opcode" : 0x0C, "format" : "<ccI", "params" : ["reg", "ref"]}], "mul" : [{"opcode" : 0x0D, "format" : "<ccc", "params" : ["reg", "reg"]}], "halt" : [{"opcode" : 0xFF, "format" : "<c", "params" : []}], "emit" : [{"opcode" : None, "format" : "<s", "params" : ["str"]}]

Robot Odyssey chip file format


" Instruction descriptions and opcode assignments:

  NOP,  {0}     ( x y z -- x y z)               no operation
  lit,  {1}     ( x y z -- y z data) PC++       push data at PC, increment PC
  @,    {2}     ( x y addr -- x y data)         fetch data from addr
  !,    {3}     ( x data addr -- x x x)         store data to addr
  +,    {4}     ( x n1 n2 -- x x n1+n2)         add 2ND to TOP
  AND,  {5}     ( x n1 n2 -- x x n1&n2)         and 2ND to TOP
  XOR,  {6}     ( x n1 n2 -- x x n1^n2)         exclusive-or 2ND to TOP
  zgo,  {7}     ( x flg addr -- x x x)          if flg equals 0
                                                then jump to addr
                                                else continue"

-- (see also )


table 3-11 from PDF page 57 of [3] lists the 'core' ISA of the Xtensa Tensilica 32-bit RISC ISA (harvard architecture, 24-bit instruction width, most instructions have a 16-bit form also; the J (jump) instruction has an 18-bit PC-relative immediate offset):


ACPI ASL section 19.4, page 714, PDF page 752


NS 32k

" The processors had 8 general-purpose 32-bit registers, plus a series of special-purpose registers:

    Frame pointer
    Stack pointer (one each for user and supervisor modes)
    Static base register, for referencing global variables
    Link base register for dynamically linked modules (object orientation)
    Program counter
    A typical processor status register, with a low-order user byte and a high-order system byte.

(Additional system registers not listed).

The instruction set was very much in the CISC model, with 2-operand instructions, memory-to-memory operations, flexible addressing modes, and variable-length byte-aligned instruction encoding. Addressing modes could involve up to two displacements and two memory indirections per operand as well as scaled indexing, making the longest conceivable instruction 23 bytes. The actual number of instructions was much lower than that of contemporary RISC processors.

Unlike some other processors, autoincrement of the base register was not provided; the only exception was a "top of stack" addressing mode that would pop sources and push destinations. " --



J1 is a small (200 lines of Verilog) stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA. Some highlights:

        Extremely high code density. A complete system including the TCP/IP stack fits in under 8K bytes.
        Single cycle call, zero cycle return
        Instruction set maps trivially to Forth
        Cross compiler runs on Windows, Mac and Unix
        Basic software includes a sizeable subset of ANS Forth and a portable TCP/IP networking stack.

... The J1 is a simple 16-bit CPU. It has some RAM, a program counter (PC), a data stack and a call/return stack. It has a small set of built-in arithmetic instructions. Fields in the J1 instructions control the arithmetic function, and write the results back to the data stacks. There are more details on instruction coding in the paper. ... The CPU was designed to run Forth programs very efficiently: the machine’s instructions are so close to Forth that there is little benefit to writing code in assembler. Effectively Forth is the assembly language. J1 runs at about 100 Forth MIPS on a typical FPGA. This compares with about 0.1 Forth MIPS for a traditional threaded Forth running on an embedded 8-bit CPU. ... The code that defines the basic Forth operations as J1 instructions is in basewords.fs

The next layer up defines basic operations in terms of these simple words. These include many of the CORE words from the DPANS94 Forth standard. Some of the general facilities provided by nuc.fs

        byte memory access
        string handling
        double precision (i.e. 32 bit) math
        one’s complement addition
        memory copy and fill
        multiplication and division, fractional arithmetic
        pictured numeric output
        debug words: memory and stack dump, assert

The above files - about 2K of code - bring the J1 to the point where it can start to define application-specific code. " [4]

"operates reliably at 80 MHz in a Xilinx Spartan-3E FPGA" [5]

" The J1 is a small CPU core for use in FPGAs. It is a 16- bit von Neumann architecture with three basic instruction formats. The instruction set of the J1 maps very closely to ANS Forth. The J1 does not have:

There is no other internal state: the CPU has no condition flags, modes or extra registers. Memory is 16-bits wide ... there are five categories of instructions: literal, jump, conditional jump, call, and ALU.

... Instruction encoding ((paraphrased from figure)):

All target addresses - for call, jump and conditional branch - are 13-bit. This limits code size to 8K words, or 16K bytes. The advantages are twofold. Firstly, instruction decode is simpler because all three kinds of instructions have the same format. Secondly, because there are no relative branches, the cross compiler avoids the problem of range overflow in resolve.

Conditional branches are often a source of complexity in CPUs and their associated compiler. J1 has a single instruction that tests and pops T, and if T = 0 replaces the current PC with the 13-bit target value. This instruction is the same as 0branch word found in many Forth implementations, and is of course sufficient to implement the full set of control structures.

ALU instruction have multiple fields:

field   width  action
T'         4	 ALU op, replaces T, see table II
T -> N     1	 copy T to N
R -> PC    1	 copy R to the PC
T -> R     1 	 copy T to R
dstack +-  2	 signed increment data stack
rstack +-  2	 signed increment return stack
N -> [T]   1 	 RAM write

ALU operation codes:

Table III shows how these fields may be used together to implement several Forth primitive words. Hence each of these words map to a single cycle instruction. In fact J1 executes all of the frequent Forth words - as measured by (Gregg, M. A. Ertl, and J. Waldron, “The Common Case in Forth Programs,” in EuroForth?, 2001) and (P. J. Koopman, Jr., Stack computers: the new wave. New York, NY, USA: Halsted Press, 1989) in a single clock cycle:

word      T'   T->N   R->PC   T->R   dstack+-   rstack+-   N->[T]
dup	  T    1      0	      0	     +1		0	   0
over	  N    1      0	      0	     +1		0	   0
invert	  ~T   0      0	      0	     0		0	   0
+	  T+N  0      0	      0	     -1		0	   0
swap	  N    1      0	      0	     0		0	   0
nip	  T    0      0	      0	     -1		0	   0
dropN	  N    0      0       0	     -1     	0	   0
;	  T    0      1	      0	     0		-1	   0
>r	  N    0      0	      1	     -1		+1	   0
r>	  R    1      0	      1	     +1		-1	   0
r@	  R    1      0	      1	     +1		0	   0
@	  [T]  0      0	      0	     0		0	   0
!	  N    0      0	      0	     -1		0	   1

" -- [6]

My notes:

" The CPU’s architecture encourages highly-factored code:


Almost all of the core words are written in pure Forth, the exceptions are 'pick' and 'roll', which must use assembly code because the stack is not accessible in regular memory. Much of the core is based on eforth [10].

" -- [7]

There is also the J1a, "a simplified variant of the original J1. The modifications from the original J1 are:

"The board has 8K of RAM, and runs SwapForth?, a small but complete interactive Forth development environment. SwapForth? takes about 5K of the available RAM, and includes a full native compiler, the ANS standard CORE words, and several more modern extensions."



embedded CPU for Forth.

" MicroCore? [1] is a popular configurable processor core targeted at FPGAs. It is a dual-stack Harvard architecture, encodes instructions in 8 bits, and executes one instruction in two system clock cycles. A call requires two of these instructions: a push literal followed by a branch to Top-of-Stack (TOS). A 32-bit implementation with all options enabled runs at 25 MHz - 12.5 MIPS - in a Xilinx Spartan-2S FPGA. " -- [8]



embedded CPU for Forth.

" b16-small [2], [3] is a 16-bit RISC processor. In addition to dual stacks, it has an address register A, and a carry flag C. Instructions are 5 bits each, and are packed 1-3 in each word. Byte memory access is supported. Instructions execute at a rate of one per cycle, except memory accesses and literals which take one extra cycle. The b16 assembly language resembles Chuck Moore’s ColorForth?. FPGA implementations of b16 run at 30 MHz. " -- [9]



embedded CPU for Forth.

" eP32 [4] is a 32-bit RISC processor with deep return and data stacks. It has an address register (X) and status register (T). Instructions are encoded in six bits, hence each 32-bit word contains five instructions. Implemented in TSMC’s 0.18um CMOS standard library the CPU runs at 100 MHz, providing 100 MIPS if all instructions are short. However a jump or call instruction causes a stall as the target instruction is fetched, so these instructions operate at 20 MIPS. " -- [10]


F18 (GA144 core)

Two stacks (data and return). "Each stack is eight elements indexed circularly.". There are also registers to access some of the stack elements; T (top of data stack), S (second item on data stack), R (top of return stack). These registers are actually in addition to/spilled from the stack, so there is effectively a 9-element return stack and a 10-element data stack.


"Having separate address registers may be a very strange thing to someone familiar with Forth. Normally, addresses go on the stack and the fetch (@) and store (!) operations use them from there. On the F18, fetch and store are always through P, A or B." [11]

"Having such a lightweight calling convention is what allows for extremely aggressive factoring in Forth...In Forth a call costs only a push/pop of a return address and having lots of small routines is encouraged without worrying about inlining." [12]

18 bit words. 64-word RAM. In addition, a 64-word ROM.

Instruction encoding: 5 bit instructions (no operands except for (sometimes) addresses). Note that 4 instructions 18 bit fit in each 8-bit word, by restricting what the last instruction in a word may be.

"The P register is the program counter; pointing at the next instruction cell to be fetched. I’ll call these “cells” so as not to confuse with Forth “words”. Initially P points at a port reading instructions from a block on disk (block 0, to which the assembler targets). When pointing at a port like this, it continues to do so until a (call), (jump), etc. directs it to fetch instructions from other ports or from RAM. However, when fetching from RAM, P will auto-increment. The most recently fetched instruction cell is placed in register I where it is executed slot by slot – four slots containing op codes (see below). It’s important to realize that we don’t execute through P. Instead we first fetch through P into I and then execute instruction slots from I. Instructions may in the mean time further advance P (e.g. @p to fetch a literal value inline) without immediately affecting execution. There may even be a micronext (unext) loop within I without further fetching at all. " [13]

"Note that upon boot, the B register conveniently points at the console I/O port." [14]

"These are the 32 instructions of this simple machine. I will briefly describe them here and will get into more detail on some (e.g. the multiply-step instruction) in future posts." [15]


'addr' operands take up the remaining slots within the current contents of I.

loads and stores:

"Fetching and storing through P auto-increments (except when pointing at a port)" [16]

arithmetic and misc ALU stuff:

Simulator-only instructions: break (break into debug view), mark (reset performance statistics)

Example programs from [17]:

– push . .      
@b !b unext 

" a two-cell program that echos console input. There are no addresses shown because these are not packed in memory, but streamed over a port at which the P register points. The first cell negates the top of the stack (–), leaving ffffffff in T, and then does a push to R. The remaining two slots are nops (.). The following cell sets up a micronext (unext) loop; first reading from the console with fetch-B (@b) and echoing a key back with store-B (!b). Note that upon boot, the B register conveniently points at the console I/O port. Finally the unext causes execution to loop back to the first slot; decrement R as the induction variable counting down to zero. This means the machine will sit and spin for 2^32 iterations echoing keypresses. It’s executing the single-cell program with nothing in RAM! This is an interesting aspect of the F18; that you can execute code streaming over a port without first loading into memory and do micronext looping without instruction fetches. " -- [18])

" Port execution is a very interesting aspect of the F18. Executing code streaming in over a port rather than from memory is extremely useful as a “protocol” of sorts between nodes. No parsing. Just ask your neighbors to do things. You may use your neighbor’s RAM and stacks. You can treat the nodes as “agents” and messages become just code passed between them. " -- [19]) (see also )

Another example program:

         08 0f 1c 1c    @p ! . .        
 0000    0a 0e 02 00    @b !b jump:0000 
         02 00 00 00    jump:0000       

"The above does nearly the same thing as Program 1; echoing keypresses forever. The previous version didn’t work forever. It used a micronext loop to over ffffffff iterations. Here we use a (jump) instruction to loop literally forever. This requires an address to which to jump and thus requires the program to be in memory. These three cells are streamed in over the port to which P points upon boot (block 0) as usual. The first cell simply reads the second cell from the port (@p) and stores it in memory (!). Note that A points at 0000 upon boot. Then the third cell is executed (the second having just been fetched inline). This jumps to address 0000. Notice that the assembler shows that the second cell is indeed packed into RAM at address 0000. So three cells have been streamed in, but just one remains in memory. This program is just like the earlier one; reading from console-in (@b) and writing to console-out (!b), but instead of a unext for R iterations it does an unconditional jump to itself." -- [20]) (see also )

Another example program:

         08 0d 1c 1c    @p push . .     
         00 00 00 05    5               
         08 0d 04 1c    @p !+ unext .   
 0000    08 08 1d 1c    @p @p push .    
 0001    00 00 00 60    60              
 0002    00 00 00 19    19              
 0003    08 14 18 0e    @p + dup !b     
 0004    00 00 00 01    1               
 0005    05 00 03 00    next:0003 ;     
         03 00 00 00    call:0000       

" This is the last example we’ll walk through like this before moving on to see how much easier this is in colorForth. Try reading the code and working through what it does before continuing 🙂

Again this sequence of cells is streamed through a port at which P points. The first three cells load the next five and the last cell calls the freshly loaded code. In fact, you can think of the first three cells as a generic “loader” for the computer. It fetches a value inline (the green 5) with @p and does a push to R to set up for a micronext loop for six iterations (one more than R). Notice that literals like this are fetched inline and execution continues with the next cell.The loop then is in the third cell. The @p !+ unext will execute six times; reading in the following cells and appending to memory. Notice that the !+ stores to and then increments A. Finally, the last cell will call into this memory-packed code.

Notice that since this is a call and not a jump and because the packed code ends with a return (;) then the computer will go back to awaiting further instructions after the program executes. That is, the P register is initially pointing to a port. Upon executing the call, this port address is pushed to the return stack as usual and P now points at RAM. Upon returning from the call the port address is popped and it goes back to reading from the port. In general, this is an interesting idea. You can think of it as memory mapped ports or you can think of I/O vs. memory as being different “modes” triggered by bits in P. Being able to nest “mode switches” by pushing flags along with return addresses to the return stack is powerful. In the actual F18 computer for example, extended arithmetic mode is triggered by a bit in P and mode switching may be conveniently nested this way.

Back to the program above. Did you figure out what it does? This will emit the alphabet to the console. It fetched a couple of literals to the data stack (@p @p and the following green 60 and 19) and does a push of the 19 to R for use as a next loop counter. Realize that these are hex values (19 hex is 25 decimal for iterating the letters of the alphabet, 60 is just below the ASCII character ‘a’).

The loop starts at address 0003. We increment the top of the stack (the letter to emit); fetching a literal 1 (@p paired with the green 1) and adding (+) it. We then dup this so that we can use it and have a copy still available for the next iteration. The !b emits the letter to the console. This loop doesn’t fit within a micronext. Instead, the following cell is executed (address 0005) with a regular next instruction. Just like unext, this decrements R and loops while non-zero. Once the loop falls through, the return instruction causes the computer to return to awaiting instructions from the port as we discussed.

This, in general, is how you use the computers. Sometimes you can simply stream code to them to be executed. Other times the stream of code is effectively a “loader” to pack RAM with useful routines. These routines may then be called by further streamed code. These calls may return to awaiting further instructions from the stream (or possibly never return). " --



" kragen on May 18, 2016

parent favorite on: The MOnSter? 6502

I think the GreenArrays? F18A cores are similar in transistor count to the 6502, but the instruction set is arguably better, and the logic is asynchronous, leading to lower power consumption and no need for low-skew clock distribution. In 180nm fabrication technology, supposedly, it needs an eighth of a square millimeter (, which makes it almost 4 million square lambdas. If we figure that a transistor is about 30 square lambdas and that wires occupy, say, 75% of the chip, that's about 32000 transistors per core, the majority of which is the RAM and ROM, not the CPU itself; the CPU is probably between 5000 and 10 000 transistors. The 6502 was 4528 transistors:

The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.

Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)

Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.

The GreenArrays? team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.

I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)

So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.

There are all kinds of chips that want to embed some kind of small microprocessor using a minimal amount of silicon area, but aren't too demanding of its power. A lot of them embed a Z80 or an 8051, which have lots of existing toolchains targeting them. A 6502 might be a reasonable choice, too. Both 6502 and Z80 have self-hosting toolchains available, too, but they kind of suck compared to modern stuff.

If you wanted to build your own CPU out of discrete components (like this delightful MOnSter?!) and wanted to minimize the number of transistors without regard to the number of other components involved, you could go a long way with either diode logic or diode-array ROM state machines.

Diode logic allows you to compute arbitrary non-inverting combinational functions; if all your inputs are from flip-flops that have complementary outputs, that's as universal as NAND. This means that only the amount of state in your state machine costs you transistors. Stan Frankel's Librascope General Precision LGP-21 "had 460 transistors and about 300 diodes", but you could probably do better than that.

Diode-array ROM state machines are a similar, but simpler, approach: you simply explicitly encode the transition function of your state machine into a ROM, decode the output of your state flip-flops into a selection of one word of that ROM, and then the output data gives you the new state of your state machine. This costs you some more transistors in the address-decoding logic, and probably costs you more diodes, too, but it's dead simple. The reason people do this in real life is that they're using an EPROM or similar chip instead of wiring up a diode array out of discrete components. (The Apple ][ disk controller was a famous example of this kind of thing.) " -- [21]


a library of hardware-level primitives:


Nand2Tetis (The Elements of Computing Systems)'s 'Hack' assembly language and ISA

assembly and ISA:

See also their stack VM:

And their HLL, Jack:



An open ISA designed by Agner Fog starting in 2016. E seems to have thought particularly hard about vector operations.



IBCM (Itty Bitty Computing Machine)

A teaching language. Has a single 16-bit accumulator. 16-bit words, 12-bit addressing (so 4k words of memory, or 8k bytes). Instructions are 16-bit words, with a 4-bit opcode and 12-bits for operands (instructions 'io' and 'shift' have a special format, halt doesn't use the operands at all, and the other instructions all have a 12-bit memory address here).

Instructions (mostly quoted and paraphrased from [22]; i've changed some of the mnemonics):



Steffens' Google Sheets VM

Implemented (for fun) on top of a spreadsheet application (Google Sheets). Registers and memory are cells in the spreadsheet.

6 registers: 4 GPRs, the IP, and a stack pointer.

Instructions (there is no instruction encoding, the textual assembly is directly executed):

4 addressing modes:







" Then there was M.Core, Motorola’s direct assault on ARM. An entirely new 32-bit MCU family designed from the ground up to be small, efficient, and inexpensive, M.Core found few fans outside of its creator’s walls. " -- [23]



For massive parallelism


" Instruction - set highlights include:


Branch Instructions:

Unrestricted branching is supported through out the 32-bit memory map using branch instructions and register jump instructions. Branching can occur conditionally , based on the arithmetic flags set by the integer or floating-point execution unit. The following table illustrates the condition codes supported by the ISA . The architecture supports two sets of flags to allow independent condi tional execution and branching of instructions based on results from two separate arithmetic units. The full set of branching conditions allows the synthesis of any high - level control comparison, including: <, <=, = , == , !=, >=, and >. Both signed and unsigned arithmetic is supported

Load/Store Instructions:

Load and st ore instructions move data between the general - purpose register file and any legal memory location within the architecture, including external memory and any other eCore CPU in the system. All other instructions, such as floating - point and integer arithmet ic instructions , are restricted to using registers as source and destination operands.

The ISA supports the following addressing modes:

Byte, short, word, and double data types are supported by all load / store instruction formats. All loads and stores must be aligned with respect to the data size being used. Short ( 16 - bit ) data types must be aligne d on 16 - bit boundaries in memory, word ( 32 - bit ) data types must be aligned on 32 - bit boundaries, and double ( 64 - bit ) data types must be aligned on 64 - bit boundaries. Unaligned memory accesses returns unexpected data and generates a software exception. Doub le data - type load/store instructions must specify an even register in the general - purpose register file. The corresponding odd register is written implicitly. Attempts to use odd registers with double data format is flagged as an error by the assembler.

LDR (load register) STR (store register) TESTSET (test and set) addressing modes on LDR and STR: Immediate Offset (effective address is contents of registers, plus or minus constant), Postmodify-immediate, Register Offset (effective address is sum or difference of two registers), Postmodify-Register

Integer Instructions: General - purpose integer instructions , such as ADD, SUB, ORR, AND, are useful for control code and integer math. These instructions work with immediate as well as register-based operands. The instructions update the integer status bits of the STATUS register.

Floating-Point Instructions: An orthogonal set of IEEE754 - compliant floating - point instructions for signal processing applications. These instructions update the floating-point status bits of the STATUS register .

Secondary Signed Integer Instructions: The basic floating point instruction set can be substituted with a set of signed integer instruction s by setting the appropriate mode bits in the CONFIG register [19:16]. These instructions use the same opcodes as the flo a ting - point instr uctions and include: IADD, ISUB, IMUL, IMADD, IMSUB.

Register Move Instructions: All register moves are done as complete word ( 32 - bit ) entities. Conditional move instructions support the same set of condition codes as the branch instructions specified in Table 12.

(<imm16><< 16)))

Program Flow Instructions A number of special instructions used by the run time environment to enable efficient interrupt handling, multicore programming, and program debug:

" Figure 9 shows how the shared - memory architecture and the eMesh network work productively together . In the example, a dot - product routine writes its result to a memory location in another mesh node . The only thing required to pass data from one node t o another is the setting of a pointer. The hardware decod es the transaction and determin es whether it belongs to the local node ’s memory or to another node ’s memory.

Figure 9 : Pointer Manipulation Example

C - CODE VecA? array at 0x82002000 VecB? array at 0x82004000 remote_res at 0x92004000 for (i=0;i<100;i++){ loc_sum+=vecA[i]*vecB[i]; } remote_res=loc_sum;

ASSEMBLY R0=pointer to VecA? R2=pointer to VecB? R6=pointer to remote_res R4=loc_sum;

MOV R5 , #100 ; _L: LDR R1,[R0], #1 ; LDR R3,[R2], #1 ; FMADD R4 ,R1, R 3 ; SUB R 5 ,R5, #1 ; BNE _L; STR R4, [R 6 ]; "



" The 16 'primary' one-operand instructions were: Mnemonic Description J Jump — add immediate operand to instruction pointer. LDLP Load Local Pointer — load a Workspace-relative pointer onto the top of the register stack PFIX Prefix — general way to increase lower nibble of following primary instruction LDNL Load non-local — load a value offset from address at top of stack LDC Load constant — load constant operand onto the top of the register stack LDNLP Load Non-local pointer — Load address, offset from top of stack NFIX Negative prefix — general way to negate (and possibly increase) lower nibble LDL Load Local — load value offset from Workspace ADC Add Constant — add constant operand to top of register stack CALL Subroutine call — push instruction pointer and jump CJ Conditional jump — depending on value at top of register stack AJW Adjust workspace — add operand to workspace pointer EQC Equals constant — test if top of register stack equals constant operand STL Store local — store at constant offset from workspace STNL Store non-local — store at address offset from top of stack OPR Operate — general way to extend instruction set

All these instructions take a constant, representing an offset or an arithmetic constant. If this constant was less than 16, all these instructions coded to a single byte.

The first 16 'secondary' zero-operand instructions (using the OPR primary instruction) were: Mnemonic Description REV Reverse — swap two top items of register stack LB Load byte BSUB Byte subscript ENDP End process DIFF Difference ADD Add GCALL General Call — swap top of stack and instruction pointer IN Input — receive message PROD Product GT Greater Than — the only comparison instruction WSUB Word subscript OUT Output — send message SUB Subtract STARTP Start Process OUTBYTE Output Byte — send single-byte message OUTWORD Output word — send single-word message " -- [24]



16 32-bit GPRs


16-bit 3-operand instruction encoding; 4-bit opcode, then 3x4-bit operand fields selecting three registers. Some opcodes (like LDD, LDW, MJMP) require an additional word or two to encode data.

Sweet32 instructions