Bayle Shanks's website: books-programmingLanguages-programmingLanguagesChArmIsa

Table of Contents for Programming Languages: a survey

ARM: Intro

https://en.wikipedia.org/wiki/ARM_architecture#32-bit_architecture

http://users.ece.utexas.edu/~valvano/EE345M/Arm_EE382N_4.pdf

https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html list of instructions with names, todo

A recent addition to the ARM ISA family is ARM64 (ARMv8 A64 / AArch64), described on the pages http://www.arm.com/products/processors/instruction-set-architectures/index.php http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0677b/ch01s01.html http://www.arm.com/files/downloads/ARMv8_Architecture.pdf http://www.cs.utexas.edu/~peterson/arm/DDI0487A_a_armv8_arm_errata.pdf http://www.arm.com/files/pdf/ARMv8R__Architecture_Oct13.pdf.

ARM has various versions and 3 profiles; A (full-features for use as e.g. CPU of smartphone or computer; has virtual addressing MMU), R (real-time, for use in e.g. car engines; has deterministic (i think) physical addressing MMU), M (microcontroller; only supports Thumb ISA). The latest version is v8, but according to the ARM Wikipedia page only A and R profiles are (yet) available for v8. v7 has all 3 profiles (e.g. http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf ). There's also an E-M which is like M with a DSP extension, found in v7.

ARM Thumb: "The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions." -- (ARM7TDMI Technical Reference Manual Revision: r4p1) "The Thumb instruction set provides better code density, at the expense of inferior performance....Thumb-2, a major enhancement of the Thumb instruction set. Thumb-2 provides almost exactly the same functionality as the ARM instruction set. It has both 16-bit and 32-bit instructions, and achieves ARM-like performance with Thumb-like code density." -- (RealView? Compilation Tools Assembler Guide Version 4.0) https://en.wikipedia.org/wiki/ARM_Cortex-M

"The biggest register difference involves the SP register. The Thumb state has unique stack mnemonics (PUSH, POP) that don't exist in the ARM state. These instructions assume the existence of a stack pointer, for which R13 is used. They translate into load and store instructions in the ARM state. " -- http://www.embedded.com/electronics-blogs/beginner-s-corner/4024632/Introduction-to-ARM-thumb

"The original Thumb-Instruction set only contained 16-bit instructions. Thumb2 introduced mixed 16/32 bit instructions....The ARM processor has 2 instruction sets, the traditional ARM set, where the instructions are all 32-bit long, and the more condensed Thumb(2) set, where most common instructions are 16-bit long (and some are 32-bit long)." -- http://stackoverflow.com/questions/10638130/thumb-instruction-in-arm

Some instructions have immediate addressing modes and others do not. i won't bother to include that information because my interest here is mainly in the instruction set. I leave out some instructions that are, to me, uninteresting variants of existing ones. Note that the purpose of these listings is not accuracy, but rather to get a sense of what sorts of instructions are in RISC-ish CPU instruction sets.

Note that in Thumb2, instructions cannot reference the PC (program counter) or SP (stack pointer) as operands, including destination operand, unless noted. Note that every instruction that returns a result takes an operand specifying the destination register; operations are NOT done in place on the input registers (except when the destination register given is the same as an input register).

ARM has 'barrel shifting', meaning that shifts and rotates can be performed on operands without issuing separate instructions.

It has a clever way of representing 32-bit immediate values with only 8 bits plus 4 bits to determine a shift, which allows it to represent any power of 2 as an immediate value: http://alisdair.mcdiarmid.org/2014/01/12/arm-immediate-value-encoding.html . "Thumb-2 immediate encoding is even more gleeful--in addition to allowing rotation, it also allows for spaced repetition of any 8-bit pattern (common in low level hack patterns, like from [1]) to be encoded in single instructions." -- https://news.ycombinator.com/item?id=7046803 . If the value you want isn't accessible as an immediate, you can load it from a constant table or you can compute it, or some instruction sets have MOVW and MOVT which can construct and combine 16-bit immediates into a 32-bit value. Some assemblers let you just specify the immediate and the assembler figures out how to get it ( https://news.ycombinator.com/item?id=7045898 ).

ARM instructions traditionally encoded a conditional execution field, allowing instructions to be skipped depending on the flags, without doing a branch. On ARM64 this has been changed:

" arm64 ... sort of ditches conditional execution. It’s not on every instruction any more, but it’s still available on more instructions than on most other arches.

To the usual complement of typical conditional instructions (branch, add/sub with carry, select and set), arm64 adds select with increment, negate, or inversion, the ability to conditionally set to -1 as well as +1, and the ability to conditionally compare and merge the flags in a fairly flexible manner (it’s really a conditional select of condition flags between the result of a comparison and an immediate). This actually preserves most of the power of conditional execution (except for really exotic hand-coded usages), while taking up much less encoding space. " -- stephencanon , https://news.ycombinator.com/item?id=7047762

ARM has 8 Operating Modes ). "Each mode has its own mode-specific registers, including a status register":

User – normal operation
Fast interrupt – handling of ”fast” interrupts
Interrupt – handling of all other interrupts
Supervisor – operating system protected mode
Abort – abortion of memory access
System – operating system privileged mode
Undefined – invalid instruction in stream
Secure monitor – on-chip security features

(descriptions from http://www.cs.virginia.edu/~skadron/cs433_s09_processors/arm11.pdf )

Addressing modes ( http://www.cs.uregina.ca/Links/class-info/301/ARM-addressing/lecture.html ):

register
absolute
immediate
register indirect
register indirect with immediate offset
register indirect preincrementing by immediate offset
register indirect postincrementing by immediate offset
register indirect with register offset
register indirect with register offset with scaling

ARM: 16-bit Thumb2 instructions

MOV LSL r1 r2 r3 (logical shift left; r1 := r2 << r3) LSR ASR (arithmetic shift left) ADD (note; the source and/or destination operands for ADD can include SP, the stack pointer; in this way you can get the SP into a register) SUB (note; the source and destination operands for SUB can include SP, the stack pointer)

ADR (Add immediate to program counter; in this way you can get the PC into a register; useful for getting the address of a 'label' if your assembler translates labels to relative offsets )

CMP

AND EOR (xor)

ADC (Add with Carry; a + b + carry bit) SBC (Subtract with Carry; a - b - carry bit) ROR (Rotate Right) TST (Test bits: TST x y: update condition code flags on Rn AND Rm) RSB (Reverse subtract (from zero; e.g. negate)) CMP (update condition code flags on Rn - Rm) CMN (Compare Negative; update condition code flags on Rn + Rm) ORR (or) MUL BIC (Bit Clear: x AND (NOT y)) MVN (Move Negative/NOT: binary negation)

BL (branch with link; BL <label>: LR register = address of next instruction, PC = label)

BX (Branch and Exchange; this is used to enter/exit "thumb state") BLX (Branch with Link and Exchange; this is used to enter/exit "thumb state")

Load and store:

STR (Store word. Addressing modes include immediate, register offset, PC offset, SP offset. Can store list of multiple registers (STMIA).) also STRH for store halfword, STRB for byte

LDR (Load word. Addressing modes include immediate, register offset, SP offset. Can load list of multiple registers (LDMIA).) also LDRH for Load unsigned halfword, LDRSH for signed halfword, LDRB for unsigned byte, LDRSB for signed byte

LDR (load from literal pool instrs) B (unconditional, conditional branch instructions: takes as an operand a 'condition field' (this is different from a condition code), which is one of equal, not equal, Carry Set / Unsigned higher or same, Carry Clear / Unsigned lower, Negative, Positive or zero, Overflow, No overflow, Unsigned higher, Unsigned lower or same, Signed greater than or equal, Signed less than or equal, Signed greater than, Signed less than, always

SVC (service (system) call instructions; formerly SWI) SETEND (set endianness) CPS (change processor state; enables and disables specified interrupts) BKPT (software breakpoint) IT (If-Then; "Makes up to four following instructions conditional, according to pattern. pattern is a string of up to three letters. Each letter can be T (Then) or E (Else)."

Adjust stack pointer instructions Increment stack pointer ADD (SP plus immediate) Decrement stack pointer SUB (SP minus immediate)

Sign or zero extend instructions (these are used to convert a signed or unsigned value of a certain byte width into a value of a larger byte width, e.g. to convert a signed byte representing "-10" to a signed word representing "-10"; see http://odellconnie.blogspot.com/2012/03/sign-extension-zero-extension.html ) SXTH (Signed Extend Halfword to Word: SXTH Rd Rm: Rd[31:0] := SignExtend?(Rm[15:0])) SXTB (Signed Extend Byte to Word: Rd[31:0] := SignExtend?(Rm[7:0]) UXTH (Unsigned Extend Halfword to word: Rd[31:0] := ZeroExtend?(Rm[15:0])) UXTB (Unsigned Extend Byte to word: Rd[31:0] := ZeroExtend?(Rm[7:0]))

Compare and branch on (non-)zero instructions CBZ (Compare and branch on zero; CBZ r <label>: if r == 0, goto <label>) CBNZ (Compare and branch on non-zero)

PUSH (push selected registers onto stack) POP (push selected registers from stack)

Reverse byte instructions REV (Byte-Reverse Word, e.g. reverse the ordering of the four bytes in the word (and put the result in the destination register)) REV16 (Byte-Reverse Packed Halfword, e.g. reverse the ordering of the two bytes in both halfwords) REVSH (Byte-Reverse Signed Halfword, e.g. reverse the bytes in the low halfword, and sign extend the result to will the whole word)

NOP-compatible hint instructions: NOP YIELD (Yield control to alternative thread) WFE (Wait For Event) WFI (Wait For Interrupt) SEV (Send event; signal event in multiprocessor system)

ARM: 32-bit Thumb2 instructions

ORN (OR (not)) TEQ (update condition code flags on a XOR b) MOVT (move the source halfword into the top halfword of the destination register) BFC (Bit Field Clear; set specified bits to zero; takes a starting bit and a bitwidth) BFI (Bit Field Insert; set specified bits to specified values; takes a starting bit and a bitwidth and a source value)

SBFX (Signed Bit Field extract) SSAT (Signed saturate, LSL, ASR) SSAT16 (Signed saturate 16-bit) UBFX (Unsigned Bit Field extract) USAT (Unsigned saturate, LSL, ASR) USAT16 (Unsigned saturate 16-bit)

PKH (Pack halfword, BT, TB) RRX (Rotate Right with Extend)

Signed and unsigned extend instructions with optional addition: SXTAB (Signed extend byte and add) SXTAB16 (Signed extend two bytes to halfwords, and add) SXTAH (Signed extend halfword and add) SXTB16 (Signed extend two bytes to halfwords) UXTAB (Unsigned extend byte and add) UXTAB16 (Unsigned extend two bytes to halfwords, and add) UXTAH (Unsigned extend halfword and add) UXTB16 (Unsigned extend two bytes to halfwords)

SIMD add and subtract: QADD16, UADD16, QADD8, UADD8, QASX, UASX, QSUB16, UHADD16, QSUB8, UHADD8, QSAX, UHASX, SADD16, UHSUB16, SADD8, UHSUB8, SASX, UHSAX, SHADD16, UQADD16, SHADD8, UQADD8, SHASX, UQASX, SHSUB16, UQSUB16, SHSUB8, UQSUB8, SHSAX, UQSAX, SSUB16, USUB16, SSUB8, USUB8, SSAX

Mnemonic element Meaning: Q prefix Signed saturating arithmetic. S prefix Signed arithmetic, modulo 28 or 216. SH prefix Signed halving arithmetic. The result of the calculation is halved. U prefix Unsigned arithmetic, modulo 28 or 216. UH prefix Unsigned halving arithmetic. The result of the calculation is halved. UQ prefix Unsigned saturating arithmetic. 16 suffix The instruction performs two 16-bit calculations. 8 suffix The instruction performs four 8-bit calculations. ASX mnemonic The instruction performs one 16-bit addition and one 16-bit subtraction. The X indicates that the halfwords of the second operand are exchanged before the operation. SAX mnemonic The instruction performs one 16-bit subtraction and one 16-bit addition. The X indicates that the halfwords of the second operand are exchanged before the operation.

CLZ (Count Leading Zeros (just what is sounds like)) QADD (Saturating Add) QDADD (Saturating Double and Add) QDSUB (Saturating Double and Subtract) QSUB (Saturating Subtract) RBIT (Reverse Bits) SEL (Select bytes; passed 4 bits in GE register, which control, in each of the four word positions of the output, which word out of the two input bytes will contribute that byte)

multiply/divide and accumulate (add/subtract the result of multiplying to the destination, in-place), with various different byte widths of the operands and destination register(s): MLA (multiply and accumulate; x + (y*z)) MLS (multiply and subtract) SMLAxy (Signed Multiply-Accumulate Add, with double-length result) SMLAD (Signed Dual Multiply-Accumulate Add) SMLAWx (Signed Multiply-Accumulate Add) SMLSD (Signed Dual Multiply Subtract and Accumulate) SMMLA (Signed 32 + 32 x 32-bit, most significant word) SMMLS (Signed 32 – 32 x 32-bit, most significant word) SMMUL (Signed 32 x 32-bit, most significant 32-bit word) SMUAD (Signed Dual Multiply Add) SMULxy SMULWx SMUSD (Signed Dual Multiply Subtract) USAD8 (Unsigned Sum of Absolute Differences) USADA8 (Unsigned Accumulate Absolute Differences)

with 64-bit results (two registers to hold result): SMULL (Signed multiply with double-length result) UMULL (Unsigned multiply with double-length result) SDIV (Signed divide) UDIV (Unsigned divide) SMLALxy (Signed multiply with double-length result and accumulate) SMLALD (Signed Multiply Accumulate Long Dual) SMLSLD (Signed Multiply Subtract accumulate Long Dual) UMLAL (Unsigned 64 + 32 x 32) UMAAL (Unsigned multiply and accumulate with double-length result)

loads and stores:

add versions for postindexing, and for double words
PLD, PLI (preload)

LDRD (load double) STRD (store double) LDREX (load exclusive word; something to do with semaphores) STREX (store exclusive word; something to do with semaphores) CLREX (clear local processor exclusive tag; something to do with semaphores)

TBB (Table Branch Byte) TBH (Table Branch Halfword)

LDMDB / LDMEA (Load Multiple Decrement Before / Empty Ascending) RFE (Return From Exception) SRS (Store Return State) STMDB / STMFD on page 4-333 (Store Multiple Decrement Before / Full Descending)

MRS (Move from Status register to ARM Register, e.g. put the condition codes into a register) MSR (Move from ARM register to Status register, e.g. copy a register over the condition codes) SUBS (Return From Exception without stack)

DBG (Debug hint)

Special control operations: CLREX (Clear Exclusive) DSB (Data Synchronization Barrier) DMB (Data Memory Barrier) ISB (Instruction Synchronization Barrier)

Coprocessor instructions: not listed

Links:

ARM: Cortex M profile

Cortex M0, M0+, and M1 only have these instructions:

16-bit: ADC, ADD, ADR, AND, ASR, B, BIC, BKPT, BLX, BX, CMN, CMP, CPS, EOR, LDM, LDR, LDRB, LDRH, LDRSB, LDRSH, LSL, LSR, MOV, MUL, MVN, NOP, ORR, POP, PUSH, REV, REV16, REVSH, ROR, RSB, SBC, SEV, STM, STMIA, STR, STRB, STRH, SUB, SVC, SXTB, SXTH, TST, UXTB, UXTH, WFE, WFI, YIELD

32-bit: BL (branch with link), DMB (Data Memory Barrier; Ensure the order of observation of memory accesses), DSB (Data Synchronization Barrier; Ensure the completion of memory accesses), ISB (Instruction Synchronization Barrier; flush processor pipeline and branch prediction logic), MRS (Move from Status register), MSR (move to status register)

Note that the 16-bit instruction set is identical to the 16-bit thumb-2 instruction set above, except for SETEND (set endianness), IT (if-then), CBZ (Compare and branch on zero), CBNZ. (also, BL here appears only as 32-bit, whereas it was in the 16-bit instruction set, but I think that BL is actually 32-bits in the 16-bit instruction set in some way, not sure i understand that though). IT, CBZ, CBNZ are added in the Cortex M3, as well as a bunch of 32-bit instructions:

new 32-bit instructions in the Cortex M3: BFC (Bit Field Clear), BFI (Bit Field Insert), CDP (?), CLREX (clear local processor exclusive tag), CLZ (count leading zeros), DBG (debug hint), various loads (LDC, LDMA, LDMDB, LDRBT, LDRD, LDREX, LDREXB, LDREXH, LDRHT, LDRSB, LDRSBT, LDRSHT, LDRT), MCR (?), MLS (multiply and subtract), MCRR (?), MLA (multiply and accumulate; x + (y*z)), MOVT (move the source halfword into the top halfword of the destination register), MRC (?), MRRC (?), ORN (x or (not(y)), PLD (preload data), PLDW, PLI (preload instructions), RRX (Rotate Right with Extend), SBFX (Signed Bit Field extract), SDIV (Signed divide), SMLAL (an SMULL-like thingee), SMULL, SSAT (signed saturate), STC (?), various stores (STMDB, STRBT, STRD, STREX, STREXB, STREXH, STRHT, STRT), TBB (Table Branch Byte), TBH (Table Branch Halfword), TEQ (update condition code flags on a XOR b), UBFX (Unsigned Bit Field extract), UDIV (Unsigned divide), other multiply, multiply-accumulate, and saturate instructions (UMLAL, UMULL, USAT)

Note that http://www.eetimes.com/document.asp?doc_id=1319726 claims that "SoCs? based on ARM's M0+ Flycatcher core will not run Linux, although they do hit the sub-50-cent price point for the IoT?, including security engines and targeted peripherals."

As of this writing, the Cortex M0+ seems to be the leading design for 32bit tiny low-power devices. There are very small versions of them, e.g. http://cache.freescale.com/files/microcontrollers/doc/fact_sheet/KINETISKL02CSPFS.pdf?fpsp=1 which is 16 mm^2. This device runs about 48 MHz and the M0+ design yields about 1 MIPS/MHz, which means that according to http://www.roylongbottom.org.uk/mips.htm it's about as powerful as a 486! It has 32KB flash RAM (presumably for program storage) and 4 KB RAM. Intel recently released a small low-power chip called the Quark which is a SoC? with a 486 ISA, 512 KB SRAM, 16 KB cache.

Links:

https://en.wikipedia.org/wiki/ARM_Cortex-M

ARM history

always had a reputation for weirdness, and I suppose this was the ultimate. While everyone else went 16-bit (or disappeared altogether), Acorn just kept selling variations on the same 8-bit theme. Then, all of a sudden, in 1987, they launched a machine known as Archimedes. It was based on an entirely new processor; the Acorn Risc Machine. This was fully 32-bit data, although it only boasted a 26 bit (equivalent) address bus. It was the first RISC-based home micro in production.

" The ARM chip owed a lot to the experience of its designers with the 6502 upon which its instruction set was based, but it introduced a couple of new ideas. First it had four processor modes with 16 general-purpose registers available. Some of the 16 were different in each mode. It also introduced conditional execution of instructions, avoiding many jumps in code, and helping increase the efficiency of the pipeline. The other interesting feature was its ability to use a barrel-shifter on one of the operands of an instruction with no performance penalty. In other words, a multiply and add can be done in one instruction. This is the kind of technology that Intel are hyping with their 'MMX' Pentiums. Yes, I know MMX is more than that, but it does say something...

Variants

The first ARM chip was available as a second processor for Acorn's 8-bit micros. The ARM chip in the Archimedes was an ARM 2 which ran at 8 MHz. The ARM 3 was installed in several later machines running at speeds up to 25 MHz. Its greatest performance boost came from a simple onboard 4k cache. It was after this that ARM Ltd was spun off from Acorn and started licensing the designs. They came up with the ARM 6 macrocell (what happened to 4 and 5?) and turned it into the ARM 610 processor used in the first Risc PCs. It was coupled with an 8k cache, full 32-bit addressing mode, better cache algorithms and 30 MHz clock. The ARM 710 soon followed with a few preformance tweaks, running at 40 MHz, and the ARM 810 was announced.

Then along came Digital. I'm not sure who initiated the pairing, but somehow Digital Equipment Corp, makers of the blindingly fast Alpha processors, got hold of the ARM designs, and built a processor using their semiconductor expertise. The result was the StrongARM?; a processor that functionally is little different from the ARM 710 except that it is (internally) clocked at 202 MHz. Oh yes, it also has two 8k caches; one for instructions and one for data. Rumour has it that the interpreter of RiscOS?'s built-in BASIC fits neatly into the instruction cache. If this is the case, it explains why interpreted BBC BASIC V is so flippin' fast. The other thing, and this is the cause of most of the few software problems, is that the length of the pipeline has been increased, so that self-modifying code which relies on knowing the length of the pipeline to calculate the PC gets in a real mess."

-- http://www.landley.net/history/mirror/acorn/processors.html

ARM opinions

" I'll just cover those things I really like about ARM in general :)

1. load/store multiple of any arbitrary register combination Yes, thats right. One can do "STM r0, {r0-r15}" if they want to and save every register. LDM is the same.

2. Address updates available for every memory instruction Reusing STM from above, "STM r0!, {r1-r15}", will write the final address to r0 (I've forgotten the exact specifics here). Pretty much every memory op supports this

3. The stack is my territory, and mine alone The processor will never touch the stack. I don't have to deal with processor built stack frames. This greatly simplifies some things

4. Pre-shifts available on all basic ALU instructions (Where "basic ALU" is defined as pretty much everything except MUL. ARM doesn't have division)

This is an incredibly useful feature, though it does make the instructions occasionally look like huge monstrosities! It also means that ARM's ADD instruction can double for most architecture's LEA.

5. Three operand instruction set Well, that one should be reasonably clear ;)

6. No mode flags (or those which exist are implicit) For example, while there are both the ARM and Thumb instruction sets, they're designated by the least significant bit of the branch target address. The BX/BLX instructions automatically move this bit into the current program status register (CPSR)

7. PC is in the register file Yes, you can do "MOV pc, lr" (this is the traditional way to return), and can use the ALU operations for relative branches.

(Caveat: On machines prior to ARMv7 [ARM11 and older processors], these instructions will not transition to/from Thumb mode and the result of loading the least significant bits of PC is Unpredictable. ARMv7 makes them interwork properly with Thumb)

(By the way - when ARM say Unpredictable they mean "May raise a trap, may do something completely unrelated, may be a NOP - behaviour is undefined except that it cannot cause a security hole" and be redefined by future revisions) " -- http://forum.6502.org/viewtopic.php?t=1594

ARM: Links

Thumb (thumb1, i think) ISA https://ece.uwaterloo.ca/~ece222/ARM/ARM7-TDMI-manual-pt3.pdf

ARM: summary

It seems like the 'core' instruction set is indeed the set found in Cortex M0, M0+, and M1. This is a subset of the 16-bit thumb2 set, but with a few 32-bit instructions too.

Those instructions are: MOV, arithmetic (ADD, ADC, SUB, SBC, RSB, MUL), bitwise arithmetic (LSL, LSR, ASR, AND, ORR, EOR, ROR, BIC, MVN), byte reversals (REV, REV16, REVSH), get/set special registers (ADR, MRS, MSR), comparisons (CMP, CMN, TST), branching (B, BL), load/stores with immediate, register offset, PC, SP offset, and multiple registers, push/pop, extension (SXTH, SXTB, UXTH, UXTB), misc control (SVC, NOP), multiprocessing and (YIELD, WFE, WFI, SEV, DMB, DSB), and a few other misc instructions (ISB and some others).

When we get to the Cortex M3 we add 32-bit instructions for bit fields (BFC/BFI, SBFX, UBFX), multiprocessing (LDREX, STREX, CLREX), bitwise arithmetic (CLZ, MOVT, ORN, RRX, saturating versions of things), comparisons (TEQ), various loads and stores (with postindexing and various widths), arithmetic (division, multiply-accumulate (add/subtract) operations with various widths), branch tables (TBB, TBH), and some other misc instructions (DBG, PLD, PLI).