proj-plbook-plChArmIsa

Table of Contents for Programming Languages: a survey

ARM: Intro

https://en.wikipedia.org/wiki/ARM_architecture#32-bit_architecture

http://users.ece.utexas.edu/~valvano/EE345M/Arm_EE382N_4.pdf

https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html list of instructions with names, todo

A recent addition to the ARM ISA family is ARM64 (ARMv8 A64 / AArch64), described on the pages http://www.arm.com/products/processors/instruction-set-architectures/index.php http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0677b/ch01s01.html http://www.arm.com/files/downloads/ARMv8_Architecture.pdf http://www.cs.utexas.edu/~peterson/arm/DDI0487A_a_armv8_arm_errata.pdf http://www.arm.com/files/pdf/ARMv8R__Architecture_Oct13.pdf.

ARM has various versions and 3 profiles; A (full-features for use as e.g. CPU of smartphone or computer; has virtual addressing MMU), R (real-time, for use in e.g. car engines; has deterministic (i think) physical addressing MMU), M (microcontroller; only supports Thumb ISA). The latest version is v8, but according to the ARM Wikipedia page only A and R profiles are (yet) available for v8. v7 has all 3 profiles (e.g. http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf ). There's also an E-M which is like M with a DSP extension, found in v7.

ARM Thumb: "The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions." -- (ARM7TDMI Technical Reference Manual Revision: r4p1) "The Thumb instruction set provides better code density, at the expense of inferior performance....Thumb-2, a major enhancement of the Thumb instruction set. Thumb-2 provides almost exactly the same functionality as the ARM instruction set. It has both 16-bit and 32-bit instructions, and achieves ARM-like performance with Thumb-like code density." -- (RealView? Compilation Tools Assembler Guide Version 4.0) https://en.wikipedia.org/wiki/ARM_Cortex-M

"The biggest register difference involves the SP register. The Thumb state has unique stack mnemonics (PUSH, POP) that don't exist in the ARM state. These instructions assume the existence of a stack pointer, for which R13 is used. They translate into load and store instructions in the ARM state. " -- http://www.embedded.com/electronics-blogs/beginner-s-corner/4024632/Introduction-to-ARM-thumb

"The original Thumb-Instruction set only contained 16-bit instructions. Thumb2 introduced mixed 16/32 bit instructions....The ARM processor has 2 instruction sets, the traditional ARM set, where the instructions are all 32-bit long, and the more condensed Thumb(2) set, where most common instructions are 16-bit long (and some are 32-bit long)." -- http://stackoverflow.com/questions/10638130/thumb-instruction-in-arm

Some instructions have immediate addressing modes and others do not. i won't bother to include that information because my interest here is mainly in the instruction set. I leave out some instructions that are, to me, uninteresting variants of existing ones. Note that the purpose of these listings is not accuracy, but rather to get a sense of what sorts of instructions are in RISC-ish CPU instruction sets.

Note that in Thumb2, instructions cannot reference the PC (program counter) or SP (stack pointer) as operands, including destination operand, unless noted. Note that every instruction that returns a result takes an operand specifying the destination register; operations are NOT done in place on the input registers (except when the destination register given is the same as an input register).

ARM has 'barrel shifting', meaning that shifts and rotates can be performed on operands without issuing separate instructions.

It has a clever way of representing 32-bit immediate values with only 8 bits plus 4 bits to determine a shift, which allows it to represent any power of 2 as an immediate value: http://alisdair.mcdiarmid.org/2014/01/12/arm-immediate-value-encoding.html . "Thumb-2 immediate encoding is even more gleeful--in addition to allowing rotation, it also allows for spaced repetition of any 8-bit pattern (common in low level hack patterns, like from [1]) to be encoded in single instructions." -- https://news.ycombinator.com/item?id=7046803 . If the value you want isn't accessible as an immediate, you can load it from a constant table or you can compute it, or some instruction sets have MOVW and MOVT which can construct and combine 16-bit immediates into a 32-bit value. Some assemblers let you just specify the immediate and the assembler figures out how to get it ( https://news.ycombinator.com/item?id=7045898 ).

ARM instructions traditionally encoded a conditional execution field, allowing instructions to be skipped depending on the flags, without doing a branch. On ARM64 this has been changed:

" arm64 ... sort of ditches conditional execution. It’s not on every instruction any more, but it’s still available on more instructions than on most other arches.

To the usual complement of typical conditional instructions (branch, add/sub with carry, select and set), arm64 adds select with increment, negate, or inversion, the ability to conditionally set to -1 as well as +1, and the ability to conditionally compare and merge the flags in a fairly flexible manner (it’s really a conditional select of condition flags between the result of a comparison and an immediate). This actually preserves most of the power of conditional execution (except for really exotic hand-coded usages), while taking up much less encoding space. " -- stephencanon , https://news.ycombinator.com/item?id=7047762

ARM has 8 Operating Modes ). "Each mode has its own mode-specific registers, including a status register":

(descriptions from http://www.cs.virginia.edu/~skadron/cs433_s09_processors/arm11.pdf )

Addressing modes ( http://www.cs.uregina.ca/Links/class-info/301/ARM-addressing/lecture.html ):

For ARM64 (AArch64), see also https://developer.arm.com/documentation/102374/0101/Loads-and-stores---addressing , which presents just 4 addressing modes applied only to loads/stores:

The AArch64 spec, ( https://developer.arm.com/documentation/ddi0487/latest/ ), speaks of other "addressing modes", but afaict from section C1.3 "Address generation" subsection "Address calculation", these are just ways to compute addresses with instructions like ADD, rather than ways to avoid using a separate instruction to compute an address.

The notes in section C1.3 "Address generation" subsection "Address calculation" indicate that when using an ADD instruction to add an immediate offset to a base address, the size of the immediate is 12 bits.

I can't tell if there is a way to use a single ADD instruction to compute (base + scale*index + immediate_offset), but it appears to me that this would require two instructions, one to add the scaled index, and a second to add the immediate offset.

ARM: 16-bit Thumb2 instructions

MOV LSL r1 r2 r3 (logical shift left; r1 := r2 << r3) LSR ASR (arithmetic shift left) ADD (note; the source and/or destination operands for ADD can include SP, the stack pointer; in this way you can get the SP into a register) SUB (note; the source and destination operands for SUB can include SP, the stack pointer)

ADR (Add immediate to program counter; in this way you can get the PC into a register; useful for getting the address of a 'label' if your assembler translates labels to relative offsets )

CMP

AND EOR (xor)

ADC (Add with Carry; a + b + carry bit) SBC (Subtract with Carry; a - b - carry bit) ROR (Rotate Right) TST (Test bits: TST x y: update condition code flags on Rn AND Rm) RSB (Reverse subtract (from zero; e.g. negate)) CMP (update condition code flags on Rn - Rm) CMN (Compare Negative; update condition code flags on Rn + Rm) ORR (or) MUL BIC (Bit Clear: x AND (NOT y)) MVN (Move Negative/NOT: binary negation)

BL (branch with link; BL <label>: LR register = address of next instruction, PC = label)

BX (Branch and Exchange; this is used to enter/exit "thumb state") BLX (Branch with Link and Exchange; this is used to enter/exit "thumb state")

Load and store:

STR (Store word. Addressing modes include immediate, register offset, PC offset, SP offset. Can store list of multiple registers (STMIA).) also STRH for store halfword, STRB for byte

LDR (Load word. Addressing modes include immediate, register offset, SP offset. Can load list of multiple registers (LDMIA).) also LDRH for Load unsigned halfword, LDRSH for signed halfword, LDRB for unsigned byte, LDRSB for signed byte

LDR (load from literal pool instrs) B (unconditional, conditional branch instructions: takes as an operand a 'condition field' (this is different from a condition code), which is one of equal, not equal, Carry Set / Unsigned higher or same, Carry Clear / Unsigned lower, Negative, Positive or zero, Overflow, No overflow, Unsigned higher, Unsigned lower or same, Signed greater than or equal, Signed less than or equal, Signed greater than, Signed less than, always

SVC (service (system) call instructions; formerly SWI) SETEND (set endianness) CPS (change processor state; enables and disables specified interrupts) BKPT (software breakpoint) IT (If-Then; "Makes up to four following instructions conditional, according to pattern. pattern is a string of up to three letters. Each letter can be T (Then) or E (Else)."

Adjust stack pointer instructions Increment stack pointer ADD (SP plus immediate) Decrement stack pointer SUB (SP minus immediate)

Sign or zero extend instructions (these are used to convert a signed or unsigned value of a certain byte width into a value of a larger byte width, e.g. to convert a signed byte representing "-10" to a signed word representing "-10"; see http://odellconnie.blogspot.com/2012/03/sign-extension-zero-extension.html ) SXTH (Signed Extend Halfword to Word: SXTH Rd Rm: Rd[31:0] := SignExtend?(Rm[15:0])) SXTB (Signed Extend Byte to Word: Rd[31:0] := SignExtend?(Rm[7:0]) UXTH (Unsigned Extend Halfword to word: Rd[31:0] := ZeroExtend?(Rm[15:0])) UXTB (Unsigned Extend Byte to word: Rd[31:0] := ZeroExtend?(Rm[7:0]))

Compare and branch on (non-)zero instructions CBZ (Compare and branch on zero; CBZ r <label>: if r == 0, goto <label>) CBNZ (Compare and branch on non-zero)

PUSH (push selected registers onto stack) POP (push selected registers from stack)

Reverse byte instructions REV (Byte-Reverse Word, e.g. reverse the ordering of the four bytes in the word (and put the result in the destination register)) REV16 (Byte-Reverse Packed Halfword, e.g. reverse the ordering of the two bytes in both halfwords) REVSH (Byte-Reverse Signed Halfword, e.g. reverse the bytes in the low halfword, and sign extend the result to will the whole word)

NOP-compatible hint instructions: NOP YIELD (Yield control to alternative thread) WFE (Wait For Event) WFI (Wait For Interrupt) SEV (Send event; signal event in multiprocessor system)

ARM: 32-bit Thumb2 instructions

ORN (OR (not)) TEQ (update condition code flags on a XOR b) MOVT (move the source halfword into the top halfword of the destination register) BFC (Bit Field Clear; set specified bits to zero; takes a starting bit and a bitwidth) BFI (Bit Field Insert; set specified bits to specified values; takes a starting bit and a bitwidth and a source value)

SBFX (Signed Bit Field extract) SSAT (Signed saturate, LSL, ASR) SSAT16 (Signed saturate 16-bit) UBFX (Unsigned Bit Field extract) USAT (Unsigned saturate, LSL, ASR) USAT16 (Unsigned saturate 16-bit)

PKH (Pack halfword, BT, TB) RRX (Rotate Right with Extend)

Signed and unsigned extend instructions with optional addition: SXTAB (Signed extend byte and add) SXTAB16 (Signed extend two bytes to halfwords, and add) SXTAH (Signed extend halfword and add) SXTB16 (Signed extend two bytes to halfwords) UXTAB (Unsigned extend byte and add) UXTAB16 (Unsigned extend two bytes to halfwords, and add) UXTAH (Unsigned extend halfword and add) UXTB16 (Unsigned extend two bytes to halfwords)

SIMD add and subtract: QADD16, UADD16, QADD8, UADD8, QASX, UASX, QSUB16, UHADD16, QSUB8, UHADD8, QSAX, UHASX, SADD16, UHSUB16, SADD8, UHSUB8, SASX, UHSAX, SHADD16, UQADD16, SHADD8, UQADD8, SHASX, UQASX, SHSUB16, UQSUB16, SHSUB8, UQSUB8, SHSAX, UQSAX, SSUB16, USUB16, SSUB8, USUB8, SSAX

Mnemonic element Meaning: Q prefix Signed saturating arithmetic. S prefix Signed arithmetic, modulo 28 or 216. SH prefix Signed halving arithmetic. The result of the calculation is halved. U prefix Unsigned arithmetic, modulo 28 or 216. UH prefix Unsigned halving arithmetic. The result of the calculation is halved. UQ prefix Unsigned saturating arithmetic. 16 suffix The instruction performs two 16-bit calculations. 8 suffix The instruction performs four 8-bit calculations. ASX mnemonic The instruction performs one 16-bit addition and one 16-bit subtraction. The X indicates that the halfwords of the second operand are exchanged before the operation. SAX mnemonic The instruction performs one 16-bit subtraction and one 16-bit addition. The X indicates that the halfwords of the second operand are exchanged before the operation.

CLZ (Count Leading Zeros (just what is sounds like)) QADD (Saturating Add) QDADD (Saturating Double and Add) QDSUB (Saturating Double and Subtract) QSUB (Saturating Subtract) RBIT (Reverse Bits) SEL (Select bytes; passed 4 bits in GE register, which control, in each of the four word positions of the output, which word out of the two input bytes will contribute that byte)

multiply/divide and accumulate (add/subtract the result of multiplying to the destination, in-place), with various different byte widths of the operands and destination register(s): MLA (multiply and accumulate; x + (y*z)) MLS (multiply and subtract) SMLAxy (Signed Multiply-Accumulate Add, with double-length result) SMLAD (Signed Dual Multiply-Accumulate Add) SMLAWx (Signed Multiply-Accumulate Add) SMLSD (Signed Dual Multiply Subtract and Accumulate) SMMLA (Signed 32 + 32 x 32-bit, most significant word) SMMLS (Signed 32 – 32 x 32-bit, most significant word) SMMUL (Signed 32 x 32-bit, most significant 32-bit word) SMUAD (Signed Dual Multiply Add) SMULxy SMULWx SMUSD (Signed Dual Multiply Subtract) USAD8 (Unsigned Sum of Absolute Differences) USADA8 (Unsigned Accumulate Absolute Differences)

with 64-bit results (two registers to hold result): SMULL (Signed multiply with double-length result) UMULL (Unsigned multiply with double-length result) SDIV (Signed divide) UDIV (Unsigned divide) SMLALxy (Signed multiply with double-length result and accumulate) SMLALD (Signed Multiply Accumulate Long Dual) SMLSLD (Signed Multiply Subtract accumulate Long Dual) UMLAL (Unsigned 64 + 32 x 32) UMAAL (Unsigned multiply and accumulate with double-length result)

loads and stores:

LDRD (load double) STRD (store double) LDREX (load exclusive word; something to do with semaphores) STREX (store exclusive word; something to do with semaphores) CLREX (clear local processor exclusive tag; something to do with semaphores)

TBB (Table Branch Byte) TBH (Table Branch Halfword)

LDMDB / LDMEA (Load Multiple Decrement Before / Empty Ascending) RFE (Return From Exception) SRS (Store Return State) STMDB / STMFD on page 4-333 (Store Multiple Decrement Before / Full Descending)

MRS (Move from Status register to ARM Register, e.g. put the condition codes into a register) MSR (Move from ARM register to Status register, e.g. copy a register over the condition codes) SUBS (Return From Exception without stack)

DBG (Debug hint)

Special control operations: CLREX (Clear Exclusive) DSB (Data Synchronization Barrier) DMB (Data Memory Barrier) ISB (Instruction Synchronization Barrier)

Coprocessor instructions: not listed

Links:

ARM: Cortex M profile

Cortex M0, M0+, and M1 only have these instructions:

16-bit: ADC, ADD, ADR, AND, ASR, B, BIC, BKPT, BLX, BX, CMN, CMP, CPS, EOR, LDM, LDR, LDRB, LDRH, LDRSB, LDRSH, LSL, LSR, MOV, MUL, MVN, NOP, ORR, POP, PUSH, REV, REV16, REVSH, ROR, RSB, SBC, SEV, STM, STMIA, STR, STRB, STRH, SUB, SVC, SXTB, SXTH, TST, UXTB, UXTH, WFE, WFI, YIELD

32-bit: BL (branch with link), DMB (Data Memory Barrier; Ensure the order of observation of memory accesses), DSB (Data Synchronization Barrier; Ensure the completion of memory accesses), ISB (Instruction Synchronization Barrier; flush processor pipeline and branch prediction logic), MRS (Move from Status register), MSR (move to status register)

Note that the 16-bit instruction set is identical to the 16-bit thumb-2 instruction set above, except for SETEND (set endianness), IT (if-then), CBZ (Compare and branch on zero), CBNZ. (also, BL here appears only as 32-bit, whereas it was in the 16-bit instruction set, but I think that BL is actually 32-bits in the 16-bit instruction set in some way, not sure i understand that though). IT, CBZ, CBNZ are added in the Cortex M3, as well as a bunch of 32-bit instructions:

new 32-bit instructions in the Cortex M3: BFC (Bit Field Clear), BFI (Bit Field Insert), CDP (?), CLREX (clear local processor exclusive tag), CLZ (count leading zeros), DBG (debug hint), various loads (LDC, LDMA, LDMDB, LDRBT, LDRD, LDREX, LDREXB, LDREXH, LDRHT, LDRSB, LDRSBT, LDRSHT, LDRT), MCR (?), MLS (multiply and subtract), MCRR (?), MLA (multiply and accumulate; x + (y*z)), MOVT (move the source halfword into the top halfword of the destination register), MRC (?), MRRC (?), ORN (x or (not(y)), PLD (preload data), PLDW, PLI (preload instructions), RRX (Rotate Right with Extend), SBFX (Signed Bit Field extract), SDIV (Signed divide), SMLAL (an SMULL-like thingee), SMULL, SSAT (signed saturate), STC (?), various stores (STMDB, STRBT, STRD, STREX, STREXB, STREXH, STRHT, STRT), TBB (Table Branch Byte), TBH (Table Branch Halfword), TEQ (update condition code flags on a XOR b), UBFX (Unsigned Bit Field extract), UDIV (Unsigned divide), other multiply, multiply-accumulate, and saturate instructions (UMLAL, UMULL, USAT)

Note that http://www.eetimes.com/document.asp?doc_id=1319726 claims that "SoCs? based on ARM's M0+ Flycatcher core will not run Linux, although they do hit the sub-50-cent price point for the IoT?, including security engines and targeted peripherals."

As of this writing, the Cortex M0+ seems to be the leading design for 32bit tiny low-power devices. There are very small versions of them, e.g. http://cache.freescale.com/files/microcontrollers/doc/fact_sheet/KINETISKL02CSPFS.pdf?fpsp=1 which is 16 mm^2. This device runs about 48 MHz and the M0+ design yields about 1 MIPS/MHz, which means that according to http://www.roylongbottom.org.uk/mips.htm it's about as powerful as a 486! It has 32KB flash RAM (presumably for program storage) and 4 KB RAM. Intel recently released a small low-power chip called the Quark which is a SoC? with a 486 ISA, 512 KB SRAM, 16 KB cache.

ARM Cortex M0 instruction list

from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html

hsbsh) (load bytehalfwordsigned bytesigned halfword) ldm (load multiple) str (store) str(bh) stm (store multiple) (pushpop) (push/pop registers onto/from stack)
s)xt(bh) (extend unsignedsigned bytehalfword)
e) (disable/enable interrupts) (mrsmsr) (read/write special register) bkpt (breakpoint)

More notes on ARM Cortex Ms

from https://en.m.wikipedia.org/wiki/ARM_Cortex-M

" See also: ARM architecture § Instruction set

The Cortex-M0 / M0+ / M1 implement the ARMv6-M architecture,[9] the Cortex-M3 implements the ARMv7-M architecture,[10] and the Cortex-M4 / M7 implements the ARMv7E-M architecture.[10] The architectures are binary instruction upward compatible from ARMv6-M to ARMv7-M to ARMv7E-M. Binary instructions available for the Cortex-M0 / M0+ / M1 can execute without modification on the Cortex-M3 / M4 / M7. Binary instructions available for the Cortex-M3 can execute without modification on the Cortex-M4 / M7 / M33.[9][10] Only Thumb-1 and Thumb-2 instruction sets are supported in Cortex-M architectures, but the legacy 32-bit ARM instruction set isn't supported.

All six Cortex-M cores implement a common subset of instructions that consists of most Thumb-1, some Thumb-2, including a 32-bit result multiply. The Cortex-M0 / M0+ / M1 / M23 were designed to create the smallest silicon die, thus having the fewest instructions of the Cortex-M family.

The Cortex-M0 / M0+ / M1 include Thumb-1 instructions, except new instructions (CBZ, CBNZ, IT) which were added in ARMv7-M architecture. The Cortex-M0 / M0+ / M1 include a minor subset of Thumb-2 instructions (BL, DMB, DSB, ISB, MRS, MSR). The Cortex-M3 / M4 / M7 / M33 have all base Thumb-1 and Thumb-2 instructions. The Cortex-M3 adds three Thumb-1 instructions, all Thumb-2 instructions, hardware integer divide, and saturation arithmetic instructions. The Cortex-M4 adds DSP instructions and an optional single-precision floating-point unit (VFPv4-SP). The Cortex-M7 adds an optional double-precision FPU (VFPv5).[9][10] ...

    The 32-bit ARM instruction set is not included in Cortex-M cores.
    Endianness is chosen at silicon implementation in Cortex-M cores. Legacy cores allowed "on-the-fly" changing of the data endian mode.
    Co-processors aren't supported on Cortex-M cores.

"

see also chart https://en.m.wikipedia.org/wiki/ARM_Cortex-M#Instruction_sets

" SysTick? timer: A 24-bit system timer that extends the functionality of both the processor and the Nested Vectored Interrupt Controller (NVIC). When present, it also provides an additional configurable priority SysTick? interrupt.[9][10][11] Though the SysTick? timer is optional, it is very rare to find a Cortex-M microcontroller without it. "

" Memory Protection Unit (MPU): Provides support for protecting regions of memory through enforcing privilege and access rules. It supports up to eight different regions, each of which can be split into a further eight equal-size sub-regions.[9][10][11] " -- Cortex-M3, M4, M7, and M23 have an MPU option

M0, M1, M0+ and the new M23 are Von Neumann; M3, M4, M4 are Harvard.

" The Cortex-M0 core is optimized for small silicon die size and use in the lowest price chips. (ARMv6-M) ... The Cortex-M0+ is an optimized superset of the Cortex-M0. (ARMv6-M architecture) ... The Cortex-M1 is an optimized core especially designed to be loaded into FPGA chips. (ARMv6-M) ... (Cortex-M3 is ARMv7-M) ... Conceptually the Cortex-M4 is a Cortex-M3 plus DSP instructions, and optional floating-point unit (FPU). If a core contains an FPU, it is known as a Cortex-M4F, otherwise it is a Cortex-M4. (ARMv7E-M) ... The Cortex-M7 is a high-performance core with almost double the power efficiency of the older Cortex-M4. (ARMv7E-M) ... The Cortex-M23 core was announced in October 2016[23] and based on the newer ARMv8-M architecture that was previously announced in November 2015.[24] Conceptually the Cortex-M23 is similar to a Cortex-M0+ plus integer divide instructions and TrustZone? security features, and also has a 2-stage instruction pipeline. ... The Cortex-M33 core was announced in October 2016[23] and based on the newer ARMv8-M architecture that was previously announced in November 2015.[24] Conceptually the Cortex-M33 is similar to a Cortex-M4 plus TrustZone? security features, and also has a 3-stage instruction pipeline. "

note: the 32-bit multiply on the Cortex-M23 only gives a 32-bit result! (the lower 32 bits)

"The Cortex-M0 / M0+ / M1 / M23 only has 32-bit multiply instructions with a lower-32-bit result (32bit × 32bit = lower 32bit), where as the Cortex-M3 / M4 / M7 / M33 includes additional 32-bit multiply instructions with 64-bit results (32bit × 32bit = 64bit)."

relevant parts of table "ARM Cortex-M instruction groups":

all Cortex-M parts have the following Thumb1 instrs (16-bit):

ADC, ADD, ADR, AND, ASR, B, BIC, BKPT, BLX, BX, CMN, CMP, CPS, EOR, LDM, LDR, LDRB, LDRH, LDRSB, LDRSH, LSL, LSR, MOV, MUL, MVN, NOP, ORR, POP, PUSH, REV, REV16, REVSH, ROR, RSB, SBC, SEV, STM, STMIA, STR, STRB, STRH, SUB, SVC, SXTB, SXTH, TST, UXTB, UXTH, WFE, WFI, YIELD (52 instrs)

and the following Thumb2 instrs (32-bit):

BL, DMB, DSB, ISB, MRS, MSR (6 instrs)

The M23 but not the M0+ has the following Thumb1 (16-bit):

CBNZ, CBZ

and the following Thumb2 (32-bit):

SDIV, UDIV

The M23 and M33 only have the following trustzone instrs:

16-bit: BLXNS, BXNS 32-bit: SG, TT, TTT, TTA, TTAT

Links:

ARM Cortex M4 floating-point

ARM Cortex M4 floating-point:

"The FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations. It also provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions."

Instructions:

" 7.2.5 Complete implementation of the IEEE 754 standard

The Cortex‑M?4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. The Cortex‑M?4 floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:

    Remainder.
    Round floating-point number to integer-valued floating-point number.
    Binary-to-decimal conversions.
    Decimal-to-binary conversions.
    Direct comparison of single-precision and double-precision values."

" The FPU sets the cumulative exception status flag in the FPSCR register as required for each instruction, in accordance with the FPv4 architecture. The FPU does not support exception traps. The processor also has six output pins, FPIXC, FPUFC, FPOFC, FPDZC, FPIDC, and FPIOC, that each reflect the status of one of the cumulative exception flags. See the Cortex®‑M?4 Integration and Implementation Manual for a description of these outputs. "

ARM interrupts

ARM provides nested vectored interrupts. "Nested" because if another interrupt occurs while the first one is executing, the currently executing interrupt may itself be interrupted. "Vectored" because each interrupt causes the code at the corresponding interrupt handler entry point to be executed [2] (as opposed to the alternative, "polled" interrupts, in which, during an interrupt, the system calls each handler out of a large group of handlers until one handler 'claims' the interrupt).

ARM has some built-in interrupt types (these have IRQ numbers less than 0); see [3] for a list of built-in interrupt types [4]. It also supports vendor specific interrupt types (which have non-negative IRQ numbers) "typically for devices like UART/I²C?/USB/etc" [5].

Interrupts have priority levels. in ARM, lower priority is more urgent (that is, an interrupt may interrupt another currently executing interrupt if the currently executing interrupt has a higher priority number). There must be at least 4 priority levels available on any Cortex M0 or M0+ device; more on Cortex M3/M4/M7 devices. There are also 'subpriority' levels, which are used to determine which interrupt goes first when multiple interrupts are pending.

You can temporarily disable all interrupts. Disabling interrupts is sometimes called "masking" them [6], but other times a distinction is made between disabling (where the interrupt is never emitted or is completly ignored) and masking (where the interrupt is omitted but held for later) [7].

On Cortex M3/M4/M7, but not on M0/M0+, you can also temporarily disable all interrupts higher than a certain priority (that is, LESS urgent than the given priority).

For each interrupt, you can also:

The other built-in interrupts are: [10]

Cortex-M0/M0+/M1 can have up to 32 interrupts, M3/M4/M7/M23 can have up to 240, M33 up to 480. [12].

Links:

ARM history

always had a reputation for weirdness, and I suppose this was the ultimate. While everyone else went 16-bit (or disappeared altogether), Acorn just kept selling variations on the same 8-bit theme. Then, all of a sudden, in 1987, they launched a machine known as Archimedes. It was based on an entirely new processor; the Acorn Risc Machine. This was fully 32-bit data, although it only boasted a 26 bit (equivalent) address bus. It was the first RISC-based home micro in production.

" The ARM chip owed a lot to the experience of its designers with the 6502 upon which its instruction set was based, but it introduced a couple of new ideas. First it had four processor modes with 16 general-purpose registers available. Some of the 16 were different in each mode. It also introduced conditional execution of instructions, avoiding many jumps in code, and helping increase the efficiency of the pipeline. The other interesting feature was its ability to use a barrel-shifter on one of the operands of an instruction with no performance penalty. In other words, a multiply and add can be done in one instruction. This is the kind of technology that Intel are hyping with their 'MMX' Pentiums. Yes, I know MMX is more than that, but it does say something...

Variants

The first ARM chip was available as a second processor for Acorn's 8-bit micros. The ARM chip in the Archimedes was an ARM 2 which ran at 8 MHz. The ARM 3 was installed in several later machines running at speeds up to 25 MHz. Its greatest performance boost came from a simple onboard 4k cache. It was after this that ARM Ltd was spun off from Acorn and started licensing the designs. They came up with the ARM 6 macrocell (what happened to 4 and 5?) and turned it into the ARM 610 processor used in the first Risc PCs. It was coupled with an 8k cache, full 32-bit addressing mode, better cache algorithms and 30 MHz clock. The ARM 710 soon followed with a few preformance tweaks, running at 40 MHz, and the ARM 810 was announced.

Then along came Digital. I'm not sure who initiated the pairing, but somehow Digital Equipment Corp, makers of the blindingly fast Alpha processors, got hold of the ARM designs, and built a processor using their semiconductor expertise. The result was the StrongARM?; a processor that functionally is little different from the ARM 710 except that it is (internally) clocked at 202 MHz. Oh yes, it also has two 8k caches; one for instructions and one for data. Rumour has it that the interpreter of RiscOS?'s built-in BASIC fits neatly into the instruction cache. If this is the case, it explains why interpreted BBC BASIC V is so flippin' fast. The other thing, and this is the cause of most of the few software problems, is that the length of the pipeline has been increased, so that self-modifying code which relies on knowing the length of the pipeline to calculate the PC gets in a real mess."

-- http://www.landley.net/history/mirror/acorn/processors.html

ARM opinions

" I'll just cover those things I really like about ARM in general :)

1. load/store multiple of any arbitrary register combination Yes, thats right. One can do "STM r0, {r0-r15}" if they want to and save every register. LDM is the same.

2. Address updates available for every memory instruction Reusing STM from above, "STM r0!, {r1-r15}", will write the final address to r0 (I've forgotten the exact specifics here). Pretty much every memory op supports this

3. The stack is my territory, and mine alone The processor will never touch the stack. I don't have to deal with processor built stack frames. This greatly simplifies some things

4. Pre-shifts available on all basic ALU instructions (Where "basic ALU" is defined as pretty much everything except MUL. ARM doesn't have division)

This is an incredibly useful feature, though it does make the instructions occasionally look like huge monstrosities! It also means that ARM's ADD instruction can double for most architecture's LEA.

5. Three operand instruction set Well, that one should be reasonably clear ;)

6. No mode flags (or those which exist are implicit) For example, while there are both the ARM and Thumb instruction sets, they're designated by the least significant bit of the branch target address. The BX/BLX instructions automatically move this bit into the current program status register (CPSR)

7. PC is in the register file Yes, you can do "MOV pc, lr" (this is the traditional way to return), and can use the ALU operations for relative branches.

(Caveat: On machines prior to ARMv7 [ARM11 and older processors], these instructions will not transition to/from Thumb mode and the result of loading the least significant bits of PC is Unpredictable. ARMv7 makes them interwork properly with Thumb)

(By the way - when ARM say Unpredictable they mean "May raise a trap, may do something completely unrelated, may be a NOP - behaviour is undefined except that it cannot cause a security hole" and be redefined by future revisions) " -- http://forum.6502.org/viewtopic.php?t=1594

"ARM wasn't really a pure RISC from the beginning (e.g. multicycle instructions like LDM/STM, pre/post-increment addressing modes, built-in shifts)..." [13]

" Yes, although the new instruction set in ARMv8 removes several of the things that made programming in 'classic' ARM assembly such fun on the Acorn, such as the free barrel shifter on most arithmetic ops, conditional execution on all instructions and fast multiple loads/stores with groups of registers. These have gone for various reasons; the fully-flexible barrel shifter is awkward at high frequencies with deep pipelines, the conditional execution flags became a waste of opcode space as branch prediction improved and the load/store multiples required microcode on modern implementations and so increased complexity. " [14]

ARM: Links

ARM: summary

It seems like the 'core' instruction set is indeed the set found in Cortex M0, M0+, and M1. This is a subset of the 16-bit thumb2 set, but with a few 32-bit instructions too.

Those instructions are: MOV, arithmetic (ADD, ADC, SUB, SBC, RSB, MUL), bitwise arithmetic (LSL, LSR, ASR, AND, ORR, EOR, ROR, BIC, MVN), byte reversals (REV, REV16, REVSH), get/set special registers (ADR, MRS, MSR), comparisons (CMP, CMN, TST), branching (B, BL), load/stores with immediate, register offset, PC, SP offset, and multiple registers, push/pop, extension (SXTH, SXTB, UXTH, UXTB), misc control (SVC, NOP), multiprocessing and (YIELD, WFE, WFI, SEV, DMB, DSB), and a few other misc instructions (ISB and some others).

When we get to the Cortex M3 we add 32-bit instructions for bit fields (BFC/BFI, SBFX, UBFX), multiprocessing (LDREX, STREX, CLREX), bitwise arithmetic (CLZ, MOVT, ORN, RRX, saturating versions of things), comparisons (TEQ), various loads and stores (with postindexing and various widths), arithmetic (division, multiply-accumulate (add/subtract) operations with various widths), branch tables (TBB, TBH), and some other misc instructions (DBG, PLD, PLI).

ARM64 instruction list

General instructions:

Data transfer instructions:

A64 floating-point instructions:

A64 SIMD scalar instructions:

A64 SIMD Vector instructions:

ARM64 things removed compared to ARM32

" If you are familiar with ARMv7-A, you’ll know that many instructions can be conditionally executed. In A32, this is supported via a condition field in the instruction itself; in T32, we have the IT (if-then) instruction for building conditional sequences. This isn’t supported in A64 and we have a different set of specific conditional instructions. You can find examples below.

The ability to “embed” shift and rotate operations into data processing instructions is not supported in the same way in A64, although it is still possible to shift, rotate and sign-extend or zero-extend the second operand.

The Program Counter (PC) is no longer generally accessible. In particular, it can’t be read or modified like other general purpose registers. There are pseudo-instructions which can be used to use it indirectly (for instance, to generate PC-relative addresses at run-time).

Historically, the ARM instruction set has included a space for «coprocessors». Originally, these were external blocks of logic which were connected to the core via a dedicated coprocessor interface. More recently, this support for external coprocessors has been dropped and the instruction set space is used for extension instructions. One specific use of it has been to provide for system configuration and control operations via the notional «coprocessor 15». You won’t find anything like this in A64.

The load and store multiple instructions have been replaced with instructions which load and store pairs of 64-bit registers. These are used for stack operations as well, in place of the earlier PUSH and POP. " -- https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf

A comment on what ARMv8 has changed that is good, from https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip:

" ... There is a HUGE amount of learning that informed ARMv8, from the dropping of predication and shifting everywhere, to the way constants are encoded, to high-impact ideas like load/store pair and their particular version of conditional selection, to the codification of the memory ordering rules. Look at SVE as the newest version of something very different from what they were doing earlier. " -- name99

ARM64 calling convention

https://c9x.me/compile/bib/abi-arm64.pdf

Misc

"ARM/64 fact: integer divide-by-zero doesn't cause an exception. The Microsoft C compiler implicitly inserts a check if the divisor is 0 and triggers an pseudo-div-by-0 exception. Clang and GCC compiles the code as it so you always get 0." [15]

Opinions

Links

A64:

Arm v1: