Table of Contents for Programming Languages: a survey
ARM: Intro
https://en.wikipedia.org/wiki/ARM_architecture#32-bit_architecture
http://users.ece.utexas.edu/~valvano/EE345M/Arm_EE382N_4.pdf
https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html list of instructions with names, todo
A recent addition to the ARM ISA family is ARM64 (ARMv8 A64 / AArch64), described on the pages http://www.arm.com/products/processors/instruction-set-architectures/index.php http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0677b/ch01s01.html http://www.arm.com/files/downloads/ARMv8_Architecture.pdf http://www.cs.utexas.edu/~peterson/arm/DDI0487A_a_armv8_arm_errata.pdf http://www.arm.com/files/pdf/ARMv8R__Architecture_Oct13.pdf.
ARM has various versions and 3 profiles; A (full-features for use as e.g. CPU of smartphone or computer; has virtual addressing MMU), R (real-time, for use in e.g. car engines; has deterministic (i think) physical addressing MMU), M (microcontroller; only supports Thumb ISA). The latest version is v8, but according to the ARM Wikipedia page only A and R profiles are (yet) available for v8. v7 has all 3 profiles (e.g. http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf ). There's also an E-M which is like M with a DSP extension, found in v7.
ARM Thumb: "The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions." -- (ARM7TDMI Technical Reference Manual Revision: r4p1) "The Thumb instruction set provides better code density, at the expense of inferior performance....Thumb-2, a major enhancement of the Thumb instruction set. Thumb-2 provides almost exactly the same functionality as the ARM instruction set. It has both 16-bit and 32-bit instructions, and achieves ARM-like performance with Thumb-like code density." -- (RealView? Compilation Tools Assembler Guide Version 4.0) https://en.wikipedia.org/wiki/ARM_Cortex-M
"The biggest register difference involves the SP register. The Thumb state has unique stack mnemonics (PUSH, POP) that don't exist in the ARM state. These instructions assume the existence of a stack pointer, for which R13 is used. They translate into load and store instructions in the ARM state. " -- http://www.embedded.com/electronics-blogs/beginner-s-corner/4024632/Introduction-to-ARM-thumb
"The original Thumb-Instruction set only contained 16-bit instructions. Thumb2 introduced mixed 16/32 bit instructions....The ARM processor has 2 instruction sets, the traditional ARM set, where the instructions are all 32-bit long, and the more condensed Thumb(2) set, where most common instructions are 16-bit long (and some are 32-bit long)." -- http://stackoverflow.com/questions/10638130/thumb-instruction-in-arm
Some instructions have immediate addressing modes and others do not. i won't bother to include that information because my interest here is mainly in the instruction set. I leave out some instructions that are, to me, uninteresting variants of existing ones. Note that the purpose of these listings is not accuracy, but rather to get a sense of what sorts of instructions are in RISC-ish CPU instruction sets.
Note that in Thumb2, instructions cannot reference the PC (program counter) or SP (stack pointer) as operands, including destination operand, unless noted. Note that every instruction that returns a result takes an operand specifying the destination register; operations are NOT done in place on the input registers (except when the destination register given is the same as an input register).
ARM has 'barrel shifting', meaning that shifts and rotates can be performed on operands without issuing separate instructions.
It has a clever way of representing 32-bit immediate values with only 8 bits plus 4 bits to determine a shift, which allows it to represent any power of 2 as an immediate value: http://alisdair.mcdiarmid.org/2014/01/12/arm-immediate-value-encoding.html . "Thumb-2 immediate encoding is even more gleeful--in addition to allowing rotation, it also allows for spaced repetition of any 8-bit pattern (common in low level hack patterns, like from [1]) to be encoded in single instructions." -- https://news.ycombinator.com/item?id=7046803 . If the value you want isn't accessible as an immediate, you can load it from a constant table or you can compute it, or some instruction sets have MOVW and MOVT which can construct and combine 16-bit immediates into a 32-bit value. Some assemblers let you just specify the immediate and the assembler figures out how to get it ( https://news.ycombinator.com/item?id=7045898 ).
ARM instructions traditionally encoded a conditional execution field, allowing instructions to be skipped depending on the flags, without doing a branch. On ARM64 this has been changed:
" arm64 ... sort of ditches conditional execution. It’s not on every instruction any more, but it’s still available on more instructions than on most other arches.
To the usual complement of typical conditional instructions (branch, add/sub with carry, select and set), arm64 adds select with increment, negate, or inversion, the ability to conditionally set to -1 as well as +1, and the ability to conditionally compare and merge the flags in a fairly flexible manner (it’s really a conditional select of condition flags between the result of a comparison and an immediate). This actually preserves most of the power of conditional execution (except for really exotic hand-coded usages), while taking up much less encoding space. " -- stephencanon , https://news.ycombinator.com/item?id=7047762
ARM has 8 Operating Modes ). "Each mode has its own mode-specific registers, including a status register":
- User – normal operation
- Fast interrupt – handling of ”fast” interrupts
- Interrupt – handling of all other interrupts
- Supervisor – operating system protected mode
- Abort – abortion of memory access
- System – operating system privileged mode
- Undefined – invalid instruction in stream
- Secure monitor – on-chip security features
(descriptions from http://www.cs.virginia.edu/~skadron/cs433_s09_processors/arm11.pdf )
Addressing modes ( http://www.cs.uregina.ca/Links/class-info/301/ARM-addressing/lecture.html ):
- register
- absolute
- immediate
- register indirect
- register indirect with immediate offset
- register indirect preincrementing by immediate offset
- register indirect postincrementing by immediate offset
- register indirect with register offset
- register indirect with register offset with scaling
For ARM64 (AArch64), see also https://developer.arm.com/documentation/102374/0101/Loads-and-stores---addressing , which presents just 4 addressing modes applied only to loads/stores:
- simple (register), offset (register + immediate offset), pre-indexed (register += immediate offset), post-indexed (like pre-indexed except the address used is before adding the offset)
- the AArch64 spec, ( https://developer.arm.com/documentation/ddi0487/latest/ ), also mentions a PC-relative addressing mode called 'literal'
The AArch64 spec, ( https://developer.arm.com/documentation/ddi0487/latest/ ), speaks of other "addressing modes", but afaict from section C1.3 "Address generation" subsection "Address calculation", these are just ways to compute addresses with instructions like ADD, rather than ways to avoid using a separate instruction to compute an address.
The notes in section C1.3 "Address generation" subsection "Address calculation" indicate that when using an ADD instruction to add an immediate offset to a base address, the size of the immediate is 12 bits.
I can't tell if there is a way to use a single ADD instruction to compute (base + scale*index + immediate_offset), but it appears to me that this would require two instructions, one to add the scaled index, and a second to add the immediate offset.
ARM: 16-bit Thumb2 instructions
MOV LSL r1 r2 r3 (logical shift left; r1 := r2 << r3) LSR ASR (arithmetic shift left) ADD (note; the source and/or destination operands for ADD can include SP, the stack pointer; in this way you can get the SP into a register) SUB (note; the source and destination operands for SUB can include SP, the stack pointer)
ADR (Add immediate to program counter; in this way you can get the PC into a register; useful for getting the address of a 'label' if your assembler translates labels to relative offsets )
CMP
AND EOR (xor)
ADC (Add with Carry; a + b + carry bit) SBC (Subtract with Carry; a - b - carry bit) ROR (Rotate Right) TST (Test bits: TST x y: update condition code flags on Rn AND Rm) RSB (Reverse subtract (from zero; e.g. negate)) CMP (update condition code flags on Rn - Rm) CMN (Compare Negative; update condition code flags on Rn + Rm) ORR (or) MUL BIC (Bit Clear: x AND (NOT y)) MVN (Move Negative/NOT: binary negation)
BL (branch with link; BL <label>: LR register = address of next instruction, PC = label)
BX (Branch and Exchange; this is used to enter/exit "thumb state") BLX (Branch with Link and Exchange; this is used to enter/exit "thumb state")
Load and store:
STR (Store word. Addressing modes include immediate, register offset, PC offset, SP offset. Can store list of multiple registers (STMIA).) also STRH for store halfword, STRB for byte
LDR (Load word. Addressing modes include immediate, register offset, SP offset. Can load list of multiple registers (LDMIA).) also LDRH for Load unsigned halfword, LDRSH for signed halfword, LDRB for unsigned byte, LDRSB for signed byte
LDR (load from literal pool instrs) B (unconditional, conditional branch instructions: takes as an operand a 'condition field' (this is different from a condition code), which is one of equal, not equal, Carry Set / Unsigned higher or same, Carry Clear / Unsigned lower, Negative, Positive or zero, Overflow, No overflow, Unsigned higher, Unsigned lower or same, Signed greater than or equal, Signed less than or equal, Signed greater than, Signed less than, always
SVC (service (system) call instructions; formerly SWI) SETEND (set endianness) CPS (change processor state; enables and disables specified interrupts) BKPT (software breakpoint) IT (If-Then; "Makes up to four following instructions conditional, according to pattern. pattern is a string of up to three letters. Each letter can be T (Then) or E (Else)."
Adjust stack pointer instructions Increment stack pointer ADD (SP plus immediate) Decrement stack pointer SUB (SP minus immediate)
Sign or zero extend instructions (these are used to convert a signed or unsigned value of a certain byte width into a value of a larger byte width, e.g. to convert a signed byte representing "-10" to a signed word representing "-10"; see http://odellconnie.blogspot.com/2012/03/sign-extension-zero-extension.html ) SXTH (Signed Extend Halfword to Word: SXTH Rd Rm: Rd[31:0] := SignExtend?(Rm[15:0])) SXTB (Signed Extend Byte to Word: Rd[31:0] := SignExtend?(Rm[7:0]) UXTH (Unsigned Extend Halfword to word: Rd[31:0] := ZeroExtend?(Rm[15:0])) UXTB (Unsigned Extend Byte to word: Rd[31:0] := ZeroExtend?(Rm[7:0]))
Compare and branch on (non-)zero instructions CBZ (Compare and branch on zero; CBZ r <label>: if r == 0, goto <label>) CBNZ (Compare and branch on non-zero)
PUSH (push selected registers onto stack) POP (push selected registers from stack)
Reverse byte instructions REV (Byte-Reverse Word, e.g. reverse the ordering of the four bytes in the word (and put the result in the destination register)) REV16 (Byte-Reverse Packed Halfword, e.g. reverse the ordering of the two bytes in both halfwords) REVSH (Byte-Reverse Signed Halfword, e.g. reverse the bytes in the low halfword, and sign extend the result to will the whole word)
NOP-compatible hint instructions: NOP YIELD (Yield control to alternative thread) WFE (Wait For Event) WFI (Wait For Interrupt) SEV (Send event; signal event in multiprocessor system)
ARM: 32-bit Thumb2 instructions
ORN (OR (not)) TEQ (update condition code flags on a XOR b) MOVT (move the source halfword into the top halfword of the destination register) BFC (Bit Field Clear; set specified bits to zero; takes a starting bit and a bitwidth) BFI (Bit Field Insert; set specified bits to specified values; takes a starting bit and a bitwidth and a source value)
SBFX (Signed Bit Field extract) SSAT (Signed saturate, LSL, ASR) SSAT16 (Signed saturate 16-bit) UBFX (Unsigned Bit Field extract) USAT (Unsigned saturate, LSL, ASR) USAT16 (Unsigned saturate 16-bit)
PKH (Pack halfword, BT, TB) RRX (Rotate Right with Extend)
Signed and unsigned extend instructions with optional addition: SXTAB (Signed extend byte and add) SXTAB16 (Signed extend two bytes to halfwords, and add) SXTAH (Signed extend halfword and add) SXTB16 (Signed extend two bytes to halfwords) UXTAB (Unsigned extend byte and add) UXTAB16 (Unsigned extend two bytes to halfwords, and add) UXTAH (Unsigned extend halfword and add) UXTB16 (Unsigned extend two bytes to halfwords)
SIMD add and subtract: QADD16, UADD16, QADD8, UADD8, QASX, UASX, QSUB16, UHADD16, QSUB8, UHADD8, QSAX, UHASX, SADD16, UHSUB16, SADD8, UHSUB8, SASX, UHSAX, SHADD16, UQADD16, SHADD8, UQADD8, SHASX, UQASX, SHSUB16, UQSUB16, SHSUB8, UQSUB8, SHSAX, UQSAX, SSUB16, USUB16, SSUB8, USUB8, SSAX
Mnemonic element Meaning: Q prefix Signed saturating arithmetic. S prefix Signed arithmetic, modulo 28 or 216. SH prefix Signed halving arithmetic. The result of the calculation is halved. U prefix Unsigned arithmetic, modulo 28 or 216. UH prefix Unsigned halving arithmetic. The result of the calculation is halved. UQ prefix Unsigned saturating arithmetic. 16 suffix The instruction performs two 16-bit calculations. 8 suffix The instruction performs four 8-bit calculations. ASX mnemonic The instruction performs one 16-bit addition and one 16-bit subtraction. The X indicates that the halfwords of the second operand are exchanged before the operation. SAX mnemonic The instruction performs one 16-bit subtraction and one 16-bit addition. The X indicates that the halfwords of the second operand are exchanged before the operation.
CLZ (Count Leading Zeros (just what is sounds like)) QADD (Saturating Add) QDADD (Saturating Double and Add) QDSUB (Saturating Double and Subtract) QSUB (Saturating Subtract) RBIT (Reverse Bits) SEL (Select bytes; passed 4 bits in GE register, which control, in each of the four word positions of the output, which word out of the two input bytes will contribute that byte)
multiply/divide and accumulate (add/subtract the result of multiplying to the destination, in-place), with various different byte widths of the operands and destination register(s): MLA (multiply and accumulate; x + (y*z)) MLS (multiply and subtract) SMLAxy (Signed Multiply-Accumulate Add, with double-length result) SMLAD (Signed Dual Multiply-Accumulate Add) SMLAWx (Signed Multiply-Accumulate Add) SMLSD (Signed Dual Multiply Subtract and Accumulate) SMMLA (Signed 32 + 32 x 32-bit, most significant word) SMMLS (Signed 32 – 32 x 32-bit, most significant word) SMMUL (Signed 32 x 32-bit, most significant 32-bit word) SMUAD (Signed Dual Multiply Add) SMULxy SMULWx SMUSD (Signed Dual Multiply Subtract) USAD8 (Unsigned Sum of Absolute Differences) USADA8 (Unsigned Accumulate Absolute Differences)
with 64-bit results (two registers to hold result): SMULL (Signed multiply with double-length result) UMULL (Unsigned multiply with double-length result) SDIV (Signed divide) UDIV (Unsigned divide) SMLALxy (Signed multiply with double-length result and accumulate) SMLALD (Signed Multiply Accumulate Long Dual) SMLSLD (Signed Multiply Subtract accumulate Long Dual) UMLAL (Unsigned 64 + 32 x 32) UMAAL (Unsigned multiply and accumulate with double-length result)
loads and stores:
- add versions for postindexing, and for double words
- PLD, PLI (preload)
LDRD (load double) STRD (store double) LDREX (load exclusive word; something to do with semaphores) STREX (store exclusive word; something to do with semaphores) CLREX (clear local processor exclusive tag; something to do with semaphores)
TBB (Table Branch Byte) TBH (Table Branch Halfword)
LDMDB / LDMEA (Load Multiple Decrement Before / Empty Ascending) RFE (Return From Exception) SRS (Store Return State) STMDB / STMFD on page 4-333 (Store Multiple Decrement Before / Full Descending)
MRS (Move from Status register to ARM Register, e.g. put the condition codes into a register) MSR (Move from ARM register to Status register, e.g. copy a register over the condition codes) SUBS (Return From Exception without stack)
DBG (Debug hint)
Special control operations: CLREX (Clear Exclusive) DSB (Data Synchronization Barrier) DMB (Data Memory Barrier) ISB (Instruction Synchronization Barrier)
Coprocessor instructions: not listed
Links:
ARM: Cortex M profile
Cortex M0, M0+, and M1 only have these instructions:
16-bit: ADC, ADD, ADR, AND, ASR, B, BIC, BKPT, BLX, BX, CMN, CMP, CPS, EOR, LDM, LDR, LDRB, LDRH, LDRSB, LDRSH, LSL, LSR, MOV, MUL, MVN, NOP, ORR, POP, PUSH, REV, REV16, REVSH, ROR, RSB, SBC, SEV, STM, STMIA, STR, STRB, STRH, SUB, SVC, SXTB, SXTH, TST, UXTB, UXTH, WFE, WFI, YIELD
32-bit: BL (branch with link), DMB (Data Memory Barrier; Ensure the order of observation of memory accesses), DSB (Data Synchronization Barrier; Ensure the completion of memory accesses), ISB (Instruction Synchronization Barrier; flush processor pipeline and branch prediction logic), MRS (Move from Status register), MSR (move to status register)
Note that the 16-bit instruction set is identical to the 16-bit thumb-2 instruction set above, except for SETEND (set endianness), IT (if-then), CBZ (Compare and branch on zero), CBNZ. (also, BL here appears only as 32-bit, whereas it was in the 16-bit instruction set, but I think that BL is actually 32-bits in the 16-bit instruction set in some way, not sure i understand that though). IT, CBZ, CBNZ are added in the Cortex M3, as well as a bunch of 32-bit instructions:
new 32-bit instructions in the Cortex M3: BFC (Bit Field Clear), BFI (Bit Field Insert), CDP (?), CLREX (clear local processor exclusive tag), CLZ (count leading zeros), DBG (debug hint), various loads (LDC, LDMA, LDMDB, LDRBT, LDRD, LDREX, LDREXB, LDREXH, LDRHT, LDRSB, LDRSBT, LDRSHT, LDRT), MCR (?), MLS (multiply and subtract), MCRR (?), MLA (multiply and accumulate; x + (y*z)), MOVT (move the source halfword into the top halfword of the destination register), MRC (?), MRRC (?), ORN (x or (not(y)), PLD (preload data), PLDW, PLI (preload instructions), RRX (Rotate Right with Extend), SBFX (Signed Bit Field extract), SDIV (Signed divide), SMLAL (an SMULL-like thingee), SMULL, SSAT (signed saturate), STC (?), various stores (STMDB, STRBT, STRD, STREX, STREXB, STREXH, STRHT, STRT), TBB (Table Branch Byte), TBH (Table Branch Halfword), TEQ (update condition code flags on a XOR b), UBFX (Unsigned Bit Field extract), UDIV (Unsigned divide), other multiply, multiply-accumulate, and saturate instructions (UMLAL, UMULL, USAT)
Note that http://www.eetimes.com/document.asp?doc_id=1319726 claims that "SoCs? based on ARM's M0+ Flycatcher core will not run Linux, although they do hit the sub-50-cent price point for the IoT?, including security engines and targeted peripherals."
As of this writing, the Cortex M0+ seems to be the leading design for 32bit tiny low-power devices. There are very small versions of them, e.g. http://cache.freescale.com/files/microcontrollers/doc/fact_sheet/KINETISKL02CSPFS.pdf?fpsp=1 which is 16 mm^2. This device runs about 48 MHz and the M0+ design yields about 1 MIPS/MHz, which means that according to http://www.roylongbottom.org.uk/mips.htm it's about as powerful as a 486! It has 32KB flash RAM (presumably for program storage) and 4 KB RAM. Intel recently released a small low-power chip called the Quark which is a SoC? with a 486 ISA, 512 KB SRAM, 16 KB cache.
ARM Cortex M0 instruction list
from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html
- move: mov movs
- arithmetic: add(s) adcs (add with carry) adr (PC-relative Address to Register) sub subs sbcs (sub with carry) rsbs (reverse subtract; negate) muls (multiply 32-bit with 32-bit result)
- compare: cmp cmn (compare negative)
- logical: ands (and) orrs (or) eors (xor) bics (bit clear) mvns (move NOT) tst (AND test)
- bit shifts: lsls lsrs asrs rors
- loads and stores: ldr (load) ldr(b
h | sb | sh) (load byte | halfword | signed byte | signed halfword) ldm (load multiple) str (store) str(b | h) stm (store multiple) (push | pop) (push/pop registers onto/from stack) |
- control: b (branch, conditional or unconditional) bl (branch with link) bx (branch with exchange) blx (branch with link and exchange)
- extend: (u
s)xt(b | h) (extend unsigned | signed byte | halfword) |
- byte-reverse: rev (reverse bytes in word) rev16 (reverse bytes in both halfwords) revsh (reverse signed bottom half word)
- State change: svc (supervisor call) cpsi(d
e) (disable/enable interrupts) (mrs | msr) (read/write special register) bkpt (breakpoint) |
- Hint/events: sev (Send event) wfe (wait for event) wfi (wait for interrupt) yield (this is a no-op) nop
- barriers: isb (instruction sync barrier) dmb (Data Memory Barrier) dsb (data sync barrier)
More notes on ARM Cortex Ms
from https://en.m.wikipedia.org/wiki/ARM_Cortex-M
" See also: ARM architecture § Instruction set
The Cortex-M0 / M0+ / M1 implement the ARMv6-M architecture,[9] the Cortex-M3 implements the ARMv7-M architecture,[10] and the Cortex-M4 / M7 implements the ARMv7E-M architecture.[10] The architectures are binary instruction upward compatible from ARMv6-M to ARMv7-M to ARMv7E-M. Binary instructions available for the Cortex-M0 / M0+ / M1 can execute without modification on the Cortex-M3 / M4 / M7. Binary instructions available for the Cortex-M3 can execute without modification on the Cortex-M4 / M7 / M33.[9][10] Only Thumb-1 and Thumb-2 instruction sets are supported in Cortex-M architectures, but the legacy 32-bit ARM instruction set isn't supported.
All six Cortex-M cores implement a common subset of instructions that consists of most Thumb-1, some Thumb-2, including a 32-bit result multiply. The Cortex-M0 / M0+ / M1 / M23 were designed to create the smallest silicon die, thus having the fewest instructions of the Cortex-M family.
The Cortex-M0 / M0+ / M1 include Thumb-1 instructions, except new instructions (CBZ, CBNZ, IT) which were added in ARMv7-M architecture. The Cortex-M0 / M0+ / M1 include a minor subset of Thumb-2 instructions (BL, DMB, DSB, ISB, MRS, MSR). The Cortex-M3 / M4 / M7 / M33 have all base Thumb-1 and Thumb-2 instructions. The Cortex-M3 adds three Thumb-1 instructions, all Thumb-2 instructions, hardware integer divide, and saturation arithmetic instructions. The Cortex-M4 adds DSP instructions and an optional single-precision floating-point unit (VFPv4-SP). The Cortex-M7 adds an optional double-precision FPU (VFPv5).[9][10] ...
The 32-bit ARM instruction set is not included in Cortex-M cores.
Endianness is chosen at silicon implementation in Cortex-M cores. Legacy cores allowed "on-the-fly" changing of the data endian mode.
Co-processors aren't supported on Cortex-M cores.
"
see also chart https://en.m.wikipedia.org/wiki/ARM_Cortex-M#Instruction_sets
" SysTick? timer: A 24-bit system timer that extends the functionality of both the processor and the Nested Vectored Interrupt Controller (NVIC). When present, it also provides an additional configurable priority SysTick? interrupt.[9][10][11] Though the SysTick? timer is optional, it is very rare to find a Cortex-M microcontroller without it. "
" Memory Protection Unit (MPU): Provides support for protecting regions of memory through enforcing privilege and access rules. It supports up to eight different regions, each of which can be split into a further eight equal-size sub-regions.[9][10][11] " -- Cortex-M3, M4, M7, and M23 have an MPU option
M0, M1, M0+ and the new M23 are Von Neumann; M3, M4, M4 are Harvard.
" The Cortex-M0 core is optimized for small silicon die size and use in the lowest price chips. (ARMv6-M) ... The Cortex-M0+ is an optimized superset of the Cortex-M0. (ARMv6-M architecture) ... The Cortex-M1 is an optimized core especially designed to be loaded into FPGA chips. (ARMv6-M) ... (Cortex-M3 is ARMv7-M) ... Conceptually the Cortex-M4 is a Cortex-M3 plus DSP instructions, and optional floating-point unit (FPU). If a core contains an FPU, it is known as a Cortex-M4F, otherwise it is a Cortex-M4. (ARMv7E-M) ... The Cortex-M7 is a high-performance core with almost double the power efficiency of the older Cortex-M4. (ARMv7E-M) ... The Cortex-M23 core was announced in October 2016[23] and based on the newer ARMv8-M architecture that was previously announced in November 2015.[24] Conceptually the Cortex-M23 is similar to a Cortex-M0+ plus integer divide instructions and TrustZone? security features, and also has a 2-stage instruction pipeline. ... The Cortex-M33 core was announced in October 2016[23] and based on the newer ARMv8-M architecture that was previously announced in November 2015.[24] Conceptually the Cortex-M33 is similar to a Cortex-M4 plus TrustZone? security features, and also has a 3-stage instruction pipeline. "
note: the 32-bit multiply on the Cortex-M23 only gives a 32-bit result! (the lower 32 bits)
"The Cortex-M0 / M0+ / M1 / M23 only has 32-bit multiply instructions with a lower-32-bit result (32bit × 32bit = lower 32bit), where as the Cortex-M3 / M4 / M7 / M33 includes additional 32-bit multiply instructions with 64-bit results (32bit × 32bit = 64bit)."
relevant parts of table "ARM Cortex-M instruction groups":
all Cortex-M parts have the following Thumb1 instrs (16-bit):
ADC, ADD, ADR, AND, ASR, B, BIC, BKPT, BLX, BX, CMN, CMP, CPS, EOR, LDM, LDR, LDRB, LDRH, LDRSB, LDRSH, LSL, LSR, MOV, MUL, MVN, NOP, ORR, POP, PUSH, REV, REV16, REVSH, ROR, RSB, SBC, SEV, STM, STMIA, STR, STRB, STRH, SUB, SVC, SXTB, SXTH, TST, UXTB, UXTH, WFE, WFI, YIELD (52 instrs)
and the following Thumb2 instrs (32-bit):
BL, DMB, DSB, ISB, MRS, MSR (6 instrs)
The M23 but not the M0+ has the following Thumb1 (16-bit):
CBNZ, CBZ
and the following Thumb2 (32-bit):
SDIV, UDIV
The M23 and M33 only have the following trustzone instrs:
16-bit: BLXNS, BXNS 32-bit: SG, TT, TTT, TTA, TTAT
Links:
ARM Cortex M4 floating-point
ARM Cortex M4 floating-point:
"The FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations. It also provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions."
Instructions:
- arithmetic: VADD, VSUB, VMUL, VDIV, VNEG, VABS, VSQRT
- comparisons: VCMP, VCMPE (compare; E variant means raise exception even if either operand is quiet NaN?; o/w it raises exception only if either operand is signaling NaN?)
- conversions: VCVT, VCVTR, (convert; 'R' means to use custom rounding mode instead of rounding towards 0 -- only applicable to conversion from float to int),
- loads and stores and movs: VLDR, VSTR, VMOV, multiple load/store: VLDM, VSTM, special register load/store: VMRS (load from special), VMSR (store to special)
- multiply variants: VMLA (then accumulate), VMLS (then subtract), VNMLA (then negate, then accumulate), VNMLS (then subtract, then negate), VNMUL (negate then multiply)
- stack ops: VPOP, VPUSH,
" 7.2.5 Complete implementation of the IEEE 754 standard
The Cortex‑M?4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. The Cortex‑M?4 floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:
Remainder.
Round floating-point number to integer-valued floating-point number.
Binary-to-decimal conversions.
Decimal-to-binary conversions.
Direct comparison of single-precision and double-precision values."
" The FPU sets the cumulative exception status flag in the FPSCR register as required for each instruction, in accordance with the FPv4 architecture. The FPU does not support exception traps. The processor also has six output pins, FPIXC, FPUFC, FPOFC, FPDZC, FPIDC, and FPIOC, that each reflect the status of one of the cumulative exception flags. See the Cortex®‑M?4 Integration and Implementation Manual for a description of these outputs. "
ARM interrupts
ARM provides nested vectored interrupts. "Nested" because if another interrupt occurs while the first one is executing, the currently executing interrupt may itself be interrupted. "Vectored" because each interrupt causes the code at the corresponding interrupt handler entry point to be executed [2] (as opposed to the alternative, "polled" interrupts, in which, during an interrupt, the system calls each handler out of a large group of handlers until one handler 'claims' the interrupt).
ARM has some built-in interrupt types (these have IRQ numbers less than 0); see [3] for a list of built-in interrupt types [4]. It also supports vendor specific interrupt types (which have non-negative IRQ numbers) "typically for devices like UART/I²C?/USB/etc" [5].
Interrupts have priority levels. in ARM, lower priority is more urgent (that is, an interrupt may interrupt another currently executing interrupt if the currently executing interrupt has a higher priority number). There must be at least 4 priority levels available on any Cortex M0 or M0+ device; more on Cortex M3/M4/M7 devices. There are also 'subpriority' levels, which are used to determine which interrupt goes first when multiple interrupts are pending.
You can temporarily disable all interrupts. Disabling interrupts is sometimes called "masking" them [6], but other times a distinction is made between disabling (where the interrupt is never emitted or is completly ignored) and masking (where the interrupt is omitted but held for later) [7].
On Cortex M3/M4/M7, but not on M0/M0+, you can also temporarily disable all interrupts higher than a certain priority (that is, LESS urgent than the given priority).
For each interrupt, you can also:
- enable/disable it (individually)
- change its priority (except Reset (-3), NMI (-2) and HardFault? (-1) which "have a fixed (negative) priority and cannot be disabled" [8]. Reset is "invoked on power up or a warm reset". NMI is "A Non Maskable Interrupt (NMI) can be signalled by a peripheral or triggered by software". "A HardFault? is an exception that occurs because of an error during exception processing, or because an exception cannot be managed by any other exception mechanism" [9].
- mark it as pending, or clear this mark
The other built-in interrupts are: [10]
- (not M0) MemManage? "a memory protection related fault"
- (not M0) BusFault? "a memory related fault for an instruction or data memory transaction. This might be from an error detected on a bus in the memory system."
- (not M0) UsageFault? "an undefined instruction, an illegal unaligned access, invalid state on instruction execution, an error on exception return", and maybe "unaligned address on word and halfword memory access", and maybe "division by zero."
- SVCall "supervisor call (SVC) is an exception that is triggered by the SVC instruction. In an OS environment, applications can use SVC instructions to access OS kernel functions and device drivers."
- PendSV? "an interrupt-driven request for system-level service. In an OS environment, use PendSV? for context switching when no other exception is active."
- SysTick? (optional on M0) (IRQ #-1) "an exception the system timer generates when it reaches zero. Software can also generate a SysTick? exception. In an OS environment, the processor can use this exception as system tick."
- On M3/M4/M7 there is also a negative IRQ reserved for debugging [11]
Cortex-M0/M0+/M1 can have up to 32 interrupts, M3/M4/M7/M23 can have up to 240, M33 up to 480. [12].
Links:
ARM history
always had a reputation for weirdness, and I suppose this was the ultimate. While everyone else went 16-bit (or disappeared altogether), Acorn just kept selling variations on the same 8-bit theme. Then, all of a sudden, in 1987, they launched a machine known as Archimedes. It was based on an entirely new processor; the Acorn Risc Machine. This was fully 32-bit data, although it only boasted a 26 bit (equivalent) address bus. It was the first RISC-based home micro in production.
" The ARM chip owed a lot to the experience of its designers with the 6502 upon which its instruction set was based, but it introduced a couple of new ideas. First it had four processor modes with 16 general-purpose registers available. Some of the 16 were different in each mode. It also introduced conditional execution of instructions, avoiding many jumps in code, and helping increase the efficiency of the pipeline. The other interesting feature was its ability to use a barrel-shifter on one of the operands of an instruction with no performance penalty. In other words, a multiply and add can be done in one instruction. This is the kind of technology that Intel are hyping with their 'MMX' Pentiums. Yes, I know MMX is more than that, but it does say something...
Variants
The first ARM chip was available as a second processor for Acorn's 8-bit micros. The ARM chip in the Archimedes was an ARM 2 which ran at 8 MHz. The ARM 3 was installed in several later machines running at speeds up to 25 MHz. Its greatest performance boost came from a simple onboard 4k cache. It was after this that ARM Ltd was spun off from Acorn and started licensing the designs. They came up with the ARM 6 macrocell (what happened to 4 and 5?) and turned it into the ARM 610 processor used in the first Risc PCs. It was coupled with an 8k cache, full 32-bit addressing mode, better cache algorithms and 30 MHz clock. The ARM 710 soon followed with a few preformance tweaks, running at 40 MHz, and the ARM 810 was announced.
Then along came Digital. I'm not sure who initiated the pairing, but somehow Digital Equipment Corp, makers of the blindingly fast Alpha processors, got hold of the ARM designs, and built a processor using their semiconductor expertise. The result was the StrongARM?; a processor that functionally is little different from the ARM 710 except that it is (internally) clocked at 202 MHz. Oh yes, it also has two 8k caches; one for instructions and one for data. Rumour has it that the interpreter of RiscOS?'s built-in BASIC fits neatly into the instruction cache. If this is the case, it explains why interpreted BBC BASIC V is so flippin' fast. The other thing, and this is the cause of most of the few software problems, is that the length of the pipeline has been increased, so that self-modifying code which relies on knowing the length of the pipeline to calculate the PC gets in a real mess."
-- http://www.landley.net/history/mirror/acorn/processors.html
ARM opinions
" I'll just cover those things I really like about ARM in general :)
1. load/store multiple of any arbitrary register combination Yes, thats right. One can do "STM r0, {r0-r15}" if they want to and save every register. LDM is the same.
2. Address updates available for every memory instruction Reusing STM from above, "STM r0!, {r1-r15}", will write the final address to r0 (I've forgotten the exact specifics here). Pretty much every memory op supports this
3. The stack is my territory, and mine alone The processor will never touch the stack. I don't have to deal with processor built stack frames. This greatly simplifies some things
4. Pre-shifts available on all basic ALU instructions (Where "basic ALU" is defined as pretty much everything except MUL. ARM doesn't have division)
This is an incredibly useful feature, though it does make the instructions occasionally look like huge monstrosities! It also means that ARM's ADD instruction can double for most architecture's LEA.
5. Three operand instruction set Well, that one should be reasonably clear ;)
6. No mode flags (or those which exist are implicit) For example, while there are both the ARM and Thumb instruction sets, they're designated by the least significant bit of the branch target address. The BX/BLX instructions automatically move this bit into the current program status register (CPSR)
7. PC is in the register file Yes, you can do "MOV pc, lr" (this is the traditional way to return), and can use the ALU operations for relative branches.
(Caveat: On machines prior to ARMv7 [ARM11 and older processors], these instructions will not transition to/from Thumb mode and the result of loading the least significant bits of PC is Unpredictable. ARMv7 makes them interwork properly with Thumb)
(By the way - when ARM say Unpredictable they mean "May raise a trap, may do something completely unrelated, may be a NOP - behaviour is undefined except that it cannot cause a security hole" and be redefined by future revisions) " -- http://forum.6502.org/viewtopic.php?t=1594
"ARM wasn't really a pure RISC from the beginning (e.g. multicycle instructions like LDM/STM, pre/post-increment addressing modes, built-in shifts)..." [13]
" Yes, although the new instruction set in ARMv8 removes several of the things that made programming in 'classic' ARM assembly such fun on the Acorn, such as the free barrel shifter on most arithmetic ops, conditional execution on all instructions and fast multiple loads/stores with groups of registers. These have gone for various reasons; the fully-flexible barrel shifter is awkward at high frequencies with deep pipelines, the conditional execution flags became a waste of opcode space as branch prediction improved and the load/store multiples required microcode on modern implementations and so increased complexity. " [14]
ARM: Links
ARM: summary
It seems like the 'core' instruction set is indeed the set found in Cortex M0, M0+, and M1. This is a subset of the 16-bit thumb2 set, but with a few 32-bit instructions too.
Those instructions are: MOV, arithmetic (ADD, ADC, SUB, SBC, RSB, MUL), bitwise arithmetic (LSL, LSR, ASR, AND, ORR, EOR, ROR, BIC, MVN), byte reversals (REV, REV16, REVSH), get/set special registers (ADR, MRS, MSR), comparisons (CMP, CMN, TST), branching (B, BL), load/stores with immediate, register offset, PC, SP offset, and multiple registers, push/pop, extension (SXTH, SXTB, UXTH, UXTB), misc control (SVC, NOP), multiprocessing and (YIELD, WFE, WFI, SEV, DMB, DSB), and a few other misc instructions (ISB and some others).
When we get to the Cortex M3 we add 32-bit instructions for bit fields (BFC/BFI, SBFX, UBFX), multiprocessing (LDREX, STREX, CLREX), bitwise arithmetic (CLZ, MOVT, ORN, RRX, saturating versions of things), comparisons (TEQ), various loads and stores (with postindexing and various widths), arithmetic (division, multiply-accumulate (add/subtract) operations with various widths), branch tables (TBB, TBH), and some other misc instructions (DBG, PLD, PLI).
ARM64 instruction list
General instructions:
- ADC Add with Carry
- ADCS Add with Carry, setting flags
- ADD (extended register) Add (extended register)
- ADD (immediate) Add (immediate)
- ADD (shifted register) Add (shifted register)
- ADDS (extended register) Add (extended register), setting flags
- ADDS (immediate) Add (immediate), setting flags
- ADDS (shifted register) Add (shifted register), setting flags
- ADR Form PC-relative address
- ADRL pseudo-instruction Load a PC-relative address into a register
- ADRP Form PC-relative address to 4KB page
- AND (immediate) Bitwise AND (immediate)
- AND (shifted register) Bitwise AND (shifted register)
- ANDS (immediate) Bitwise AND (immediate), setting flags
- ANDS (shifted register) Bitwise AND (shifted register), setting flags
- ASR (register) Arithmetic Shift Right (register)
- ASR (immediate) Arithmetic Shift Right (immediate)
- ASRV Arithmetic Shift Right Variable
- AT Address Translate
- AUTDA, AUTDZA Authenticate Data address, using key A
- AUTDB, AUTDZB Authenticate Data address, using key B AUTIA, AUTIZA, AUTIA1716, AUTIASP, AUTIAZ Authenticate Instruction address, using key A 16.24 AUTIA, AUTIZA, AUTIA1716, AUTIASP, AUTIAZ AUTIB, AUTIZB, AUTIB1716, AUTIBSP, AUTIBZ Authenticate Instruction address, using key B 16.25 AUTIB, AUTIZB, AUTIB1716, AUTIBSP, AUTIBZ
- B.cond Branch conditionally
- B Branch
- BFC Bitfield Clear, leaving other bits unchanged
- BFI Bitfield Insert
- BFM Bitfield Move
- BFXIL Bitfield extract and insert at low end
- BIC (shifted register) Bitwise Bit Clear (shifted register)
- BICS (shifted register) Bitwise Bit Clear (shifted register), setting flags
- BL Branch with Link
- BLR Branch with Link to Register
- BLRAA, BLRAAZ, BLRAB, BLRABZ Branch with Link to Register, with pointer authentication
- BR Branch to Register
- BRAA, BRAAZ, BRAB, BRABZ Branch to Register, with pointer authentication
- BRK Breakpoint instruction
- CBNZ Compare and Branch on Nonzero
- CBZ Compare and Branch on Zero
- CCMN (immediate) Conditional Compare Negative (immediate)
- CCMN (register) Conditional Compare Negative (register)
- CCMP (immediate) Conditional Compare (immediate)
- CCMP (register) Conditional Compare (register)
- CINC Conditional Increment
- CINV Conditional Invert
- CLREX Clear Exclusive
- CLS Count leading sign bits
- CLZ Count leading zero bits
- CMN (extended register) Compare Negative (extended register)
- CMN (immediate) Compare Negative (immediate)
- CMN (shifted register) Compare Negative (shifted register)
- CMP (extended register) Compare (extended register)
- CMP (immediate) Compare (immediate)
- CMP (shifted register) Compare (shifted register)
- CNEG Conditional Negate
- CRC32B, CRC32H, CRC32W, CRC32X CRC32 checksum performs a cyclic redundancy check (CRC) calculation on a value held in a general-purpose register
- CRC32CB, CRC32CH, CRC32CW, CRC32CX CRC32C checksum performs a cyclic redundancy check (CRC) calculation on a value held in a general-purpose register
- CSEL Conditional Select
- CSET Conditional Set
- CSETM Conditional Set Mask
- CSINC Conditional Select Increment
- CSINV Conditional Select Invert
- CSNEG Conditional Select Negation
- DC Data Cache operation
- DCPS1 Debug Change PE State to EL1
- DCPS2 Debug Change PE State to EL2
- DCPS3 Debug Change PE State to EL3
- DMB Data Memory Barrier
- DRPS Debug restore process state
- DSB Data Synchronization Barrier
- EON (shifted register) Bitwise Exclusive OR NOT (shifted register)
- EOR (immediate) Bitwise Exclusive OR (immediate)
- EOR (shifted register) Bitwise Exclusive OR (shifted register)
- ERET Returns from an exception
- ERETAA, ERETAB Exception Return, with pointer authentication
- ESB Error Synchronization Barrier
- EXTR Extract register
- HINT Hint instruction
- HLT Halt instruction
- HVC Hypervisor call to allow OS code to call the Hypervisor
- IC Instruction Cache operation
- ISB Instruction Synchronization Barrier
- LSL (register) Logical Shift Left (register)
- LSL (immediate) Logical Shift Left (immediate)
- LSLV Logical Shift Left Variable
- LSR (register) Logical Shift Right (register)
- LSR (immediate) Logical Shift Right (immediate)
- LSRV Logical Shift Right Variable
- MADD Multiply-Add
- MNEG Multiply-Negate
- MOV (to or from SP) Move between register and stack pointer
- MOV (inverted wide immediate) Move (inverted wide immediate)
- MOV (wide immediate) Move (wide immediate)
- MOV (bitmask immediate) Move (bitmask immediate)
- MOV (register) Move (register)
- MOVK Move wide with keep
- MOVL pseudo-instruction Load a register with either a 32-bit or 64-bit immediate value or any address
- MOVN Move wide with NOT
- MOVZ Move wide with zero
- MRS Move System Register
- MSR (immediate) Move immediate value to Special Register
- MSR (register) Move general-purpose register to System Register
- MSUB Multiply-Subtract
- MUL Multiply
- MVN Bitwise NOT
- NEG (shifted register) Negate (shifted register)
- NEGS Negate, setting flags
- NGC Negate with Carry
- NGCS Negate with Carry, setting flags
- NOP No Operation
- ORN (shifted register) Bitwise OR NOT (shifted register)
- ORR (immediate) Bitwise OR (immediate)
- ORR (shifted register) Bitwise OR (shifted register)
- PACDA, PACDZA Pointer Authentication Code for Data address, using key A
- PACDB, PACDZB Pointer Authentication Code for Data address, using key B
- PACGA Pointer Authentication Code, using Generic key
- PACIA, PACIZA, PACIA
- PACIB, PACIZB, PACIB
- PSB Profiling Synchronization Barrier
- RBIT Reverse Bits
- RET Return from subroutine
- RETAA, RETAB Return from subroutine, with pointer authentication
- R
- REV32 Reverse bytes in 32-bit words
- REV64 Reverse Bytes
- REV Reverse Bytes
- ROR (immediate) Rotate right (immediate)
- ROR (register) Rotate Right (register)
- RORV Rotate Right Variable
- SBC Subtract with Carry
- SBCS Subtract with Carry, setting flags
- SBFIZ Signed Bitfield Insert in Zero
- SBFM Signed Bitfield Move
- SBFX Signed Bitfield Extract
- SDIV Signed Divide
- SEV Send Event
- SEVL Send Event Local
- SMADDL Signed Multiply-Add Long
- SMC Supervisor call to allow OS or Hypervisor code to call the Secure Monitor
- SMNEGL Signed Multiply-Negate Long
- SMSUBL Signed Multiply-Subtract Long
- SMULH Signed Multiply High
- SMULL Signed Multiply Long
- SUB (extended register) Subtract (extended register)
- SUB (immediate) Subtract (immediate)
- SUB (shifted register) Subtract (shifted register)
- SUBS (extended register) Subtract (extended register), setting flags
- SUBS (immediate) Subtract (immediate), setting flags
- SUBS (shifted register) Subtract (shifted register), setting flags
- SVC Supervisor call to allow application code to call the OS
- SXTB Signed Extend Byte
- SXTH Sign Extend Halfword
- SXTW Sign Extend Word
- SYS System instruction
- SYSL System instruction with result
- TBNZ Test bit and Branch if Nonzero
- TBZ Test bit and Branch if Zero
- TLBI TLB Invalidate operation
- TST (immediate) , setting the condition flags and discarding the result
- TST (shifted register) Test (shifted register)
- UBFIZ Unsigned Bitfield Insert in Zero
- UBFM Unsigned Bitfield Move
- UBFX Unsigned Bitfield Extract
- UDIV Unsigned Divide
- UMADDL Unsigned Multiply-Add Long
- UMNEGL Unsigned Multiply-Negate Long
- UMSUBL Unsigned Multiply-Subtract Long
- UMULH Unsigned Multiply High
- UMULL Unsigned Multiply Long
- UXTB Unsigned Extend Byte
- UXTH Unsigned Extend Halfword
- WFE Wait For Event
- WFI Wait For Interrupt
- XPACD, XPACI, XPACLRI Strip Pointer Authentication Code
- YIELD YIELD
Data transfer instructions:
- CASA, CASAL, CAS, CASL, CASAL, CAS, CASL Compare and Swap word or doubleword in memory
- CASAB, CASALB, CASB, CASLB Compare and Swap byte in memory
- CASAH, CASALH, CASH, CASLH Compare and Swap halfword in memory
- CASPA, CASPAL, CASP, CASPL, CASPAL, CASP, CASPL Compare and Swap Pair of words or doublewords in memory
- LDADDA, LDADDAL, LDADD, LDADDL, LDADDAL, LDADD, LDADDL Atomic add on word or doubleword in memory
- LDADDAB, LDADDALB, LDADDB, LDADDLB Atomic add on byte in memory
- LDADDAH, LDADDALH, LDADDH, LDADDLH Atomic add on halfword in memory
- LDAPR Load-Acquire RCpc Register
- LDAPRB Load-Acquire RCpc Register Byte
- LDAPRH Load-Acquire RCpc Register Halfword
- LDAR Load-Acquire Register
- LDARB Load-Acquire Register Byte
- LDARH Load-Acquire Register Halfword
- LDAXP Load-Acquire Exclusive Pair of Registers
- LDAXR Load-Acquire Exclusive Register
- LDAXRB Load-Acquire Exclusive Register Byte
- LDAXRH Load-Acquire Exclusive Register Halfword
- LDCLRA, LDCLRAL, LDCLR, LDCLRL, LDCLRAL, LDCLR, LDCLRL Atomic bit clear on word or doubleword in memory
- LDCLRAB, LDCLRALB, LDCLRB, LDCLRLB Atomic bit clear on byte in memory
- LDCLRAH, LDCLRALH, LDCLRH, LDCLRLH Atomic bit clear on halfword in memory
- LDEORA, LDEORAL, LDEOR, LDEORL, LDEORAL, LDEOR, LDEORL Atomic exclusive OR on word or doubleword in memory
- LDEORAB, LDEORALB, LDEORB, LDEORLB Atomic exclusive OR on byte in memory
- LDEORAH, LDEORALH, LDEORH, LDEORLH Atomic exclusive OR on halfword in memory
- LDLAR Load LOAcquire Register
- LDLARB Load LOAcquire Register Byte
- LDLARH Load LOAcquire Register Halfword
- LDNP Load Pair of Registers, with non-temporal hint
- LDP Load Pair of Registers
- LDPSW Load Pair of Registers Signed Word
- LDR (immediate) Load Register (immediate)
- LDR (literal) Load Register (literal)
- LDR pseudo-instruction Load a register with either a 32-bit or 64-bit immediate value or any address
- LDR (register) Load Register (register)
- LDRAA, LDRAB, LDRAB Load Register, with pointer authentication
- LDRB (immediate) Load Register Byte (immediate)
- LDRB (register) Load Register Byte (register)
- LDRH (immediate) Load Register Halfword (immediate)
- LDRH (register) Load Register Halfword (register)
- LDRSB (immediate) Load Register Signed Byte (immediate)
- LDRSB (register) Load Register Signed Byte (register)
- LDRSH (immediate) Load Register Signed Halfword (immediate)
- LDRSH (register) Load Register Signed Halfword (register)
- LDRSW (immediate) Load Register Signed Word (immediate)
- LDRSW (literal) Load Register Signed Word (literal)
- LDRSW (register) Load Register Signed Word (register)
- LDSETA, LDSETAL, LDSET, LDSETL, LDSETAL, LDSET, LDSETL Atomic bit set on word or doubleword in memory
- LDSETAB, LDSETALB, LDSETB, LDSETLB Atomic bit set on byte in memory
- LDSETAH, LDSETALH, LDSETH, LDSETLH Atomic bit set on halfword in memory
- LDSMAXA, LDSMAXAL, LDSMAX, LDSMAXL, LDSMAXAL, LDSMAX, LDSMAXL Atomic signed maximum on word or doubleword in memory
- LDSMAXAB, LDSMAXALB, LDSMAXB, LDSMAXLB Atomic signed maximum on byte in memory
- LDSMAXAH, LDSMAXALH, LDSMAXH, LDSMAXLH Atomic signed maximum on halfword in memory
- LDSMINA, LDSMINAL, LDSMIN, LDSMINL, LDSMINAL, LDSMIN, LDSMINL Atomic signed minimum on word or doubleword in memory
- LDSMINAB, LDSMINALB, LDSMINB, LDSMINLB Atomic signed minimum on byte in memory
- LDSMINAH, LDSMINALH, LDSMINH, LDSMINLH Atomic signed minimum on halfword in memory
- LDTR Load Register (unprivileged)
- LDTRB Load Register Byte (unprivileged)
- LDTRH Load Register Halfword (unprivileged)
- LDTRSB Load Register Signed Byte (unprivileged)
- LDTRSH Load Register Signed Halfword (unprivileged)
- LDTRSW Load Register Signed Word (unprivileged)
- LDUMAXA, LDUMAXAL, LDUMAX, LDUMAXL, LDUMAXAL, LDUMAX, LDUMAXL Atomic unsigned maximum on word or doubleword in memory
- LDUMAXAB, LDUMAXALB, LDUMAXB, LDUMAXLB Atomic unsigned maximum on byte in memory
- LDUMAXAH, LDUMAXALH, LDUMAXH, LDUMAXLH Atomic unsigned maximum on halfword in memory
- LDUMINA, LDUMINAL, LDUMIN, LDUMINL, LDUMINAL, LDUMIN, LDUMINL Atomic unsigned minimum on word or doubleword in memory
- LDUMINAB, LDUMINALB, LDUMINB, LDUMINLB Atomic unsigned minimum on byte in memory
- LDUMINAH, LDUMINALH, LDUMINH, LDUMINLH Atomic unsigned minimum on halfword in memory
- LDUR Load Register (unscaled)
- LDURB Load Register Byte (unscaled)
- LDURH Load Register Halfword (unscaled)
- LDURSB Load Register Signed Byte (unscaled)
- LDURSH Load Register Signed Halfword (unscaled)
- LDURSW Load Register Signed Word (unscaled)
- LDXP Load Exclusive Pair of Registers
- LDXR Load Exclusive Register
- LDXRB Load Exclusive Register Byte
- LDXRH Load Exclusive Register Halfword
- PRFM (immediate) Prefetch Memory (immediate)
- PRFM (literal) Prefetch Memory (literal)
- PRFM (register) Prefetch Memory (register)
- PRFUM (unscaled offset) Prefetch Memory (unscaled offset)
- STADD, STADDL, STADDL Atomic add on word or doubleword in memory, without return
- STADDB, STADDLB Atomic add on byte in memory, without return
- STADDH, STADDLH Atomic add on halfword in memory, without return
- STCLR, STCLRL, STCLRL Atomic bit clear on word or doubleword in memory, without return
- STCLRB, STCLRLB Atomic bit clear on byte in memory, without return
- STCLRH, STCLRLH Atomic bit clear on halfword in memory, without return
- STEOR, STEORL, STEORL Atomic exclusive OR on word or doubleword in memory, without return
- STEORB, STEORLB Atomic exclusive OR on byte in memory, without return
- STEORH, STEORLH Atomic exclusive OR on halfword in memory, without return
- STLLR Store LORelease Register
- STLLRB Store LORelease Register Byte
- STLLRH Store LORelease Register Halfword
- STLR Store-Release Register
- STLRB Store-Release Register Byte
- STLRH Store-Release Register Halfword
- STLXP Store-Release Exclusive Pair of registers
- STLXR Store-Release Exclusive Register
- STLXRB Store-Release Exclusive Register Byte
- STLXRH Store-Release Exclusive Register Halfword
- STNP Store Pair of Registers, with non-temporal hint
- STP Store Pair of Registers
- STR (immediate) Store Register (immediate)
- STR (register) Store Register (register)
- STRB (immediate) Store Register Byte (immediate)
- STRB (register) Store Register Byte (register)
- STRH (immediate) Store Register Halfword (immediate)
- STRH (register) Store Register Halfword (register)
- STSET, STSETL, STSETL Atomic bit set on word or doubleword in memory, without return
- STSETB, STSETLB Atomic bit set on byte in memory, without return
- STSETH, STSETLH Atomic bit set on halfword in memory, without return
- STSMAX, STSMAXL, STSMAXL Atomic signed maximum on word or doubleword in memory, without return
- STSMAXB, STSMAXLB Atomic signed maximum on byte in memory, without return
- STSMAXH, STSMAXLH Atomic signed maximum on halfword in memory, without return
- STSMIN, STSMINL, STSMINL Atomic signed minimum on word or doubleword in memory, without return
- STSMINB, STSMINLB Atomic signed minimum on byte in memory, without return
- STSMINH, STSMINLH Atomic signed minimum on halfword in memory, without return
- STTR Store Register (unprivileged)
- STTRB Store Register Byte (unprivileged)
- STTRH Store Register Halfword (unprivileged)
- STUMAX, STUMAXL, STUMAXL Atomic unsigned maximum on word or doubleword in memory, without return
- STUMAXB, STUMAXLB Atomic unsigned maximum on byte in memory, without return
- STUMAXH, STUMAXLH Atomic unsigned maximum on halfword in memory, without return
- STUMIN, STUMINL, STUMINL Atomic unsigned minimum on word or doubleword in memory, without return
- STUMINB, STUMINLB Atomic unsigned minimum on byte in memory, without return
- STUMINH, STUMINLH Atomic unsigned minimum on halfword in memory, without return
- STUR Store Register (unscaled)
- STURB Store Register Byte (unscaled)
- STURH Store Register Halfword (unscaled)
- STXP Store Exclusive Pair of registers
- STXR Store Exclusive Register
- STXRB Store Exclusive Register Byte
- STXRH Store Exclusive Register Halfword
- SWPA, SWPAL, SWP, SWPL, SWPAL, SWP, SWPL Swap word or doubleword in memory
- SWPAB, SWPALB, SWPB, SWPLB Swap byte in memory
- SWPAH, SWPALH, SWPH, SWPLH Swap halfword in memory
A64 floating-point instructions:
- FABS (scalar) Floating-point Absolute value (scalar)
- FADD (scalar) Floating-point Add (scalar)
- FCCMP Floating-point Conditional quiet Compare (scalar)
- FCCMPE Floating-point Conditional signaling Compare (scalar)
- FCMP Floating-point quiet Compare (scalar)
- FCMPE Floating-point signaling Compare (scalar)
- FCSEL Floating-point Conditional Select (scalar)
- FCVT Floating-point Convert precision (scalar)
- FCVTAS (scalar) Floating-point Convert to Signed integer, rounding to nearest with ties to Away (scalar)
- FCVTAU (scalar) Floating-point Convert to Unsigned integer, rounding to nearest with ties to Away (scalar)
- FCVTMS (scalar) Floating-point Convert to Signed integer, rounding toward Minus infinity (scalar)
- FCVTMU (scalar) Floating-point Convert to Unsigned integer, rounding toward Minus infinity (scalar)
- FCVTNS (scalar) Floating-point Convert to Signed integer, rounding to nearest with ties to even (scalar)
- FCVTNU (scalar) Floating-point Convert to Unsigned integer, rounding to nearest with ties to even (scalar)
- FCVTPS (scalar) Floating-point Convert to Signed integer, rounding toward Plus infinity (scalar)
- FCVTPU (scalar) Floating-point Convert to Unsigned integer, rounding toward Plus infinity (scalar)
- FCVTZS (scalar, fixed-point) Floating-point Convert to Signed fixed-point, rounding toward Zero (scalar)
- FCVTZS (scalar, integer) Floating-point Convert to Signed integer, rounding toward Zero (scalar)
- FCVTZU (scalar, fixed-point) Floating-point Convert to Unsigned fixed-point, rounding toward Zero (scalar)
- FCVTZU (scalar, integer) Floating-point Convert to Unsigned integer, rounding toward Zero (scalar)
- FDIV (scalar) Floating-point Divide (scalar)
- FJCVTZS Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero
- FMADD Floating-point fused Multiply-Add (scalar)
- FMAX (scalar) Floating-point Maximum (scalar)
- FMAXNM (scalar) Floating-point Maximum Number (scalar)
- FMIN (scalar) Floating-point Minimum (scalar)
- FMINNM (scalar) Floating-point Minimum Number (scalar)
- FMOV (register) Floating-point Move register without conversion
- FMOV (general) Floating-point Move to or from general-purpose register without conversion
- FMOV (scalar, immediate) Floating-point move immediate (scalar)
- FMSUB Floating-point Fused Multiply-Subtract (scalar)
- FMUL (scalar) Floating-point Multiply (scalar)
- FNEG (scalar) Floating-point Negate (scalar)
- FNMADD Floating-point Negated fused Multiply-Add (scalar)
- FNMSUB Floating-point Negated fused Multiply-Subtract (scalar)
- FNMUL (scalar) Floating-point Multiply-Negate (scalar)
- FRINTA (scalar) Floating-point Round to Integral, to nearest with ties to Away (scalar)
- FRINTI (scalar) Floating-point Round to Integral, using current rounding mode (scalar)
- FRINTM (scalar) Floating-point Round to Integral, toward Minus infinity (scalar)
- FRINTN (scalar) Floating-point Round to Integral, to nearest with ties to even (scalar)
- FRINTP (scalar) Floating-point Round to Integral, toward Plus infinity (scalar)
- FRINTX (scalar) Floating-point Round to Integral exact, using current rounding mode (scalar)
- FRINTZ (scalar) Floating-point Round to Integral, toward Zero (scalar)
- FSQRT (scalar) Floating-point Square Root (scalar)
- FSUB (scalar) Floating-point Subtract (scalar)
- LDNP (SIMD and FP) Load Pair of SIMD and FP registers, with Non-temporal hint
- LDP (SIMD and FP) Load Pair of SIMD and FP registers
- LDR (immediate, SIMD and FP) Load SIMD and FP Register (immediate offset)
- LDR (literal, SIMD and FP) Load SIMD and FP Register (PC-relative literal)
- LDR (register, SIMD and FP) Load SIMD and FP Register (register offset)
- LDUR (SIMD and FP) Load SIMD and FP Register (unscaled offset)
- SCVTF (scalar, fixed-point) Signed fixed-point Convert to Floating-point (scalar)
- SCVTF (scalar, integer) Signed integer Convert to Floating-point (scalar)
- STNP (SIMD and FP) Store Pair of SIMD and FP registers, with Non-temporal hint
- STP (SIMD and FP) Store Pair of SIMD and FP registers
- STR (immediate, SIMD and FP) Store SIMD and FP register (immediate offset)
- STR (register, SIMD and FP) Store SIMD and FP register (register offset)
- STUR (SIMD and FP) Store SIMD and FP register (unscaled offset)
- UCVTF (scalar, fixed-point) Unsigned fixed-point Convert to Floating-point (scalar)
- UCVTF (scalar, integer) Unsigned integer Convert to Floating-point (scalar)
A64 SIMD scalar instructions:
- ABS (scalar) Absolute value (vector)
- ADD (scalar) Add (vector)
- ADDP (scalar) Add Pair of elements (scalar)
- CMEQ (scalar, register) Compare bitwise Equal (vector)
- CMEQ (scalar, zero) Compare bitwise Equal to zero (vector)
- CMGE (scalar, register) Compare signed Greater than or Equal (vector)
- CMGE (scalar, zero) Compare signed Greater than or Equal to zero (vector)
- CMGT (scalar, register) Compare signed Greater than (vector)
- CMGT (scalar, zero) Compare signed Greater than zero (vector)
- CMHI (scalar, register) Compare unsigned Higher (vector)
- CMHS (scalar, register) Compare unsigned Higher or Same (vector)
- CMLE (scalar, zero) Compare signed Less than or Equal to zero (vector)
- CMLT (scalar, zero) Compare signed Less than zero (vector)
- CMTST (scalar) Compare bitwise Test bits nonzero (vector)
- DUP (scalar, element) Duplicate vector element to scalar
- FABD (scalar) Floating-point Absolute Difference (vector)
- FACGE (scalar) Floating-point Absolute Compare Greater than or Equal (vector)
- FACGT (scalar) Floating-point Absolute Compare Greater than (vector)
- FADDP (scalar) Floating-point Add Pair of elements (scalar)
- FCMEQ (scalar, register) Floating-point Compare Equal (vector)
- FCMEQ (scalar, zero) Floating-point Compare Equal to zero (vector)
- FCMGE (scalar, register) Floating-point Compare Greater than or Equal (vector)
- FCMGE (scalar, zero) Floating-point Compare Greater than or Equal to zero (vector)
- FCMGT (scalar, register) Floating-point Compare Greater than (vector)
- FCMGT (scalar, zero) Floating-point Compare Greater than zero (vector)
- FCMLA (scalar, by element) Floating-point Complex Multiply Accumulate (by element)
- FCMLE (scalar, zero) Floating-point Compare Less than or Equal to zero (vector)
- FCMLT (scalar, zero) Floating-point Compare Less than zero (vector)
- FCVTAS (scalar) Floating-point Convert to Signed integer, rounding to nearest with ties to Away (vector)
- FCVTAU (scalar) Floating-point Convert to Unsigned integer, rounding to nearest with ties to Away (vector)
- FCVTMS (scalar) Floating-point Convert to Signed integer, rounding toward Minus infinity (vector)
- FCVTMU (scalar) Floating-point Convert to Unsigned integer, rounding toward Minus infinity (vector)
- FCVTNS (scalar) Floating-point Convert to Signed integer, rounding to nearest with ties to even (vector)
- FCVTNU (scalar) Floating-point Convert to Unsigned integer, rounding to nearest with ties to even (vector)
- FCVTPS (scalar) Floating-point Convert to Signed integer, rounding toward Plus infinity (vector)
- FCVTPU (scalar) Floating-point Convert to Unsigned integer, rounding toward Plus infinity (vector)
- FCVTXN (scalar) Floating-point Convert to lower precision Narrow, rounding to odd (vector)
- FCVTZS (scalar, fixed-point) Floating-point Convert to Signed fixed-point, rounding toward Zero (vector)
- FCVTZS (scalar, integer) Floating-point Convert to Signed integer, rounding toward Zero (vector)
- FCVTZU (scalar, fixed-point) Floating-point Convert to Unsigned fixed-point, rounding toward Zero (vector)
- FCVTZU (scalar, integer) Floating-point Convert to Unsigned integer, rounding toward Zero (vector)
- FMAXNMP (scalar) Floating-point Maximum Number of Pair of elements (scalar)
- FMAXP (scalar) Floating-point Maximum of Pair of elements (scalar)
- FMINNMP (scalar) Floating-point Minimum Number of Pair of elements (scalar)
- FMINP (scalar) Floating-point Minimum of Pair of elements (scalar)
- FMLA (scalar, by element) Floating-point fused Multiply-Add to accumulator (by element)
- FMLS (scalar, by element) Floating-point fused Multiply-Subtract from accumulator (by element)
- FMUL (scalar, by element) Floating-point Multiply (by element)
- FMULX (scalar, by element) Floating-point Multiply extended (by element)
- FMULX (scalar) Floating-point Multiply extended
- FRECPE (scalar) Floating-point Reciprocal Estimate
- FRECPS (scalar) Floating-point Reciprocal Step
- FRSQRTE (scalar) Floating-point Reciprocal Square Root Estimate
- FRSQRTS (scalar) Floating-point Reciprocal Square Root Step
- MOV (scalar) Move vector element to scalar
- NEG (scalar) Negate (vector)
- SCVTF (scalar, fixed-point) Signed fixed-point Convert to Floating-point (vector)
- SCVTF (scalar, integer) Signed integer Convert to Floating-point (vector)
- SHL (scalar) Shift Left (immediate)
- SLI (scalar) Shift Left and Insert (immediate)
- SQABS (scalar) Signed saturating Absolute value
- SQADD (scalar) Signed saturating Add
- SQDMLAL (scalar, by element) Signed saturating Doubling Multiply-Add Long (by element)
- SQDMLAL (scalar) Signed saturating Doubling Multiply-Add Long
- SQDMLSL (scalar, by element) Signed saturating Doubling Multiply-Subtract Long (by element)
- SQDMLSL (scalar) Signed saturating Doubling Multiply-Subtract Long
- SQDMULH (scalar, by element) Signed saturating Doubling Multiply returning High half (by element)
- SQDMULH (scalar) Signed saturating Doubling Multiply returning High half
- SQDMULL (scalar, by element) Signed saturating Doubling Multiply Long (by element)
- SQDMULL (scalar) Signed saturating Doubling Multiply Long
- SQNEG (scalar) Signed saturating Negate
- SQRDMLAH (scalar, by element) Signed Saturating Rounding Doubling Multiply Accumulate returning High Half (by element)
- SQRDMLAH (scalar) Signed Saturating Rounding Doubling Multiply Accumulate returning High Half (vector)
- SQRDMLSH (scalar, by element) Signed Saturating Rounding Doubling Multiply Subtract returning High Half (by element)
- SQRDMLSH (scalar) Signed Saturating Rounding Doubling Multiply Subtract returning High Half (vector)
- SQRDMULH (scalar, by element) Signed saturating Rounding Doubling Multiply returning High half (by element)
- SQRDMULH (scalar) Signed saturating Rounding Doubling Multiply returning High half
- SQRSHL (scalar) Signed saturating Rounding Shift Left (register)
- SQRSHRN (scalar) Signed saturating Rounded Shift Right Narrow (immediate)
- SQRSHRUN (scalar) Signed saturating Rounded Shift Right Unsigned Narrow (immediate)
- SQSHL (scalar, immediate) Signed saturating Shift Left (immediate)
- SQSHL (scalar, register) Signed saturating Shift Left (register)
- SQSHLU (scalar) Signed saturating Shift Left Unsigned (immediate)
- SQSHRN (scalar) Signed saturating Shift Right Narrow (immediate)
- SQSHRUN (scalar) Signed saturating Shift Right Unsigned Narrow (immediate)
- SQSUB (scalar) Signed saturating Subtract
- SQXTN (scalar) Signed saturating extract Narrow
- SQXTUN (scalar) Signed saturating extract Unsigned Narrow
- SRI (scalar) Shift Right and Insert (immediate)
- SRSHL (scalar) Signed Rounding Shift Left (register)
- SRSHR (scalar) Signed Rounding Shift Right (immediate)
- SRSRA (scalar) Signed Rounding Shift Right and Accumulate (immediate)
- SSHL (scalar) Signed Shift Left (register)
- SSHR (scalar) Signed Shift Right (immediate)
- SSRA (scalar) Signed Shift Right and Accumulate (immediate)
- SUB (scalar) Subtract (vector)
- SUQADD (scalar) Signed saturating Accumulate of Unsigned value
- UCVTF (scalar, fixed-point) Unsigned fixed-point Convert to Floating-point (vector)
- UCVTF (scalar, integer) Unsigned integer Convert to Floating-point (vector)
- UQADD (scalar) Unsigned saturating Add
- UQRSHL (scalar) Unsigned saturating Rounding Shift Left (register)
- UQRSHRN (scalar) Unsigned saturating Rounded Shift Right Narrow (immediate)
- UQSHL (scalar, immediate) Unsigned saturating Shift Left (immediate)
- UQSHL (scalar, register) Unsigned saturating Shift Left (register)
- UQSHRN (scalar) Unsigned saturating Shift Right Narrow (immediate)
- UQSUB (scalar) Unsigned saturating Subtract
- UQXTN (scalar) Unsigned saturating extract Narrow
- URSHL (scalar) Unsigned Rounding Shift Left (register)
- URSHR (scalar) Unsigned Rounding Shift Right (immediate)
- URSRA (scalar) Unsigned Rounding Shift Right and Accumulate (immediate)
- USHL (scalar) Unsigned Shift Left (register)
- USHR (scalar) Unsigned Shift Right (immediate)
- USQADD (scalar) Unsigned saturating Accumulate of Signed value
- USRA (scalar) Unsigned Shift Right and Accumulate (immediate)
A64 SIMD Vector instructions:
- ABS (vector) Absolute value (vector)
- ADD (vector) Add (vector)
- ADDHN, ADDHN2 (vector) Add returning High Narrow
- ADDP (vector) Add Pairwise (vector)
- ADDV (vector) Add across Vector
- AND (vector) Bitwise AND (vector)
- BIC (vector, immediate) Bitwise bit Clear (vector, immediate)
- BIC (vector, register) Bitwise bit Clear (vector, register)
- BIF (vector) Bitwise Insert if False
- BIT (vector) Bitwise Insert if True
- BSL (vector) Bitwise Select
- CLS (vector) Count Leading Sign bits (vector)
- CLZ (vector) Count Leading Zero bits (vector)
- CMEQ (vector, register) Compare bitwise Equal (vector)
- CMEQ (vector, zero) Compare bitwise Equal to zero (vector)
- CMGE (vector, register) Compare signed Greater than or Equal (vector)
- CMGE (vector, zero) Compare signed Greater than or Equal to zero (vector)
- CMGT (vector, register) Compare signed Greater than (vector)
- CMGT (vector, zero) Compare signed Greater than zero (vector)
- CMHI (vector, register) Compare unsigned Higher (vector)
- CMHS (vector, register) Compare unsigned Higher or Same (vector)
- CMLE (vector, zero) Compare signed Less than or Equal to zero (vector)
- CMLT (vector, zero) Compare signed Less than zero (vector)
- CMTST (vector) Compare bitwise Test bits nonzero (vector)
- CNT (vector) Population Count per byte
- DUP (vector, element) vector
- DUP (vector, general) Duplicate general-purpose register to vector
- EOR (vector) Bitwise Exclusive OR (vector)
- EXT (vector) Extract vector from pair of vectors
- FABD (vector) Floating-point Absolute Difference (vector)
- FABS (vector) Floating-point Absolute value (vector)
- FACGE (vector) Floating-point Absolute Compare Greater than or Equal (vector)
- FACGT (vector) Floating-point Absolute Compare Greater than (vector)
- FADD (vector) Floating-point Add (vector)
- FADDP (vector) Floating-point Add Pairwise (vector)
- FCADD (vector) Floating-point Complex Add
- FCMEQ (vector, register) Floating-point Compare Equal (vector)
- FCMEQ (vector, zero) Floating-point Compare Equal to zero (vector)
- FCMGE (vector, register) Floating-point Compare Greater than or Equal (vector)
- FCMGE (vector, zero) Floating-point Compare Greater than or Equal to zero (vector)
- FCMGT (vector, register) Floating-point Compare Greater than (vector)
- FCMGT (vector, zero) Floating-point Compare Greater than zero (vector)
- FCMLA (vector) Floating-point Complex Multiply Accumulate
- FCMLE (vector, zero) Floating-point Compare Less than or Equal to zero (vector)
- FCMLT (vector, zero) Floating-point Compare Less than zero (vector)
- FCVTAS (vector) Floating-point Convert to Signed integer, rounding to nearest with ties to Away (vector)
- FCVTAU (vector) Floating-point Convert to Unsigned integer, rounding to nearest with ties to Away (vector)
- FCVTL, FCVTL2 (vector) Floating-point Convert to higher precision Long (vector)
- FCVTMS (vector) Floating-point Convert to Signed integer, rounding toward Minus infinity (vector)
- FCVTMU (vector) Floating-point Convert to Unsigned integer, rounding toward Minus infinity (vector)
- FCVTN, FCVTN2 (vector) Floating-point Convert to lower precision Narrow (vector)
- FCVTNS (vector) Floating-point Convert to Signed integer, rounding to nearest with ties to even (vector)
- FCVTNU (vector) Floating-point Convert to Unsigned integer, rounding to nearest with ties to even (vector)
- FCVTPS (vector) Floating-point Convert to Signed integer, rounding toward Plus infinity (vector)
- FCVTPU (vector) Floating-point Convert to Unsigned integer, rounding toward Plus infinity (vector)
- FCVTXN, FCVTXN2 (vector) Floating-point Convert to lower precision Narrow, rounding to odd (vector)
- FCVTZS (vector, fixed-point) Floating-point Convert to Signed fixed-point, rounding toward Zero (vector)
- FCVTZS (vector, integer) Floating-point Convert to Signed integer, rounding toward Zero (vector)
- FCVTZU (vector, fixed-point) Floating-point Convert to Unsigned fixed-point, rounding toward Zero (vector)
- FCVTZU (vector, integer) Floating-point Convert to Unsigned integer, rounding toward Zero (vector)
- FDIV (vector) Floating-point Divide (vector)
- FMAX (vector) Floating-point Maximum (vector)
- FMAXNM (vector) Floating-point Maximum Number (vector)
- FMAXNMP (vector) Floating-point Maximum Number Pairwise (vector)
- FMAXNMV (vector) Floating-point Maximum Number across Vector
- FMAXP (vector) Floating-point Maximum Pairwise (vector)
- FMAXV (vector) Floating-point Maximum across Vector
- FMIN (vector) Floating-point minimum (vector)
- FMINNM (vector) Floating-point Minimum Number (vector)
- FMINNMP (vector) Floating-point Minimum Number Pairwise (vector)
- FMINNMV (vector) Floating-point Minimum Number across Vector
- FMINP (vector) Floating-point Minimum Pairwise (vector)
- FMINV (vector) Floating-point Minimum across Vector
- FMLA (vector, by element) Floating-point fused Multiply-Add to accumulator (by element)
- FMLA (vector) Floating-point fused Multiply-Add to accumulator (vector)
- FMLS (vector, by element) Floating-point fused Multiply-Subtract from accumulator (by element)
- FMLS (vector) Floating-point fused Multiply-Subtract from accumulator (vector)
- FMOV (vector, immediate) Floating-point move immediate (vector)
- FMUL (vector, by element) Floating-point Multiply (by element)
- FMUL (vector) Floating-point Multiply (vector)
- FMULX (vector, by element) Floating-point Multiply extended (by element)
- FMULX (vector) Floating-point Multiply extended
- FNEG (vector) Floating-point Negate (vector)
- FRECPE (vector) Floating-point Reciprocal Estimate
- FRECPS (vector) Floating-point Reciprocal Step
- FRECPX (vector) Floating-point Reciprocal exponent (scalar)
- FRINTA (vector) Floating-point Round to Integral, to nearest with ties to Away (vector)
- FRINTI (vector) Floating-point Round to Integral, using current rounding mode (vector)
- FRINTM (vector) Floating-point Round to Integral, toward Minus infinity (vector)
- FRINTN (vector) Floating-point Round to Integral, to nearest with ties to even (vector)
- FRINTP (vector) Floating-point Round to Integral, toward Plus infinity (vector)
- FRINTX (vector) Floating-point Round to Integral exact, using current rounding mode (vector)
- FRINTZ (vector) Floating-point Round to Integral, toward Zero (vector)
- FRSQRTE (vector) Floating-point Reciprocal Square Root Estimate
- FRSQRTS (vector) Floating-point Reciprocal Square Root Step
- FSQRT (vector) Floating-point Square Root (vector)
- FSUB (vector) Floating-point Subtract (vector)
- INS (vector, element) Insert vector element from another vector element
- INS (vector, general) Insert vector element from general-purpose register
- LD1 (vector, multiple structures) Load multiple single-element structures to one, two, three, or four registers
- LD1 (vector, single structure) Load one single-element structure to one lane of one register
- LD1R (vector) Load one single-element structure and Replicate to all lanes (of one register)
- LD2 (vector, multiple structures) Load multiple 2-element structures to two registers
- LD2 (vector, single structure) Load single 2-element structure to one lane of two registers
- LD2R (vector) Load single 2-element structure and Replicate to all lanes of two registers
- LD3 (vector, multiple structures) Load multiple 3-element structures to three registers
- LD3 (vector, single structure) Load single 3-element structure to one lane of three registers)
- LD3R (vector) Load single 3-element structure and Replicate to all lanes of three registers
- LD4 (vector, multiple structures) Load multiple 4-element structures to four registers
- LD4 (vector, single structure) Load single 4-element structure to one lane of four registers
- LD4R (vector) Load single 4-element structure and Replicate to all lanes of four registers
- MLA (vector, by element) Multiply-Add to accumulator (vector, by element)
- MLA (vector) Multiply-Add to accumulator (vector)
- MLS (vector, by element) Multiply-Subtract from accumulator (vector, by element)
- MLS (vector) Multiply-Subtract from accumulator (vector)
- MOV (vector, element) Move vector element to another vector element
- MOV (vector, from general) Move general-purpose register to a vector element
- MOV (vector) Move vector
- MOV (vector, to general) Move vector element to general-purpose register
- MOVI (vector) Move Immediate (vector)
- MUL (vector, by element) Multiply (vector, by element)
- MUL (vector) Multiply (vector)
- MVN (vector) Bitwise NOT (vector)
- MVNI (vector) Move inverted Immediate (vector)
- NEG (vector) Negate (vector)
- NOT (vector) Bitwise NOT (vector)
- ORN (vector) Bitwise inclusive OR NOT (vector)
- ORR (vector, immediate) Bitwise inclusive OR (vector, immediate)
- ORR (vector, register) Bitwise inclusive OR (vector, register)
- PMUL (vector) Polynomial Multiply
- PMULL, PMULL2 (vector) Polynomial Multiply Long
- RADDHN, RADDHN2 (vector) Rounding Add returning High Narrow
- RBIT (vector) Reverse Bit order (vector)
- REV16 (vector) Reverse elements in 16-bit halfwords (vector)
- REV32 (vector) Reverse elements in 32-bit words (vector)
- REV64 (vector) Reverse elements in 64-bit doublewords (vector)
- RSHRN, RSHRN2 (vector) Rounding Shift Right Narrow (immediate)
- RSUBHN, RSUBHN2 (vector) Rounding Subtract returning High Narrow
- SABA (vector) Signed Absolute difference and Accumulate
- SABAL, SABAL2 (vector) Signed Absolute difference and Accumulate Long
- SABD (vector) Signed Absolute Difference
- SABDL, SABDL2 (vector) Signed Absolute Difference Long
- SADALP (vector) Signed Add and Accumulate Long Pairwise
- SADDL, SADDL2 (vector) Signed Add Long (vector)
- SADDLP (vector) Signed Add Long Pairwise
- SADDLV (vector) Signed Add Long across Vector
- SADDW, SADDW2 (vector) Signed Add Wide
- SCVTF (vector, fixed-point) Signed fixed-point Convert to Floating-point (vector)
- SCVTF (vector, integer) Signed integer Convert to Floating-point (vector)
- SDOT (vector, by element) Dot Product signed arithmetic (vector, by element)
- SDOT (vector) Dot Product signed arithmetic (vector)
- SHADD (vector) Signed Halving Add
- SHL (vector) Shift Left (immediate)
- SHLL, SHLL2 (vector) Shift Left Long (by element size)
- SHRN, SHRN2 (vector) Shift Right Narrow (immediate)
- SHSUB (vector) Signed Halving Subtract
- SLI (vector) Shift Left and Insert (immediate)
- SMAX (vector) Signed Maximum (vector)
- SMAXP (vector) Signed Maximum Pairwise
- SMAXV (vector) Signed Maximum across Vector
- SMIN (vector) Signed Minimum (vector)
- SMINP (vector) Signed Minimum Pairwise
- SMINV (vector) Signed Minimum across Vector
- SMLAL, SMLAL2 (vector, by element) Signed Multiply-Add Long (vector, by element)
- SMLAL, SMLAL2 (vector) Signed Multiply-Add Long (vector)
- SMLSL, SMLSL2 (vector, by element) Signed Multiply-Subtract Long (vector, by element)
- SMLSL, SMLSL2 (vector) Signed Multiply-Subtract Long (vector)
- SMOV (vector) Signed Move vector element to general-purpose register
- SMULL, SMULL2 (vector, by element) Signed Multiply Long (vector, by element)
- SMULL, SMULL2 (vector) Signed Multiply Long (vector)
- SQABS (vector) Signed saturating Absolute value
- SQADD (vector) Signed saturating Add
- SQDMLAL, SQDMLAL2 (vector, by element) Signed saturating Doubling Multiply-Add Long (by element)
- SQDMLAL, SQDMLAL2 (vector) Signed saturating Doubling Multiply-Add Long
- SQDMLSL, SQDMLSL2 (vector, by element) Signed saturating Doubling Multiply-Subtract Long (by element)
- SQDMLSL, SQDMLSL2 (vector) Signed saturating Doubling Multiply-Subtract Long
- SQDMULH (vector, by element) Signed saturating Doubling Multiply returning High half (by element)
- SQDMULH (vector) Signed saturating Doubling Multiply returning High half
- SQDMULL, SQDMULL2 (vector, by element) Signed saturating Doubling Multiply Long (by element)
- SQDMULL, SQDMULL2 (vector) Signed saturating Doubling Multiply Long
- SQNEG (vector) Signed saturating Negate
- SQRDMLAH (vector, by element) Signed Saturating Rounding Doubling Multiply Accumulate returning High Half (by element)
- SQRDMLAH (vector) Signed Saturating Rounding Doubling Multiply Accumulate returning High Half (vector)
- SQRDMLSH (vector, by element) Signed Saturating Rounding Doubling Multiply Subtract returning High Half (by element)
- SQRDMLSH (vector) Signed Saturating Rounding Doubling Multiply Subtract returning High Half (vector)
- SQRDMULH (vector, by element) Signed saturating Rounding Doubling Multiply returning High half (by element)
- SQRDMULH (vector) Signed saturating Rounding Doubling Multiply returning High half
- SQRSHL (vector) Signed saturating Rounding Shift Left (register)
- SQRSHRN, SQRSHRN2 (vector) Signed saturating Rounded Shift Right Narrow (immediate)
- SQRSHRUN, SQRSHRUN2 (vector) Signed saturating Rounded Shift Right Unsigned Narrow (immediate)
- SQSHL (vector, immediate) Signed saturating Shift Left (immediate)
- SQSHL (vector, register) Signed saturating Shift Left (register)
- SQSHLU (vector) Signed saturating Shift Left Unsigned (immediate)
- SQSHRN, SQSHRN2 (vector) Signed saturating Shift Right Narrow (immediate)
- SQSHRUN, SQSHRUN2 (vector) Signed saturating Shift Right Unsigned Narrow (immediate)
- SQSUB (vector) Signed saturating Subtract
- SQXTN, SQXTN2 (vector) Signed saturating extract Narrow
- SQXTUN, SQXTUN2 (vector) Signed saturating extract Unsigned Narrow
- SRHADD (vector) Signed Rounding Halving Add
- SRI (vector) Shift Right and Insert (immediate)
- SRSHL (vector) Signed Rounding Shift Left (register)
- SRSHR (vector) Signed Rounding Shift Right (immediate)
- SRSRA (vector) Signed Rounding Shift Right and Accumulate (immediate)
- SSHL (vector) Signed Shift Left (register)
- SSHLL, SSHLL2 (vector) Signed Shift Left Long (immediate)
- SSHR (vector) Signed Shift Right (immediate)
- SSRA (vector) Signed Shift Right and Accumulate (immediate)
- SSUBL, SSUBL2 (vector) Signed Subtract Long
- SSUBW, SSUBW2 (vector) Signed Subtract Wide
- ST1 (vector, multiple structures) Store multiple single-element structures from one, two, three, or four registers
- ST1 (vector, single structure) Store a single-element structure from one lane of one register
- ST2 (vector, multiple structures) Store multiple 2-element structures from two registers
- ST2 (vector, single structure) Store single 2-element structure from one lane of two registers
- ST3 (vector, multiple structures) Store multiple 3-element structures from three registers
- ST3 (vector, single structure) Store single 3-element structure from one lane of three registers
- ST4 (vector, multiple structures) Store multiple 4-element structures from four registers
- ST4 (vector, single structure) Store single 4-element structure from one lane of four registers
- SUB (vector) Subtract (vector)
- SUBHN, SUBHN2 (vector) Subtract returning High Narrow
- SUQADD (vector) Signed saturating Accumulate of Unsigned value
- SXTL, SXTL2 (vector) Signed extend Long
- TBL (vector) Table vector Lookup
- TBX (vector) Table vector lookup extension
- TRN1 (vector) Transpose vectors (primary)
- TRN2 (vector) Transpose vectors (secondary)
- UABA (vector) Unsigned Absolute difference and Accumulate
- UABAL, UABAL2 (vector) Unsigned Absolute difference and Accumulate Long
- UABD (vector) Unsigned Absolute Difference (vector)
- UABDL, UABDL2 (vector) Unsigned Absolute Difference Long
- UADALP (vector) Unsigned Add and Accumulate Long Pairwise
- UADDL, UADDL2 (vector) Unsigned Add Long (vector)
- UADDLP (vector) Unsigned Add Long Pairwise
- UADDLV (vector) Unsigned sum Long across Vector
- UADDW, UADDW2 (vector) Unsigned Add Wide
- UCVTF (vector, fixed-point) Unsigned fixed-point Convert to Floating-point (vector)
- UCVTF (vector, integer) Unsigned integer Convert to Floating-point (vector)
- UDOT (vector, by element) Dot Product unsigned arithmetic (vector, by element)
- UDOT (vector) Dot Product unsigned arithmetic (vector)
- UHADD (vector) Unsigned Halving Add
- UHSUB (vector) Unsigned Halving Subtract
- UMAX (vector) Unsigned Maximum (vector)
- UMAXP (vector) Unsigned Maximum Pairwise
- UMAXV (vector) Unsigned Maximum across Vector
- UMIN (vector) Unsigned Minimum (vector)
- UMINP (vector) Unsigned Minimum Pairwise
- UMINV (vector) Unsigned Minimum across Vector
- UMLAL, UMLAL2 (vector, by element) Unsigned Multiply-Add Long (vector, by element)
- UMLAL, UMLAL2 (vector) Unsigned Multiply-Add Long (vector)
- UMLSL, UMLSL2 (vector, by element) Unsigned Multiply-Subtract Long (vector, by element)
- UMLSL, UMLSL2 (vector) Unsigned Multiply-Subtract Long (vector)
- UMOV (vector) Unsigned Move vector element to general-purpose register
- UMULL, UMULL2 (vector, by element) Unsigned Multiply Long (vector, by element)
- UMULL, UMULL2 (vector) Unsigned Multiply long (vector)
- UQADD (vector) Unsigned saturating Add
- UQRSHL (vector) Unsigned saturating Rounding Shift Left (register)
- UQRSHRN, UQRSHRN2 (vector) Unsigned saturating Rounded Shift Right Narrow (immediate)
- UQSHL (vector, immediate) Unsigned saturating Shift Left (immediate)
- UQSHL (vector, register) Unsigned saturating Shift Left (register)
- UQSHRN, UQSHRN2 (vector) Unsigned saturating Shift Right Narrow (immediate)
- UQSUB (vector) Unsigned saturating Subtract
- UQXTN, UQXTN2 (vector) Unsigned saturating extract Narrow
- URECPE (vector) Unsigned Reciprocal Estimate
- URHADD (vector) Unsigned Rounding Halving Add
- URSHL (vector) Unsigned Rounding Shift Left (register)
- URSHR (vector) Unsigned Rounding Shift Right (immediate)
- URSQRTE (vector) Unsigned Reciprocal Square Root Estimate
- URSRA (vector) Unsigned Rounding Shift Right and Accumulate (immediate)
- USHL (vector) Unsigned Shift Left (register)
- USHLL, USHLL2 (vector) Unsigned Shift Left Long (immediate)
- USHR (vector) Unsigned Shift Right (immediate)
- USQADD (vector) Unsigned saturating Accumulate of Signed value
- USRA (vector) Unsigned Shift Right and Accumulate (immediate)
- USUBL, USUBL2 (vector) Unsigned Subtract Long
- USUBW, USUBW2 (vector) Unsigned Subtract Wide
- UXTL, UXTL2 (vector) Unsigned extend Long
- UZP1 (vector) Unzip vectors (primary)
- UZP2 (vector) Unzip vectors (secondary)
- XTN, XTN2 (vector) Extract Narrow
- ZIP1 (vector) Zip vectors (primary)
- ZIP2 (vector) Zip vectors (secondary)
ARM64 things removed compared to ARM32
" If you are familiar with ARMv7-A, you’ll know that many instructions can be conditionally executed. In A32, this is supported via a condition field in the instruction itself; in T32, we have the IT (if-then) instruction for building conditional sequences. This isn’t supported in A64 and we have a different set of specific conditional instructions. You can find examples below.
The ability to “embed” shift and rotate operations into data processing instructions is not supported in the same way in A64, although it is still possible to shift, rotate and sign-extend or zero-extend the second operand.
The Program Counter (PC) is no longer generally accessible. In particular, it can’t be read or modified like other general purpose registers. There are pseudo-instructions which can be used to use it indirectly (for instance, to generate PC-relative addresses at run-time).
Historically, the ARM instruction set has included a space for «coprocessors». Originally, these were external blocks of logic which were connected to the core via a dedicated coprocessor interface. More recently, this support for external coprocessors has been dropped and the instruction set space is used for extension instructions. One specific use of it has been to provide for system configuration and control operations via the notional «coprocessor 15». You won’t find anything like this in A64.
The load and store multiple instructions have been replaced with instructions which load and store pairs of 64-bit registers. These are used for stack operations as well, in place of the earlier PUSH and POP. " -- https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-00-52-01/Porting-to-ARM-64_2D00_bit.pdf
A comment on what ARMv8 has changed that is good, from https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip:
" ... There is a HUGE amount of learning that informed ARMv8, from the dropping of predication and shifting everywhere, to the way constants are encoded, to high-impact ideas like load/store pair and their particular version of conditional selection, to the codification of the memory ordering rules. Look at SVE as the newest version of something very different from what they were doing earlier. " -- name99
ARM64 calling convention
https://c9x.me/compile/bib/abi-arm64.pdf
Misc
"ARM/64 fact: integer divide-by-zero doesn't cause an exception. The Microsoft C compiler implicitly inserts a check if the divisor is 0 and triggers an pseudo-div-by-0 exception. Clang and GCC compiles the code as it so you always get 0." [15]
Opinions
- "...ARM was a lot more pleasant than MIPS specifically because MIPS was overly minimalistic, so it’d take a surprising number of instructions to get stuff done." gecko
- "...the driving success of ARM was it's ability to run small, compact code held in cheap, small memory. ARM was a success because it made the most of limited resources. Not because it was the perfect on-paper design." [16]
- "ARM was a pretty damn fine on-paper design (still is). And it was one of the fastest designs you could get back in the day. ARM gives you anything you need to make it fast (like advanced addressing modes and complex instructions) while still admitting simple implementations with good performance." [17]
- ""> Most current ISAs have hundreds of instructions which will never be generated by compilers." The only ISA with this problem is x86 and compilers have gotten better at making use of the instruction set. If you want to see what an instruction set optimised for compilers looks like, check out ARM64. It has instructions like “conditional select and increment if condition” which compiler writers really love." -- FUZxxl
- "ARM64 (aka AArch64) is the best version of x86 yet. It's clean, very little warts, they learned from their mistakes with ARMv7/Thumb2 (specifically the IT instruction)." [18] (note: ARM64 is NOT a version of x86, they are being sarcastic)
- "Thumb2 was good. The main issue with AArch64 is that they dropped the variable instruction length. As such all instructions are huge and this significantly slows down code, especially after a mispredicted branch. I'm observing on average a 20% performance loss from thumb2 to aarch64 on the exact same CPU and same kernel, just switching executables, an d 40% larger code or so. Also something to consider, an A53 can only read 64 bits per cycle from the cache, i.e. just two instructions. That doesn't even allow it to fetch a bit more and start to decode in advance. " [19]
- "The original ARM ISA felt very VAX-inspired to me, such as the elegant (but ultimately inefficient) use of a general-purpose register for the program counter. I've only just started looking at AArch64 but I agree that it feels a lot more like MIPS though. I think that's a good thing." [20]
- "It is interesting to watch ARM finally adopting many of the great architectural solutions that MIPS used 22 years ago, back in 1991, when it launched the MIPS R4000 family of 64 bit processors. [21]" [22]
- " > This killed the basic advantage of RISC. The "all instructions the same length" concept really killed it - it meant 2x code bloat. That meant bigger caches or worse cache performance. It meant more RAM and more RAM bandwidth or worse memory performance. The x86 instruction set, for all its faults, is compact. I'm not really sure that is a big deal today, however. Maybe it was when caches were much smaller, but AArch64 went back from a variable width encoding scheme (Thumb) to uniform width instructions in 64-bit mode, without any problems that I'm aware of, and the performance is quite good. At the same time, the x86-64 ISA has gotten quite a bit less space-efficient: because of the extension to 16 registers, REX prefixes are everywhere and eat up lots of bytes of the instruction stream. " [23]
- "AArch64 and x86-64 have about the same code size" -- pcwalton
- " > in fact I'd say one of the reasons ARM remained competitive is because of conditional execution, the "free" barrel shifter No compiler developer would agree with you. The conditional execution wreaks havoc with dependencies, and branches are very cheap if correctly predicted. The barrel shifter is not as useful as you would think (what fraction of instructions are shl or shr?) Thumb mode does help code density, but not as much as you might think due to Thumb-1 not being practical and Thumb-2 being fairly large. AArch64 is quite a bit denser than x86-64 already. It is true that the ISA doesn't matter too much from a performance point of view. But why not take advantage of the necessary compatibility break to clean things up? There's a lot of needless complexity in our ISAs from the programmer's point of view, and cleaning it up is just good engineering practice. Let's not saddle future generations with the mistakes of the 1980s. ... (further, down, replying to a comment about the barrel shifter on immediate value encodings being useful) ... The immediate value encoding is still there. What's gone is the barrel shifter on arithmetic instructions, other than those that explicitly mention that they perform a shift. " -- pcwalton
- "...generally speaking ARM v8 is pretty damn well designed..." -- ksec
- "The two modern ARM instruction sets, the 16-bit-encoded ARMv7-M / ARMv8-M (for microcontrollers) and the 64-bit (32-bit-encoded) ARMv8-A, are very different from the traditional ARM ISA and they both are very well designed, incomparably better than RISC-V." -- adrian_b
- "ARMv8.2 or newer is a very well designed ISA, while RISC-V is a very bad ISA and I would hate to be forced to use it. OpenPOWER? is a far better ISA than RISC-V, but unfortunately most developers do not have any experience with POWER and they have the wrong belief that POWER is some antique ISA while RISC-V must be some modern fashionable ISA. Therefore even if OpenPOWER? is much better, it is less likely than RISC-V to be used as a replacement for ARM." -- adrian_b
- adrian_b on ARMv8 vs POWER:" ARMv8 was a clean design not constrained by compatibility with the past and it was created by people having a lot of experience with the implementation of ISAs in hardware and in software tools. Therefore there is no surprise that it is an efficient ISA. The only significant flaw in its first version was the lack of atomic instructions, but that was corrected in the subsequent versions. The 32-bit POWER was a very nice ISA, but it was not designed to be extendable to 64-bit. It had blocks of the encoding space reserved for future extensions, but various details of the instruction word formats depended on the fact that the size of the registers was 32 bit. When POWER was extended to 64-bit, much earlier than IBM expected, i.e. only 5 years after the introduction of the 32-bit variant, the extension was constrained because IBM has chosen to not have a mode switch like ARM but they have chosen to make a compatible ISA extension, i.e. which has the original POWER ISA as an instruction subset. This has constrained the instruction encodings, so the 64-bit POWER ISA has some parts that seem more clumsy that in ARMv8 and the result is that programs for POWER are usually slightly larger than their ARMv8 equivalents. However, the hardware implementation effort for equivalent performance levels should be very similar for POWER and ARM and significantly less for both than for x86. POWER also had a compressed encoding variant, but that was implemented in very few chips. Now the latest ISA variant has introduced 2 instruction word lengths, i.e. both 64-bit and 32-bit long instructions, instead of just 32-bit long instructions. This allows the embedding of large immediate constants in the instructions, which is an important advantage of x86 vs. traditional RISC ISAs. This might help to reduce the sizes of many POWER programs. " -- adrian_b
- "AArch64 is a pretty well-designed instruction set that learns a lot of lessons from AArch32 and other competing ISAs." -- David Chisnall
- "I’m not really a fan of RISC-V, but RISC-V manages to copy MIPS while avoiding the most awful parts of MIPS. If you want to learn a simple RISC assembly language, RISC-V is a better choice than MIPS. If you want to learn assembly language for a well-designed ISA, learn AArch64. If you want to learn assembly language that’s a joy to write, learn AArch32 (things like stm and ldm, predication, and the fact that $pc is a general-purpose register are great to use for assembly programmers, difficult to use for compilers, and awful to implement)." -- David Chisnall
- " There are cases when cbz/tbz are very useful, but for loops they do not help at all. All the ARMv8 loops need 2 instructions, i.e. 8 bytes, instead of the single compare-and-branch of RISC-V. There are 2 ways to do simple loops in ARM, you can either use an addition that stores the flags, then a conditional branch, or you can use an addition that does not store the flags, then a CBNZ (which tests whether the loop counter is null). Both ways need a pair of instructions. Nevertheless, ARM has an unused opcode space equal in size to the space used by CBNZ/CBZ/TBNZ/TBZ (bits 29 to 31 equal to 3 or 7 instead of 1 or 5). In that unused opcode space, 4 pairs of compare-and-branch instructions could be encoded (3 pairs corresponding to those of RISC-V plus 1 pair of test-under-mask, corresponding to the TEST instruction of x86; each pair being for a condition and its negation). All 4 pairs of compare-and-branch would have 14-bit offsets, like TBZ/TBNZ, i.e. a range larger than that of the RISC-V branches. This addition to the ARM ISA would decrease the code size by 4 bytes for each 25 to 30 bytes, so a 10% to 15% improvement. " -- adrian_b
- "arm64 does seem constrained by compatibility with arm32 -- at least in that they until now (ten years later) usually have to share an execution pipeline and register set. Is it really conceivable that the arm64 designers had free rein to make the choice whether to use condition codes or not on a purely technical basis? I don't think so. Even if they thought -- as all other designers of ISAs intended for high performance since 1990 have (Alpha, Itanium, RISC-V) -- that it's better not to use condition codes, I don't think they would have been free to make that choice. The same goes for whether to expose instructions using the "free" shift on the 2nd ALU input. It's not really free -- it's paid for with a longer clock cycle or an extra pipeline stage or splitting instructions into uops. And since it was there for 32 bit they might as well use it in 64 bit as well. And the same for the complex addressing modes. " -- brucehoult
- (commenting upon https://gmplib.org/list-archives/gmp-devel/2021-September/006013.html ) "Funny, I thought the whole thing was bitching that RISC V has no carry flag which obviously causes multi word arithmetic to take more instructions. The obvious workaround is to use half-words and use the upper half for carry. There may be better solutions, but at twice the number of instructions this "dumb" method is better than what the author did. Flags were removed because they cause a lot of unwanted dependencies and contention in hardware designs and they aren't even part of any high level language. I still think instead of compare-and-branch they should have made "if" which would execute the following instruction only if true. But that's just just an opinion. I also hate the immediate constants (12 bits?) Inside the instruction. Nothing wrong with 16 32 or 64bit immediate data after the opcode. I hope RISC 6 will come along down the road (not soon) and fix a few things. But I like the lack of flags... " -- [24]
Links
A64:
Arm v1: