RISC-V
open source
clearest concise summary of major opcodes is in tables at the end of http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf
younger than (2010) and claims to have learned from SPARC V8 (1994) and OpenRISC? (2000).
http://riscv.org/
http://riscv.org/download.html#tab_isaspec (base opcode listing in Chapter 8)
- Eschews condition codes (dependencies are explicit) [1] [2]
- eschews "branch delay slots, which complicate higher performance implementations" [3] [4]
- "rs1, rs2, rd fields are always in the same location and all register sources and destinations are explicit (makes decoding faster and you can start fetching/renaming without having decoded)" [5]
- "The sign bit for all immediate fields is in a fixed location" [6]
I'm not sure what addressing modes are supported, but i'm guessing it's non-uniform, with different opcodes for different modes, and mostly register, except for the 'immediate' opcodes which have an 'immediate' component, and loads and stores which have a base+offset mode, with base address in register rs1. Unconditional jumps have PC-relative addressing.
interesting comparison of RISC-V with Epiphany (the one used in Parallella) http://www.adapteva.com/andreas-blog/analyzing-the-risc-v-instruction-set-architecture/
note: RISC-V instructions that the Epiphany guy thought maybe could have been left out:
AUIPC (but in the comments a RISC-V guy says AUIPC was important for relocatable code), SLT/SLTI/STLU/SLTIU (compare: set-less-than, with unsigned and immediate and unsigned immediate variants), XORI/ORI/ANDI (boolean logic with immediate values), FENCE (mb; sync threads), MULH/MULHSU/MULHU (multiply variants with the 'upper half' variant), FSGNJ/FSGNJN/FSGNJX (Sign Inject: Sign source), FCLASS (Categorization: Classify Type). Then there was a bunch for which he said "Not needed for epiphany", which i dunno if he means 'this is good but since Epiphany had a restricted use case target (DSP) we didn't include it'. These are: FLW/FSW (load/store 'W'; i didn't note this below), FMV.X/FMV.S (move from/to integer), FRSCSR (typo? read status regs? i didn't note this below), FSRM/FSRMI (swap rounding mode; i didn't note this below), FSFLAGS (swap flags; i didn't note this below). Then there are some for which he said and 'Needed?', these are: FNMSUB (Negative Multiply-SUBtract), FMIN/FMAX (min/max), FCMP (i can't find this in the table in http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf , so i didn't note this below).
Epiphany instructions that the Epiphany said RISC-V left out that are good: LDRD (load/store double), LDR and STR with the POSTMOD addressing mode (postincrement).
RISC-V has a choice of 16 or 32 integer registers (32 is more typical, i think?) and also optionally, 32 additional floating point registers. Memory is addressed as 8-bit bytes. The instruction encoding is 32-bit, but the 'Compressed' instruction encoding has 16-bit instructions. Instructions tend to have 32-bit, 64-bit, and 128-bit variants; arithmetic is done in at least 32-bit width ("RISC-V can load and store 8 and 16-bit items, but it lacks 8 and 16-bit arithmetic, including comparison-and-branch instructions." [7] ). Register 0 is constant 0.
RISC; no indirect or memory-memory addr modes, but instead there are LOAD and STORE instructions. No autoincrement addr modes. Some opcodes indicate immediate addr mode, others indicate register direct. Little-endian. Branching is compare-and-branch. Variable-length encoding.
"The RISC-V ISA has been designed to include small, fast, and low-power real-world implementations,[2][3] but without over-architecting for a particular microarchitecture style." [8]
"the RISC-V instruction set is designed for practicality of implementation, with features to increase a computer's speed, while reducing its cost and power use. These include placing most-significant bits at a fixed location to speed sign-extension, and a bit-arrangement designed to reduce the number of multiplexers in a CPU." [9]
"RISC-V intentionally lacks condition codes, and even a carry bit.[3] The designers claim that this can simplify CPU designs by minimizing interactions between instructions.[3] Instead RISC-V builds comparison operations into its conditional-jumps.[3] Use of comparisons may slightly increase its power use in some applications. The lack of a carry bit complicates multiple-precision arithmetic. RISC-V does not detect or flag most arithmetic errors, including overflow, underflow, and divide by zero.[3] RISC-V also lacks the "count leading zero" and bit-field operations normally used to speed software floating-point in a pure-integer processor." [10]
" A load or store can add a twelve-bit signed offset to a register that contains an address. A further 20 bits (yielding a 32-bit address) can be generated at an absolute address.[3]
RISC-V was designed to permit position-independent code. It has a special instruction to generate 20 upper address bits that are relative to the program counter. The lower twelve bits are provided by normal loads, stores and jumps.[3] " [11]
" RISC-V does define a special set of integer multiplication instructions. This includes a recommended sequence of instructions that a compiler can generate and a CPU can interpret to perform a fused multiply-accumulate operation. Multiply-accumulate is a core primitive of numerical linear algebra, and so is incorporated as part of a common benchmark, Coremark.[3][15] " [12]
Tutorials:
Retrospectives
Risc-V Compressed (16-bit encoding) opcodes
From [13] (Draft version 1.9):
Summary of Risc-V Compressed (16-bit encoding) opcodes
MOVs and loads and stores and LOADK:
- MOV
- load/store from (the stack pointer plus a 6-bit offset (scaled by 4))
- load/store from (memory address in a register, plus a 5-bit immediate offset)
- Load 6-bit immediate into register
Jumps and branches:
- Jump to PC-relative offset (signed 11-bit range), and optionally update the link register with the current PC (plus 1 instruction). Either immediate or register addr mode.
- Branch if zero (or if not-zero) to PC-relative offset, signed 8-bit.
Other stack-pointer-related:
- Increment or decrement the stack pointer by 6-bit immediate
- Add 8-bit immediate to stack pointer, and write the result to a register
Arithmetic and boolean logic:
- Shifts: left, right, logical, arithmetic, 6-bit immediate value for shift amount
- ADD, SUB (overwriting result onto one of the inputs)
- bitwise AND, OR, XOR (overwriting result onto one of the inputs); also bitwise AND with 6-bit immediate
Misc:
- NOP, BREAK, BAD (illegal instruction; no actual mnemonic, i made up 'BAD')
Details of Risc-V Compressed (16-bit encoding) opcodes
Variants are bit-width and integer vs. floating-point (there are (optional) floating-point registers in RISC-V).
C.(F)L(W
| D | Q)SP: Load value from stack (stack-pointer + 6-bit offset) into register. (there is no FLQSP though) |
C.(F)S(W| D | Q)SP: Store value from register to stack (stack-pointer + 6-bit offset). (there is no FSQSP though) |
C.(F)L(W| D | Q): Load value from memory (memory address in a register, plus a 5-bit immediate offset) into register (there is no FLQ though) |
C.(F)S(W| D | Q): Store value from register into memory (memory address in a register, plus a 5-bit immediate offset) (there is no FSQ though) |
C.J: Jump to offset given as an immediate constant (PC-relative, signed 11-bit, +-2k range (so +-1k instructions) C.JAL: Like C.J but also writes the current PC (plus 1 instruction) to the link register. C.JR: Jump to PC-relative offset given by register. C.JALR: Like C.JR but also writes the current PC (plus 1 instruction) to the link register.
C.BEQZ: Branch if the value in the given register zero. Offset is signed 8-bit, +-256 (so +-128 instructions). C.BNEQZ: Like C.BEQZ but branch if NOT zero.
C.LI: Load 6-bit immediate into register. C.LUI: Load 6-bit immediate into bits 17-12 of register.
C.ADDI(W): Add 6-bit immediate to register (mutating the register) C.ADDI16SP: Scale 6-bit immediate by 16 then add to stack-pointer (mutating the stack pointer). "used to adjust the stack pointer in procedure prologues and epilogues". C.ADDI4SPN: Scale 8-bit immediate by 4, add to stack pointer, and write the result to register. "used to generate pointers to stack-allocated variables".
C.S(L
| R)(L | A)I (logical | arithmetic) (left | right)-shifts a register (mutating it) (6-bit immediate shift amount). These variants have a non-uniform scheme for interpreting the immediate to allow it to be most useful. (there is no SLAI though) |
C.ANDI is bitwise AND of a register and a 6-bit immediate (mutating the register).
C.MV is register-register MOV.
C.(ADD
| SUB)(W) adds | subtracts two registers and writes the result over one of the input registers. |
C.AND, C.OR, C.XOR is bitwise AND of two registers, writing the result over one of the input registers.
C.BAD, the all-zero instruction, is illegal (not mnemonic is given, i made up 'BAD')
C.NOP is NOP
C.EBREAK breaks into the debugging environment.
Base instructions (32-bit encoding)
from https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf ; see also http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf or https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf although they have an older version of the ISA:
- Loads: LB (Load Byte), LH (Load Halfword), LW (Load Word), LBU (Load Byte Unsigned), LHU (Load Half Unsigned),
- Stores: SB (Store Byte), SH (Store Halfword), SW (Store Word)
- Arithmetic: ADD, ADDI (ADD Immediate), SUBtract, LUI (Load Upper ImmU?), AUIPC (Add Upper Imm to PC (note: the Epiphany guy thought this could have been left out, but in the comments a RISC guy said it was useful))
- Logical (note: the Epiphany guy thought the immediate versions of these could have been left out): XOR, XORI, OR, ORI, AND, ANDI
- Shifts: SLL (Shift Left), SLLI (Shift Left Immediate), SRL (Shift Right), SRLI (Shift Right Immediate), SRA (Shift Right Arithmetic), SRAI (Shift Right Arith Imm),
- Compare (note: the Epiphany guy thought these could have been left out): SLT (Set <), SLTI (Set < Immediate), SLTU (Set < Unsigned), SLTIU (Set < Imm Unsigned)
- Branch: BEQ (Branch =), BNE (Branch !=), BLT (Branch <), BGE (Branch >=), BLTU (Branch < Unsigned), BGEU (Branch >= Unsigned)
- Jump & Link: JAL (Jump and Link), JALR (Jump & Link Register)
- Synch: FENCE (Synch threads (note: the Epiphany guy thought this could have been left out)), FENCE.I (Synch Instr & Data)
- System: SCALL (System CALL), SBREAK (System BREAK)
- Counters pseudo-instructions: RDCYCLE (ReaD? CYCLE), RDCYCLEH (ReaD? CYCLE upper Half), RDTIME (ReaD? TIME), RDTIMEH (ReaD? TIME upper Half), RDINSTRRET (ReaD? INSTR RETired), RDINSTRRETH (ReaD? INSTR upper Half),
- Control and Status Register: CSRRW (Atomic Read/Write CSR), CSRRS (Atomic Read and Set Bits in CSR), CSRRC (Atomic Read and Clear Bits in CSR), CSRRWI (CSRRW immediate), CSRRSI (CSRRS immediate), CSRRCI (CSRRC immediate)
Multiply-divide extension ('M') instructions
from https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf or http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf :
(note: the Epiphany guy thought the 'upper half' multiply variants could have been left out)
- MULtiply
- MULtiply upper Half
- MULtiply Half Sign/Uns
- MULtiply upper Half Uns
- DIVide
- DIVide Unsigned
- REMainder
- REMainder Unsigned
(mul, mulh, mulhsu, mulhu, div, divu, rem, remu)
Floating-point extension ('F') instructions
from https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf or http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf :
Load and store:
- Float Load (FLW)
- Float Store (FSW)
Arithmetic
- Float ADD
- Float SUBtract
- Float MULtiply
- Float DIVide
- Float SQuare RooT?
Mul-Add:
- Float Multiply-ADD
- Float Multiply-SUBtract
- Float Negative Multiply-SUBtract (note: the Epiphany guy thought this could have been left out)
- Float Negative Multiply-ADD
Move (note: the Epiphany guy thought these could have been left out):
- Float MoVe? from integer (FMV.W.X)
- Float MoVe? to integer (FMV.X.W)
Sign Inject (note: the Epiphany guy thought these could have been left out):
- Float SiGN? source (FSGNJ)
- Float SiGN? source Negate (FSGNJN)
- Float SiGN? source Xor (FSGNJX)
Min/Max (note: the Epiphany guy thought these could have been left out):
- Float MINimum
- Float MAXimum
Compare:
- compare Float EQual
- compare Float Less Than
- compare Float Less than or Equal to
Convert:
- Float ConVerT? from int (FCVT.S.W)
- Float ConVerT? from int Unsigned (FCVT.S.WU)
- Float ConVerT? to int (FCVT.W.S)
- Float ConVerT? to int Unsigned (FCVT.WU.S)
Categorization (note: the Epiphany guy thought these could have been left out):
Configuration instructions (read/write the Floating-Point Control and Status Register, fcsr):
- Float Read Control Status Register (read the fcsr into an integer register)
- Float Swap Control Status Register (swap the fcsr with an integer register)
Configuration pseudo-op instructions:
- Float Read Rounding Mode, Float Swap Rounding Mode, Float Set Rounding Mode Immediate
- Float Read FLAGS, Float Swap FLAGS (accrued exception flags), Float Set FLAGS Immediate
(flw fsw fadd fsub fmul fdiv fsqrt fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi)
Atomicity extension opcodes
from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf :
- Load Reserved
- Store Conditional
- SWAP
- ADD
- XOR
- AND
- OR
- MINimum
- MAXimum
- MINimum Unsigned
- MAXimum Unsigned
suggested register usage
- 0: Constant 0
- 1: Reserved for the assembler
- 2 - 3: Variables
- 4 - 7: arguments
- 8 -15: temporary values
- 16 - 23: saved
- 24 - 25: temporary values
- 26 - 27: reserved for kernel use
- 28: Global pointer
- 29: Stack pointer
- 30: Frame pointer
- 31: Return address
-- [14]
RISC-V interrupts
" In this second RISC-V article I talk about its interrupt and exception system and about SiFive‘s? FE310G, the first commercial silicon implementation of a RISC-V ...
RISC-V ISA defines two major interrupt types: global and local. Basically, global interrupts are designed for multicore environments, while local interrupts are always associated with one specific core. Local interrupts suffer less overhead as there is no need for arbitration (which is the case of global interrupts on multicore systems).
...
Local interrupt system is responsible for processing a limited (and usually small) number of interrupt sources. The CLINT (Coreplex Local Interrupts) module has three basic interrupt sources: software interrupt (SI), timer interrupt (TI) and external interrupt (EI). RISC-V ISA also defines sixteen other optional local interrupt sources (which are not present on E31). One important note: all global interrupts from PLIC (Platform-level Interrupt Controller) are applied to the external interrupt input within CLINT!
RISC-V interrupt system will suspend execution flow and branch to an ISR if a local interrupt source (as long as it is previously enabled) sets its pending interrupt flag. There is also a global interrupt enable bit (MIE/SIE/UIE according to the current mode) available on MSTATUS register. This register also controls interrupt nesting, memory access privileges, etc. For further information regarding take a look at the RISC-V privileged instructions manual.
There are two ways to deal with interrupts on RISC-V: by using a single vector or multiple vectors. On the single vector mode, register MTVEC (CSR number 0x305) points to the ISR base address, that is, MTVEC points to the single/unique entry point for all ISR code. On the multiple vector mode, on the other hand, MTVEC works as a pointer to the vector table base address and the index for that table is taken from the MCAUSE register (CSR number 0x342). " [15]
RISC-V variants
DarkRiscV subset
I think it contains:
- lui auipc
- jal jalr
- beq bne ble bge bltu bgeu
- lb lh lw lbu lhu sb sh sw
- addi add sub
- slli srli srai sll srl sra
- slti sltiu slt sltu
- xori ori andi xor or and
Note that it does not contain the fence*, e*, and csr* instructions (memory fences, privilege levels and configuration registers). I believe that it also omits the SCALL, SBREAK, and the counter (RD*) instructions. The above instructions are all of the RV32I instructions except for these omissions.
RISC-V links
RISC-V discussion
"
Lack of execute-only/read-only memory
" tropo 51 days ago [-]
Security:
It still won't do execute-only and true read-only memory. We've had true read-only for ages now on x86, and just got execute-only. You need these: rw- r-- --x
It still has poor support for ASLR, especially the limited-MMU variants. Even the most limited version should be able to require that the uppermost address bits be something randomish, even if it's only a per-priv-level random cookie. " -- [16]
Lack of overflow checks
" pizlonator 51 days ago [-]
"We did not include special instruction set support for overflow checks on integer arithmetic operations, as many overflow checks can be cheaply implemented using RISC-V branches."
False. For example, JavaScript? add/sub will require 3x more instructions on RISC-V than x86 or ARM. Same will be true for any other language requires (either implicitly, like JS, or explicitly, like .NET) overflow checking. Good luck with that, lol.
__s 50 days ago [-]
Many overflow checks can be removed with optimization. RISC-V's compressed encoding has shown to be ~70-80% more compact, so it has room for overflow checks. The efficiency of the architecture can always compile it out by time it hits microcode
http://joeduffyblog.com/2015/12/19/safe-native-code
pizlonator 50 days ago [-]
I pioneered most of WebKit?'s overflow check optimizations, and our compiler is bleeding-edge when it comes to eliminating them. Still, the overwhelming majority of the checks remain, because most integer values are not friendly to analysis (because they came from some heap location, or they came from some hard math, etc).
I doubt that the architecture will compile out signed integer addition overflow checks, which are the most common. They are brutal to express correctly without an overflow bit, and the architecture will have a hard time with this.
zxcdw 50 days ago [-]
Why do you suppose they left it out? Is it merely a matter of "cheap implementation" being purely subjective, and hence they might have thought it as cheap, while you seem to disagree? Or could there be a more pressing reason, but "Oh well, its cheap enough anyway" is more of an excuse?
pizlonator 50 days ago [-]
I don't think they knew that modern languages rely on overflow checks so heavily and that the perf of overflow checks dominates perf overall. " -- [17]
Opinions
Overall opinions
http://www.adapteva.com/andreas-blog/analyzing-the-risc-v-instruction-set-architecture/
" There are a lot of things about the RISCV design that come from a very ideological place and hurt in a high end design. Yes, there are extensions and designs with high end features, that's certainly true, and I'm sure people someone will be making a high end version at some point. But the ISA isn't very well suited to it compared to Power or ARM.
By default, code density on RISC-V is pretty bad. You can try to solve that by using variable length instructions which many high end RISC-V projects intend to do but having variable length instructions means your front end is going to have to be more complicated to reach the same level of performance that a fixed width instruction machine can achieve.
More instructions for a task means your back end also has to execute more instructions to reach the same level of performance. One way to do better is to fuse together ISA-level instructions into a smaller number of more complex instructions that get executed in your core. This is something that basically every high end design does but RISC-V would have to do it far more extensively than other architectures to achieve a similar level of density on the back end which makes designing a high end core more complex and possibly uses extra pipeline stages making mispredicts more costly.
And more more criticisms here: https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68
EDIT: But in fairness it looks like conditional move might be getting added to the bit manipulation RISC-V extension which would fix one big pain point.
This isn't to say that RISC-V is bad. It's simplicity makes it wonderful for low end designs. It's extensibility makes it great for higher level embedded uses where you might want to add some instruction that makes your life easier for your hard driver controller or whatever in a way that would require a very expensive architecture license if you were using ARM. It's open, which would be great if you were making a high end open-source core for other people to use except the Power ISA just opened up so if I were to start a project like that I'd use that instead. " -- Symmetry
"Aarch64 has more complex addressing modes (base + index<<shift in particular) whereas RISC-V needs both RVC and fusion to do the same with similar code size and execution slot occupation. Personally, I'm leaving towards thinking that it was a mistake for RISC-V to not support such addressing modes. Unless you're aiming for something really super-constrained in terms of gate counts, having an adder and small shifter as part of your memory pipeline(s) seem like an obvious choice. And thus, having single instructions to use those pipelines isn't really committing any sins against the RISC philosophy. " -- jabl
brucehoult 1 day ago [–]
(On ARM) "NEON is guaranteed to exist on everything, and this means you're never going to see Aarch64 replace the Cortex M0 and M3. That's fragmentation right there. Severe fragmentation. Two completely incompatible ISAs. Small 32 bit RISC-V comes in smaller and lower power than An M0, and small 64 bit RISC-V is not much bigger than an M0 and is rather popular controlling something in the corner of a larger 64 bit SoC?." -- brucehoult
- "Personally I think the POWER instruction set is better in many ways. It has a proven track record of high performance and embedded implementations." orbifold
- "POWER was designed to be a compiler writer's dream and has some sharp implementation corners. I think I would probably recycle the Alpha ISA circa 21164 (EV-5) with maybe a CAS instruction. It was pretty balanced between hardware and software and a lot of the complications in the VLSI design (dynamic logic, mostly) are moot with a modern technology if you stick with reasonable speeds. Presumably now that the MIPS unaligned byte access patents are expired, a whole bunch of the idiocy that Alpha had to abide to avoid that patent can just be sidestepped." bsder
Misc opinions
- "The V in RISC-V stands for the 5 different immediate encodings" -- @erincandescent
- "I guess my opinion is that RISC-V doesn't have many gigantic flaws but for a modern architecture it does contain dozens of unnecessary unforced errors and especially given who was involved in designing it that's just very disappointing" -- @erincandescent
- "... Instruction Fusion looks way better in benchmarks than reality (Fusion wants specific patterns of adjacent instructions which a good fusion unaware compiler - say, one targetting preexisting CPUs - will try its hardest to avoid!)" -- @erincandescent
- "Even out of order CPUs rarely fuse non adjacent instructions. It requires that you verify that there's no observation of intermediate side effects, and is a big combinational explosion in terms of muxing. It's already bad enough that if you're trying to reduce executed instruction count with fusion (rather than just improve latencies) it is basically a new set of variable length instructions by the back door. Now, one neat thing RISC-V implementations do do (or have proposed) is treat a fused adjacent pair of 16 bit instructions as 32 bit. This is still somewhat painful (because fusion is slower than length decode) but it does mean you can save on the size of some structures. " @erincandescent
- "Agree it does feel like you're just re-CISCing the RISC via backdoor uarch policies! Just to understand, the 16-bit fusion is neat just because requires half the encoding space vs 32-bit? But otherwise all the same thing, right?" [18]
- "The core already supports both 16 and 33 bit instructions. You detect the 16 bit pair you want to fuse and pretend it's one 32 bit instruction, it doesn't massively complicate it" [19]
- "I'm still annoyed they put integer multiply in an extension instead of core" -- Peter Barufss
- "RISC-V scales down to chips smaller than Cortex M0 chips. Guess why ARM never replaced Z80 chips?" [20]
- "Honestly the bit that annoys me most is no LL/SC in core. THEY CAN JUST BE ALIASED FOR LOAD/STORE EXCEPT FOR ONE BIT OF STATE" -- @erincandescent
- "The fact that JAL encoded a register made my brain explode and I stopped there. From my perspective as a linker engineer the removal of J is basically game over from a size and perf standpoint, at least for mobile and desktop." -- Igerbarg
- "It's like... They went for a specific kind of dogmatic simplicity and... The real world is uglier" -- @erincandescent
- "I was at a talk on RISC V given by Patterson recently, and apparently the main original goal of the project was to have a modern ISA for teaching purposes that wouldn't incur in licensing issues. I would imagine that many of such unrealistic simplifications come from that." -- Alessandro @volcacius
- "Yeah, I'm reading all this in the first place because I'm like "I bet it would be interesting to implement a soft cpu, just to see if i could" and every "simple" thing y'all are decrying I'm like "oh, that makes it so much easier for me!"...but it makes it easier for me BECAUSE I'm making a minimal, for-one-person, non-pipelined no-memory-hierarchy implementation, I can totally see how if my goal was to make a commercial chip it would make my life only harder" -- @mcclure111
- "Sorry for the v 101 question, but does this mean that building an IOT or even a CPU chip with RSIC V is just going to be always a failure/substantially weaker than ARM/x86? There’s a lot of interest in CHina now - wondering if it’s pure hype? Thanks!"
- "Lots of these hurt more at the higher end of the market, and good engineering can overcome a lot. x86 has many sins but still has world leading performance." -- @erincandescent
- "Just based on trends – Moore’s law and Spectre/Meltdown – my impression is the same: RISC-V is designed for simplicity, not modernity." -- anordal
- "...lack of a rotate instruction in the base ISA, and unix standard extension groups. This makes symmetric key encryption and most hash functions much slower. The Bitmanip extension has a rotate instruction, but it should have been in the base ISA." -- bem94
- "While crypto is important, it's not the whole world and not everyone needs the absolute fastest crypto possible...Most software doesn't use rotate operations. When you do need one, it takes three instructions to synthesize it if the shift count is a constant, or four otherwise. Unless you're doing nothing but rotates you're not going to notice it. If you're doing any memory loads that miss in the L1 cache to get the data you're rotating then you're also not going to notice it." -- brucehoult
- "I'd also emphasise the lack of indexed load/store instructions again. It is a disaster for any sort of array indexing where you're addressing >1 array using the same index. I found this in the context of multi-precision arithmetic for crypto, but examples come from all over." -- bem94
- "On the author's point about multiply and divide in the same extension: crypto is another good example. Lots of crypto really benefits from multiply, but doesn't need divide." -- bem94
- "The most important thing about RISC-V is the idea behind it's openness as a standard and the ecosystem around that standard. The engineering of the ISA itself is not what makes it remarkable, and actually leaves a lot to be desired." -- bem94
- "Having one instruction that can be a branch, call, or return...makes tracking the callstack harder" -- [21] combined with [22]
- a thread about RISC-V and whether relying on fusing instructions is acceptable: https://www.reddit.com/r/programming/comments/cixatj/an_exarm_engineer_critiques_riscv/evarycd
- one response is that empirically, RISC-V's 'Total dynamic bytes' is lower than other popular ISAs: https://www.reddit.com/r/programming/comments/cixatj/an_exarm_engineer_critiques_riscv/evg87ti/
- "There is no point in having an artificially small set of instructions. Instruction decoding is a laughably small part of the overall die space and mostly irrelevant to performance if you don't get it terribly wrong. It's always possible to start with complex instructions and make them execute faster. However, it is very hard to speed up anything when the instructions are broken down like on RISC V as you can't do much better than execute each individually." -- [23]
- "...This is already a terrible pain point with ARM and the RISC-V people go even further and put fundamental instructions everybody needs into extensions. For example: Multiply is optional" -- [24]
- "RISCV is not meant for high performance. It's optimizing for low cost, where it has the potential to really compete with ARM." [25]
- "The sweet spot for RISCV in my opinion is competing with higher-end ARM microcontrollers, like the -M4, and various low-end application processors like the Cortex-A9. But those all have full integer instructions and often an FPU as well." -- [26]
- "There are better ISAs, like ARM64 or POWER. And it's very hard to make a design fast if it doesn't give you anything to make fast." [27]
- "RISC is better for hardware-constrained simple in-order implementations, because it reduces the overhead of instruction decoding and makes it easy to implement a simple, fast core. Typically, these implementations have on-chip SRAM that the application runs out of, so memory speed isn't much of an issue. However, this basically limits you to low-end embedded microcontrollers. This is basically why the original RISC concept took off in the 80s -- microprocessors back then had very primitive hardware, so an instruction set that made the implementation more hardware-efficient greatly improved performance. RISC becomes a problem when you have a high-performance, superscalar out-of-order core. These cores operate by taking the incoming instructions, breaking them down into basically RISC-like micro-ops, and issuing those operations in parallel to a bunch of execution units. The decoding step is parallelizable, so there is no big advantage to simplifying this operation. However, at this point, the increased code density of a non-RISC instruction set becomes a huge advantage because it greatly increases the efficiency of the various on-chip caches (which is what ends up using a good 70% of the die area of a typical high-end CPU). So basically, RISCV is good for low-end chips, but becomes suboptimal for higher-performance ones, where you want a more dense instruction set...there's nothing really wrong with riscv. It's likely not as good as arm64 for big chips. It is definitely good enough to be useful" -- psycoee and subthread
- "You might have some sort of point if x86_64 code was more compact than RV64GC code, but in fact it is typically something like 30% *bigger*. And Aarch64 code is of similar size to x86_64, or even a little bigger. In 64 bit CPUs (which is what anyone who cares about high performance big systems cares about) RISC-V is by *far* the most compact code. It's only in 32 bit that it has competition from Thumb2 and some others." -- brucehoult
- "Expert opinion is divided -- to say the least -- on whether complex addressing modes help to make a machine fast. You assert that they do, but others up to and including Turing award winners in computer architecture disagree." -- brucehoult
- "With RISCV, the overhead of, say, passing arguments into a function, or accessing struct fields via a pointer is absolutely insane. Easily 3x vs ARM or x86. Even in an embedded system where you don't care about speed that much, this is insane purely from a code size standpoint. The compressed instruction set solves that problem to some extent, but there is still a performance hit." -- psycoee
- "...why you can't have multiply without divide. That's crazy." IshKebab
- "Compilers definitely handle instruction set extensions without too much trouble." -- theQuandary
- "Actually they don't. Unless you specifically tell the compiler to assume more, it's only going to use SSE and SSE2 on amd64." -- FUZxxl
- "It is a problem if I (a) want to write assembly code or (b) want to distribute binary code. Imagine you had no access to binary packages on your computer and instead every package installation was a half-hour wait for compilation to finish. Or alternatively, packages only make use of half the available instructions and are thus much slower than they could be. That's what you get when the ISA is fragmented. It wouldn't be as bad if the RISC-V people didn't place even fundamentally important instructions into instruction set extensions. You can't even count trailing zeroes in the base ISA! Or multiply!" -- FUZxxl
- "If you're compiling a say Linux binary you can very much assume the presence of multiplication. RISC-V's "base ISA" as you call it, that is, RISC-V without any of the (standard!) extensions is basically a 32-bit MOS 6510. A ridiculously small ISA, a ridiculously small core, something you won't ever see if you aren't developing for an embedded platform. How, pray tell, things look in the case of ARM? Why can't I run an armhf binary on a Cortex-M0? Why can't I execute sse instructions on a Z80? Because they're entirely different classes of chips and noone in their right mind would even try running code for the big cores on a small core. The other way around, sure, and that's why RISC-V can do exactly that." -- barsoap
- "...Yes, ARM has the same fragmentation issues. They fixed this in ARM64 mostly and I'm really surprised RISC-V makes the same mistake..." -- FUZxxl
- "That'd be because there's no such thing as 64-bit microcontrollers." barsoap
- "Fragmentation is okay if the base instruction set is sufficiently powerful and if it's not fragmentation but rather a one-dimensional axis of instruction set extensions. Also, there must be binary compatibility. This means that I can optimise my code for n possible sets of available instructions (one for each CPU generation) instead of 2n sets (one for each combination of available extensions). The same shit is super annoying with ARM cores, especially as there isn't really a way to detect what instructions are available at runtime. Though it got better with ARM64." FUZxxl
- "RISC-V aims to be suitable for both very small & simple and very large & complex & fast implementations. Where there is a conflict between the two RISC-V errs in the direction of making the small implementation simple, even if it puts more complexity on the high end -- it's complex anyway and a little more won't be very noticeable. Take the macro-op fusion vs splitting complex instructions into micro-ops argument. Maybe in a mid-level CPU it's a bit easier to do instruction splitting than instruction combining, but having complex instructions means that the very simplest cores are burdened with splitting complex addressing modes into multiple operations or having a sequencer for load/store multiple. That makes a *significant* difference to the size and complexity of those cores that can least afford it." -- brucehoult
- "It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling." -- [28]
Selected criticisms from https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 (Erin Shepherd):
- "The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions."
- "RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its' numerous prefixes). The simplification of an instruction set should not be pursued to its' limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its' constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations. We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance."
- "Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care."
- "Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)"
- "RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)"
- "All 32-bit ops on x86 zero extend into the upper half of the register. There are various reasons to prefer zero extend, primarily that 32-bit leave the top half of the register completely static and you can get good power savings there" [29]
- "LR/SC has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache) "
- "FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode"
- "FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)"
- "No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its' implications:
- Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
- No conditional selects (useful for highly unpredictable branches)
- No add with carry/subtract with carry or borrow
- (Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags) "
- "Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not"
- "No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations)."
- "LR/SC are in the same extension as more complicated atomic instructions"
- "General (non LR/SC) atomics do not include a CAS primitive. The motivation is to avoid the need for an instruction which reads 5 registers (Addr, CmpHi:CmpLo?, SwapHi:SwapLo?), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it"
- "Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit"
- "For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory"
- "e.g. RV32I 32-bit ADD and RV64I 64-bit ADD share encodings, and RVI64 adds a different ADD.W encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead"
- "No MOV instruction. The MV assembler alias is implemted as MV rD, rS -> ADDI rD, rS, 0. MOV optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonical MV requires oring a 12-bit immediate "
- "JAL wastes 5 bits encoding the link register, which will always be R1 (or R0 for branches). This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)"
- "Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped). It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation."
- "No loads with register offsets (Rbase+Roffset) or indexes (Rbase+Rindex << Scale)."
- "FENCE.I implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer"
- "In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation. Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue"
- "No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients. Common examples of pure "NOP hints" are things like spinlock yields. More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)"
- "The worst issue, at least for the versions of the ISA that will run a "real" OS, are the lack of conditional move instructions and lack of bitwise rotation instructions. Lack of shift-and-sum instructions or equivalently addresses with shifted indexes is usually mitigated by optimization of induction variables in the compiler. They are nice to have (I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction) but not particularly common with the massive inlining that is common in C++ or Rust." bonzini
- " Having taken a look at the RISC-V ISA spec I'm wondering if they did cripple LL/SC (LR/SC in RISC-V). Basically:
- LL/SC can prevent ABA if the ABA-prone part is in-between a LL and SC instruction
- To have a ABA prone problem you need some state implicitly dependent on the atomic state but not encoded in it. Normally (always?) the atomic state is a pointer and we depend on some state behind the pointer not changing in a context of a ABA situation (roughly ~ switch out ptr, change ptr target, switch back in ptr, through often more complex to prevent race conditions). This means in all situations I'm aware of LL/SC only prevents the ABA problem if you at least can do one atomic (relaxed ordering) load "somehow" depending on the LL load. (LL load pointer, offset or similar). But the RISC-V spec doesn't only not guarantee forward process in this cases (which I guess is fine) but goes as far as explicitly stating that guaranteed not having forward provess is ok, e.g. doing any load between the load reserved and store conditional is allowed to make the store conditional fail> Doesn't that mean that if you target RISC-V you will not benefit from LL/SC based ABA workaround and instead it's just a slightly more flexible and potential faster compare exchange which can spuriously fail? The spec says you are supposed to detect if it work and potentially switch implementations. But how can you do that reasonable if it means that you have to switch to fundamentally different data structures, which isn't something easily and reasonably done at runtime. Or do I miss something fundamental? " -- dathinab
- "The use of LL/SC for atomics is a common mistake. It makes replay debuggers like rr impossible to implement." souprock
- "So many flaws, and without even mentioning the missing POPCOUNT. (No, the M extension does not help.)" [30]