Bayle Shanks's website: proj-plbook-plChRiscvIsa

RISC-V

open source

clearest concise summary of major opcodes is in tables at the end of http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf

younger than (2010) and claims to have learned from SPARC V8 (1994) and OpenRISC? (2000).

http://riscv.org/

http://riscv.org/download.html#tab_isaspec (base opcode listing in Chapter 8)

Eschews condition codes (dependencies are explicit) [1] [2]
eschews "branch delay slots, which complicate higher performance implementations" [3] [4]
"rs1, rs2, rd fields are always in the same location and all register sources and destinations are explicit (makes decoding faster and you can start fetching/renaming without having decoded)" [5]
"The sign bit for all immediate fields is in a fixed location" [6]

I'm not sure what addressing modes are supported, but i'm guessing it's non-uniform, with different opcodes for different modes, and mostly register, except for the 'immediate' opcodes which have an 'immediate' component, and loads and stores which have a base+offset mode, with base address in register rs1. Unconditional jumps have PC-relative addressing.

interesting comparison of RISC-V with Epiphany (the one used in Parallella) http://www.adapteva.com/andreas-blog/analyzing-the-risc-v-instruction-set-architecture/

note: RISC-V instructions that the Epiphany guy thought maybe could have been left out:

AUIPC (but in the comments a RISC-V guy says AUIPC was important for relocatable code), SLT/SLTI/STLU/SLTIU (compare: set-less-than, with unsigned and immediate and unsigned immediate variants), XORI/ORI/ANDI (boolean logic with immediate values), FENCE (mb; sync threads), MULH/MULHSU/MULHU (multiply variants with the 'upper half' variant), FSGNJ/FSGNJN/FSGNJX (Sign Inject: Sign source), FCLASS (Categorization: Classify Type). Then there was a bunch for which he said "Not needed for epiphany", which i dunno if he means 'this is good but since Epiphany had a restricted use case target (DSP) we didn't include it'. These are: FLW/FSW (load/store 'W'; i didn't note this below), FMV.X/FMV.S (move from/to integer), FRSCSR (typo? read status regs? i didn't note this below), FSRM/FSRMI (swap rounding mode; i didn't note this below), FSFLAGS (swap flags; i didn't note this below). Then there are some for which he said and 'Needed?', these are: FNMSUB (Negative Multiply-SUBtract), FMIN/FMAX (min/max), FCMP (i can't find this in the table in http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf , so i didn't note this below).

Epiphany instructions that the Epiphany said RISC-V left out that are good: LDRD (load/store double), LDR and STR with the POSTMOD addressing mode (postincrement).

RISC-V has a choice of 16 or 32 integer registers (32 is more typical, i think?) and also optionally, 32 additional floating point registers. Memory is addressed as 8-bit bytes. The instruction encoding is 32-bit, but the 'Compressed' instruction encoding has 16-bit instructions. Instructions tend to have 32-bit, 64-bit, and 128-bit variants; arithmetic is done in at least 32-bit width ("RISC-V can load and store 8 and 16-bit items, but it lacks 8 and 16-bit arithmetic, including comparison-and-branch instructions." [7] ). Register 0 is constant 0.

RISC; no indirect or memory-memory addr modes, but instead there are LOAD and STORE instructions. No autoincrement addr modes. Some opcodes indicate immediate addr mode, others indicate register direct. Little-endian. Branching is compare-and-branch. Variable-length encoding.

"The RISC-V ISA has been designed to include small, fast, and low-power real-world implementations,[2][3] but without over-architecting for a particular microarchitecture style." [8]

"the RISC-V instruction set is designed for practicality of implementation, with features to increase a computer's speed, while reducing its cost and power use. These include placing most-significant bits at a fixed location to speed sign-extension, and a bit-arrangement designed to reduce the number of multiplexers in a CPU." [9]

"RISC-V intentionally lacks condition codes, and even a carry bit.[3] The designers claim that this can simplify CPU designs by minimizing interactions between instructions.[3] Instead RISC-V builds comparison operations into its conditional-jumps.[3] Use of comparisons may slightly increase its power use in some applications. The lack of a carry bit complicates multiple-precision arithmetic. RISC-V does not detect or flag most arithmetic errors, including overflow, underflow, and divide by zero.[3] RISC-V also lacks the "count leading zero" and bit-field operations normally used to speed software floating-point in a pure-integer processor." [10]

" A load or store can add a twelve-bit signed offset to a register that contains an address. A further 20 bits (yielding a 32-bit address) can be generated at an absolute address.[3]

RISC-V was designed to permit position-independent code. It has a special instruction to generate 20 upper address bits that are relative to the program counter. The lower twelve bits are provided by normal loads, stores and jumps.[3] " [11]

" RISC-V does define a special set of integer multiplication instructions. This includes a recommended sequence of instructions that a compiler can generate and a CPU can interpret to perform a fused multiply-accumulate operation. Multiply-accumulate is a core primitive of numerical linear algebra, and so is incorporated as part of a common benchmark, Coremark.[3][15] " [12]

Tutorials:

Retrospectives

Design of the RISC-V Instruction Set Architecture by Andrew Waterman
- Design of the RISC-V Instruction Set Architecture by Andrew Shell Waterman (dissertation)
OLD OBSOLETE OUT-OF-DATE VERSION v1.0 of RISC-V from 2011: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-62.pdf
RISC-V Geneology surveys 18 instruction set architectures prior to RISC-V, "chosen primarily from earlier UC Berkeley RISC architectures and major proprietary RISC instruction sets", and present a matrix of which instructions in each instruction set correspond to which RISC-V instructions
https://riscv.org/blog/2014/10/why-not-build-on-openrisc/
https://lowrisc.org/blog/2017/11/seventh-risc-v-workshop-day-one/
https://lowrisc.org/blog/2017/11/seventh-risc-v-workshop-day-two/

Risc-V Compressed (16-bit encoding) opcodes

From [13] (Draft version 1.9):

Summary of Risc-V Compressed (16-bit encoding) opcodes

MOVs and loads and stores and LOADK:

MOV
load/store from (the stack pointer plus a 6-bit offset (scaled by 4))
load/store from (memory address in a register, plus a 5-bit immediate offset)
Load 6-bit immediate into register

Jumps and branches:

Jump to PC-relative offset (signed 11-bit range), and optionally update the link register with the current PC (plus 1 instruction). Either immediate or register addr mode.
Branch if zero (or if not-zero) to PC-relative offset, signed 8-bit.

Other stack-pointer-related:

Increment or decrement the stack pointer by 6-bit immediate
Add 8-bit immediate to stack pointer, and write the result to a register

Arithmetic and boolean logic:

Shifts: left, right, logical, arithmetic, 6-bit immediate value for shift amount
ADD, SUB (overwriting result onto one of the inputs)
bitwise AND, OR, XOR (overwriting result onto one of the inputs); also bitwise AND with 6-bit immediate

Misc:

NOP, BREAK, BAD (illegal instruction; no actual mnemonic, i made up 'BAD')

Details of Risc-V Compressed (16-bit encoding) opcodes

Variants are bit-width and integer vs. floating-point (there are (optional) floating-point registers in RISC-V).

C.(F)L(W

C.(F)S(WC.(F)L(WC.(F)S(W

D	Q)SP: Load value from stack (stack-pointer + 6-bit offset) into register. (there is no FLQSP though)
D	Q)SP: Store value from register to stack (stack-pointer + 6-bit offset). (there is no FSQSP though)
D	Q): Load value from memory (memory address in a register, plus a 5-bit immediate offset) into register (there is no FLQ though)
D	Q): Store value from register into memory (memory address in a register, plus a 5-bit immediate offset) (there is no FSQ though)

C.J: Jump to offset given as an immediate constant (PC-relative, signed 11-bit, +-2k range (so +-1k instructions) C.JAL: Like C.J but also writes the current PC (plus 1 instruction) to the link register. C.JR: Jump to PC-relative offset given by register. C.JALR: Like C.JR but also writes the current PC (plus 1 instruction) to the link register.

C.BEQZ: Branch if the value in the given register zero. Offset is signed 8-bit, +-256 (so +-128 instructions). C.BNEQZ: Like C.BEQZ but branch if NOT zero.

C.LI: Load 6-bit immediate into register. C.LUI: Load 6-bit immediate into bits 17-12 of register.

C.ADDI(W): Add 6-bit immediate to register (mutating the register) C.ADDI16SP: Scale 6-bit immediate by 16 then add to stack-pointer (mutating the stack pointer). "used to adjust the stack pointer in procedure prologues and epilogues". C.ADDI4SPN: Scale 8-bit immediate by 4, add to stack pointer, and write the result to register. "used to generate pointers to stack-allocated variables".

C.S(L

R)(L

A)I (logical

arithmetic) (left

right)-shifts a register (mutating it) (6-bit immediate shift amount). These variants have a non-uniform scheme for interpreting the immediate to allow it to be most useful. (there is no SLAI though)

C.ANDI is bitwise AND of a register and a 6-bit immediate (mutating the register).

C.MV is register-register MOV.

C.(ADD

SUB)(W) adds

subtracts two registers and writes the result over one of the input registers.

C.AND, C.OR, C.XOR is bitwise AND of two registers, writing the result over one of the input registers.

C.BAD, the all-zero instruction, is illegal (not mnemonic is given, i made up 'BAD')

C.NOP is NOP

C.EBREAK breaks into the debugging environment.

Base instructions (32-bit encoding)

from https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf ; see also http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf or https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf although they have an older version of the ISA:

Loads: LB (Load Byte), LH (Load Halfword), LW (Load Word), LBU (Load Byte Unsigned), LHU (Load Half Unsigned),
Stores: SB (Store Byte), SH (Store Halfword), SW (Store Word)
Arithmetic: ADD, ADDI (ADD Immediate), SUBtract, LUI (Load Upper ImmU?), AUIPC (Add Upper Imm to PC (note: the Epiphany guy thought this could have been left out, but in the comments a RISC guy said it was useful))
Logical (note: the Epiphany guy thought the immediate versions of these could have been left out): XOR, XORI, OR, ORI, AND, ANDI
Shifts: SLL (Shift Left), SLLI (Shift Left Immediate), SRL (Shift Right), SRLI (Shift Right Immediate), SRA (Shift Right Arithmetic), SRAI (Shift Right Arith Imm),
Compare (note: the Epiphany guy thought these could have been left out): SLT (Set <), SLTI (Set < Immediate), SLTU (Set < Unsigned), SLTIU (Set < Imm Unsigned)
Branch: BEQ (Branch =), BNE (Branch !=), BLT (Branch <), BGE (Branch >=), BLTU (Branch < Unsigned), BGEU (Branch >= Unsigned)
Jump & Link: JAL (Jump and Link), JALR (Jump & Link Register)
Synch: FENCE (Synch threads (note: the Epiphany guy thought this could have been left out)), FENCE.I (Synch Instr & Data)
System: SCALL (System CALL), SBREAK (System BREAK)
Counters pseudo-instructions: RDCYCLE (ReaD? CYCLE), RDCYCLEH (ReaD? CYCLE upper Half), RDTIME (ReaD? TIME), RDTIMEH (ReaD? TIME upper Half), RDINSTRRET (ReaD? INSTR RETired), RDINSTRRETH (ReaD? INSTR upper Half),
Control and Status Register: CSRRW (Atomic Read/Write CSR), CSRRS (Atomic Read and Set Bits in CSR), CSRRC (Atomic Read and Clear Bits in CSR), CSRRWI (CSRRW immediate), CSRRSI (CSRRS immediate), CSRRCI (CSRRC immediate)

Multiply-divide extension ('M') instructions

from https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf or http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf :

(note: the Epiphany guy thought the 'upper half' multiply variants could have been left out)

MULtiply
MULtiply upper Half
MULtiply Half Sign/Uns
MULtiply upper Half Uns
DIVide
DIVide Unsigned
REMainder
REMainder Unsigned

(mul, mulh, mulhsu, mulhu, div, divu, rem, remu)

Floating-point extension ('F') instructions

from https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf or http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf :

Load and store:

Float Load (FLW)
Float Store (FSW)

Arithmetic

Float ADD
Float SUBtract
Float MULtiply
Float DIVide
Float SQuare RooT?

Mul-Add:

Float Multiply-ADD
Float Multiply-SUBtract
Float Negative Multiply-SUBtract (note: the Epiphany guy thought this could have been left out)
Float Negative Multiply-ADD

Move (note: the Epiphany guy thought these could have been left out):

Float MoVe? from integer (FMV.W.X)
Float MoVe? to integer (FMV.X.W)

Sign Inject (note: the Epiphany guy thought these could have been left out):

Float SiGN? source (FSGNJ)
Float SiGN? source Negate (FSGNJN)
Float SiGN? source Xor (FSGNJX)

Min/Max (note: the Epiphany guy thought these could have been left out):

Float MINimum
Float MAXimum

Compare:

compare Float EQual
compare Float Less Than
compare Float Less than or Equal to

Convert:

Float ConVerT? from int (FCVT.S.W)
Float ConVerT? from int Unsigned (FCVT.S.WU)
Float ConVerT? to int (FCVT.W.S)
Float ConVerT? to int Unsigned (FCVT.WU.S)

Categorization (note: the Epiphany guy thought these could have been left out):

Float CLASSify type

Configuration instructions (read/write the Floating-Point Control and Status Register, fcsr):

Float Read Control Status Register (read the fcsr into an integer register)
Float Swap Control Status Register (swap the fcsr with an integer register)

Configuration pseudo-op instructions:

Float Read Rounding Mode, Float Swap Rounding Mode, Float Set Rounding Mode Immediate
Float Read FLAGS, Float Swap FLAGS (accrued exception flags), Float Set FLAGS Immediate

(flw fsw fadd fsub fmul fdiv fsqrt fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi)

Atomicity extension opcodes

from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.pdf :

Load Reserved
Store Conditional
SWAP
ADD
XOR
AND
OR
MINimum
MAXimum
MINimum Unsigned
MAXimum Unsigned

suggested register usage

0: Constant 0
1: Reserved for the assembler
2 - 3: Variables
4 - 7: arguments
8 -15: temporary values
16 - 23: saved
24 - 25: temporary values
26 - 27: reserved for kernel use
28: Global pointer
29: Stack pointer
30: Frame pointer
31: Return address

-- [14]

RISC-V interrupts

" In this second RISC-V article I talk about its interrupt and exception system and about SiFive‘s? FE310G, the first commercial silicon implementation of a RISC-V ...

RISC-V ISA defines two major interrupt types: global and local. Basically, global interrupts are designed for multicore environments, while local interrupts are always associated with one specific core. Local interrupts suffer less overhead as there is no need for arbitration (which is the case of global interrupts on multicore systems).

...

Local interrupt system is responsible for processing a limited (and usually small) number of interrupt sources. The CLINT (Coreplex Local Interrupts) module has three basic interrupt sources: software interrupt (SI), timer interrupt (TI) and external interrupt (EI). RISC-V ISA also defines sixteen other optional local interrupt sources (which are not present on E31). One important note: all global interrupts from PLIC (Platform-level Interrupt Controller) are applied to the external interrupt input within CLINT!

RISC-V interrupt system will suspend execution flow and branch to an ISR if a local interrupt source (as long as it is previously enabled) sets its pending interrupt flag. There is also a global interrupt enable bit (MIE/SIE/UIE according to the current mode) available on MSTATUS register. This register also controls interrupt nesting, memory access privileges, etc. For further information regarding take a look at the RISC-V privileged instructions manual.

There are two ways to deal with interrupts on RISC-V: by using a single vector or multiple vectors. On the single vector mode, register MTVEC (CSR number 0x305) points to the ISR base address, that is, MTVEC points to the single/unique entry point for all ISR code. On the multiple vector mode, on the other hand, MTVEC works as a pointer to the vector table base address and the index for that table is taken from the MCAUSE register (CSR number 0x342). " [15]

RISC-V variants

Tiny RISC-V pedagogical subset
RISC-U 14-instruction pedagogical subset

DarkRiscV subset

I think it contains:

lui auipc
jal jalr
beq bne ble bge bltu bgeu
lb lh lw lbu lhu sb sh sw
addi add sub
slli srli srai sll srl sra
slti sltiu slt sltu
xori ori andi xor or and

Note that it does not contain the fence*, e*, and csr* instructions (memory fences, privilege levels and configuration registers). I believe that it also omits the SCALL, SBREAK, and the counter (RD*) instructions. The above instructions are all of the RV32I instructions except for these omissions.

RISC-V links

The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.1 (May 2016)
RISC-V reference card
http://www.cs.cornell.edu/courses/cs3410/2019sp/riscv/interpreter/index.html
https://www.cl.cam.ac.uk/~sf502/regressions/rmem/ (useful for checking stuff about RISC-V's memory model, RVWMO)
https://rv8.io/asm.html
https://gms.tf/riscv-vector.html

RISC-V discussion

Lack of execute-only/read-only memory

" tropo 51 days ago [-]

Security:

It still won't do execute-only and true read-only memory. We've had true read-only for ages now on x86, and just got execute-only. You need these: rw- r-- --x

It still has poor support for ASLR, especially the limited-MMU variants. Even the most limited version should be able to require that the uppermost address bits be something randomish, even if it's only a per-priv-level random cookie. " -- [16]

Lack of overflow checks

" pizlonator 51 days ago [-]

"We did not include special instruction set support for overflow checks on integer arithmetic operations, as many overflow checks can be cheaply implemented using RISC-V branches."

False. For example, JavaScript? add/sub will require 3x more instructions on RISC-V than x86 or ARM. Same will be true for any other language requires (either implicitly, like JS, or explicitly, like .NET) overflow checking. Good luck with that, lol.

__s 50 days ago [-]

Many overflow checks can be removed with optimization. RISC-V's compressed encoding has shown to be ~70-80% more compact, so it has room for overflow checks. The efficiency of the architecture can always compile it out by time it hits microcode

http://joeduffyblog.com/2015/12/19/safe-native-code

pizlonator 50 days ago [-]

I pioneered most of WebKit?'s overflow check optimizations, and our compiler is bleeding-edge when it comes to eliminating them. Still, the overwhelming majority of the checks remain, because most integer values are not friendly to analysis (because they came from some heap location, or they came from some hard math, etc).

I doubt that the architecture will compile out signed integer addition overflow checks, which are the most common. They are brutal to express correctly without an overflow bit, and the architecture will have a hard time with this.

zxcdw 50 days ago [-]

Why do you suppose they left it out? Is it merely a matter of "cheap implementation" being purely subjective, and hence they might have thought it as cheap, while you seem to disagree? Or could there be a more pressing reason, but "Oh well, its cheap enough anyway" is more of an excuse?

pizlonator 50 days ago [-]

I don't think they knew that modern languages rely on overflow checks so heavily and that the perf of overflow checks dominates perf overall. " -- [17]

Opinions

Overall opinions

http://www.adapteva.com/andreas-blog/analyzing-the-risc-v-instruction-set-architecture/

" There are a lot of things about the RISCV design that come from a very ideological place and hurt in a high end design. Yes, there are extensions and designs with high end features, that's certainly true, and I'm sure people someone will be making a high end version at some point. But the ISA isn't very well suited to it compared to Power or ARM.

By default, code density on RISC-V is pretty bad. You can try to solve that by using variable length instructions which many high end RISC-V projects intend to do but having variable length instructions means your front end is going to have to be more complicated to reach the same level of performance that a fixed width instruction machine can achieve.

More instructions for a task means your back end also has to execute more instructions to reach the same level of performance. One way to do better is to fuse together ISA-level instructions into a smaller number of more complex instructions that get executed in your core. This is something that basically every high end design does but RISC-V would have to do it far more extensively than other architectures to achieve a similar level of density on the back end which makes designing a high end core more complex and possibly uses extra pipeline stages making mispredicts more costly.

And more more criticisms here: https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68

EDIT: But in fairness it looks like conditional move might be getting added to the bit manipulation RISC-V extension which would fix one big pain point.

This isn't to say that RISC-V is bad. It's simplicity makes it wonderful for low end designs. It's extensibility makes it great for higher level embedded uses where you might want to add some instruction that makes your life easier for your hard driver controller or whatever in a way that would require a very expensive architecture license if you were using ARM. It's open, which would be great if you were making a high end open-source core for other people to use except the Power ISA just opened up so if I were to start a project like that I'd use that instead. " -- Symmetry

"Aarch64 has more complex addressing modes (base + index<<shift in particular) whereas RISC-V needs both RVC and fusion to do the same with similar code size and execution slot occupation. Personally, I'm leaving towards thinking that it was a mistake for RISC-V to not support such addressing modes. Unless you're aiming for something really super-constrained in terms of gate counts, having an adder and small shifter as part of your memory pipeline(s) seem like an obvious choice. And thus, having single instructions to use those pipelines isn't really committing any sins against the RISC philosophy. " -- jabl

brucehoult 1 day ago [–]

(On ARM) "NEON is guaranteed to exist on everything, and this means you're never going to see Aarch64 replace the Cortex M0 and M3. That's fragmentation right there. Severe fragmentation. Two completely incompatible ISAs. Small 32 bit RISC-V comes in smaller and lower power than An M0, and small 64 bit RISC-V is not much bigger than an M0 and is rather popular controlling something in the corner of a larger 64 bit SoC?." -- brucehoult

"Personally I think the POWER instruction set is better in many ways. It has a proven track record of high performance and embedded implementations." orbifold
- "POWER was designed to be a compiler writer's dream and has some sharp implementation corners. I think I would probably recycle the Alpha ISA circa 21164 (EV-5) with maybe a CAS instruction. It was pretty balanced between hardware and software and a lot of the complications in the VLSI design (dynamic logic, mostly) are moot with a modern technology if you stick with reasonable speeds. Presumably now that the MIPS unaligned byte access patents are expired, a whole bunch of the idiocy that Alpha had to abide to avoid that patent can just be sidestepped." bsder

Misc opinions

"The V in RISC-V stands for the 5 different immediate encodings" -- @erincandescent
"I guess my opinion is that RISC-V doesn't have many gigantic flaws but for a modern architecture it does contain dozens of unnecessary unforced errors and especially given who was involved in designing it that's just very disappointing" -- @erincandescent
"... Instruction Fusion looks way better in benchmarks than reality (Fusion wants specific patterns of adjacent instructions which a good fusion unaware compiler - say, one targetting preexisting CPUs - will try its hardest to avoid!)" -- @erincandescent
"Even out of order CPUs rarely fuse non adjacent instructions. It requires that you verify that there's no observation of intermediate side effects, and is a big combinational explosion in terms of muxing. It's already bad enough that if you're trying to reduce executed instruction count with fusion (rather than just improve latencies) it is basically a new set of variable length instructions by the back door. Now, one neat thing RISC-V implementations do do (or have proposed) is treat a fused adjacent pair of 16 bit instructions as 32 bit. This is still somewhat painful (because fusion is slower than length decode) but it does mean you can save on the size of some structures. " @erincandescent
"Agree it does feel like you're just re-CISCing the RISC via backdoor uarch policies! Just to understand, the 16-bit fusion is neat just because requires half the encoding space vs 32-bit? But otherwise all the same thing, right?" [18]
"The core already supports both 16 and 33 bit instructions. You detect the 16 bit pair you want to fuse and pretend it's one 32 bit instruction, it doesn't massively complicate it" [19]
"I'm still annoyed they put integer multiply in an extension instead of core" -- Peter Barufss
"RISC-V scales down to chips smaller than Cortex M0 chips. Guess why ARM never replaced Z80 chips?" [20]
"Honestly the bit that annoys me most is no LL/SC in core. THEY CAN JUST BE ALIASED FOR LOAD/STORE EXCEPT FOR ONE BIT OF STATE" -- @erincandescent
"The fact that JAL encoded a register made my brain explode and I stopped there. From my perspective as a linker engineer the removal of J is basically game over from a size and perf standpoint, at least for mobile and desktop." -- Igerbarg
"It's like... They went for a specific kind of dogmatic simplicity and... The real world is uglier" -- @erincandescent
"I was at a talk on RISC V given by Patterson recently, and apparently the main original goal of the project was to have a modern ISA for teaching purposes that wouldn't incur in licensing issues. I would imagine that many of such unrealistic simplifications come from that." -- Alessandro @volcacius
"Yeah, I'm reading all this in the first place because I'm like "I bet it would be interesting to implement a soft cpu, just to see if i could" and every "simple" thing y'all are decrying I'm like "oh, that makes it so much easier for me!"...but it makes it easier for me BECAUSE I'm making a minimal, for-one-person, non-pipelined no-memory-hierarchy implementation, I can totally see how if my goal was to make a commercial chip it would make my life only harder" -- @mcclure111
"Sorry for the v 101 question, but does this mean that building an IOT or even a CPU chip with RSIC V is just going to be always a failure/substantially weaker than ARM/x86? There’s a lot of interest in CHina now - wondering if it’s pure hype? Thanks!"
"Lots of these hurt more at the higher end of the market, and good engineering can overcome a lot. x86 has many sins but still has world leading performance." -- @erincandescent
"Just based on trends – Moore’s law and Spectre/Meltdown – my impression is the same: RISC-V is designed for simplicity, not modernity." -- anordal
"...lack of a rotate instruction in the base ISA, and unix standard extension groups. This makes symmetric key encryption and most hash functions much slower. The Bitmanip extension has a rotate instruction, but it should have been in the base ISA." -- bem94
"While crypto is important, it's not the whole world and not everyone needs the absolute fastest crypto possible...Most software doesn't use rotate operations. When you do need one, it takes three instructions to synthesize it if the shift count is a constant, or four otherwise. Unless you're doing nothing but rotates you're not going to notice it. If you're doing any memory loads that miss in the L1 cache to get the data you're rotating then you're also not going to notice it." -- brucehoult
"I'd also emphasise the lack of indexed load/store instructions again. It is a disaster for any sort of array indexing where you're addressing >1 array using the same index. I found this in the context of multi-precision arithmetic for crypto, but examples come from all over." -- bem94
"On the author's point about multiply and divide in the same extension: crypto is another good example. Lots of crypto really benefits from multiply, but doesn't need divide." -- bem94
"The most important thing about RISC-V is the idea behind it's openness as a standard and the ecosystem around that standard. The engineering of the ISA itself is not what makes it remarkable, and actually leaves a lot to be desired." -- bem94
"Having one instruction that can be a branch, call, or return...makes tracking the callstack harder" -- [21] combined with [22]
a thread about RISC-V and whether relying on fusing instructions is acceptable: https://www.reddit.com/r/programming/comments/cixatj/an_exarm_engineer_critiques_riscv/evarycd
one response is that empirically, RISC-V's 'Total dynamic bytes' is lower than other popular ISAs: https://www.reddit.com/r/programming/comments/cixatj/an_exarm_engineer_critiques_riscv/evg87ti/
"There is no point in having an artificially small set of instructions. Instruction decoding is a laughably small part of the overall die space and mostly irrelevant to performance if you don't get it terribly wrong. It's always possible to start with complex instructions and make them execute faster. However, it is very hard to speed up anything when the instructions are broken down like on RISC V as you can't do much better than execute each individually." -- [23]
"...This is already a terrible pain point with ARM and the RISC-V people go even further and put fundamental instructions everybody needs into extensions. For example: Multiply is optional" -- [24]
"RISCV is not meant for high performance. It's optimizing for low cost, where it has the potential to really compete with ARM." [25]
"The sweet spot for RISCV in my opinion is competing with higher-end ARM microcontrollers, like the -M4, and various low-end application processors like the Cortex-A9. But those all have full integer instructions and often an FPU as well." -- [26]
"There are better ISAs, like ARM64 or POWER. And it's very hard to make a design fast if it doesn't give you anything to make fast." [27]
"RISC is better for hardware-constrained simple in-order implementations, because it reduces the overhead of instruction decoding and makes it easy to implement a simple, fast core. Typically, these implementations have on-chip SRAM that the application runs out of, so memory speed isn't much of an issue. However, this basically limits you to low-end embedded microcontrollers. This is basically why the original RISC concept took off in the 80s -- microprocessors back then had very primitive hardware, so an instruction set that made the implementation more hardware-efficient greatly improved performance. RISC becomes a problem when you have a high-performance, superscalar out-of-order core. These cores operate by taking the incoming instructions, breaking them down into basically RISC-like micro-ops, and issuing those operations in parallel to a bunch of execution units. The decoding step is parallelizable, so there is no big advantage to simplifying this operation. However, at this point, the increased code density of a non-RISC instruction set becomes a huge advantage because it greatly increases the efficiency of the various on-chip caches (which is what ends up using a good 70% of the die area of a typical high-end CPU). So basically, RISCV is good for low-end chips, but becomes suboptimal for higher-performance ones, where you want a more dense instruction set...there's nothing really wrong with riscv. It's likely not as good as arm64 for big chips. It is definitely good enough to be useful" -- psycoee and subthread
"You might have some sort of point if x86_64 code was more compact than RV64GC code, but in fact it is typically something like 30% *bigger*. And Aarch64 code is of similar size to x86_64, or even a little bigger. In 64 bit CPUs (which is what anyone who cares about high performance big systems cares about) RISC-V is by *far* the most compact code. It's only in 32 bit that it has competition from Thumb2 and some others." -- brucehoult
"Expert opinion is divided -- to say the least -- on whether complex addressing modes help to make a machine fast. You assert that they do, but others up to and including Turing award winners in computer architecture disagree." -- brucehoult
"With RISCV, the overhead of, say, passing arguments into a function, or accessing struct fields via a pointer is absolutely insane. Easily 3x vs ARM or x86. Even in an embedded system where you don't care about speed that much, this is insane purely from a code size standpoint. The compressed instruction set solves that problem to some extent, but there is still a performance hit." -- psycoee
"...why you can't have multiply without divide. That's crazy." IshKebab
"Compilers definitely handle instruction set extensions without too much trouble." -- theQuandary
"Actually they don't. Unless you specifically tell the compiler to assume more, it's only going to use SSE and SSE2 on amd64." -- FUZxxl
"It is a problem if I (a) want to write assembly code or (b) want to distribute binary code. Imagine you had no access to binary packages on your computer and instead every package installation was a half-hour wait for compilation to finish. Or alternatively, packages only make use of half the available instructions and are thus much slower than they could be. That's what you get when the ISA is fragmented. It wouldn't be as bad if the RISC-V people didn't place even fundamentally important instructions into instruction set extensions. You can't even count trailing zeroes in the base ISA! Or multiply!" -- FUZxxl
"If you're compiling a say Linux binary you can very much assume the presence of multiplication. RISC-V's "base ISA" as you call it, that is, RISC-V without any of the (standard!) extensions is basically a 32-bit MOS 6510. A ridiculously small ISA, a ridiculously small core, something you won't ever see if you aren't developing for an embedded platform. How, pray tell, things look in the case of ARM? Why can't I run an armhf binary on a Cortex-M0? Why can't I execute sse instructions on a Z80? Because they're entirely different classes of chips and noone in their right mind would even try running code for the big cores on a small core. The other way around, sure, and that's why RISC-V can do exactly that." -- barsoap
"...Yes, ARM has the same fragmentation issues. They fixed this in ARM64 mostly and I'm really surprised RISC-V makes the same mistake..." -- FUZxxl
"That'd be because there's no such thing as 64-bit microcontrollers." barsoap
"Fragmentation is okay if the base instruction set is sufficiently powerful and if it's not fragmentation but rather a one-dimensional axis of instruction set extensions. Also, there must be binary compatibility. This means that I can optimise my code for n possible sets of available instructions (one for each CPU generation) instead of 2n sets (one for each combination of available extensions). The same shit is super annoying with ARM cores, especially as there isn't really a way to detect what instructions are available at runtime. Though it got better with ARM64." FUZxxl
"RISC-V aims to be suitable for both very small & simple and very large & complex & fast implementations. Where there is a conflict between the two RISC-V errs in the direction of making the small implementation simple, even if it puts more complexity on the high end -- it's complex anyway and a little more won't be very noticeable. Take the macro-op fusion vs splitting complex instructions into micro-ops argument. Maybe in a mid-level CPU it's a bit easier to do instruction splitting than instruction combining, but having complex instructions means that the very simplest cores are burdened with splitting complex addressing modes into multiple operations or having a sequencer for load/store multiple. That makes a *significant* difference to the size and complexity of those cores that can least afford it." -- brucehoult
"It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling." -- [28]

Selected criticisms from https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 (Erin Shepherd):

"The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and number of instructions."
"RISC-V's simplifications make the decoder (i.e. CPU frontend) easier, at the expense of executing more instructions. However, scaling the width of a pipeline is a hard problem, while the decoding of slightly (or highly) irregular instructions is well understood (the primary difficulty arises when determining the length of an instruction is nontrivial - x86 is a particularly bad case of this with its' numerous prefixes). The simplification of an instruction set should not be pursued to its' limits. A register + shifted register memory operation is not a complicated instruction; it is a very common operation in programs, and very easy for a CPU to implement performantly. If a CPU is not capable of implementing the instruction directly, it can break it down into its' constituent operations with relative ease; this is a much easier problem than fusing sequences of simple operations. We should distinguish the "Complex" instructions of CISC CPUs - complicated, rarely used, and universally low performance, from the "Featureful" instructions common to both CISC and RISC CPUs, which combine a small sequence of operations, are commonly used, and high performance."
"Highly unconstrained extensibility. While this is a goal of RISC-V, it is also a recipe for a fragmented, incompatible ecosystem and will have to be managed with extreme care."
"Same instruction (JALR) used for both calls, returns and register-indirect branches (requires extra decode for branch prediction)"
"RV64I requires sign extension of all 32-bit values. This produces unnecessary top-half toggling or requires special accomodation of the upper half of registers. Zero extension is preferable (as it reduces toggling, and can generally be optimized by tracking an "is zero" bit once the upper half is known to be zero)"
- "All 32-bit ops on x86 zero extend into the upper half of the register. There are various reasons to prefer zero extend, primarily that 32-bit leave the top half of the register completely static and you can get good power savings there" [29]
"LR/SC has a strict eventual forward progress requirement for a limited subset of uses. While this constraint is quite tight, it does potentially pose some problems for small implementations (particularly those without cache) "
"FP sticky bits and rounding mode are in the same register. This requires serialization of the FP pipe if a RMW operation is performed to change rounding mode"
"FP Instructions are encoded for 32, 64 and 128-bit precision, but not 16-bit (which is significantly more common in hardware than 128-bit)"
"No condition codes, instead compare-and-branch instructions. This is not problematic by itself, but rather in its' implications:
- Decreased encoding space in conditional branches due to requirement to encode one or two register specifiers
- No conditional selects (useful for highly unpredictable branches)
- No add with carry/subtract with carry or borrow
- (Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags) "
"Multiply and divide are part of the same extension, and it appears that if one is implemented the other must be also. Multiply is significantly simpler than divide, and common on most CPUs even where divide is not"
"No atomic instructions in the base ISA. Multi-core microcontrollers are increasingly common, and LL/SC type atomics inexpensive (only 1 bit of CPU state required for minimal single CPU implementations)."
"LR/SC are in the same extension as more complicated atomic instructions"
"General (non LR/SC) atomics do not include a CAS primitive. The motivation is to avoid the need for an instruction which reads 5 registers (Addr, CmpHi:CmpLo?, SwapHi:SwapLo?), but this is likely to impose less overhead on the implementation than the guaranteed-forward-progress LR/SC which is provided to replace it"
"Atomic instructions are provided which operate on 32-bit and 64-bit quantities, but not 8 or 16-bit"
"For RV32I, no way to tranfer a DP FP value between the integer and FP register files except through memory"
"e.g. RV32I 32-bit ADD and RV64I 64-bit ADD share encodings, and RVI64 adds a different ADD.W encoding. This is needless complication for a CPU which implements both instructions - it would have been preferable to add a new 64-bit encoding instead"
"No MOV instruction. The MV assembler alias is implemted as MV rD, rS -> ADDI rD, rS, 0. MOV optimization is commonly performed by high-end processors (especially out-of-order); recognizing RISC-V's canonical MV requires oring a 12-bit immediate "
"JAL wastes 5 bits encoding the link register, which will always be R1 (or R0 for branches). This means that RV32I has 21-bit branch displacements (insufficient for large applications - e.g. web browsers - without using multiple instruction sequences and/or branch islands)"
"Despite great effort being expended on a uniform encoding, load/store instructions are encoded differently (register vs immediate fields swapped). It seems orthogonality of destination register encoding was preferred over orthogonality of encoding two highly related instructions. This choice seems a little odd given that address generation is the more timing critical operation."
"No loads with register offsets (Rbase+Roffset) or indexes (Rbase+Rindex << Scale)."
"FENCE.I implies full synchronization of instruction cache with all preceding stores, fenced or unfenced. Implementations will need to either flush entire I$ on fence, or snoop both D$ and the store buffer"
"In RV32I, reading the 64-bit counters requires reading upper half twice, comparing and branching in case a carry occurs between the lower and upper half during a read operation. Normally 32-bit ISAs include a "read pair of special registers" instruction to avoid this issue"
"No architecturally defined "hint" encoding space. Hint encodings are those which execute as NOPs on current processors but which have some behavior on later varients. Common examples of pure "NOP hints" are things like spinlock yields. More complicated hints have also been implemented (i.e. those which have visible side effects on new processors; for example, the x86 bounds checking instructions are encoded in hint space so that binaries remain backwards compatible)"
"The worst issue, at least for the versions of the ISA that will run a "real" OS, are the lack of conditional move instructions and lack of bitwise rotation instructions. Lack of shift-and-sum instructions or equivalently addresses with shifted indexes is usually mitigated by optimization of induction variables in the compiler. They are nice to have (I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction) but not particularly common with the massive inlining that is common in C++ or Rust." bonzini
" Having taken a look at the RISC-V ISA spec I'm wondering if they did cripple LL/SC (LR/SC in RISC-V). Basically:
LL/SC can prevent ABA if the ABA-prone part is in-between a LL and SC instruction
To have a ABA prone problem you need some state implicitly dependent on the atomic state but not encoded in it. Normally (always?) the atomic state is a pointer and we depend on some state behind the pointer not changing in a context of a ABA situation (roughly ~ switch out ptr, change ptr target, switch back in ptr, through often more complex to prevent race conditions). This means in all situations I'm aware of LL/SC only prevents the ABA problem if you at least can do one atomic (relaxed ordering) load "somehow" depending on the LL load. (LL load pointer, offset or similar). But the RISC-V spec doesn't only not guarantee forward process in this cases (which I guess is fine) but goes as far as explicitly stating that guaranteed not having forward provess is ok, e.g. doing any load between the load reserved and store conditional is allowed to make the store conditional fail> Doesn't that mean that if you target RISC-V you will not benefit from LL/SC based ABA workaround and instead it's just a slightly more flexible and potential faster compare exchange which can spuriously fail? The spec says you are supposed to detect if it work and potentially switch implementations. But how can you do that reasonable if it means that you have to switch to fundamentally different data structures, which isn't something easily and reasonably done at runtime. Or do I miss something fundamental? " -- dathinab
"The use of LL/SC for atomics is a common mistake. It makes replay debuggers like rr impossible to implement." souprock
"So many flaws, and without even mentioning the missing POPCOUNT. (No, the M extension does not help.)" [30]
"...Sure. The most notable one I can think of is that integer multiplication is not in core. It's an extension. Anybody that's written a compiler or even stared at assembly would tell you how prevalent integer multiplication is, for any sort of loop or pointer addressing. Even more baffling, is that the same extension that adds integer multiplication, also requires integer division, telling me that the spec authors think that multiplication and division are of the same circuit complexity, and used with the same frequency. And for one example for how goofy of an idea that is, ARM has never included integer division in its core ISA. Then there's the whole JALR mess, which I don't know how to describe it other than "tech debt". Basically, they tried to simplify different kinds of jumps and returns into a single extension, by allowing any register to be used for the source and destination addresses. But for branch prediction reasons, and prefetching, you really do want to have a single link register, so the spec says "please don't use anything other than x1 as the link register, or you might not get proper performance". There's a lot more that I can go into. " -- [31]
- (see thread for replies)
"Design flaws. RISC-V seems like it hasn’t learned anything from CPUs designed after 1991. Between some rookie mistakes like few addressing modes (register churn, code density) and blowing out the encoding space. However, despite its flaws, it’s poised to take over embedded and possibly beyond anyways – worse truly is better." [32]
"If you want to get an understanding of a simple close-to-the-metal environment, RISC-V is fine. If you want to write assembly code, it’s painful. The lack of complex addressing modes means that you end up burning registers and doing arithmetic for simple tasks. If you want to do complex things like bitfield manipulation, you either need to write a lot of logic with shifts and masks or you need to use an extension (I think the bitmanip extension is standardised now, but the cores from ETH have their own variants). There are lots of clunky things in RISC-V. ARM (AArch32 or AArch64) is much nicer to use as an assembly programmer. Both are big instruction sets, but the addressing modes on ARM are really nice to work with (it’s almost as if they, unlike the RISC-V project, followed the RISC I methodology of examining the output from compilers and working out what the common sequences of operations were, before designing an instruction set). Note that ARM doesn’t call itself a RISC ISA anymore, it calls itself a load-store architecture. This is one of the key points of RISC (memory-register and memory-memory instructions make out-of-order execution difficult), but they’re definitely not a small ISA. They do have a much more efficient encoding than RISC-V (which, in a massive case of premature optimisation, optimised the ISA to be simple to decode in an in-order pipeline)." [33]
"The important lesson for RISC-V is why MIPS died. MIPS was not intended as an open ISA, but it was a de-facto one. Aside from LWL / LWR, everything in the ISA was out of patent. Anyone could implement an almost-MIPS core (and GCC could target MIPS-without-those-two-instructions) and many people did. Three things killed it in the market: First, fragmentation. This also terrifies ARM. Back in the PDA days, ARM handed out licenses that allowed people to extend the ISA. Intel’s XScale series added a floating-point extension called Wireless MMX that was incompatible with the ARM floating point extension. This cost a huge amount for software maintenance. Linux, GCC, and so on had to have different code paths for Intel vs non-Intel ARM cores. It doesn’t actually matter which one was better, the fact both existed prevented Linux from moving to a hard-float ABI for userland for a long time: the calling convention passed floating-point values in integer registers, so code could either call a soft-float library or be compiled for one or the other floating-point extensions and still interop with other libraries that were portable across both. There are a few other examples, but that’s the most painful one for ARM. In contrast, every MIPS vendor extended the ISA in incompatible ways. The baseline for 64-bit MIPS is still often MIPS III (circa 1991) because it’s the only ISA that all modern 64-bit MIPS processors can be expected to handle. Vendor extensions only get used in embedded products. RISC-V has some very exciting fragmentation already, with both a weak memory model and TSO: the theory is that TSO will be used for systems that want x86 compatibility, the weak model for things that don’t, but code compiled for the TSO cores is not correct on weak cores. There are ELF header flags reserved to indicate which is which, but it’s easy to compile code for the weak model, test it on a TSO core, see it work, and have it fail in subtle ways on a weak core. That’s going to cause massive headaches in the future, unless all vendors shipping cores that run a general-purpose OS go with TSO. Second, a modern ISA is big. Vector instructions, bit-manipulation instructions, virtualisation extensions, two-pointer atomic operations (needed for efficient RCU and a few other lockless data structures) and so on. Dense encoding is really important for performance (i-cache usage). RISC-V burned almost all of their 32-bit instruction space in the core ISA. It’s quite astonishing how much encoding space they’ve managed to consume with so few instructions. The C extension consumes all of the 16-bit encoding space and is severely over-fitted to the output of an unoptimised GCC on a small corpus of C code. At the moment, every vendor is trampling over all of the other vendors in the last remaining bits of the 32-bit encoding space. RISC-V really should have had a 48-bit load-64-bit-immediate instruction in the core spec to force everyone to implement support for 48-bit instructions, but at the moment no one uses the 48-bit space and infrequently used instructions are still consuming expensive 32-bit real-estate. Third, the ISA is not the end of the story. There’s a load of other stuff (interrupt controllers, DMA engines, management interfaces, and so on) that need to be standardised before you can have a general-purpose compute platform. Porting an OS to a new ARM SoC? used to be a huge amount of effort because of this. It’s now a lot easier because ARM has standardised a lot of this. x86 had some major benefits from Compaq copying IBM: every PC had a compatible bootloader that provided device enumeration and some basic device interfaces. You could write an OS that would access a disk, read from a keyboard, and write text to a display for a PC that would run on any PC (except the weird PC98 machines from Japan). After early boot, you’d typically stop doing BIOS thunks and do proper PCI device numeration and load real drivers, but that baseline made it easy to produce boot images that ran on all hardware. The RISC-V project is starting to standardise this stuff but it hasn’t been a priority. MIPS never standardised any of it. The RISC-V project has had a weird mix from the start of explicitly saying that it’s not a research project and wants to be simple and also depending on research ideas. The core ISA is a fairly mediocre mid-90s ISA. Its fine, but turning it into something that’s competitive with modern x86 or AArch64 is a huge amount of work. Some of those early design decisions are going to need to either be revisited (breaking compatibility) or are going to incur technical debt. The first RISC-V spec was frozen far too early, with timelines largely driven by PhD? students needing to graduate rather than the specs actually being in a good state. Krste is a very strong believer in micro-op fusion as a solution to a great many problems, but if every RISC-V core needs to be able to identify 2-3 instruction patterns and fuse them into a single micro-op to do operations that are a single instruction on other ISAs, that’s a lot of power and i-cache being consumed just to reach parity. There’s a lot of premature optimisation (e.g. instruction layouts that simplify decoding on an in-order core) that hurt other things (e.g. use more encoding space than necessary), where the saving is small and the cost will become increasingly large as the ISA matures. AArch64 is a pretty well-designed instruction set that learns a lot of lessons from AArch32 and other competing ISAs. RISC-V is very close to MIPS III at the core. The extensions are somewhat better, but they’re squeezed into the tiny amount of left-over encoding space. The value of an ecosystem with no fragmentation is huge. For RISC-V to succeed, it needs to get a load of the important extensions standardised quickly, define and standardise the platform specs (underway, but slow, and without enough of the people who actually understand the problem space contributing, not helped by the fact that the RISC-V Foundation is set up to discourage contributions), and get software vendors to agree on those baselines. The problem is that, for a silicon vendor, one big reason to pick RISC-V over ARM is the ability to differentiate your cores by adding custom instructions. Every RISC-V vendor’s incentives are therefore diametrically opposed to the goals of the ecosystem as a whole. " [34]
"

    3
    ethoh edited 6 months ago | link |

Ignoring the parent and focusing on hard data instead, RV64GC has higher code density than ARM, x86 and even MIPS16, so the encoding they chose isn’t exactly bad, objectively speaking.

    8
    david_chisnall 6 months ago | link |

Note that Andrew’s dissertation is using integer-heavy, single-threaded, C code as the evaluation and even then, RISC-V does worse than Thumb-2 (see Figure 8 of the linked dissertation). Once you add atomics, higher-level languages, or vector instructions, you see a different story. For example, RISC-V made an early decision to make the offset of loads and stores scaled with the size of the memory value. Unfortunately, a lot of dynamic languages set one of the low bits to differentiate between a pointer and a boxed value. They then use a complex addressing mode to combine the subtraction of one with the addition of the field offset for field addressing. With RISC-V, this requires two instructions. You won’t see that pattern in pure C code anywhere but you’ll see it all over the place in dynamic language interpreters and JITs.

    1
    ethoh 6 months ago | link |

Interesting. There’s work on an extension to help interpreters, JITs, which might or might not help mitigate this.

In any event, it is far from ready.

    6
    david_chisnall 6 months ago | link |

I was the chair of that working group but I stepped down because I was unhappy with the way the Foundation was being run.

The others involved are producing some interesting proposals though a depressing amount of it is trying to fix fundamentally bad design decisions in the core spec. For example, the i-cache is not coherent with respect to the d-cache on RISC-V. That means you need explicit sync instructions after every modification to a code page. The hardware cost of making them coherent is small (i-cache lines need to participate in cache coherency, but they can only ever be in shared state, so the cache doesn’t have to do much. If you have an inclusive L2, then the logic can all live in L2) but the overheads from not doing it are surprisingly high. SPARC changed this choice because the overhead on process creating from the run-time linker having to do i-cache invalidates on every mapped page were huge. Worse, RISC-V’s i-cache invalidate instruction is local to the current core. That means that you actually need to do a syscall, which does an IPI to all cores, which then invalidates the i-cache. That’s insanely expensive but the initial measurements were from C code on a port of Linux that didn’t do the invalidates (and didn’t break because the i-cache was so small you were never seeing the stale entries). " [35]

" No one who had worked on an non-toy OS or compiler was involved in any of the design work until all of the big announcements had been made and the spec was close to final. The Foundation was set up so that it was difficult for any individuals to contribute (that’s slowly changing) - you had to pay $99 or ask for the fee to be waived to give feedback on the specs as an individual. You had to pay more to provide feedback as a corporation and no corporation was going to pay thousands of dollars membership and the salaries of their contributors to provide feedback unless they were pretty confident that they were going to use RISC-V.

It probably shouldn’t come as a surprise that saying to people ‘we need your expertise, please pay us money so that you can provide it’ didn’t lead to a huge influx of expert contributors. There were a few, but not enough. " [36]

"Here’s the opinion of probably THE most important ARM engineer of the 1990s and 2000s, Dave Jaggar who developed the ARM7TDMI, Thumb, Thumb2.

https://www.youtube.com/watch?v=_6sh097Dk5k

Check at 51:30 where he says “I would Google RISC-V and find out all about it. They’ve done a fine instruction set, a fine job […] it’s the state of the art now for 32-bit general purpose instruction sets. And it’s got the 16-bit compressed stuff. So, yeah, learning about that, you’re learning from the best.” "

" Krste is a strong believer in macro-op fusion but I remain unconvinced. It requires decoder complexity (power and complexity), more i-cache space (power), trace caches if you want to avoid having it on the hot path in loops (power and complexity), weird performance anomalies when the macro-ops span a fetch granule and so the fusion doesn’t happen (software pain). And, in exchange for all of this, you get something that you could have got for free from a well-designed instruction set. " -- David Chisnall

" 1 smaddox 8 months ago

link

The paper linked in the article appears to show RV64GC, the compressed variant of RV64G, results in smaller program sizes than x86_64. If that’s true, wouldn’t that mean you would need less i-cache space? This isn’t my area of expertise, but I find it fascinating.

    3
    david_chisnall 8 months ago | link |

There are a lot of variables here. One is the input corpus. As I recall, that particular paper evaluated almost exclusively C code. The generated code for C++ will use a slightly different instruction mix, for other languages the difference is even greater. To give a concrete example, C/C++ do not have (in the standard) any checking for integer overflow. It either wraps for unsigned arithmetic or is undefined for signed. This means that a+b on any C integer type up to [u]int64_t is a single RISC-V instruction. A lot of other languages (including Rust, I believe, and the implementations of most dynamic languages) depend on overflow-checked arithmetic on their fast paths. With Arm or x86 (32- or 64-bit variants), the add instructions set a condition code that you can then branch on, accumulate in a GPR, or use in a conditional move instruction. If you want to have a good fast path, you accumulate the condition code after each arithmetic op in a hot path then branch at the end and hit a slow path if any of the calculations overflowed. This is very dense on x86 or Arm.

RISC-V does not have condition codes. This is great for microarchitects. Condition code registers are somewhat painful because they’re an implicit data dependency from any arithmetic instruction to a load of others. In spite of this, Arm kept them with AArch64 (though dramatically reduced the number of predicated instructions, to simplify the microarchitecture) because they did a lot of measurement and found that a carefully optimised compiler made significant use of them.

RISC-V also doesn’t have a conditional move instruction. Krste likes to cite a paper by the Alpha authors regretting their choice of a conditional move, because it required one extra read port on the register file. These days, conditional moves are typically folded into the register rename engine and so are quite cheap in the microarchitecture of anything doing a non-trivial amount of register rename (they’re just an update in the rename directory telling subsequent instructions which value to use). Compilers have become really good at if-conversion, turning small if blocks into a path that does both versions and selects the results. This is so common that LLVM has a select instruction in the IR. To do the equivalent with RISC-V, you need to have logic in decode that recognises a small branch forward and converts it into a predicated sequence. That’s a lot more difficult to do than simply having a conditional move instruction and reduces code density.

I had a student try adding a conditional move to a small RISC-V processor a few years ago and they reproduced the result that Arm used in making this decision: Without conditional moves, you need roughly four times as much branch predictor state to get the same overall performance.

Note, also, that these results predate any vector extensions for RISC-V. They are not comparing autovectorised code with SSE, AVX, Neon, or SVE. RISC-V has used up all of its 16-bit and most of its 32-bit instruction space and so can’t add the instructions that other architectures have introduced to improve code density without going into the larger 48-bit encoding space.

1 dbremner 8 months ago

link

    The paper linked in the article appears to show RV64GC, the compressed variant of RV64G, results in smaller program sizes than x86_64

x86-64 has pretty large binary sizes; if your compressed instruction set doesn’t have smaller binaries than it should be redesigned.

I would take the measurements in that paper with a grain of salt; they aren’t comparing like-to-like. The Cortex A-15 Core benchmarks, for example, should have also been run in Thumb-2 mode. Thumb-2 causes substantial reductions in code size; it’s dubious to compare your compressed ISA to a competitor’s uncompressed ISA.

1 Forty-Bot 8 months ago

link

This paper has some size comparisons against thumb (along with Huawei’s custom extensions). This page has sone as well. " -- [37]

(links in the last comment above are: [HW/SW approaches for RISC-V code size reduction and Zephyr code examples)

https://gmplib.org/list-archives/gmp-devel/2021-September/006013.html
- https://news.ycombinator.com/item?id=29420622
"Let's look at some examples of how Risc V underperforms. First, addition of a double-word integer with carry-out: add t0, a4, a6 add low words sltu t6, t0, a4 compute carry-out from low add add t1, a5, a7 add hi words sltu t2, t1, a5 compute carry-out from high add add t4, t1, t6 add carry to low result sltu t3, t4, t1 compute carry out from the carry add add t6, t2, t3 combine carries Same for 64-bit arm: adds x12, x6, x10 adcs x13, x7, x11 Same for 64-bit x86: add %r8, %rax adc %r9, %rdx (Some additional move insn might be needed for x86 due to the 2-operand nature of this arch.) " -- [38]
- "Godbolt: typedef __int128_t int128_t; int128_t add(int128_t left, int128_t right) { return left + right; } GCC 10, -O2, RISC-V: add(__int128, __int128): mv a5,a0 add a0,a0,a2 sltu a5,a0,a5 add a1,a1,a3 add a1,a5,a1 ret ARM64: add(__int128, __int128): adds x0, x0, x2 adc x1, x1, x3 ret This issue hurts the wider types that are compiler built-ins. Even though C has a programming model that is devoid of any carry flag concept, canned types like a 128 bit integer can take advantage of it. Portable C code to simulate a 128 bit integer will probably emit bad code across the board. The code will explicitly calculate the carry as an additional operand and pull it into the result. The RISC-V won't look any worse, then, in all likelihood. (The above RISC-V instruction set sequence is shorter than the mailing list post author's 7 line sequence because it doesn't calculate a carry out: the result is truncated. You'd need a carry out to continue a wider addition.) " -- [39]
"Moreover, the results for RISC-V are hugely influenced by the programming language and the compiler options that are chosen. RISC-V has an acceptable code size only for unsafe code, if the programming language or the compiler options require run-time checks, to ensure safe behavior, then the RISC-V code size increases enormously, while for other CPUs it barely changes. The RISC-V ISA has only 1 good feature for code size, the combined compare-and-branch instructions. Because there typically is 1 branch for every 6 to 8 instructions, using 1 instruction instead of 2 saves a lot. Except for this good feature, the rest of the ISA is full of bad features, which frequently require at least 2 instructions instead of 1 instruction in any other CPU, e.g. the lack of indexed addressing, which is needed in any loop that must access some aggregate data structure, in order to be able to implement the loop with a minimum number of instructions. ... All the comparisons where RISC-V has better code density compare the compressed encoding with the 32-bit ARMv8-A. This is a classical example of apples-to-oranges, because the compressed encoding will never have a performance in the same league with ARMv8-A. When the comparisons are matched, 16-bit RISC-V encoding with 16-bit ARMv8-M and 32-bit RISC-V with 32-bit ARMv8-A, RISC-V always loses in code density in both comparisons, because only the RISC-V branch instructions are frequently shorter than those of ARM, while all the other instructions are frequently longer. ... If you want to use a RISC-V at a performance level good enough for being used in something like a mobile phone or a personal computer, you need to simultaneously decode at least 8 instructions per clock cycle and preferably much more, because to match 8 instructions of other CPUs you need at least 10 to 12 RISC-V instructions and sometimes much more. Nobody has succeeded to simultaneously decode a significant number of compressed RISC-V instructions and it is unlikely that anyone would attempt this, because the cost in area and power of a decoder able to do this is much larger than the cost of a decoder for simultaneous decoding of fixed-length instructions. This is the reason why also ARM uses a compressed encoding in their -M CPUs for embedded applications but a 32-bit fixed-length encoding in their -A CPUs for applications where more than 1 watt per core is available and high performance is needed. " adrian_b and [40] and [41]
- " RISC-C compressed instructions cannot be compared to CISC variable length instructions. The instruction boundaries are easy to determine in parallel for multiple decoders. Something which is hard for e.g. x86. Compressed instructions don’t have arbitrary length. It is two instructions fitted in a 32-bit word. Decompression is part of the instruction decoding itself. It only requires a minuscule 400 logical gates to do. In fact RISC-V is very well designed for doing out-of-order execution of multiple instructions as instructions have been specifically designed to share as little state as possible. No status registers or conditional execution bits. Thus most instructions can run in separate pipelines without influencing each other. " [42]
- " RISC-V doesn't hinder safe code, that was an incorrect claim. Bound checks are done with one instruction - bltu for slices, bgeu for indexes. On intel processor you need cmp+jb pair for this. The linked message is about carry propagation pattern used in gmp. AIU optimized bignum algorithms accumulate carry bits and propagate them in bulk and don't benefit from optimal one bit at a time carry propagation pattern. " [43]
" The compressed instruction encoding is very good and it is mandatory for any use of RISC-V in embedded computers. With this extension, RISC-V can be competitive with ARM Cortex-M. On the other hand, the compressed instruction encoding is useless for general-purpose computers intended as personal computers or as servers, because it limits the achievable performance to much lower levels than for ARMv8-A or Intel/AMD. " adrian_b
The compressed instruction encoding "only addresses a subset of the available registers. Small revisions in a function which change the number of live variables will suddenly and dramatically change the compressibility of the instructions. Higher-level languages rely heavily on inlining to reduce their abstraction penalty. Profiles which were taken from the Linux kernel and (checks notes...) Drystone are not representative of code from higher-level languages..." brandmeyer
- "Note that a number of a C instructions can in fact use all 32 registers. This includes stack pointer-relative loads and stores, load immediate ({-32..+31}), load upper immediate (4096 * {-32..+31}, add immediate and add immediate word ({-32..+31}), shift left logical immediate, register to register add, and register move. It's certainly possible that another compressed encoding might do better using fewer opcode, and I've seen the suggestions. The main thing wrong with the standard one in my opinion is that it gives too much prominence to floating point code, having been developed to optimise for SPEC including SPECFP (no, not the Linux kernel or Dhrystone ... I have no idea where you got that from). But anyway it does well, and the opcode space used is not excessive. If anything it's TOO SMALL. Thumb2 gets marginally better code size while using 7/8ths of the opcode space for the 16 bit instructions instead of RISC-V's 3/4. " brucehoult
"The "C" extension is technically optional, but I'm not aware of anyone who has made or sold a production chip without it -- generally only student projects or tiny cores for FPGAs running very simple programs don't have it. My estimate is if you have even 200 to 300 instructions in your code it's cheaper to implement "C" than to build the extra SRAM/cache to hold the bigger code without it. " -- brucehoult
vs ARM64 and ARM32: " ...compressed instructions on RISC-V which 64-bit ARM does not have. And if you compare 32-bit CPUs then RISC-V has twice as many registers reducing the number of instructions needed to read and write from memory. RISC-V branching takes less space and so does vector instructions. There are many case like that which adds up end results in RISC-V having the most dense ISA in all studies when using compressed instructions. " -- [44]
vs ARM: "I've worked on designs with both ARM and RISC-V cores. The RISC-V code outperforms the ARM core, with smaller gate count, and has similar or higher code density in real world code, depending on the extensions supported." -- audunw
w/r/t performance: "Branching is a problem, but the branch predictors do an excellent job. (especially for function calls which are very well predicted by the RAS [Return Address Stack]) But the biggest bottleneck to fetch a large instruction group is decoding. Especially the instruction size decoding. An ISA like RISC-V or ARM that drastically reduces the possible instruction sizes is a big advantage to decode large instruction groups. And dependencies between instructions within the decoding group is also a concern. For example, register renaming will quickly require several cycles (several stages) when the decoding group scales up. RISC-V also addresses this since the register indexes are easily decoded earlier and the number of registers used can also be quickly decoded." -- [45]

"The 2 most annoying missing features are the lack of support for multi-word operations, which are needed to compute with numbers larger than 64 bits, but also the lack of support for detecting overflow in the operations with standard-size integers. If you either want larger integers or safe computations with normal integers, the number of RISC-V instructions needed for implementation is very large compared to any other ISA." -- adrian_b

"> (condition code flags are) a bit of an architectural nightmare for everyone doing any more than the simplest CPUs, they make every instruction potentially have dependencies or anti-dependencies with every other It's doesn't have to be _that_ bad. As long as condition flags are all written at once (or are essentially banked like PowerPC?) the dependency issue can go away because they're renamed and their results aren't dependent on previous data. Now, of course, instructions that only update some condition flags and preserve others are the devil." [46]
"RISC-V designers optimized for C and found overflow flag isn't used much and got rid of it. It was the wrong choice: overflow flag is used a lot for JavaScript? and any language with arbitrary precision integer (including GMP, the topic of OP)." [47]
- " Over just the time I've been aware of things, there's been a constant positive feedback loop of "checked overflow isn't used by software, so CPU designers make it less performant" followed by "Checked overflow is less performant so software uses it less. I wish there was a way out. Language features are also often implemented at least partly because they can be done efficiently on the premiere hardware for the language. Then new hardware can make such features hard to implement. WASM implemented return values in a way that was different from register hardware, and it makes efficient codegen of Common Lisp more challenging. This was brought to the attention of the committee while WASM was still in flux, and they (perhaps rightfully) decided CL was insufficiently important to change things. I'm sure that people brought up the overflow situation to the RISC-V designers, and it was similarly dismissed. It's just unfortunate that legacy software is such a big driver of CPU features as that's a race towards lowest-common-denominator hardware. ... Common Lisp will often have auxiliary return values that are often not needed (e.g. the mathematical floor function returns the remainder as a second value). Unused extra values are silently discarded. So you can do, for example (+ (floor x y) z) without worrying about the second value to floor. A lisp compiler for a register machine will usually return the first N values in registers (for some small value of N), so the common case of using only the primary return value generates exactly the same code regardless of how many values are actually returned. I don't remember the details, but however wasm happens to implement multiple return values, you can't just pretend that a function that returns 2 values only returns one. ... wasm functions must have a fixed number of results[1]. Lisp functions may have variadic results. A wasm function call must explicitly pop every result off of the stack[2]. These rules add significant overhead to implementing: (foo (bar)) Which calls foo with just the first result of bar, and the number of results yielded by bar is possibly unknown. In psuedo-assembly for a register machine, implementing this is roughly: CALL bar MOVE ResultReg?1 -> ArgumentReg?1 CALL foo And this works regardless of how many values the function bar happens to return[3]. Any implementation that is correct for any number of results of "bar" will be slow for the common case of bar yielding one or two results. 1: https://webassembly.github.io/spec/core/syntax/types.html#sy... 2: https://webassembly.github.io/spec/core/exec/instructions.ht... 3: Showing only the caller side of things, when the caller doesn't care about extra values hides some complexity of implementation of returning values because you also need to be able to detect at run-time how many values were actually returned. e.g. SBCL on x86 uses a condition flag to indicate if there are multiple values, because branching on a condition flag lets you handle the only-1 value case efficiently.

" aidenn0 * " There's a blog page somewhere that's a rant for implementing saturating and other arithmetic modes. Would be a really good idea. Main one is interrupt on overflow." -- R0b0t " I agree. A lot of software only wants protection against overflows but does not depend on them for functionality. If something wants to read out the carry bit, it should be explicit and although it is unfortunate, indicating that requires a full instruction. " [48]

" It kind of chafed when I excitedly read the ISA docs and found that overflow testing was cumbersome." [49]
- "It just feels backwards to me to increase the cost of these checks in a time where we have realized that unchecked arithmetic is not a good idea in general. " [50]
  - "I think I agree it was a mistake/wart." [51]
    - " Shouldn't settle for less than interrupt on overflow tbh. " R0b0t
" They provide recommended insn sequences for overflow checking as commentary to the ISA specification, and this enables efficient implementation in hardware. " [52]
- "I would like to see some benchmarks of this efficient implementation in hardware, even simulated hardware, compared against conventional architectures. Even for C, it's a recurring source of bugs and vulnerabilities that int overflow goes undetected. What we really need is an overflow trap like the one in IEEE floating point. RISC-V went the opposite direction. " [53]
"RISC-V makes the same basic tradeoff (simplicity above all else) across the board. You can see this in the (lack of) addressing modes, compare-and-branch, etc. Where this really bites you is in workloads dominated by tight loops (image processing, cryptography, HPC, etc). While a microarchitecture may be more efficient thanks to simpler instructions (ignoring the added complexity of compressed instructions and macro-fusion, the usual suggested fixes...), it's not going to be 2-3x faster, so it's never going to compensate for a 2-3x larger inner loop. " [54], discussing the lack of an overflow condition code
"Macro fusion definitely has a place in microarchitecture performance, especially when you have to deal with a legacy ISA. RISC-V makes the very unusual choice of depending on it for performance, when most ISAs prefer to fix the problem upstream. " [55]
"...today the biggest speed ups compilers get for super scalar arch's has nothing to do with the addressing modes so much attention is being focused on here. It comes from avoiding conditional jumps. The compilers will often emit code that evaluates both paths of the computation (thus burning 50% more ALU time on a calculating a result that will never be used), then choose the result they want with a cmov. In extreme cases, I've seen doing that sort of thing gain them a factor of 10, which is far more than playing tiddly winks with addressing modes will get you." [56]
"So do what Power does: most instructions that update the condition flags can do so optionally (except for instructions like stdcx. or cmpd where they're meaningless without it, and corner oddballs like andi.)." [57]
- " On POWER, all the comparison instruction can store their result in any of the 8 sets of flags. The conditional branches can use any flag from any set. The arithmetic instructions, e.g. addition or multiplication, do not encode a field for where to store the flags, so they use, like you said, an implicit destination, which is still different for integer and floating-point. In large out-of-order CPUs, with flag register renaming, this is no longer so important, but in 1990, when POWER was introduced, the multiple sets of flags were a great advance, because they enabled the parallel execution of many instructions even in CPUs much simpler than today. Besides POWER, the 64-bit ARMv8 also provides most of the 14 predicates that exist for a partial order relation. For some weird reason, the IEEE FP standard requires only 12 of the 14 predicates, so ARM implemented just those 12, even if they have 14 encodings, by using duplicate encodings for a pair of predicates. I consider this stupid, because there would not have been any additional cost to gate correctly the missing predicate pair, even if it is indeed one that is only seldom needed (distinguishing between less-or-greater and equal-or-unordered). " adrian_b
- "ARM also does something similar, many instructions has a flag bit specifying whether flags should be updated or not. It doesn't have the multiple flag registers of POWER though. " [58]
"Compressed instructions and macro-fusion aren't magical solutions. It's not always possible to convince the compiler to generate the magical sequence required, and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding. Beyond that, compressed instructions are not a 1:1 substitute for more complex instructions, because a pair of compressed instructions cannot have any fields that cross the 16-bit boundary. This means you can't recover things like larger load/store offsets. Additionally, you can't discard architectural state changes due to the first instruction. If you want to fuse an address computation with a load, you still have to write the new address to the register destination of the address computation. If you want to perform clever fusion for carry propagation, you still have to perform all of the GPR writes. This is work that a more complex instruction simply wouldn't have to perform, and again it complicates a high performance implementation. " [59]
- " I don’t see why offsets larger than 16-bit are important. Are you implying that most fusion candidate pairs would need this? In tight inner loops why would you need large offsets? Of course you discard architectural state changes in fusion. If I have a bunch of instructions which end up reading from memory into register x10, then I can fuse with all previous instructions which wrote into x10, as their results get clobbered anyway. Disclaimer: I may have misunderstood the point you made. However you don’t seem to make it clear how fusion is bad for performance. What performance tricks are you giving up by doing fusion? " [60]
- " > and it actually makes high-performance implementations (wide superscalar) more difficult thanks to the variable width decoding. More difficult than x86? We're talking about a damn simple variable width decoding here. I could imagine RISC-V with C extension being more tricky than 64-bit ARM. Maybe. > and again it complicates a high performance implementation. But so much of the rationale behind the design of RISC-V is to simplify high performance implementation in other ways. So the big question is what the net effect is. The other big question is if extensions will be added to optimise for desktop/server workloads by the time RISC-V CPUs penetrate that market significantly. " [61]
" I think overall it is amazing. I think considering everything they had to do and everything they wanted to achieve, I think they made really, really good overall choices. There are some things here and there one can argue, depending on what one considers the main usecase. Overall I think starting with a very small base makes a lot of sense. In fact they actually went to large in some places and that's why they have been working on a profile that is considerably smaller, less registers and floats in int register standards." [62]
" I think talking about ISAs as better or worse than one another is often a bad idea for the same reason that arguing about whether C or Python is better is a bad idea. Different ISAs are used for different purposes. We can point to some specific things as almost always being bad in the modern world like branch delay slots or the way the C preprocessor works but even then for widely employed languages or ISAs there was a point to it when it was created. RISC-V has a number of places it's employed where it makes an excellent fit. First of all academia. For an undergrad making building the netlist for their first processor or a grad student doing their first out of order processor RISC-V's simplicity is great for the pedagogical purpose. For a researcher trying to experiment with better branch prediction techniques having a standard high-ish performance open source design they can take and modify with their ideas is immensely helpful. And for many companies in the real world with their eyes on the bottom line like having an ISA where you can add instructions that happen to accelerate your own particular workload, where you can use a standard compiler framework outside your special assembly inner loops, and where you don't have to spend transistors on features you don't need. I'm not optimistic about RISC-V's widescale adoption as an application processor. If I were going to start designing an open source processor in that space I'd probably start with IBM's now open Power ISA. But there are so many more niches in the world than just that and RISC-V is already a success in some of them. " [63]
"I don't think its vector extensions would be good for video codecs because they seem designed around large vectors. (and the article the designers wrote about it was quite insulting to regular SIMD) " [64]
"RISC-V is pretty good. Probably slightly better for some things than ARM, and slightly worse for others. It's open, which is awesome, and the instruction set lends itself to extensions which is nice (but possibly risks the ecosystem fragmenting). Building really high performance RISC-V designs looks like it's going to rely on slightly smarter instruction decoders than we've seen in the past for RISCs, but it doesn't look insurmountable. " [65]
"...people are already arguing past each other because nobody seems to agree what the ISA is for. If you look at the early history of RISC-V, it does indeed look like as something built for teaching. But I don't think that use case warrants all the hype around it. So how did all the hype form, and why is it that there are people seemingly hyping it as the next-gen dream-come-true super elegant open developed-with-hindsight ISA that will eventually displace crufty old x86 and proprietary ARM while offering better performance and better everything? Of course that just baits you into arguing about its potential performance. And don't worry if it doesn't have all the instructions you need for performance yet, we'll just slap it with another extension and it totally won't turn into a clusterfuck with a stench of legacy and numerous attempts at fixing it (coz' remember, hindsight)! And then if you question its potential, you'll get someone else arguing that no no, it's not a high performance ISA for general use in desktops / servers, it's just an extensible ISA that companies can customize for their special sauce microcontrollers or whatever. Of course it's all armchair speculation because there are no high performance real world implementations and there aren't enough experts you can trust." [66]
"...I have never looked at the ISA of the Alpha (referenced in post), but RISC V has always struck me as being nearly identical to (early) MIPS, just without the HI and LO registers for multiply results and the addition of variable length instruction support, even if the core ISA doesn't use them. MIPS didn't have a flag register either and depended on a dedicated zero register and slt instructions (set if less than) " [67]
- "Yes, that's exactly my thought every time it comes out; RISC-V is likely to displace MIPS everywhere performance doesn't matter, but it'll have a hard time competing with ARM or x86 on that. " [68]
"The hardware adders used for addition/subtraction provide, at a negligible additional cost, 2 extra bits, carry and overflow, which are needed for operations with large integers and for safe operations with 64-bit integers." -- adrian_b
"One thing that bothers me: RISC-V seems to use up a lot of the available instruction set space with "HINT" instructions that nobody has (yet) found a use for. Is it anticipated that all of the available HINTs will actually be used, or is the hope that the compressed version of the instruction set will avoid the wasted space? " [69]
" Few years ago, I designed my own ISA. In that time I investigated design decisions in lots of ISAs and compared them. There was nothing in the RISC-V instruction set that stood out to me, like for example, the SuperH? instruction set, which is remarkably well designed. Edit: Don't get me wrong, I don't think RISC-V is "garbage" or anything like that. I just think it could have been better. But of course, most of an architecture's value comes from its ecosystem and the time spent optimizing and tailoring everything... " [70]
"> Is RISC-V the future? No, RISC-V is 1980s done correctly, 30 years later. It still concentrates on fixing those problems that we had in 1980s (making instruction set that is easy to pipeline with a simple pipeline), but we mostly don’t have anymore, because we have managed to find other, more practical solutions to those problems. And it’s “done correctly” because it abandons the most stupid RISC features such as delay slots. But it ignores most of the things we have learned after that. ARMv8 is much more advanced and better instruction set which makes much more sense from a technical point of view. Many common things require much more RISC-V instruction than ARMv8 instructions. The only good reason to use RISC-V instead of ARM is to avoid paying licence fees to ARM." -- Heikki Kultala, Technical lead, SoC architecture at Nokia (2020-present)
" Doesn't RISC-V have an add-with-carry instruction as part of the vector extension? I see it listed here: https://github.com/riscv/riscv-v-spec/releases/tag/v1.0 "
- " Afaict that's only for operations the vector register file. Most of the complaints about the lack of addc/subc are around how they're heavily used in JITs for languages that want to speculatively optimize multi precision arthimetic into the integer register file for their regular integer ops. JavaScript?, a lot of Lisps, the MLs all fit into that space. " -- [71]

RISC-V's RV32V vector extension vs fixed-width SIMD

https://www.sigarch.org/simd-instructions-considered-harmful/
- for backwards compatibility, as SIMD registers get wider, new instructions are introduced for each size
  - e.g. "The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD."
- staticly, "The SIMD computation code is dwarfed by the bookkeeping code. Two-thirds to three-fourths of the code for MIPS-32 MSA and IA-32 AVX2 is SIMD overhead, either to prepare the data for the main SIMD loop or to handle the fringe elements when n is not a multiple of the number of floating-point numbers in a SIMD register."
- dynamically, "The SIMD instructions execute 10 to 20 times more instructions than RV32V because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case."

A comment on that article with an opposing view:

" 1. If you work with long dense vectors and nothing else, you don't need any CPU instructions. GPGPUs win performance and power efficiency by a factor of magnitude.

Current SIMD can be used for more than that.

A register can be treated as a complete small vector as opposed to a chunk in a long vector. Try implementing 3D vectors cross product with your approach and you'll see.

A register can be treated as a 2D bitmap as opposed to vector, here's an example: https://github.com/Const-me/SimdIntroArticle/blob/master/FloodFill/Vector/vectorFill.cpp#L135-L138

2. But the main problem with vector architectures is this part: "Vector architectures then scatter the results back from the vector registers to main memory." Main memory is very slow. When you work on SIMD algorithms, you want to avoid main memory access, instead you do as much as possible with the data while it's in vector registers. The approach you're advocating for can't quite do that. You can't invent a sane calling convention for a function that takes or returns kilobytes of data. Current architectures all pass arguments and return values in these vector registers, because their count and size are part of the ISA, i.e. stable and known to compilers. " -- Soonts

A similar argument is made by glangdale at https://news.ycombinator.com/item?id=19198758 :

" This argument is less effective given that SIMD is not always a straightforward substitute for vector processing. Sometimes we want 128, 256 or 512 bits of processing as a unit and will follow it up with something different, not a repeated instance of that same process. ... We also used SIMD quite extensively as a 'wider GPR' - not doing stuff over tons of input characters but instead using the superior size of SIMD registers to implement things like bitwise string and NFA matchers.

A SIMD instruction can be a reasonable proxy for a wide vector processor but the reverse is not true - a specialized vector architecture is unlikely to be very helpful for this kind of 'mixed' SIMD processing. Almost any "argument from DAXPY" fails for the much richer uses of SIMD processing among active practitioners using modern SIMD. "

Down that thread, other commentators point out that permute/shuffle instructions are useful but don't scale up (the arbitrary-sized 'permute' is gather/scatter, but that uses main memory (or at least cache) which is slower).

"As a SIMD guy, I can't say their take on SIMD/vectors floats my boat either. Very tailored for the elegant large-scale vector ops, but I've spent a whole career doing grubby things with quick SIMD ops that don't fit the big-vector model. x86 is great for that." -- Geoff Langdale

maybe see also ARM SVE, SVE2 (Scalable Vector Extension)

---

https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68

A post by 'erincandescent', a former ARM engineer, with some detailed complaints about RISC-V.

---

A negative comment on RISC-V, from https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip:

"...RISC-V has insisted on a certain kind of intellectual purity that makes no sense in terms of commerce, or the future properties of CPU manufacturing (plentiful transistors)...." -- name99

"... My point is it's the same people repeating the exact same mistakes. It has the same issues as MIPS like no register offset addressing or base with update. Some things are worse, for example branch ranges and immediate ranges are smaller than MIPS. That's what you get when you're stuck in the 80's dogma of making decode as simple as possible... " -- Wilco1

"...Saving a fraction of a mm^2 due to simplified decode is a great marketing story without doubt. However if you look at a modern SoC?, typically less than 5% is devoted to the actual CPU cores. If the resulting larger codesize means you need to add more cache/flash/DRAM, increase clock frequency to deal with the extra instructions or makes it harder for a compiler to produce efficient code, is it really an optimal system-wide decision?" -- Wilco1

" RISC-V is very similar to MIPS - MIPS never was great at codesize. When optimizing for size, compilers call special library functions to emulate instructions which are available on Arm. So you pay for saving a few transistors with lower performance and higher power consumption. " -- Wilco1

"It's not a MIPS variant. MIPS is based on work at Stanford. RISC-V is the latest incarnation of the Berkeley RISC project. You are probably thinking of SPARC which is a derivative of earlier RISC project work. MIPS is only related in that it comes from similar ideas but the two projects, Stanford and Berkeley were different." -- zmatt

"You're applying 80's RISC dogma which are no longer relevant. Transistors are cheap and efficient today, so we don't need to minimize them. We no longer optimize just the core or decoder but optimize the system as a whole. Who cares if you saved a few mW in the decoder when moving the extra instructions between DRAM and caches costs 10-100 times as much?

The RISC-V focus on simple instructions and decode is as crazy as a cult. They even want to add instruction fusion for eg. indexed accesses. So first simplify decode by leaving out useful instructions, then make it more complex again to try to make up for the missing instructions..." -- wilco1

---

" The two modern ARM instruction sets, the 16-bit-encoded ARMv7-M / ARMv8-M (for microcontrollers) and the 64-bit (32-bit-encoded) ARMv8-A, are very different from the traditional ARM ISA and they both are very well designed, incomparably better than RISC-V.

RISC-V is primitive even compared to the instructions sets used 50 years ago. It includes a few good ideas and the RISC-V team has the merit of popularizing the fact that the older vector ISAs of the seventies were better than the more recent SIMD ISAs of the nineties, which lead to modern vector ISAs, e.g. the RISC-V vector extension and ARM SVE.

However the base RISC-V ISA is extremely weak and its only merit is that it is simple enough to be easy to implement in student projects. " -- adrian_b

"ARMv8.2 or newer is a very well designed ISA, while RISC-V is a very bad ISA and I would hate to be forced to use it. OpenPOWER? is a far better ISA than RISC-V, but unfortunately most developers do not have any experience with POWER and they have the wrong belief that POWER is some antique ISA while RISC-V must be some modern fashionable ISA. Therefore even if OpenPOWER? is much better, it is less likely than RISC-V to be used as a replacement for ARM." -- adrian_b
- "I wish I could upvote your comment a thousand times!" -- ksec
"...I got the impression that RISC-V’s “simplicity” is actively harmful to making it a good ISA outside of teaching because it’s instructions are just not really all that great; indexing into an integer array (for example) is a shift, add, then load while basically every other architecture can do this in one instruction. It seems to me that the ISA is really designed more towards being pretty than being practical." saagarjha
"RISC-V is inefficient because it requires more instructions to do the same work as other ISAs and it does not have any advantage to compensate for this flaw. Those extra instructions appear especially in almost all loops and the most important reason is that RISC-V has a worse set of addressing modes than the the vacuum-tube computers from more than 60 years ago, which were built only with a few thousands tubes, compared to the millions or billions of transistors available now for a CPU. Because of this defect of the RISC-V ISA, the Alibaba team who designed the RISC-V implementation with the highest current performance (Xuantie910, which was presented last month at Hot Chips) had to add a custom ISA extension with additional addressing modes, in order to be able to reach an acceptable speed. Whenever the designers of the RISC-V ISA are criticized, they reply that the larger number of instructions is not important, because any high-performance implementation should do instruction fusion, to be able to reach the IPC of other ISAs. Nevertheless, that is wrong for 2 reasons, instruction fusion cannot reduce the larger code size due to the inefficient instruction encoding and the hardware required for decoding more instructions in parallel and for doing instruction fusion is much more complex than the hardware required for decoding less instructions with a better encoding as in other ISAs. " -- adrian_b
"Also, when comparing ISAs, I place a large weight on how good those ISAs are at GMPbench, i.e. at large number arithmetic. In my experience with embedded system programming large integer operations are useful much more frequently than traditional RISC ISA designers believe. While x86 has always been very good at GMPbench, many traditional RISC ISAs suck badly, because they lack either good carry handling instructions or good double-word multiply/divide/shift instructions. RISC-V also seems to have particularly bad multi-word operation support. " -- adrian_b
"What convinced me was how thin the RISC-V book is, and I have seen both ARM and Intel reference manuals. For example the V extension removes need to specify data size, greatly reducing instructions needed." -- * https://medium.com/codex/addressing-criticism-of-risc-v-microprocessors-803239b53284 ---- this paper describes some extensions to reduce code density: [https://raw.githubusercontent.com/riscv/riscv-code-size-reduction/master/CARRV2020_final.pdf HW/SW approaches for RISC-V code size reduction

--- why is it risc-V?

i dunno, but this comment in the spec suggests that RISC-IV was SPUR " "Decoding register specifiers is usually on the critical paths in implementations, and so the in- 2021-12-17 struction format was chosen to keep all register specifiers at the same position in all formats at 2021-12-17 the expense of having to move immediate bits across formats (a property shared with RISC-IV 2021-12-17 aka. SPUR [11])." "

a commentator on an unrelated web discussion forum says:

"The previous design, RISC-IV (named SPUR), was a Lisp machine.

http://pages.cs.wisc.edu/~markhill/papers/computer86_spur.pd...

The basic instructions (Table 3) are close enough to RISC-V but it also had a few Lisp specific instructions (Table 4). Though the name J Extension for RISC-V might lead you to think it is only about Java or Javascript, the group working on it is also interested in making it as good as possible for Lisp. " -- https://news.ycombinator.com/item?id=19266152

---

(in the middle of a discussion about a RISC-V implementation that doesn't necessarily have RAM): "Many of the applications for a CPU like this don t need any state outside of the CPU registers especially as RISC-V lets you do multiple levels of subroutine call without touching RAM if you manually allocate different registers and a different return address register for each function (which means programming in asm not C). A lot of 8051 / PIC / AVR have been sold without any RAM (or with RAM == memory mapped registers) " -- https://lobste.rs/s/nqxfoc/serv_is_award_winning_bit_serial_risc_v#c_cv88ud

proj-plbook-plChRiscvIsa

RISC-V

Retrospectives

Risc-V Compressed (16-bit encoding) opcodes

Summary of Risc-V Compressed (16-bit encoding) opcodes

Details of Risc-V Compressed (16-bit encoding) opcodes

Base instructions (32-bit encoding)

Multiply-divide extension ('M') instructions

Floating-point extension ('F') instructions

Atomicity extension opcodes

suggested register usage

RISC-V interrupts

RISC-V variants

DarkRiscV subset

RISC-V links

RISC-V discussion

Lack of execute-only/read-only memory

Lack of overflow checks

Opinions

Overall opinions

Misc opinions

RISC-V's RV32V vector extension vs fixed-width SIMD

Links