Difference between revision 26 and current revision
No diff available." Results
After recompiling the program with libctiny and the method above, the EXE jumped from a giant 64K to a much more reasonable 4K! (4096 bytes to be exact). For comparison, the entire code section of the linker map file is reproduced below: Hide Copy Code
0001:00000000 ?DumpFile?@@YAXPAD@Z 00401000 f hd.obj 0001:0000013d _main 0040113d f hd.obj 0001:0000021a _fopen 0040121a f libct:file.obj 0001:000002a7 _fread 004012a7 f libct:file.obj 0001:000003c2 _fwrite 004013c2 f libct:file.obj 0001:0000048b _fgetc 0040148b f libct:file.obj 0001:000004b6 _printf 004014b6 f libct:printf.obj 0001:000004ef _memset 004014ef f libct:memory.obj 0001:0000050e __doexit 0040150e f libct:initterm.obj 0001:0000053a _mainCRTStartup 0040153a f libct:crt0tcon.obj 0001:000005f5 _malloc 004015f5 f libct:alloc.obj 0001:00000607 __init_args 00401607 f libct:argcargv.obj 0001:00000705 __ismbcspace 00401705 f libct:isctype.obj "
---
so just to remind myself, next steps are:
---
Kragen Javier Sitaker compares Fibonacci in various target languages:
https://web.archive.org/web/20120811013032/http://lists.canonical.org/pipermail/kragen-tol/2007-September/000871.html (alternate link: http://www.1strecon.org/downloads/Forth_Resources/ByteCodeInterpretters_4_TinyComputers.pdf ) discussion:
http://thread.gmane.org/gmane.culture.people.kragen.thinking/94
(note to self: yes, i already copied this to plPartTargetLanguages)
---
i'm leaning towards making the 'programmable instructions' just compiler macros, or even just a compiler technology used to compile BootX? to Boot in the reference compiler.
In which case, maybe the BootX? spec should just reduce the SMALLSTACK size (to reserve some of the Boot SMALLSTACK forcompiler use) and/or reduce the number of registers (to reserve one or more registers for compiler use). Of course, you can still form syntactically correct BootX? instructions which use the extra stack locs or registers (these can be reserved for compiler use).
---
should we add an instruction
Pop_jump_if_false?
---
xenadu02 1 hour ago [-]
> the kernel has an easier time of single-handedly cleaning up all file descriptors, memory mappings, etc. in one fell swoop inside execve()
fork() / execve() isn't a good mechanism for launching new processes; it's one of the worst aspects of early Unix design.
A properly designed syscall should require the caller to specify the desired behavior explicitly:
1. Do I want to inherit file descriptors, etc or do I want a clean slate? 2. Do I want a separate address space or to share address space? 3. What happens to threads, mutexes, locks, etc in the new child process? 4. If a separate address space should I inherit my parent's VM mappings or should I just load and execute a new binary image?
That gives flexibility to the caller and maximum information to the kernel to optimize. Instead we have a bunch of ad-hoc patches like vfork(), clone(), and posix_spawn() (+ non-standard attributes like POSIX_SPAWN_CLOEXEC_DEFAULT, POSIX_SPAWN_SETEXEC, and POSIX_SPAWN_USEVFORK).
reply
---
" There are many mechanisms for communicating information between user-space applications and the kernel. System calls and pseudo-filesystems such as /proc and /sys are of course the most well known. Signals are similarly well known; the kernel employs signals to inform a process of various synchronous or asynchronous events—for example, when the process tries to write to a broken pipe or a child of the process terminates.
There are also a number of more obscure mechanisms for communication between the kernel and user space. These include the Linux-specific netlink sockets and user-mode helper features. Netlink sockets provide a socket-style API for exchanging information with the kernel. The user-mode helper feature allows the kernel to automatically invoke user-space executables; this mechanism is used in a number of places, including the implementation of control groups and piping core dumps to a user-space application.
The auxiliary vector, a mechanism for communicating information from the kernel to user space, has remained largely invisible until now. However, with the addition of a new library function, getauxval(), in the GNU C library (glibc) 2.16 release that appeared at the end of June, it has now become more visible. " [5]
---
In deciding to use WASM, the Ethereum project also considered:
so it seems like i'm looking at the right places...
---
i checked out the Ethereum project's WASM-based VM, EWASM, and afaict it doesn't actually deprecate any WASM instructions... [7] [8]
it does have this list of opcodes but i think these are EVM opcodes, not WASM?
https://github.com/ewasm/ewasm-kernel/blob/master/opcodes.js
However https://github.com/ethereum/EIPs/issues/48 (which is outdated [9]) does have some deprecations:
Furthermore that EIP has a list of instructions:
Integer Operations
Flow Control
Memory:
Actually the other URL does link to a (longer) instruction list, at [10]
i really like the this EVM proposal to get rid of dynamic jumps, discussed in plChMiscIntermedLangs, section 'EVM: proposal to remove dynamic jumps to assist static analysis'.
probably not at the BootX? level (not sure tho...), but maybe at the next level up.
The EVM proposal seems to provide a very minimal set of primitive static jumps that let you do most stuff. In order to do this, they explicitly delimit subroutine boundaries and arity, and define simple semantics for frame pointers, a return stack containing frame pointers, etc; this also lets them define 'getlocal' and 'putlocal' instructions.
---
kinda interesting. assembler with label and macro support, based on Coq:
---
mac012345 says: June 25, 2018 at 7:08 am
Does the interrupt latency take register saving into account for risc-v core? As there is 32 registers and no multiple load/store instruction, it means that if it’s not done by hardware you can have up to 32 cycles latency (depending on EABI), quite different from the 6 cycle figure given… Report comment Reply
Dmitry Grinberg says: June 25, 2018 at 10:50 am
The point the OP is making remains valid. Comparing their “6 cycles” with no reg saving is unfair to Cortex-M whose published latency of 12-13 cycles DOES include saving regs Report comment Reply
Bruce Hoult (@BruceHoult?) says: June 25, 2018 at 11:07 am
The time quoted is to get into a C function compiled with the “interrupt” attribute. That means the function itself will save only the registers it actually needs. A simple handler might only need to save one register.
If the interrupt handler function needs more extensive processing and calls normal C functions (i.e. without the “interrupt” attribute) then it will automatically save the volatile registers first. In the standard *nix ABI that’s a0-a7, t0-t6, and the return address (16 registers). An EABI is in the process of being defined. It’s probably going to end up needing to save 7 registers.
Note that a load/store multiple instruction makes your program smaller, but it doesn’t make it any faster. The same goes for hardware register stacking. Report comment Reply
asdf says: June 25, 2018 at 2:35 pm
The Cortex-M also sets up nested interrupts in those same 12 cycles. It’s too bad the RISC-V designers never seem to have looked at what makes Cortex-M such an elegant microcontroller architecture. As it stands, it has all the drawbacks and awkwardness of MIPS32. Report comment Reply
---
ChuckMcM? 16 hours ago
parent | flag | favorite | on: Why Use an FPGA Instead of a CPU or GPU? |
The author isn't aware apparently of over a decade of work done in FPGA/CPU integration. Both in more sequential languages like System C[1] or in extended instruction set computing like the Stretch[2]. Not to mention the Zynq[3] series where the CPU is right there next to the FPGA fabric.
For "classic" CPUs (aka x86 cpus with a proprietary frontside bus and southbridge bus) it can be challenging to work an FPGA into the flow. But the problem was pretty clear in 2007 in this Altera whitepaper[4] where they postulate that not keeping up with Moore's law means you probably end up specializing the hardware for different workloads.
The tricky bit on these systems is how much effort/time it takes to move data and context between the "main" CPU and the FPGA and then back again (Amdahl's law basically). When the cost of moving the data around is less than the scaling benefit of designing a circuit to do the specific computation, then adding an FPGA is a winning move.
[1] https://www.xilinx.com/products/design-tools/vivado/prod-advantage/rtl-synthesize.html
[2] http://www.stretchinc.com/index.php
[3] https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html
[4] https://www.intel.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01029.pdf
Yeah, the High-Level Synthesis work goes back to the 1990's. Wikipedia has a nice list:
https://en.wikipedia.org/wiki/High-level_synthesis
They didn't really pan out in terms of large uptake. They did solve some problems pretty well with shorter, developer time. There's also been open source ones designed. Here's a few:
Many of the other ones seem dead or don't have a software page. I'm not sure what's current in FOSS HLS with C-like languages. Anyone know of other developments at least as good as the two I linked?
reply
fulafel 15 hours ago [-]
Nitpick: LegUp? is published under a "no commercial use" license. Synflow mentions they use a "open-source language" but it is not clear how much of the tool chain is open source.
reply
---
events/locks need a primitive where only ONE thread, the one who next gets the lock, gets woken up when a lock is relaesed and multiple threads are waiting on that lock. O/w you get this:
" A lock convoy occurs when multiple threads of equal priority contend repeatedly for the same lock.[1] Unlike deadlock and livelock situations, the threads in a lock convoy do progress; however, each time a thread attempts to acquire the lock and fails, it relinquishes the remainder of its scheduling quantum and forces a context switch. The overhead of repeated context switches and underutilization of scheduling quanta degrade overall performance. ...
Example
Critical sections as implemented in Microsoft Windows operating systems provide a good example of how lock convoys can occur. In Windows, critical sections use a combination of a spinlock and a kernel synchronization object called an "event" to ensure mutual exclusion. For low-contention critical sections, the spinlock will provide mutual exclusion most of the time, falling back on the event only when a thread fails to acquire the spinlock within a certain amount of time. When contention is high, however, it is possible for many threads to fail to acquire the spinlock and enter a waiting state, all waiting on the same event.
When the event is signaled, all threads that are waiting on the event are woken, but only one will be allowed to acquire the critical section and continue execution; the remaining threads will each block again. " [12]
---
This explains a Windows design decision for a non-fair scheduler to increase performance. For Oot, we wouldn't do this; round-robin scheduling is simpler and we usually prioritize simplicity over performance:
---
neelance commented on Jun 29
The upcoming Go 1.11 release will have experimental support for WebAssembly?. This will include full support for all of Go's features, including goroutines, channels, etc. However, the performance of the generated WebAssembly? is currently not that good.
This is mainly because of the missing goto instruction. Without the goto instruction we had to resort to using a toplevel loop and jump table in every function. Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function. The relooper can not help with this, only a goto instruction can.
It is awesome that WebAssembly? got to the point where it can support a language like Go. But to be truly the assembly of the web, WebAssembly? should be equally powerful as other assembly languages. Go has an advanced compiler which is able to emit very efficient assembly for a number of other platforms. This is why I would like to argue that it is mainly a limitation of WebAssembly? and not of the Go compiler that it is not possible to also use this compiler to emit efficient assembly for the web. @kripken Member kripken commented on Jun 29
Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function.
Just to clarify, a regular goto would not be enough for that, a computed goto is required for your use case, is that correct? @neelance neelance commented on Jun 29
I think a regular goto would probably be sufficient in terms of performance. Jumps between basic blocks are static anyways and for switching goroutines a br_table with gotos in its branches should be performant enough. Output size is a different question though. @kripken Member kripken commented on Jun 29
It sounds like you have normal control flow in each function, but also need the ability to jump from the function entry to certain other locations in the "middle", when resuming a goroutine - how many such locations are there? If it's every single basic block, then the relooper would be forced to emit a toplevel loop that every instruction goes through, but if it's just a few, that shouldn't be a problem. (That's actually what happens with setjmp support in emscripten - we just create the extra necessary paths between LLVM's basic blocks, and let the relooper process that normally.) @neelance neelance commented on Jun 29
Every call to some other function is such a location and most basic blocks have at least one call instruction. We're more or less unwinding and restoring the call stack. @kripken Member kripken commented on Jun 29
I see, thanks. Yeah, I agree that for that to be practical you need either static goto or call stack restoring support (which has also been considered). @Heimdell Heimdell commented on Jun 30
Will it be possible to call function in CPS style or implement call/cc in WASM? @rossberg Member rossberg commented on Jul 9
@Heimdell, support for some form of delimited continuations (a.k.a. "stack switching") is on the road map, which should be enough for almost any interesting control abstraction. We cannot support undelimited continuations (i.e., full call/cc), though, since the Wasm call stack can be arbitrarily intermixed with other languages, including reentrant calls out to the embedder, and thus cannot assumed to be copyable or movable.
---
C compiler with support for structs written in Assembly (nongnu.org) https://bootstrapping.miraheze.org/wiki/Stage0 http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage2/cc_x86.s http://git.savannah.nongnu.org/cgit/stage0.git/tree/ISA_HEX_Map.org https://news.ycombinator.com/item?id=17851311
https://github.com/oriansj/knight-vm https://github.com/oriansj/knight-vm/blob/master/vm_instructions.c
https://bootstrapping.miraheze.org/wiki/Main_Page
mmastrac 1 day ago [-]
I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.
The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.
The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).
EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap
reply
giomasce 1 day ago [-]
I am also doing something like that here:
https://gitlab.com/giomasce/asmc
My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).
My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.
The README in the linked page contains more information.
I'll look at your code too!
reply
---
https://github.com/darklife/darkriscv
"implements most of the RISC-V RV32I instruction set"
implemented (in Verilog!) in one night (or mb one week?)
"The RV32I specification itself is really impressive and easy to implement (see [1], page 16). Of course, there are some drawbacks, such as the funny little-endian bus (opposed to the network oriented big-endian bus found in the 680x0 family), but after some empirical tests it is easy to make work."
not sure which instructions are implemented; i think it's in this file:
https://github.com/darklife/darkriscv/blob/master/rtl/darkriscv.v
akuma73 5 hours ago [-]
Sorry to burst any bubbles here, but this is s very incomplete implementation.
You couldn’t run anything but small toy programs on this machine. This is more like what a student would build in an undergraduate course in computer architecture.
For example, there is no MMU, no debug support, no traps, no interrupts, no exception handling, no OS privledge levels, no FP, no memory controller etc.
Of course, one wouldn’t implement all of these in a few hours.
The fact that this is RISCV is somewhat of a red herring as you could do a similar thing with a restricted subset of MIPS or ARM or even x86 as they do in UT Austin’s comp arch class.
reply
aappleby 46 minutes ago [-]
An Arduino has none of the missing parts you mention (except interrupts), yet it's quite a useful device even in non-toy applications.
reply
waterhouse 8 hours ago [-]
I carefully studied the 2.1 and 2.2 versions of the spec. I wrote, in Racket, a miniature interpreter for lists of RISC-V instructions (just symbolic lists, not the byte strings they assemble into), then wrote programs to compute triangular numbers and Fibonacci, and verified they worked. I've built the RISC-V toolchain and compiled a few test C programs and run them in "spike". I don't think I took more than a glance at the "gcc -S" output. I have no familiarity with SPARC or Motorola assembly, but I have decent familiarity with x86-64 assembly, having written what looks like a couple dozen miniature test programs and one medium-sized program, and having tried a few times to write a decent x86-64 assembler.
I found the RISC-V spec enjoyable to read, especially with all the rationales; it felt clean, minimal, and well-thought-out. Writing a program in it was probably not much different from writing in x86-64 assembly. The encodings seemed much easier to deal with than x86-64, though as mentioned I didn't go as far as writing an encoder.
What kinds of important features are present in SPARC or MC68k but not in RISC-V? Are they absent from x86-64 as well?
reply
brandmeyer 5 hours ago [-]
There are a few instructions that nearly every modern processor ended up adopting, but RISC-V still lacks. Off the top of my head, (with the RV ISA developers response in parenthesis):
So yeah, there are deficiencies. None of them are crippling, but I wouldn't say that RV is super-wham-o-dyne, either.
reply
Annatar 7 hours ago [-]
Motorola assembler is what is known as orthogonal instruction set. The flow is logical: move.b for byte, move.w for word, move.l for longword, from source to destination, which reflects real life.
The assembler reads almost like a high level programming language. The register scheme is intuitive as well, from a0-a7 being the address, to d0-d7 being the data registers.
Now, let's do a mental exercise: I'm going to load an effective address relative to the program counter, 32-bits wide, into the 1st address register. Then, I'm going to load an arbitrary value from an arbitrary memory location into the second address register. Do the same in RISC-V; compare intuitiveness.
lea MemoryAddress(pc), a0 move.l $00bfe001, a1 rts MemoryAddress: DC.l 0
reply
waterhouse 7 hours ago [-]
Indeed, in RISC-V, full 32-bit constants must be broken across two instructions. The translation goes:
; 1. lea MemoryAddress(pc), a0 auipc a0, [upper 20 bits of (MemoryAddress - label)] addi a0, a0, [lower 12 bits of (MemoryAddress - label)] label: ; 2. move.l #$00bfe001, a1 lui a1, 0x00bfe000 addi a1, a1, 0x001 ; 3. rts jalr x0, x1, 0
It is cumbersome in that sense. But it will probably be handled by an assembler (as a macro or builtin). The spec contains an appendix of "pseudoinstructions", which gives the first translation above. There isn't even a dedicated "move" instruction—it's just "addi dst, src, 0" by convention! Clearly anyone writing assembly will use at least that pseudoinstruction.
If there's a canonical format for "pseudoinstructions", and all assemblers handle them in the same way, and the abstraction doesn't leak in any way (i.e. the only temporary registers you use are ones you overwrite fully by the end; it is true that now some "instructions" have longer encodings, but that comes with the compressed instructions anyway; and it is true that an interrupt could happen between the two halves of the "instruction", but I think that shouldn't make a difference), then I don't think there's much of a problem.
reply
Annatar 1 hour ago [-]
"Indeed, in RISC-V, full 32-bit constants must be broken across two instructions."
Any RISC processor has that issue, because encoding is fixed at 32-bits to keep the hardware simple. That's not what I'm referring to.
Look at that retardation, "auipc". Because "auipc" is intuitive, right? (For the record, I'm being sarcastic.) What the hell was the instruction designer smoking? Then there is the square bracket notation, like on intel, and intel has some of the most idiotic, non-conventional assembler syntax -- and this thing mimics something so bad? Every other processor uses parenthesis, that's a well understood norm.
Then there is more intel retardation in the form of dst, src (or dst, src, src on RISC). What kind of a warped, twisted mind came up with that? What was going on in that person's head to work in such twisted ways?
Then there's the "cherry on top":
jalr x0, x1, 0 ; because "jalr" is intuitive as well, it immediately tells you what it does?
You'll know a bad processor design by its instruction set. Always and forever.
reply
---
http://www.darklife.org/pics/xilinx/src/core.v
"Verilog 16-bit RISC processor!"
`define ROR 4'd0 `define ROL 4'd1 `define ADD 4'd2 `define SUB 4'd3 `define AND 4'd4 `define OR 4'd5 `define XOR 4'd6 `define NOT 4'd7 `define LOD 4'd8 `define STO 4'd9 `define IMM 4'd10 `define MUL 4'd11 `define BRA 4'd12 `define BSR 4'd13 `define RET 4'd14 `define LOP 4'd15
---
it's tough to keep the # of instructions small because, yeah, 16-bit is probably the smallest power-of-two bitwidth that can be used to address reasonably sized programs, but otoh sometimes you really want to index more than 4 gigs of memory/list items (~2^32 bytes). So we seem to be inexorably led to having a plethora of data types in assembly, which greatly raises the # of instructions.
But.. actually floating points can represent some integers too.. and in fact it seems that a 64-bit double can exactly represent every integer up to 2^53. [14]. So it may be possible to get by with just two types; a 16-bit integer and a 64-bit float.
Very few if any assembly-like languages do that i think; even WASM has i32, i64, f32, f64.
I guess a more practical choice would be:
i16, i64, f64
and even more practical:
i8, i16, i32, i64, f64
but imo that's already a lot...
---
to reiterate some thoughts regarding the tiled/fabric CPU-node, memory-node, network-node design at the end of ootAssembly19 and mentioned in ootAssemblyThoughts:
i probably wont have time in my lifetime to muck around with hardware, even FPGAs, much, so having hardware-supported specialized CPU, memory, and network nodes is probably something i'll never get to. So really the best we could hope for is a uniform tiling of one type of node (which could maybe be, in software, specialized into CPU/memory/network nodes). So the lessons from the tiled/fabric CPU/memory/net node design should probably be distilled down to constraints on low-level memory and communications capabilities rather than used directly, even in demos. Which is pretty much what i did at the end of the relevant sections of ootAssembly19 and ootAssemblyThoughts. To reiterate some of this:
so taking the sorta-max of some of these (except packet size), for easy development given that we won't actually have a hardware implementation and so don't really know what the exact constraints are:
in terms of geometry, either:
or:
eh, let's have a payload size of 4k. We are choosing maxs here, right? If the hardware page size is 256 then the Oot VM layer has to do some networking, whatever. Also, this allows all of the constraints to be derived just from the bitwidth of 16; if each node has a one-word address, that's 64k nodes per chip, and if one packet payload can fit one bit for every node on the chip, that's a 4k payload size.
so to summarize further:
(note: this allows a data structure with one bit for every addressable node to fit within one packet payload)
(what happens when the buffer is full? do packets get overwritten? Or do packets get dropped and the sender notified? Or does the sender wait for acknowledgement which never arrives in this case? If packets can get silently dropped or acks can not arrive, then is this any better than an unreliable network?)
(is in-order streaming provided or only datagrams/packets?)
(is error correction/reliability provided/assumed? i think yes; this is supposed to be like hardware/IPC, not like external network transmission; if you wanted to access a 'real' (external) network then you'd want more than 16-bit addresses)
(does each node have multiple 'ports' for packets to be sent to/from? If so, i assume that signals don't have ports?)
Remaining Qs:
i copied those conclusion to ootAssemblyThoughts.
---
let's unify packets and signals, and also allow them to be ordered via pipes (channels)
---
dschuetz 6 minutes ago [-]
The Cell was way ahead of its time. Its security architecture capabilities was almost perfect, but they've screwed up the implementation and they failed to provide good programming tools. And, it was very very expensive.
reply
---
both WASM and LLVM have an instruction called 'unreachable' that is like our BAD instruction. So we should probably call it 'unreachable'.
---
WASM's opcodes 0 and 1 are 'unreachable' and 'NOP'. Which is nice b/c it means you can immediately identify non-instructions by comparing the opcode to 2.
---
WASM has an instruction called 'drop' that i don't quite understand yet.
---
concordance of WASM, RISC-V, LLVM, JVM, ARM Cortex M0 instructions (todo: also LuaJIT?2):
RISC-V 2.2 base 32-bit integer instruction set RV32I version 2.0:
RISC-V multiply/divide extension: mul, mulh, mulhsu, mulhu, div, divu, rem, remu
RISC-V single-precision float extension: flw fsw fadd fsub fmul fdiv fsqrt fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi
WASM: [16]
LLVM:
in addition to what is listed above, there are various 'added later' instructions, vector ops, aggregate ops including getelementptr, fence and atomic memory ops, and intrinsics
JVM:
https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings
d | f | i)(load | store) (a=references, d=doubles, f=float, i=integer) |
d | f)(load | store)_(0..3) |
b | c | d | f | i | l | s)(aload | astore) anewarray arraylength getfield getstatic multianewarray new newarray putfield putstatic |
f | i | l): add sub mul div neg rem |
l) and or xor shl shr ushr |
f)cmp(g | l) |
d | f | i | l)return athrow goto goto_w invokedynamic invokeinterface invokespecial invokestatic invokevirtual jsr jsr_w lookupswitch ret tableswitch |
ne) |
ne | ge | gt | le | lt) |
ne | ge | gt | le | lt) |
i | l) f2(d | i | l) i2(b | c | d | f | l | s) l2(d | f | i) |
ARM Cortex M0:
from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html or http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html
note: these are 32-bit operations
h | sb | sh) (load byte | halfword | signed byte | signed halfword) ldm (load multiple) str (store) str(b | h) stm (store multiple) (push | pop) (push/pop registers onto/from stack) |
s)xt(b | h) (extend unsigned | signed byte | halfword) |
e) (disable/enable interrupts) (mrs | msr) (read/write special register) bkpt (breakpoint) |
ARM Cortex M4 floating-point:
VABS, VADD, VCMP, VCMPE, VCVT, VCVTR, VDIV, VLDM, VLDR, VMLA, VMLS, VMOV, VMRS, VMSR, VMUL, VNEG, VNMLA, VNMLS, VNMUL, VPOP, VPUSH, VSQRT, VSTM, VSTR, VSUB
todo mb later after taking the initial sorta-intersection, also consider 68k/Coldfire, MIPS, those educational subsets of RISC-V, GNU Lightning, RISC-V compressed, the ARM subset in this document https://community.arm.com/processors/b/blog/posts/arm-cortex-m3-and-later-basic-integer-math-operations-32-bit-1-5 , what else?
CONCORDANCE:
loads/stores:
load/store byte or halfword integer: all but JVM have some straightforward way to load an integer byte or halfword/short (two bytes), with the option to treat it as unsigned or signed (that is, to zero-extend or sign-extend it). All but the JVM can also store integer bytes and halfwords. The JVM would appear to use array operations (bytearrays) for this?
load/store 32-bit word: all 5 have 32-bit integer load/store commands
add/sub 32-bit ints: all 5 have this
negate signed integers: all 5 can do this; (in RISC-V this would be SUB 0, $x; note that constant 0 is always available even to non-immediate instruction in RISC-V, by using the zero register); in WASM and LLVM i'm assuming you would also subtract 0 from a constant; JVM has a negate instruction; and ARM has a weird-sounding instruction rsbs (reverse subtract; negate).
add immediate: both RISC-V and ARM and JVM; n/a for the others
load upper immediate: all but armcm0 have either a load upper immediate, or direct loading of 64-bit integers
add immediate to PC: riscv and armcm0 both have something like this. Arguably the other 3 are higher level so may not need it as much. ARM can add or subtract 4k [17]