Bayle Shanks's website: proj-oot-ootAssemblyNotes22

" Results

After recompiling the program with libctiny and the method above, the EXE jumped from a giant 64K to a much more reasonable 4K! (4096 bytes to be exact). For comparison, the entire code section of the linker map file is reproduced below: Hide Copy Code

0001:00000000 ?DumpFile?@@YAXPAD@Z 00401000 f hd.obj 0001:0000013d _main 0040113d f hd.obj 0001:0000021a _fopen 0040121a f libct:file.obj 0001:000002a7 _fread 004012a7 f libct:file.obj 0001:000003c2 _fwrite 004013c2 f libct:file.obj 0001:0000048b _fgetc 0040148b f libct:file.obj 0001:000004b6 _printf 004014b6 f libct:printf.obj 0001:000004ef _memset 004014ef f libct:memory.obj 0001:0000050e __doexit 0040150e f libct:initterm.obj 0001:0000053a _mainCRTStartup 0040153a f libct:crt0tcon.obj 0001:000005f5 _malloc 004015f5 f libct:alloc.obj 0001:00000607 __init_args 00401607 f libct:argcargv.obj 0001:00000705 __ismbcspace 00401705 f libct:isctype.obj "

---

so just to remind myself, next steps are:

clean up the current spec documents
make a concordance of WASM, RISC-V, LLVM instructions
select the sorta-intersection of those three and add them to BootX?
if there is room left, consider adding higher-level datastructure etc stuff from JVM, Michelson, Scheme, Lua, Python

---

Kragen Javier Sitaker compares Fibonacci in various target languages:

https://web.archive.org/web/20120811013032/http://lists.canonical.org/pipermail/kragen-tol/2007-September/000871.html (alternate link: http://www.1strecon.org/downloads/Forth_Resources/ByteCodeInterpretters_4_TinyComputers.pdf ) discussion:

http://thread.gmane.org/gmane.culture.people.kragen.thinking/94

(note to self: yes, i already copied this to plPartTargetLanguages)

---

i'm leaning towards making the 'programmable instructions' just compiler macros, or even just a compiler technology used to compile BootX? to Boot in the reference compiler.

In which case, maybe the BootX? spec should just reduce the SMALLSTACK size (to reserve some of the Boot SMALLSTACK forcompiler use) and/or reduce the number of registers (to reserve one or more registers for compiler use). Of course, you can still form syntactically correct BootX? instructions which use the extra stack locs or registers (these can be reserved for compiler use).

---

should we add an instruction

Pop_jump_if_false?

---

xenadu02 1 hour ago [-]

> the kernel has an easier time of single-handedly cleaning up all file descriptors, memory mappings, etc. in one fell swoop inside execve()

fork() / execve() isn't a good mechanism for launching new processes; it's one of the worst aspects of early Unix design.

A properly designed syscall should require the caller to specify the desired behavior explicitly:

1. Do I want to inherit file descriptors, etc or do I want a clean slate? 2. Do I want a separate address space or to share address space? 3. What happens to threads, mutexes, locks, etc in the new child process? 4. If a separate address space should I inherit my parent's VM mappings or should I just load and execute a new binary image?

That gives flexibility to the caller and maximum information to the kernel to optimize. Instead we have a bunch of ad-hoc patches like vfork(), clone(), and posix_spawn() (+ non-standard attributes like POSIX_SPAWN_CLOEXEC_DEFAULT, POSIX_SPAWN_SETEXEC, and POSIX_SPAWN_USEVFORK).

---

" There are many mechanisms for communicating information between user-space applications and the kernel. System calls and pseudo-filesystems such as /proc and /sys are of course the most well known. Signals are similarly well known; the kernel employs signals to inform a process of various synchronous or asynchronous events—for example, when the process tries to write to a broken pipe or a child of the process terminates.

There are also a number of more obscure mechanisms for communication between the kernel and user space. These include the Linux-specific netlink sockets and user-mode helper features. Netlink sockets provide a socket-style API for exchanging information with the kernel. The user-mode helper feature allows the kernel to automatically invoke user-space executables; this mechanism is used in a number of places, including the implementation of control groups and piping core dumps to a user-space application.

The auxiliary vector, a mechanism for communicating information from the kernel to user space, has remained largely invisible until now. However, with the addition of a new library function, getauxval(), in the GNU C library (glibc) 2.16 release that appeared at the end of June, it has now become more visible. " [5]

---

In deciding to use WASM, the Ethereum project also considered:

WASM (chosen)
LLVM
CLI / ECMA-335 (this is CIL, sorta)
RISCV
JVM
EVM1 (the status quo)

[6]

so it seems like i'm looking at the right places...

---

i checked out the Ethereum project's WASM-based VM, EWASM, and afaict it doesn't actually deprecate any WASM instructions... [7] [8]

it does have this list of opcodes but i think these are EVM opcodes, not WASM?

https://github.com/ewasm/ewasm-kernel/blob/master/opcodes.js

However https://github.com/ethereum/EIPs/issues/48 (which is outdated [9]) does have some deprecations:

A local variable limit
Shared memory for call returns
Limit types and functions to i64 (note: so this is giving up the 256-bit ops from the old EVM)
No Datatype conversion
No select op
No comma op
No conditional op
No need for a explicit module definition

Furthermore that EIP has a list of instructions:

Integer Operations

add: sign-agnostic addition
sub: sign-agnostic subtraction
mul: sign-agnostic multiplication (lower 32-bits)
div_s: signed division (result is truncated toward zero)
div_u: unsigned division (result is floored)
rem_s: signed remainder (result has the sign of the dividend)
rem_u: unsigned remainder
and: sign-agnostic bitwise and
or: sign-agnostic bitwise inclusive or
xor: sign-agnostic bitwise exclusive or
shl: sign-agnostic shift left
shr_u: zero-replicating (logical) shift right
shr_s: sign-replicating (arithmetic) shift right
eq: sign-agnostic compare equal
ne: sign-agnostic compare unequal
lt_s: signed less than
le_s: signed less than or equal
lt_u: unsigned less than
le_u: unsigned less than or equal
gt_s: signed greater than
ge_s: signed greater than or equal
gt_u: unsigned greater than
ge_u: unsigned greater than or equal
clz: sign-agnostic count leading zero bits (All zero bits are considered leading if the value is zero)
ctz: sign-agnostic count trailing zero bits (All zero bits are considered trailing if the value is zero)
popcnt: sign-agnostic count number of one bits

Flow Control

block: a fixed-length sequence of expressions with a label at the end.
loop: a block with an additional label at the beginning which may be used to form loops
br: branch to a given label in an enclosing construct
br_if: conditionally branch to a given label in an enclosing construct
get_reg: gets a register. We constrain registers to be one byte. Therefor we have 255 registers
store_reg: stores a register.
unreachable

Memory:

load8_s: load 1 byte and sign-extend i8 to i32
load8_u: load 1 byte and zero-extend i8 to i32
load16_s: load 2 bytes and sign-extend i16 to i32
load16_u: load 2 bytes and zero-extend i16 to i32
load: load 4 bytes as i32
store8: wrap i64 to i8 and store 1 byte
store16: wrap i64 to i16 and store 2 bytes
store32: wrap i64 to i32 and store 4 bytes
store: (no conversion) store 8 bytes
get_local: read the current value of a local variable
set_local: set the current value of a local variable

Actually the other URL does link to a (longer) instruction list, at [10]

i really like the this EVM proposal to get rid of dynamic jumps, discussed in plChMiscIntermedLangs, section 'EVM: proposal to remove dynamic jumps to assist static analysis'.

probably not at the BootX? level (not sure tho...), but maybe at the next level up.

The EVM proposal seems to provide a very minimal set of primitive static jumps that let you do most stuff. In order to do this, they explicitly delimit subroutine boundaries and arity, and define simple semantics for frame pointers, a return stack containing frame pointers, etc; this also lets them define 'getlocal' and 'putlocal' instructions.

---

kinda interesting. assembler with label and macro support, based on Coq:

[11]

---

mac012345 says: June 25, 2018 at 7:08 am

Does the interrupt latency take register saving into account for risc-v core? As there is 32 registers and no multiple load/store instruction, it means that if it’s not done by hardware you can have up to 32 cycles latency (depending on EABI), quite different from the 6 cycle figure given… Report comment Reply

Dmitry Grinberg says: June 25, 2018 at 10:50 am

The point the OP is making remains valid. Comparing their “6 cycles” with no reg saving is unfair to Cortex-M whose published latency of 12-13 cycles DOES include saving regs Report comment Reply

Bruce Hoult (@BruceHoult?) says: June 25, 2018 at 11:07 am

The time quoted is to get into a C function compiled with the “interrupt” attribute. That means the function itself will save only the registers it actually needs. A simple handler might only need to save one register.

If the interrupt handler function needs more extensive processing and calls normal C functions (i.e. without the “interrupt” attribute) then it will automatically save the volatile registers first. In the standard *nix ABI that’s a0-a7, t0-t6, and the return address (16 registers). An EABI is in the process of being defined. It’s probably going to end up needing to save 7 registers.

Note that a load/store multiple instruction makes your program smaller, but it doesn’t make it any faster. The same goes for hardware register stacking. Report comment Reply

    asdf says:	
    June 25, 2018 at 2:35 pm

    The Cortex-M also sets up nested interrupts in those same 12 cycles. It’s too bad the RISC-V designers never seem to have looked at what makes Cortex-M such an elegant microcontroller architecture. As it stands, it has all the drawbacks and awkwardness of MIPS32.
    Report comment
    Reply

---

ChuckMcM? 16 hours ago

parent

flag

favorite

on: Why Use an FPGA Instead of a CPU or GPU?

The author isn't aware apparently of over a decade of work done in FPGA/CPU integration. Both in more sequential languages like System C[1] or in extended instruction set computing like the Stretch[2]. Not to mention the Zynq[3] series where the CPU is right there next to the FPGA fabric.

For "classic" CPUs (aka x86 cpus with a proprietary frontside bus and southbridge bus) it can be challenging to work an FPGA into the flow. But the problem was pretty clear in 2007 in this Altera whitepaper[4] where they postulate that not keeping up with Moore's law means you probably end up specializing the hardware for different workloads.

The tricky bit on these systems is how much effort/time it takes to move data and context between the "main" CPU and the FPGA and then back again (Amdahl's law basically). When the cost of moving the data around is less than the scaling benefit of designing a circuit to do the specific computation, then adding an FPGA is a winning move.

[1] https://www.xilinx.com/products/design-tools/vivado/prod-advantage/rtl-synthesize.html

[2] http://www.stretchinc.com/index.php

[3] https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html

[4] https://www.intel.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01029.pdf

Yeah, the High-Level Synthesis work goes back to the 1990's. Wikipedia has a nice list:

https://en.wikipedia.org/wiki/High-level_synthesis

They didn't really pan out in terms of large uptake. They did solve some problems pretty well with shorter, developer time. There's also been open source ones designed. Here's a few:

http://legup.eecg.utoronto.ca

https://www.synflow.com

Many of the other ones seem dead or don't have a software page. I'm not sure what's current in FOSS HLS with C-like languages. Anyone know of other developments at least as good as the two I linked?

fulafel 15 hours ago [-]

Nitpick: LegUp? is published under a "no commercial use" license. Synflow mentions they use a "open-source language" but it is not clear how much of the tool chain is open source.

---

events/locks need a primitive where only ONE thread, the one who next gets the lock, gets woken up when a lock is relaesed and multiple threads are waiting on that lock. O/w you get this:

" A lock convoy occurs when multiple threads of equal priority contend repeatedly for the same lock.[1] Unlike deadlock and livelock situations, the threads in a lock convoy do progress; however, each time a thread attempts to acquire the lock and fails, it relinquishes the remainder of its scheduling quantum and forces a context switch. The overhead of repeated context switches and underutilization of scheduling quanta degrade overall performance. ...

Example

Critical sections as implemented in Microsoft Windows operating systems provide a good example of how lock convoys can occur. In Windows, critical sections use a combination of a spinlock and a kernel synchronization object called an "event" to ensure mutual exclusion. For low-contention critical sections, the spinlock will provide mutual exclusion most of the time, falling back on the event only when a thread fails to acquire the spinlock within a certain amount of time. When contention is high, however, it is possible for many threads to fail to acquire the spinlock and enter a waiting state, all waiting on the same event.

When the event is signaled, all threads that are waiting on the event are woken, but only one will be allowed to acquire the critical section and continue execution; the remaining threads will each block again. " [12]

---

This explains a Windows design decision for a non-fair scheduler to increase performance. For Oot, we wouldn't do this; round-robin scheduling is simpler and we usually prioritize simplicity over performance:

[13]

---

 neelance commented on Jun 29

The upcoming Go 1.11 release will have experimental support for WebAssembly?. This will include full support for all of Go's features, including goroutines, channels, etc. However, the performance of the generated WebAssembly? is currently not that good.

This is mainly because of the missing goto instruction. Without the goto instruction we had to resort to using a toplevel loop and jump table in every function. Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function. The relooper can not help with this, only a goto instruction can.

It is awesome that WebAssembly? got to the point where it can support a language like Go. But to be truly the assembly of the web, WebAssembly? should be equally powerful as other assembly languages. Go has an advanced compiler which is able to emit very efficient assembly for a number of other platforms. This is why I would like to argue that it is mainly a limitation of WebAssembly? and not of the Go compiler that it is not possible to also use this compiler to emit efficient assembly for the web. @kripken Member kripken commented on Jun 29

    Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function.

Just to clarify, a regular goto would not be enough for that, a computed goto is required for your use case, is that correct? @neelance neelance commented on Jun 29

I think a regular goto would probably be sufficient in terms of performance. Jumps between basic blocks are static anyways and for switching goroutines a br_table with gotos in its branches should be performant enough. Output size is a different question though. @kripken Member kripken commented on Jun 29

It sounds like you have normal control flow in each function, but also need the ability to jump from the function entry to certain other locations in the "middle", when resuming a goroutine - how many such locations are there? If it's every single basic block, then the relooper would be forced to emit a toplevel loop that every instruction goes through, but if it's just a few, that shouldn't be a problem. (That's actually what happens with setjmp support in emscripten - we just create the extra necessary paths between LLVM's basic blocks, and let the relooper process that normally.) @neelance neelance commented on Jun 29

Every call to some other function is such a location and most basic blocks have at least one call instruction. We're more or less unwinding and restoring the call stack. @kripken Member kripken commented on Jun 29

I see, thanks. Yeah, I agree that for that to be practical you need either static goto or call stack restoring support (which has also been considered). @Heimdell Heimdell commented on Jun 30

Will it be possible to call function in CPS style or implement call/cc in WASM? @rossberg Member rossberg commented on Jul 9

@Heimdell, support for some form of delimited continuations (a.k.a. "stack switching") is on the road map, which should be enough for almost any interesting control abstraction. We cannot support undelimited continuations (i.e., full call/cc), though, since the Wasm call stack can be arbitrarily intermixed with other languages, including reentrant calls out to the embedder, and thus cannot assumed to be copyable or movable.

---

C compiler with support for structs written in Assembly (nongnu.org) https://bootstrapping.miraheze.org/wiki/Stage0 http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage2/cc_x86.s http://git.savannah.nongnu.org/cgit/stage0.git/tree/ISA_HEX_Map.org https://news.ycombinator.com/item?id=17851311

https://github.com/oriansj/knight-vm https://github.com/oriansj/knight-vm/blob/master/vm_instructions.c

https://bootstrapping.miraheze.org/wiki/Main_Page

mmastrac 1 day ago [-]

I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.

The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.

The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).

EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap

giomasce 1 day ago [-]

I am also doing something like that here:

https://gitlab.com/giomasce/asmc

My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).

My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.

The README in the linked page contains more information.

I'll look at your code too!

---

https://github.com/darklife/darkriscv

"implements most of the RISC-V RV32I instruction set"

implemented (in Verilog!) in one night (or mb one week?)

"The RV32I specification itself is really impressive and easy to implement (see [1], page 16). Of course, there are some drawbacks, such as the funny little-endian bus (opposed to the network oriented big-endian bus found in the 680x0 family), but after some empirical tests it is easy to make work."

not sure which instructions are implemented; i think it's in this file:

https://github.com/darklife/darkriscv/blob/master/rtl/darkriscv.v

akuma73 5 hours ago [-]

Sorry to burst any bubbles here, but this is s very incomplete implementation.

You couldn’t run anything but small toy programs on this machine. This is more like what a student would build in an undergraduate course in computer architecture.

For example, there is no MMU, no debug support, no traps, no interrupts, no exception handling, no OS privledge levels, no FP, no memory controller etc.

Of course, one wouldn’t implement all of these in a few hours.

The fact that this is RISCV is somewhat of a red herring as you could do a similar thing with a restricted subset of MIPS or ARM or even x86 as they do in UT Austin’s comp arch class.

aappleby 46 minutes ago [-]

An Arduino has none of the missing parts you mention (except interrupts), yet it's quite a useful device even in non-toy applications.

waterhouse 8 hours ago [-]

I carefully studied the 2.1 and 2.2 versions of the spec. I wrote, in Racket, a miniature interpreter for lists of RISC-V instructions (just symbolic lists, not the byte strings they assemble into), then wrote programs to compute triangular numbers and Fibonacci, and verified they worked. I've built the RISC-V toolchain and compiled a few test C programs and run them in "spike". I don't think I took more than a glance at the "gcc -S" output. I have no familiarity with SPARC or Motorola assembly, but I have decent familiarity with x86-64 assembly, having written what looks like a couple dozen miniature test programs and one medium-sized program, and having tried a few times to write a decent x86-64 assembler.

I found the RISC-V spec enjoyable to read, especially with all the rationales; it felt clean, minimal, and well-thought-out. Writing a program in it was probably not much different from writing in x86-64 assembly. The encodings seemed much easier to deal with than x86-64, though as mentioned I didn't go as far as writing an encoder.

What kinds of important features are present in SPARC or MC68k but not in RISC-V? Are they absent from x86-64 as well?

brandmeyer 5 hours ago [-]

There are a few instructions that nearly every modern processor ended up adopting, but RISC-V still lacks. Off the top of my head, (with the RV ISA developers response in parenthesis):

bitwise rotate (let them eat macro-op fusion)
byte and bit swapping (strictly missing from RV, although proposals exist)
leading zero count, trailing zero count, and popcount (strictly missing from RV, proposals exist)
efficient multiword arithmetic (let them eat macro-op fusion, or long dependency chains)
base + [scaled] index addressing modes (you don't really need those)
multi-register save/restore instructions (ARMv8 doesn't have them / RVC is equivalent in density to ARMv8, nanoMIPS is irrelevant, let them eat millicode)

So yeah, there are deficiencies. None of them are crippling, but I wouldn't say that RV is super-wham-o-dyne, either.

Annatar 7 hours ago [-]

Motorola assembler is what is known as orthogonal instruction set. The flow is logical: move.b for byte, move.w for word, move.l for longword, from source to destination, which reflects real life.

The assembler reads almost like a high level programming language. The register scheme is intuitive as well, from a0-a7 being the address, to d0-d7 being the data registers.

Now, let's do a mental exercise: I'm going to load an effective address relative to the program counter, 32-bits wide, into the 1st address register. Then, I'm going to load an arbitrary value from an arbitrary memory location into the second address register. Do the same in RISC-V; compare intuitiveness.

  lea MemoryAddress(pc), a0
  move.l $00bfe001, a1
  rts
  MemoryAddress:  DC.l 0

waterhouse 7 hours ago [-]

Indeed, in RISC-V, full 32-bit constants must be broken across two instructions. The translation goes:

  ; 1. lea MemoryAddress(pc), a0
  auipc a0, [upper 20 bits of (MemoryAddress - label)]
  addi a0, a0, [lower 12 bits of (MemoryAddress - label)]
  label:
  ; 2. move.l #$00bfe001, a1
  lui a1, 0x00bfe000
  addi a1, a1, 0x001
  ; 3. rts
  jalr x0, x1, 0

It is cumbersome in that sense. But it will probably be handled by an assembler (as a macro or builtin). The spec contains an appendix of "pseudoinstructions", which gives the first translation above. There isn't even a dedicated "move" instruction—it's just "addi dst, src, 0" by convention! Clearly anyone writing assembly will use at least that pseudoinstruction.

If there's a canonical format for "pseudoinstructions", and all assemblers handle them in the same way, and the abstraction doesn't leak in any way (i.e. the only temporary registers you use are ones you overwrite fully by the end; it is true that now some "instructions" have longer encodings, but that comes with the compressed instructions anyway; and it is true that an interrupt could happen between the two halves of the "instruction", but I think that shouldn't make a difference), then I don't think there's much of a problem.

Annatar 1 hour ago [-]

"Indeed, in RISC-V, full 32-bit constants must be broken across two instructions."

Any RISC processor has that issue, because encoding is fixed at 32-bits to keep the hardware simple. That's not what I'm referring to.

Look at that retardation, "auipc". Because "auipc" is intuitive, right? (For the record, I'm being sarcastic.) What the hell was the instruction designer smoking? Then there is the square bracket notation, like on intel, and intel has some of the most idiotic, non-conventional assembler syntax -- and this thing mimics something so bad? Every other processor uses parenthesis, that's a well understood norm.

Then there is more intel retardation in the form of dst, src (or dst, src, src on RISC). What kind of a warped, twisted mind came up with that? What was going on in that person's head to work in such twisted ways?

Then there's the "cherry on top":

  jalr x0, x1, 0 ; because "jalr" is intuitive as well, it immediately tells you what it does?

You'll know a bad processor design by its instruction set. Always and forever.

---

http://www.darklife.org/pics/xilinx/src/core.v

"Verilog 16-bit RISC processor!"

`define ROR 4'd0 `define ROL 4'd1 `define ADD 4'd2 `define SUB 4'd3 `define AND 4'd4 `define OR 4'd5 `define XOR 4'd6 `define NOT 4'd7 `define LOD 4'd8 `define STO 4'd9 `define IMM 4'd10 `define MUL 4'd11 `define BRA 4'd12 `define BSR 4'd13 `define RET 4'd14 `define LOP 4'd15

---

it's tough to keep the # of instructions small because, yeah, 16-bit is probably the smallest power-of-two bitwidth that can be used to address reasonably sized programs, but otoh sometimes you really want to index more than 4 gigs of memory/list items (~2^32 bytes). So we seem to be inexorably led to having a plethora of data types in assembly, which greatly raises the # of instructions.

But.. actually floating points can represent some integers too.. and in fact it seems that a 64-bit double can exactly represent every integer up to 2^53. [14]. So it may be possible to get by with just two types; a 16-bit integer and a 64-bit float.

Very few if any assembly-like languages do that i think; even WASM has i32, i64, f32, f64.

I guess a more practical choice would be:

i16, i64, f64

and even more practical:

i8, i16, i32, i64, f64

but imo that's already a lot...

---

to reiterate some thoughts regarding the tiled/fabric CPU-node, memory-node, network-node design at the end of ootAssembly19 and mentioned in ootAssemblyThoughts:

i probably wont have time in my lifetime to muck around with hardware, even FPGAs, much, so having hardware-supported specialized CPU, memory, and network nodes is probably something i'll never get to. So really the best we could hope for is a uniform tiling of one type of node (which could maybe be, in software, specialized into CPU/memory/network nodes). So the lessons from the tiled/fabric CPU/memory/net node design should probably be distilled down to constraints on low-level memory and communications capabilities rather than used directly, even in demos. Which is pretty much what i did at the end of the relevant sections of ootAssembly19 and ootAssemblyThoughts. To reiterate some of this:

It's important to choose a minimal word size if we may want to have a massively concurrent system. 8-bit pointers can only address 256 locations. The GA144 GreenArray? Forth system can be programmed to do many wonderful things, and it is built out of many F18 cores, each of which only have 64 18-bit RAM words plus an equal amount of ROM [15]. However, this seems a little spartan to me -- i'm guessing that if you wanted to write, e.g., a program to make a node perform simple switched-packet networking across the fabric, you may be able to do this in a spartan way in a program with 128 instrutions, but you could probably make a more flexible program if you had more program memory per node. 16 is the first power-of-two bitwidth that can address an amount of memory necessary to hold reasonably-sized programs for this sort of purpose (even though eg. a full Python interpreter still can't be easily fit into 64k). Therefore, choose 16 bits as our word size.
this implies a ~64k limit for local memory at each node. If local memory and neighboring memory are mapped into a unified memory space, then the amount of local memory will be much less, more like 4k. Many MCUs only have 4k RAM. Also if we assume that the hardware node has 64k but we need some of that to run the Oot VM core, then we end up with less than 64k. Also, nowadays L1 icaches are often 32k but they used to be much smaller (down to 8k). GPU cores might have amounts of local memory less than 64k.
we should expect most nodes to be asleep most of the time, and then woken up by signals
max packet sizes should probably be somewhere in the range of 256 to 4k; mb 1k. Packet PAYLOAD sizes should possibly be less. Note that the safe UDP packet payload size is only guaranteed to be 508 bytes. The SeL?4 microkernel IPC buffer is 1024 bytes on x86_64.
perhaps there should be a smaller type of packet that is used as data passed with a signal; this should probably be between 2 bytes (16 bits) and 8 bytes (64 bits), but maybe as much as 64 bytes (one cache line, or one Ed25519 signature)
16k to 64k nodes per chip (256x256 arrangement yields 64k nodes, of which 16k of them (25%) are CPU nodes)
in the last tiled design in ootAssemblyNotes19, each CPU nodes was adjacent to 4 memory nodes, and secondarily adjacent (that is, shares memory nodes with) 4 other CPU nodes (and 4 network nodes). Each network node also maybe had access to 2 busses (vertical and horizontal), each contacting/connecting 256 network nodes.
it's unclear if each of these 'k's are bytes or words. Certainly i mean kibi- not kilo-.

so taking the sorta-max of some of these (except packet size), for easy development given that we won't actually have a hardware implementation and so don't really know what the exact constraints are:

16-bit words
up to 64k nodes per 'chip'
expect most nodes to be asleep most of the time, and then woken up by signals
packet payload size of 256 words (although 4k is also attractive)
another 'small' packet type (mb just a form of signal) with a 1 word payload

in terms of geometry, either:

each node has 4-6 local neighbors (one along each 2d or 3d axis) to which it has about one packet's worth of shared memory
each node has 2 or 3 busses (2d or 3d), each with up to 256 local neighbors, with one packet on the bus at any one time
there is a routed network composed of these local neighbors and busses. Each node gets a buffer of about 4 incoming packets.

or:

just forget about geometry and assume there is an on-chip between all nodes within a chip that can route and buffer packets and signals. This network supports the following functionality:
64k nodes per 'chip' network (so 16-bit addressing of nodes)
buffering of up to 4 incoming packets per node
shared memory blocks of size 1 packet payload
signals
some signals with 1 word data payloads (possibly these are 'user-defined' signals with a scalar number giving which 'user-defined' signal type this is? or mb there is no information in them other than the source, the destination, the fact that it's the type of signal with a payload, and the payload)
note: i guess this makes it easy for signals to be 64-bits/8 bytes (4 16-bit words) long: source addr, dest addr, signal type, payload; which is probably good for implementation on contemporary platforms

eh, let's have a payload size of 4k. We are choosing maxs here, right? If the hardware page size is 256 then the Oot VM layer has to do some networking, whatever. Also, this allows all of the constraints to be derived just from the bitwidth of 16; if each node has a one-word address, that's 64k nodes per chip, and if one packet payload can fit one bit for every node on the chip, that's a 4k payload size.

so to summarize further:

16-bit words
up to 64k local memory
built-in network provides routing of packets, pipes, shared memory blocks, and signals within one 'chip'
up to 64k nodes per 'chip' (16-bit on-chip network addresses)
expect most nodes to be asleep most of the time, and then woken up by signals
packets and pipes have ports (16-bit port numbers)
packet payload size of up to 4k words
possible buffering of a fixed, probably small amount of incoming packets per node
packets can be sent synchronously or async; with sync, the sender blocks until their packet is received by the destination (not just buffered, but actually delivered). With async, the sender's packet may be buffered or may be dropped if the receiver's buffer is full.
pipes are multi-producer, multi-consumer streaming connections
shared memory blocks of up to 4k words (which can be atomically mutated one word at a time; no multiword transactions; one-word compare-and-swap is available) (you must subscribe to them before accessing, and these have sequential consistency amongst the set of current subscribers)
signals consisting of: source addr, dest addr, signal type (16-bit), one word data payload (16-bit) (so 64 bits total)
packet delivery and mutations to shared memory blocks are atomic
packet and signal delivery is atomic (you don't have to worry about reading half a packet/signal because the packet/signal is still coming in as you are reading it)

(note: this allows a data structure with one bit for every addressable node to fit within one packet payload)

(what happens when the buffer is full? do packets get overwritten? Or do packets get dropped and the sender notified? Or does the sender wait for acknowledgement which never arrives in this case? If packets can get silently dropped or acks can not arrive, then is this any better than an unreliable network?)

(is in-order streaming provided or only datagrams/packets?)

(is error correction/reliability provided/assumed? i think yes; this is supposed to be like hardware/IPC, not like external network transmission; if you wanted to access a 'real' (external) network then you'd want more than 16-bit addresses)

(does each node have multiple 'ports' for packets to be sent to/from? If so, i assume that signals don't have ports?)

Remaining Qs:

do we really want to offer both sync and async modes for each of packets, pipes, signals?
do we really want to offer ports?
do we want to offer a mode in which, something is attempted to be sent and possibly buffered, but if the destination queue is full then something like EAGAIN is immediately returned (like http://man7.org/linux/man-pages/man3/sigqueue.3.html )

i copied those conclusion to ootAssemblyThoughts.

---

let's unify packets and signals, and also allow them to be ordered via pipes (channels)

---

dschuetz 6 minutes ago [-]

The Cell was way ahead of its time. Its security architecture capabilities was almost perfect, but they've screwed up the implementation and they failed to provide good programming tools. And, it was very very expensive.

---

both WASM and LLVM have an instruction called 'unreachable' that is like our BAD instruction. So we should probably call it 'unreachable'.

---

WASM's opcodes 0 and 1 are 'unreachable' and 'NOP'. Which is nice b/c it means you can immediately identify non-instructions by comparing the opcode to 2.

---

WASM has an instruction called 'drop' that i don't quite understand yet.

---

concordance of WASM, RISC-V, LLVM, JVM, ARM Cortex M0 instructions (todo: also LuaJIT?2):

RISC-V 2.2 base 32-bit integer instruction set RV32I version 2.0:

Loads: LB (Load Byte), LH (Load Halfword), LW (Load Word), LBU (Load Byte Unsigned), LHU (Load Half Unsigned),
Stores: SB (Store Byte), SH (Store Halfword), SW (Store Word)
Arithmetic: ADD, ADDI (ADD Immediate), SUBtract, LUI (Load Upper ImmU?), AUIPC (Add Upper Imm to PC (note: the Epiphany guy thought this could have been left out, but in the comments a RISC guy said it was useful))
Logical (note: the Epiphany guy thought the immediate versions of these could have been left out): XOR, XORI, OR, ORI, AND, ANDI
Shifts: SLL (Shift Left), SLLI (Shift Left Immediate), SRL (Shift Right), SRLI (Shift Right Immediate), SRA (Shift Right Arithmetic), SRAI (Shift Right Arith Imm),
Compare (note: the Epiphany guy thought these could have been left out): SLT (Set <), SLTI (Set < Immediate), SLTU (Set < Unsigned), SLTIU (Set < Imm Unsigned)
Branch: BEQ (Branch =), BNE (Branch !=), BLT (Branch <), BGE (Branch >=), BLTU (Branch < Unsigned), BGEU (Branch >= Unsigned)
Jump & Link: JAL (Jump and Link), JALR (Jump & Link Register)
Synch: FENCE (Synch threads (note: the Epiphany guy thought this could have been left out)), FENCE.I (Synch Instr & Data)
System: SCALL (System CALL), SBREAK (System BREAK)
Counters pseudo-instructions: RDCYCLE (ReaD? CYCLE), RDCYCLEH (ReaD? CYCLE upper Half), RDTIME (ReaD? TIME), RDTIMEH (ReaD? TIME upper Half), RDINSTRRET (ReaD? INSTR RETired), RDINSTRRETH (ReaD? INSTR upper Half),
Control and Status Register: CSRRW (Atomic Read/Write CSR), CSRRS (Atomic Read and Set Bits in CSR), CSRRC (Atomic Read and Clear Bits in CSR), CSRRWI (CSRRW immediate), CSRRSI (CSRRS immediate), CSRRCI (CSRRC immediate)

RISC-V multiply/divide extension: mul, mulh, mulhsu, mulhu, div, divu, rem, remu

RISC-V single-precision float extension: flw fsw fadd fsub fmul fdiv fsqrt fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi

WASM: [16]

control: unreachable, nop, block, loop, if, else, br, br_if, br_table, return, call, call_indirect
parametric: drop, select
variable: get_local, set_local, tee_local (like set_local but also returns its argument), get_global, set_global
constant loads (available in i32, i64, f32, or f64 forms): ???.const
loads (available in i32, i64, f32, or f64 forms): ???.load
loads (some available in i32, i64 forms, and signed and unsigned forms (suffix _s or _u)): i??.load8_? i??.load16_s i64.load32_?
stores (available in i32, i64, f32, or f64 forms): ???.store
stores (some available in i32, i64 forms): i??.store8 i??.store16 i64.store32
integer comparisons (available in i32 or i64 forms): i??.eqz i??.eq i??.ne
integer comparisons (available in i32 or i64 forms, and signed and unsigned forms (suffix _s or _u)): i??.lt_? i??.gt_? i??.le_? i??.ge_?
float comparisons (available in f32 or f64 forms): f??.eq f??.ne f??.lt f??.gt f??.le f??.ge
integer arithmetic (available in i32 or i64 forms): i??.clz i??.ctz i??.popcnt i??.add i??.sub i??.mul i??.and i??.or i??.xor i??.shl i??.rotl i??.rotr
integer arithmetic (available in i32 or i64 forms, and signed and unsigned forms): i??.div_? i??.rem_? i??.shr_?
float arithmetic (available in f32 or f64 forms): f??.abs f??.neg f??.ceil f??.floor f??.trunc f??.nearest f??.sqrt f??.add f??.sub f??.mul f??.div f??.min f??.max f??.copysign
conversions (available in a variety of forms): i32.wrap/i64, i??.trunc_?/f??, i64.extend_?/i32, f??.convert_?/i??, f32.demote/f64, f64.promote/f32, ?32.reinterpret/?32, ?64.reinterpret/?64
alloc: memory.size memory.grow

LLVM:

control: ret br (can be either conditional or unconditional) switch indirectbr invoke (call function with the possibility for the called function to raise an exception) resume (rethrow exception from within exception handler?) unreachable phi (must be at the beginning of a basic block; returns a different value depending on which preceding basic block control just came from) call (call a subroutine and return the result) landingpad (tags a basic block as an exception handler)
arithmetic: add fadd sub fsub mul fmul udiv sdiv fdiv urem srem frem
bitwise arithmetic: shl lshr ashr and or xor
load, store: load store
alloc: alloca
conversions: trunc fptrunc zext sext fpext fptoui fptosi uintofp sitofp ptrtoint inttoptr bitcast
other: icmp fcmp select va_arg

in addition to what is listed above, there are various 'added later' instructions, vector ops, aggregate ops including getelementptr, fence and atomic memory ops, and intrinsics

JVM:

https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings

constant loads: aconst_null bipush sipush dconst_(0..1) fconst_(0..2) iconst_(-1..5) lconst_(0..1) ldc (load from constant pool) ldc_w ldc2_w
loads, stores: (a

with constants: (a

d	f	i)(load	store) (a=references, d=doubles, f=float, i=integer)
d	f)(load	store)_(0..3)

arrays/aggregates/structures: (a

s)(aload

astore) anewarray arraylength getfield getstatic multianewarray new newarray putfield putstatic

stack ops: dup dup_x1 dup_x2 dup2 dup2_x1 dup2_x2 pop pop2 swap
arithmetic: (d

l): add sub mul div neg rem

integer and long arithmetic: (i

l) and or xor shl shr ushr

integer arithmetic: iinc (immediate add)
long compare: lcmp
comparisons: (d

f)cmp(g

control: (a

l)return athrow goto goto_w invokedynamic invokeinterface invokespecial invokestatic invokevirtual jsr jsr_w lookupswitch ret tableswitch

reference compare and branch: if_acmp(eq

ne)

integer compare and branch: if_icmp(eq

lt)

integer compare and branch against 0: if(eq

lt)

vs. null compare and branch: ifnonnull ifnull
conversions: d2(f

l) f2(d

l) i2(b

s) l2(d

concurrency: monitorenter monitorexit
wide ops ('wide'): execute opcode, where opcode is either iload, fload, aload, lload, dload, istore, fstore, astore, lstore, dstore, or ret, but assume the index is 16 bit; or execute iinc, where the index is 16 bits and the constant to increment by is a signed 16 bit short
misc: breakpoint checkcast impdep1 impdep2 instanceof nop

ARM Cortex M0:

from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html or http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html

note: these are 32-bit operations

move: mov movs
arithmetic: add adds adcs (add with carry) adr (PC-relative Address to Register) sub subs sbcs (sub with carry) rsbs (reverse subtract; negate) muls (multiply 32-bit with 32-bit result)
compare: cmp cmn (compare negative)
logical: ands (and) orrs (or) eors (xor) bics (bit clear) mvns (move NOT) tst (AND test)
bit shifts: lsls lsrs asrs rors
loads and stores: ldr (load) ldr(b

sh) (load byte

halfword

signed byte

signed halfword) ldm (load multiple) str (store) str(b

h) stm (store multiple) (push

pop) (push/pop registers onto/from stack)

control: b (branch, conditional or unconditional) bl (branch with link) bx (branch with exchange) blx (branch with link and exchange)
extend: (u

s)xt(b

h) (extend unsigned

signed byte

halfword)

byte-reverse: rev (reverse bytes in word) rev16 (reverse bytes in both halfwords) revsh (reverse signed bottom half word)
State change: svc (supervisor call) cpsi(d

e) (disable/enable interrupts) (mrs

msr) (read/write special register) bkpt (breakpoint)

Hint/events: sev (Send event) wfe (wait for event) wfi (wait for interrupt) yield (this is a no-op) nop
barriers: isb (instruction sync barrier) dmb (Data Memory Barrier) dsb (data sync barrier)

ARM Cortex M4 floating-point:

VABS, VADD, VCMP, VCMPE, VCVT, VCVTR, VDIV, VLDM, VLDR, VMLA, VMLS, VMOV, VMRS, VMSR, VMUL, VNEG, VNMLA, VNMLS, VNMUL, VPOP, VPUSH, VSQRT, VSTM, VSTR, VSUB

todo mb later after taking the initial sorta-intersection, also consider 68k/Coldfire, MIPS, those educational subsets of RISC-V, GNU Lightning, RISC-V compressed, the ARM subset in this document https://community.arm.com/processors/b/blog/posts/arm-cortex-m3-and-later-basic-integer-math-operations-32-bit-1-5 , what else?

CONCORDANCE:

loads/stores:

load/store byte or halfword integer: all but JVM have some straightforward way to load an integer byte or halfword/short (two bytes), with the option to treat it as unsigned or signed (that is, to zero-extend or sign-extend it). All but the JVM can also store integer bytes and halfwords. The JVM would appear to use array operations (bytearrays) for this?

load/store 32-bit word: all 5 have 32-bit integer load/store commands

add/sub 32-bit ints: all 5 have this

negate signed integers: all 5 can do this; (in RISC-V this would be SUB 0, $x; note that constant 0 is always available even to non-immediate instruction in RISC-V, by using the zero register); in WASM and LLVM i'm assuming you would also subtract 0 from a constant; JVM has a negate instruction; and ARM has a weird-sounding instruction rsbs (reverse subtract; negate).

add immediate: both RISC-V and ARM and JVM; n/a for the others

load upper immediate: all but armcm0 have either a load upper immediate, or direct loading of 64-bit integers

add immediate to PC: riscv and armcm0 both have something like this. Arguably the other 3 are higher level so may not need it as much. ARM can add or subtract 4k [17].

load constant (at least 8 bits, possibly 12; also, all 5 can load a 32-bit constant within 2 instructions): (eg in riscv you would do an ADDI i think; that's a 12-bit immediate; and a LUI, which is 20-bits) (in JVM there's sipush, which i think is 16 bits; a java short?) (what about ARM? i think in ARM there are 8-bit immediates; but also i think maybe the ADR instruction can load a constant in the program code by loading a target address which equals (PC + an immediate).

and, or, xor, shift left, shift right (logical/unsigned), shift right (arithmetic/signed): all 5

immediate versions of shifts: both risc-v and ARM; n/a for WASM and LLVM

(strangely, only RISC-V seems to have immediate XOR; immediates are n/a for WASM and LLVM, but immediate XOR is missing from JVM, ARM, which seems to mean that they can't do bitwise NOT without taking up a register? maybe i'm missing something..)

(note: i'm not sure if i got the comparison's correct, but it's close enough for my purposes)

integer comparisons: RISC-V, WASM, LLVM have: lt_unsigned lt_signed (oddly, JVM has compares, but only for longs, floats, and doubles, afaict) (ARM's compare just updates condition flags) (WASM and LLVM also have eq ne le lt ge gt, both signed and unsigned) (RISC-V also has immediate versions of lt_unsigned lt_signed) (note: the Epiphany guy thought these could have been left out of RISC-V)

integer compare-and-branch, one instruction: RISC-V and JVM and ARM have beq bne blt bge (signed). WASM and LLVM and JVM have BNZ ('br_if' in WASM; BNZ and BZ are not distinguishable in LLVM's 'BR'). JVM and ARM also has ble bgt. JVM and ARM and WASM also have bz. JVM and ARM also have branch_if_lt_zero branch_if_ge_zero. JVM also has branch_if_le_zero branch_if_gt_zero. RISC-V also have bltu bgeu (unsigned).

two instruction integer compare-and-branch: All five have beq bne blt_signed bge_signed. WASM has bz. All but RISC-V have le gt (signed). All but JVM have le gt (unsigned). WASM and LLVM and ARM also have lt ge (unsigned). ARM also has branch_if_overflow.

So it sounds like a sorta-intersection would be: bnz beq bne blt_signed bge_signed. But probably cleaner to just expand this to bz bnz beq bne, ble blt bge bgt (signed), and maybe also ble blt bge bgt (unsigned). (todo: what does LuaJit?2 have? see 'Comparison ops' in http://wiki.luajit.org/Bytecode-2.0 ; it is eq ne le lt ge gt, with two instructions, and the only number type available is f64)

compare-and-branch offset sizes: todo

absolute jump: all 5 have some sort of absolute jump/branch.

absolute jump offset sizes: todo

indirect jump: RISC-V, ARM have low-level indirect branch or jump. WASM, LLVM, JVM have stuff like br_table, indirectbr (which has a list of permissible targets), call_indirect, and invokedynamic.

indirect jump offset sizes: todo

calls: RISC-V and ARM have branch/jump with link. WASM, LLVM, JVM have higher-level calls/returns of various sorts. LLVM and JVM have exceptions; i don't see any instructions for exceptions in WASM 1.0 but i heard talk of a throw instruction in earlier versions, so i don't know if WASM still has exceptions or not.

switch: WASM, LLVM, JVM have some sort of switch or br_table

nop: all 5 have some sort of nop, although this is an intrinsic in RISC-V and LLVM (llvm.donothing)

fence/sync/concurrency: all but WASM have some sort of fence, barrier, or monitor

system calls, special registers: RISC-V and ARM have system calls and special registers. The others don't; presumably the ordinary higher-level calls are also used for this sort of thing. Some have breakpoints.

unreachable: LLVM and WASM have 'unreachable' instructions

rotate right: WASM, LLVM, ARM have a rotate right (or for LLVM, a funnel shift right intrinsic)

registers and stacks: RISC-V's mov is an intrinsic. ARM has mov. JVM has stack ops. WASM and LLVM are higher-level.

multiplication: RISC-V doesn't support multiplication without the multiply-divide extension ('M'); but if RISC-V includes the multiply-divide extension, then i think all 5 support multiply of the form: unsigned 32-bit x 32-bit -> lower 32-bits of the result (i think; i'm not 100% sure i understand the various specs here). If there is no overflow, then this same operation will work correctly on 32-bit signed twos-complement values. If there are signed values and overflow, the sign bit may be incorrect. In addition, RISC-V with the 'M' extension has instructions to get the lower-order result bits for a signed multiply and to get the higher-order result bits; so does ARM Cortex M3; the rest apparently do not offer that (except in the form of converting your 32-bit value to a 64-bit one and then doing 64-bit multiply).

div and rem: ARM M0 doesn't have div and rem, but M3 has sdiv, udiv (there is no rem, but you can use 'mls' after 'udiv' to get a remainder [18]). All the others have signed div and signed rem. All except ARM M0 and JVM also have unsigned div and unsigned rem.

TODO here

floating point arith: ARM M0 has no floating point, but ARM M4 has it optionally (see [19]). fadd fsub fmul fdiv. Only LLVM and JVM appear to have frem. All but JVM has sqrt. TODO.

floating point load/store/convert: todo

64-bit integer load/store/convert: RISC-V with the 64I extension, rather than adding new instructions for 64-bits, respecifies the registers to 64-bits and respecifies all existing instructions to use 64-bits. Then new instructions are added for 32-bits instead: addw addiw subw slliw slriw sariw lw sw. These new instructions are signed. 32-bit quantities are stored in registers as sign-extended 64-bit values, so 'conversion' between 32-bit and 64-bit representations are no-ops/unnecessary.

todo other ISAs

64-bit integer arith: todo

higher-level control flow in WASM, LLVM, JVM only: todo

higher-level data structures in WASM, LLVM, JVM only: todo

(in the above, recall that 'ARM' is a shortcut for 'ARM Cortex M0'; other ARM profiles have many more instructions)

summarizing the sorta-intersection ISA:

ld8_u ld16_u ld32_u ld8_s ld16_s ld32_s st8 st16 st32
add sub neg addi
lui add_imm_to_pc ld_constant_8
and or xor sll srl sra slli srli srai
lt_u le_u gt_u ge_u lt_s le_s gt_s ge_s
bnz beq bne blt_s
jmp jd branch_with_link (is jump_with_link dynamic or absolute?)
nop unreachable
fence
mov
mul

mb also: other compare and compare-and-branch variants, ror, switch, syscalls, system registers, ret, div_u, div_s, rem_u, rem_s, fadd fsub fmul fdiv fsqrt

(47 instructions so far (excluding 'mb also'), and havent even included anything from 'div and rem' or later yet)

todo finish summarizing the sorta-intersection ISA

todo modify Boot and BootX? if needed.

BELOW THIS ARE INSTRUCTIONS YET TO BE PROCESSED FOR THE CONCORDANCE

RISC-V 2.2 base 32-bit integer instruction set RV32I version 2.0:

RISC-V single-precision float extension: fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi

only in RISC-V, not ARM, and n/a in the others:

special float registers, and load/store: flw fsw
Logical immediates (note: the Epiphany guy thought the immediate versions of these could have been left out): XORI, ORI, ANDI

only in RISC-V:

Counters pseudo-instructions: RDCYCLE (ReaD? CYCLE), RDCYCLEH (ReaD? CYCLE upper Half), RDTIME (ReaD? TIME), RDTIMEH (ReaD? TIME upper Half), RDINSTRRET (ReaD? INSTR RETired), RDINSTRRETH (ReaD? INSTR upper Half),

WASM: [20]

float comparisons (available in f32 or f64 forms): f??.eq f??.ne f??.lt f??.gt f??.le f??.ge
integer arithmetic: i64.mul
integer arithmetic (available in signed and unsigned forms): i64.div_? i64.rem_?
float arithmetic (available in f32 or f64 forms): f??.abs f??.neg f??.ceil f??.floor f??.trunc f??.nearest f??.min f??.max f??.copysign
conversions (available in a variety of forms): i32.wrap/i64, i??.trunc_?/f??, i64.extend_?/i32, f??.convert_?/i??, f32.demote/f64, f64.promote/f32, ?32.reinterpret/?32, ?64.reinterpret/?64

only in WASM:

alloc: memory.size memory.grow, although LLVM has 'alloca'
block, loop, if, else
drop

only in WASM and LLVM:

i??.clz i??.ctz i??.popcnt
i??.rotl
select

LLVM:

conversions: trunc fptrunc zext sext fpext fptoui fptosi uintofp sitofp ptrtoint inttoptr bitcast
other: fcmp

in addition to what is listed above, there are various 'added later' instructions, vector ops, aggregate ops including getelementptr, atomic memory ops, and intrinsics

only LLVM and JVM:

frem

only in LLVM, although WASM has memory.*:

alloc: alloca

only in LLVM:

va_arg
phi (must be at the beginning of a basic block; returns a different value depending on which preceding basic block control just came from)

JVM:

https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings

arithmetic: (d

l): mul div neg rem

conversions: d2(f

l) f2(d

l) i2(b

s) l2(d

only LLVM and JVM:

frem

only in JVM:

arrays/aggregates/structures: (a

s)(aload

astore) anewarray arraylength getfield getstatic multianewarray new newarray putfield putstatic

reference compare and branch: if_acmp(eq

ne)

vs. null compare and branch: ifnonnull ifnull
wide ops ('wide'): execute opcode, where opcode is either iload, fload, aload, lload, dload, istore, fstore, astore, lstore, dstore, or ret, but assume the index is 16 bit; or execute iinc, where the index is 16 bits and the constant to increment by is a signed 16 bit short
misc: checkcast impdep1 impdep2 instanceof

ARM Cortex M0:

from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html or http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html

note: these are 32-bit operations

extend: (u

s)xt(b

h) (extend unsigned

signed byte

halfword)

ARM Cortex M4 floating-point:

"The FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations. It also provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions."

VADD, VSUB, VMUL, VDIV, VNEG, VABS, VSQRT VCMP, VCMPE (compare; E variant means raise exception even if either operand is quiet NaN?; o/w it raises exception only if either operand is signaling NaN?) VCVT, VCVTR, (convert; 'R' means to use custom rounding mode instead of rounding towards 0 -- only applicable to conversion from float to int), loads and stores and movs: VLDR, VSTR, VMOV, multiple load/store: VLDM, VSTM, special register load/store: VMRS (load from special), VMSR (store to special)

multiply variants: VMLA (then accumulate), VMLS (then subtract), VNMLA (then negate, then accumulate), VNMLS (then subtract, then negate), VNMUL (negate then multiply) stack ops: VPOP, VPUSH,

" 7.2.5 Complete implementation of the IEEE 754 standard

The Cortex‑M?4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. The Cortex‑M?4 floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:

    Remainder.
    Round floating-point number to integer-valued floating-point number.
    Binary-to-decimal conversions.
    Decimal-to-binary conversions.
    Direct comparison of single-precision and double-precision values."

" The FPU sets the cumulative exception status flag in the FPSCR register as required for each instruction, in accordance with the FPv4 architecture. The FPU does not support exception traps. The processor also has six output pins, FPIXC, FPUFC, FPOFC, FPDZC, FPIDC, and FPIOC, that each reflect the status of one of the cumulative exception flags. See the Cortex®‑M?4 Integration and Implementation Manual for a description of these outputs. "

only in ARM:

arithmetic: adcs (add with carry) sbcs (sub with carry)
logical: bics (bit clear) mvns (move NOT) tst (AND test)
byte-reverse: rev (reverse bytes in word) rev16 (reverse bytes in both halfwords) revsh (reverse signed bottom half word)
compare: cmn (compare negative)
loads and stores: ldm (load multiple) stm (store multiple) (push

pop) (push/pop registers onto/from stack)

Hint/events: sev (Send event) wfe (wait for event) wfi (wait for interrupt) yield (this is a no-op)

note: RISC-V 'G' extension is shorthand for the base + M+A+F+D, that is, base + multiply/divide + atomic + single and double precision. This suggests that we should prioritize those instructions somewhat.

Also, should prioritize instruction in the RISC-V 'C' (compressed) extension, although the spec for that extension also talks about argument form constraints:

" shorter 16-bit versions of common 32-bit RISC-V instructions when:

the immediate or address offset is small, or
one of the registers is the zero register ( x0), the ABI link register (x1), or the ABI stackpointer (x2), or • the destination register and the first source register are identical, or • the registers used are the 8 most popular ones "

the argument form constraints perhaps should inform our choice of 8-bit 'shortcut' instructions in BootX?.

---

hmm i keep thinking it might be a little quixotic to make Boot a 16-bit ISA. Everyone else uses i32.

but i do also think that the for massive concurrency, 16-bit ISAs make more sense.

---

in Boot, may want to expose only 15 or 14 registers instead of 16, so that the compiler has a register or two to work with so that it can e.g. synthesize 'BZ' instructions on RISC-V.

---

interesting:

http://forum.6502.org/viewtopic.php?f=2&t=2541

" Hi there,

I recently found a really nice article on static recompilation of 6502 code into LLVM compiler intermediate representation, and the difficulties that come with it.

Here's my write up https://plus.google.com/108984290462000253857/posts/DPHzWoosXYS https://plus.google.com/108984290462000 ... PHzWoosXYS? and here's the original (lengthy, but worthy) article: http://andrewkelley.me/post/jamulator.html

What I find interesting is how to resolve the static recompilation issues posed by indirect addressing modes, stack-based jumps (like pushing the address of a function from a table onto the stack and then RTS). His final words on that are that it's probably easier to use just-in-time compilation, but I wonder if there still is a way for static recompilation - be it with manually added hints, or simulation are analysis time to find all possible code paths.

What do you think? André

---

already copied to plChAssemblyFrequent...:

" IEEE Spectrum in the early 80's had an article on minimal instruction sets. The author made a salient point that 7 instructions did most of the heavy lifting, another 7 did the most of the rest required. He did highlight that one mistake made with instruction sets was not separating the addressing modes from the instruction itself. ... I have managed to rummage around and found the article. It was actually in IEEE Micro May 1982, "A Unique Microprocessor Instruction Set" Dennis A Fairclough.

The article looked at the statistical usage of instructions and placed them into 8 broad groups, Data Movement, Program Modifying, Arithmetic, Compare, Logical, Shift, Bit and I/O & Misc groups.

The result of the analysis was as follows:

For groups data movement and Program Modifying, 1 Instruction - MOVE [Cummulative usage 75%]

For group Arithmetic, 4 Instructions - ADD, SUB, MULT, DIV [Cumulative usage 87.5%]

For group Compare, 1 previous Instruction - SUB [Cumulative usage 93.75%]

For group Logical, 3 Instructions - AND, OR, XOR [Cumulative usage 96.88%]

For group Shift, 1 Instruction - SHIFT [Cumulative usage 98.44%]

For group Bit, 1 Instruction - MOVEB [Cumulative usage 99.22%]

For group I/O & Misc, depends on whether i/o is memory mapped or otherwise, so either 0 or 1 Instruction.

Some possible extended instructions included INC, DEC, I/DBRC and MOVEM. "

---

i haven't quite finished up the 'concordance' above yet, but it's already seeming that Boot is kind of weird, in that it doesn't support all of the things in the concordance, yet it has a bunch of other stuff like the stack ops and the pointer-typed operations. The I/O stuff i feel fits fine, though. Also, it has almost everything in the concordance, notably excepting everything having to do with for floating point and larger integers -- this may be enough.

also doing everything in words instead of bytes is kind of funky, but necessary for portability.

on the one hand, the stack ops seem too high-level and far away from the goal of a 'sort-of intersection' of common platform instruction sets. On the other hand, we want the stack to be opaque for portability/interop, so we can't just implement it manually.

on the one hand, there is a bunch of I/O stuff that is far away from the goal of a 'sort-of intersection' of common platform instruction sets. On the other hand, it's called 'Boot' because it's supposed to be used for bootstrapping a self-hosting compiler, which means that we need some way to do I/O.

maybe what we really need is another layer beneath Boot -- a layer with only the 'sort-of intersection' of common platform instruction sets, with no stack ops and no I/O and no idiosyncracies -- i guess this would only be good for pure computation.

even that wouldn't get past the funkyness of having ptrs and ints have different instructions, though (most CPUs use the same instructions to operate on pointers and ints; that is to say, pointers are usually just unsigned ints).

also, CAS and fence are a bit funky; unlike machine ISAs, programming-language VMs usually don't need these because they usually don't default to relaxed memory orderings. Still, they are good for our purposes.

and on the other hand, the stack ops, at least, don't really hurt our goals because they are easy to implement on any target platform.

anyway, we have about 56 instructions in Boot, and about 128 opcodes in BootX?. It seems to me that the sorta-intersection of common platforms is shaping up to look like about 50 instructions or so (many of which are already in Boot). So it looks like we can fit this in BootX? with room to spare (i'd probably like to use this extra room to add some alternate models of computation, eg combinatorial calculus, reversible gates, mb logic programming, mb some vector/matrix ops like scatter/gather, some loops and switches, etc; and/or maybe extend to the sorta-union of common platforms). Remember that the 128 opcodes give us 128 3-operand instructions, but we can have much more if we convert some of these to 16 2-operand instructions. On the other hand, writing that, it seems like a lot to ask; i should i still need to actually just write the logical extension of (majority-vote-of-common-platforms) ISA and see how much space we have left after that. That would include stuff like:

all combinations of (le, lt) x (unsigned, signed)
unsigned and signed arithmetic on at least some integer sizes
integer multiplication and division
8-bit, 16-bit, 32-bit, 64-bit ints
32-bit, 64-bit pointers
floating point ops

on the other hand, the reason we aren't just using RISC-V or WASM is b/c i thought those were too complicated and hence hard to implement. But now we are having as many instructions as they are!

also, i want to keep in mind the interesting higher-level control structures etc seen in WASM (and maybe LLVM) -- stuff like that will be useful for the next language up in the stack.

hmm so it seems that i am veering closer to having TWO MORE languages in this stack:

Boot Core: a small, simple pure computation core containing no funkiness and only common operations
Boot: an extension of Boot Core: a sufficient language with which to bootstrap a self-hosting compiler on a new system
BootX?: an extension of Boot that can be compiled to Boot: a language with the standard assembly constructs
?: a simple language that has WASM-like higher-level control flow, rather than allowing assembly-language-style jumps and dynamic jumps, and that has a variety of common data structures like arrays, maps, etc. Can be compiled to BootX?. Possibly a 64-bit extension of BootX?.
OVM: the VM for Oot. Possibly the same as the previous level.

eh, after writing that, i think OVM and the '?' below it can still be merged. The upshot is that this is just a reminder that language supported by OVM should be really simple.

Also, how would Boot Core work exactly? Would we use a sequentially consistent memory ordering and so remove CAS and FENCE? Would we declare that Boot Core is not only pure computation, but pure computation without concurrency? Would we still have opaque pointers? Would we still have opaque stacks? If we still have opaque pointers and opaque stacks, are we really gaining enough by eliminating the few I/O operations to justify the overhead of having a separately-named 'Boot Core'?

and i can't help but think that if we would just allow segmented memory, with an base pointer that is implicitly added to integer offsets, and integer offsets taking the place of addresses, then we could get rid of pointer-typed pointer arithmetic instructions. This seems funkier than it's worth, though.

i dunno. i don't think Boot Core is worth it as some separate thing. Maybe just as the name for a whitelist of pure computation instructions within Boot.

---

floating point support presents a condundrum.

what i'd like to do is have floating point instructions in BootX?, but not in Boot. But since BootX? is supposed to be compilable to Boot, this means that the toolchain that compiles BootX? to Boot must include a soft floating point implementation. This is undesirable for a few reasons:

sometimes we want an embeddable Boot or BootX? interpreter. In this case the soft floating point library bloats codesize
what if the BootX?->Boot compiler compiles a program that does not contain any floating point instructions? Since the compile target might be an embedded system, the 'right' thing to do would be to have the toolchain analyze the program being compiled and only provide floating point support if needed, so as to reduce binary size. But this complicates the toolchain implementation.
most platforms include floating point support, in which case using soft floating point would be much slower; so if the porter didn't bother to implement a platform-specific BootX? which gives access to platform-native floating point primitives, then this port isn't very useful for scientific computation anyways
if we include a soft floating point implementation, then we are making design choices about whether we should trade smaller code size for non-standard-compliant behavior. A platform without native floating point support is probably an embedded system, in which case this choice matters and is better made by the programmer.

so, i think i'd rather make floating point support in BootX? optional, a 'profile' or 'extension' sort of thing.

so, a BootX?->Boot toolchain could have optional soft floating point support. But if you exclude this, you are still allowed to call your implementation a standard-compliant implementation of BootX?.

---

remember to check the plChArithmeticImpl subsections under 'De-facto IEEE 754' too see which parts of IEEE 754 are often ignored

---

should have a Boot and BootX? and OVM interpreter, too

---

since we have an immediate addressing mode in BootX?, we don't really need the zero register anymore. So in fact this can be used by the toolchain. So maybe we only need to reserve one register, not 2. Otoh on architectures with 16 regs, one of which is a constant zero register, this doesn't really give us a spare, so maybe still reserve 2.

---

in BootX?, we could make fuller use of what happens when the output operand is in the immediate constant addressing mode. We could either use this to expand the space of two-operand opcodes (two-operand opcode space x16), or we could use it to quickly output one word at a time to various standard streams (also having outputting to stream zero being a synonym for discard, which may allow the toolchain to internally use register 0 for something else), or some combination (eg 0 thru 3 for special streams, the rest for two-operand opcodes x 12)

but we already have plenty of two-operand opcodes. So maybe let the program assign streams to 5-8 and use those as wordwise device output?

---

i was thinking the other day that Boot is just a bare-bones portable assembly language that can be implemented really quickly, and BootX? adds in the other instructions that typical assembly languages have.

But then i remember that if we have room, we may also want to put SECD-machine like stuff and higher-order functions such as function composition, map, reduce, etc (and the stuff in Python Copperhead) (and the stuff in Simplicity) (and maps, arrays) (and the control flow stuff in WASM) (and Forth stuff) (and basic Scheme stuff). But i don't know if we have room yet. So the jury is still out on this.

Also there's a consideration of how high-level we want to be. If we're composing functions then doesn't that require transparent memory allocation (and hence a runtime) on many platforms?

---

KMag 24 minutes ago [-]

Second, on architectures designed with 4-operand fused multiply-add (FMA4) from the start, and a zero-register (like the Alpha's r31, SPARC's g0, MPIPS's r0, etc.), why not make the zero-register instead an identity-element-register that acts as a zero when adding/substracting and a one when multiplying/dividing? An architecture could optimize an FMA to a simple add, a simple multiply, or simply a move (depending on identity-element-register usage) in the decode stage, or a minimal area FPGA implementation could just run the full FMA. This avoids using up valuable opcodes for these 3 operations that can just be viewed as special cases of FMA. Move: rA = 1.0 * rC + 0.0. Add: rA = 1.0 * rC + rD. Multiply: rA = rB * rC + 0.0. FMA: rA = rB * rC + rD. "

---

" gpderetta 7 hours ago [-]

...Separate load and op instructions and fixed size instructions are pretty much the only things left differentiating RISC and CISC architectures (there is nothing reduced about modern RISCs), so I do not think the claim that x86 CPUs are RISC inside does hold. "

---

" Almost every architecture has now added SIMD vector extensions, including SPARC (VIS), x86 (MMX/SSE/AVX), POWER/PowerPC? (AltiVec?) and ARM (NEON). Only relatively recent processors from each architecture can execute some of these new instructions, however, which raises backward-compatibility issues, especially on x86 where the SIMD vector instructions evolved somewhat haphazardly (MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX, AVX2). "

---

i don't think i need to read this but it does shed a little light on the primitive operations needed for concurrency: [21]