proj-oot-ootAssemblyNotes22

[1]

" Results

After recompiling the program with libctiny and the method above, the EXE jumped from a giant 64K to a much more reasonable 4K! (4096 bytes to be exact). For comparison, the entire code section of the linker map file is reproduced below: Hide Copy Code

0001:00000000 ?DumpFile?@@YAXPAD@Z 00401000 f hd.obj 0001:0000013d _main 0040113d f hd.obj 0001:0000021a _fopen 0040121a f libct:file.obj 0001:000002a7 _fread 004012a7 f libct:file.obj 0001:000003c2 _fwrite 004013c2 f libct:file.obj 0001:0000048b _fgetc 0040148b f libct:file.obj 0001:000004b6 _printf 004014b6 f libct:printf.obj 0001:000004ef _memset 004014ef f libct:memory.obj 0001:0000050e __doexit 0040150e f libct:initterm.obj 0001:0000053a _mainCRTStartup 0040153a f libct:crt0tcon.obj 0001:000005f5 _malloc 004015f5 f libct:alloc.obj 0001:00000607 __init_args 00401607 f libct:argcargv.obj 0001:00000705 __ismbcspace 00401705 f libct:isctype.obj "

---

so just to remind myself, next steps are:

---

Kragen Javier Sitaker compares Fibonacci in various target languages:

https://web.archive.org/web/20120811013032/http://lists.canonical.org/pipermail/kragen-tol/2007-September/000871.html (alternate link: http://www.1strecon.org/downloads/Forth_Resources/ByteCodeInterpretters_4_TinyComputers.pdf ) discussion:

http://thread.gmane.org/gmane.culture.people.kragen.thinking/94

(note to self: yes, i already copied this to plPartTargetLanguages)

---

i'm leaning towards making the 'programmable instructions' just compiler macros, or even just a compiler technology used to compile BootX? to Boot in the reference compiler.

In which case, maybe the BootX? spec should just reduce the SMALLSTACK size (to reserve some of the Boot SMALLSTACK forcompiler use) and/or reduce the number of registers (to reserve one or more registers for compiler use). Of course, you can still form syntactically correct BootX? instructions which use the extra stack locs or registers (these can be reserved for compiler use).

---

should we add an instruction

Pop_jump_if_false?

---

xenadu02 1 hour ago [-]

> the kernel has an easier time of single-handedly cleaning up all file descriptors, memory mappings, etc. in one fell swoop inside execve()

fork() / execve() isn't a good mechanism for launching new processes; it's one of the worst aspects of early Unix design.

A properly designed syscall should require the caller to specify the desired behavior explicitly:

1. Do I want to inherit file descriptors, etc or do I want a clean slate? 2. Do I want a separate address space or to share address space? 3. What happens to threads, mutexes, locks, etc in the new child process? 4. If a separate address space should I inherit my parent's VM mappings or should I just load and execute a new binary image?

That gives flexibility to the caller and maximum information to the kernel to optimize. Instead we have a bunch of ad-hoc patches like vfork(), clone(), and posix_spawn() (+ non-standard attributes like POSIX_SPAWN_CLOEXEC_DEFAULT, POSIX_SPAWN_SETEXEC, and POSIX_SPAWN_USEVFORK).

reply

---

" There are many mechanisms for communicating information between user-space applications and the kernel. System calls and pseudo-filesystems such as /proc and /sys are of course the most well known. Signals are similarly well known; the kernel employs signals to inform a process of various synchronous or asynchronous events—for example, when the process tries to write to a broken pipe or a child of the process terminates.

There are also a number of more obscure mechanisms for communication between the kernel and user space. These include the Linux-specific netlink sockets and user-mode helper features. Netlink sockets provide a socket-style API for exchanging information with the kernel. The user-mode helper feature allows the kernel to automatically invoke user-space executables; this mechanism is used in a number of places, including the implementation of control groups and piping core dumps to a user-space application.

The auxiliary vector, a mechanism for communicating information from the kernel to user space, has remained largely invisible until now. However, with the addition of a new library function, getauxval(), in the GNU C library (glibc) 2.16 release that appeared at the end of June, it has now become more visible. " [5]

---

In deciding to use WASM, the Ethereum project also considered:

[6]

so it seems like i'm looking at the right places...

---

i checked out the Ethereum project's WASM-based VM, EWASM, and afaict it doesn't actually deprecate any WASM instructions... [7] [8]

it does have this list of opcodes but i think these are EVM opcodes, not WASM?

https://github.com/ewasm/ewasm-kernel/blob/master/opcodes.js

However https://github.com/ethereum/EIPs/issues/48 (which is outdated [9]) does have some deprecations:

Furthermore that EIP has a list of instructions:

Integer Operations

Flow Control

Memory:

Actually the other URL does link to a (longer) instruction list, at [10]


i really like the this EVM proposal to get rid of dynamic jumps, discussed in plChMiscIntermedLangs, section 'EVM: proposal to remove dynamic jumps to assist static analysis'.

probably not at the BootX? level (not sure tho...), but maybe at the next level up.

The EVM proposal seems to provide a very minimal set of primitive static jumps that let you do most stuff. In order to do this, they explicitly delimit subroutine boundaries and arity, and define simple semantics for frame pointers, a return stack containing frame pointers, etc; this also lets them define 'getlocal' and 'putlocal' instructions.

---

kinda interesting. assembler with label and macro support, based on Coq:

[11]

---

mac012345 says: June 25, 2018 at 7:08 am

Does the interrupt latency take register saving into account for risc-v core? As there is 32 registers and no multiple load/store instruction, it means that if it’s not done by hardware you can have up to 32 cycles latency (depending on EABI), quite different from the 6 cycle figure given… Report comment Reply

Dmitry Grinberg says: June 25, 2018 at 10:50 am

The point the OP is making remains valid. Comparing their “6 cycles” with no reg saving is unfair to Cortex-M whose published latency of 12-13 cycles DOES include saving regs Report comment Reply

Bruce Hoult (@BruceHoult?) says: June 25, 2018 at 11:07 am

The time quoted is to get into a C function compiled with the “interrupt” attribute. That means the function itself will save only the registers it actually needs. A simple handler might only need to save one register.

If the interrupt handler function needs more extensive processing and calls normal C functions (i.e. without the “interrupt” attribute) then it will automatically save the volatile registers first. In the standard *nix ABI that’s a0-a7, t0-t6, and the return address (16 registers). An EABI is in the process of being defined. It’s probably going to end up needing to save 7 registers.

Note that a load/store multiple instruction makes your program smaller, but it doesn’t make it any faster. The same goes for hardware register stacking. Report comment Reply

    asdf says:	
    June 25, 2018 at 2:35 pm
    The Cortex-M also sets up nested interrupts in those same 12 cycles. It’s too bad the RISC-V designers never seem to have looked at what makes Cortex-M such an elegant microcontroller architecture. As it stands, it has all the drawbacks and awkwardness of MIPS32.
    Report comment
    Reply

---

ChuckMcM? 16 hours ago

parent flag favorite on: Why Use an FPGA Instead of a CPU or GPU?

The author isn't aware apparently of over a decade of work done in FPGA/CPU integration. Both in more sequential languages like System C[1] or in extended instruction set computing like the Stretch[2]. Not to mention the Zynq[3] series where the CPU is right there next to the FPGA fabric.

For "classic" CPUs (aka x86 cpus with a proprietary frontside bus and southbridge bus) it can be challenging to work an FPGA into the flow. But the problem was pretty clear in 2007 in this Altera whitepaper[4] where they postulate that not keeping up with Moore's law means you probably end up specializing the hardware for different workloads.

The tricky bit on these systems is how much effort/time it takes to move data and context between the "main" CPU and the FPGA and then back again (Amdahl's law basically). When the cost of moving the data around is less than the scaling benefit of designing a circuit to do the specific computation, then adding an FPGA is a winning move.

[1] https://www.xilinx.com/products/design-tools/vivado/prod-advantage/rtl-synthesize.html

[2] http://www.stretchinc.com/index.php

[3] https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html

[4] https://www.intel.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01029.pdf

Yeah, the High-Level Synthesis work goes back to the 1990's. Wikipedia has a nice list:

https://en.wikipedia.org/wiki/High-level_synthesis

They didn't really pan out in terms of large uptake. They did solve some problems pretty well with shorter, developer time. There's also been open source ones designed. Here's a few:

http://legup.eecg.utoronto.ca

https://www.synflow.com

Many of the other ones seem dead or don't have a software page. I'm not sure what's current in FOSS HLS with C-like languages. Anyone know of other developments at least as good as the two I linked?

reply

fulafel 15 hours ago [-]

Nitpick: LegUp? is published under a "no commercial use" license. Synflow mentions they use a "open-source language" but it is not clear how much of the tool chain is open source.

reply

---

events/locks need a primitive where only ONE thread, the one who next gets the lock, gets woken up when a lock is relaesed and multiple threads are waiting on that lock. O/w you get this:

" A lock convoy occurs when multiple threads of equal priority contend repeatedly for the same lock.[1] Unlike deadlock and livelock situations, the threads in a lock convoy do progress; however, each time a thread attempts to acquire the lock and fails, it relinquishes the remainder of its scheduling quantum and forces a context switch. The overhead of repeated context switches and underutilization of scheduling quanta degrade overall performance. ...

Example

Critical sections as implemented in Microsoft Windows operating systems provide a good example of how lock convoys can occur. In Windows, critical sections use a combination of a spinlock and a kernel synchronization object called an "event" to ensure mutual exclusion. For low-contention critical sections, the spinlock will provide mutual exclusion most of the time, falling back on the event only when a thread fails to acquire the spinlock within a certain amount of time. When contention is high, however, it is possible for many threads to fail to acquire the spinlock and enter a waiting state, all waiting on the same event.

When the event is signaled, all threads that are waiting on the event are woken, but only one will be allowed to acquire the critical section and continue execution; the remaining threads will each block again. " [12]

---

This explains a Windows design decision for a non-fair scheduler to increase performance. For Oot, we wouldn't do this; round-robin scheduling is simpler and we usually prioritize simplicity over performance:

[13]

---

 neelance commented on Jun 29

The upcoming Go 1.11 release will have experimental support for WebAssembly?. This will include full support for all of Go's features, including goroutines, channels, etc. However, the performance of the generated WebAssembly? is currently not that good.

This is mainly because of the missing goto instruction. Without the goto instruction we had to resort to using a toplevel loop and jump table in every function. Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function. The relooper can not help with this, only a goto instruction can.

It is awesome that WebAssembly? got to the point where it can support a language like Go. But to be truly the assembly of the web, WebAssembly? should be equally powerful as other assembly languages. Go has an advanced compiler which is able to emit very efficient assembly for a number of other platforms. This is why I would like to argue that it is mainly a limitation of WebAssembly? and not of the Go compiler that it is not possible to also use this compiler to emit efficient assembly for the web. @kripken Member kripken commented on Jun 29

    Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function.

Just to clarify, a regular goto would not be enough for that, a computed goto is required for your use case, is that correct? @neelance neelance commented on Jun 29

I think a regular goto would probably be sufficient in terms of performance. Jumps between basic blocks are static anyways and for switching goroutines a br_table with gotos in its branches should be performant enough. Output size is a different question though. @kripken Member kripken commented on Jun 29

It sounds like you have normal control flow in each function, but also need the ability to jump from the function entry to certain other locations in the "middle", when resuming a goroutine - how many such locations are there? If it's every single basic block, then the relooper would be forced to emit a toplevel loop that every instruction goes through, but if it's just a few, that shouldn't be a problem. (That's actually what happens with setjmp support in emscripten - we just create the extra necessary paths between LLVM's basic blocks, and let the relooper process that normally.) @neelance neelance commented on Jun 29

Every call to some other function is such a location and most basic blocks have at least one call instruction. We're more or less unwinding and restoring the call stack. @kripken Member kripken commented on Jun 29

I see, thanks. Yeah, I agree that for that to be practical you need either static goto or call stack restoring support (which has also been considered). @Heimdell Heimdell commented on Jun 30

Will it be possible to call function in CPS style or implement call/cc in WASM? @rossberg Member rossberg commented on Jul 9

@Heimdell, support for some form of delimited continuations (a.k.a. "stack switching") is on the road map, which should be enough for almost any interesting control abstraction. We cannot support undelimited continuations (i.e., full call/cc), though, since the Wasm call stack can be arbitrarily intermixed with other languages, including reentrant calls out to the embedder, and thus cannot assumed to be copyable or movable.

---

C compiler with support for structs written in Assembly (nongnu.org) https://bootstrapping.miraheze.org/wiki/Stage0 http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage2/cc_x86.s http://git.savannah.nongnu.org/cgit/stage0.git/tree/ISA_HEX_Map.org https://news.ycombinator.com/item?id=17851311

https://github.com/oriansj/knight-vm https://github.com/oriansj/knight-vm/blob/master/vm_instructions.c

https://bootstrapping.miraheze.org/wiki/Main_Page

mmastrac 1 day ago [-]

I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.

The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.

The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).

EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap

reply

giomasce 1 day ago [-]

I am also doing something like that here:

https://gitlab.com/giomasce/asmc

My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).

My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.

The README in the linked page contains more information.

I'll look at your code too!

reply

---

https://github.com/darklife/darkriscv

"implements most of the RISC-V RV32I instruction set"

implemented (in Verilog!) in one night (or mb one week?)

"The RV32I specification itself is really impressive and easy to implement (see [1], page 16). Of course, there are some drawbacks, such as the funny little-endian bus (opposed to the network oriented big-endian bus found in the 680x0 family), but after some empirical tests it is easy to make work."

not sure which instructions are implemented; i think it's in this file:

https://github.com/darklife/darkriscv/blob/master/rtl/darkriscv.v

akuma73 5 hours ago [-]

Sorry to burst any bubbles here, but this is s very incomplete implementation.

You couldn’t run anything but small toy programs on this machine. This is more like what a student would build in an undergraduate course in computer architecture.

For example, there is no MMU, no debug support, no traps, no interrupts, no exception handling, no OS privledge levels, no FP, no memory controller etc.

Of course, one wouldn’t implement all of these in a few hours.

The fact that this is RISCV is somewhat of a red herring as you could do a similar thing with a restricted subset of MIPS or ARM or even x86 as they do in UT Austin’s comp arch class.

reply

aappleby 46 minutes ago [-]

An Arduino has none of the missing parts you mention (except interrupts), yet it's quite a useful device even in non-toy applications.

reply

waterhouse 8 hours ago [-]

I carefully studied the 2.1 and 2.2 versions of the spec. I wrote, in Racket, a miniature interpreter for lists of RISC-V instructions (just symbolic lists, not the byte strings they assemble into), then wrote programs to compute triangular numbers and Fibonacci, and verified they worked. I've built the RISC-V toolchain and compiled a few test C programs and run them in "spike". I don't think I took more than a glance at the "gcc -S" output. I have no familiarity with SPARC or Motorola assembly, but I have decent familiarity with x86-64 assembly, having written what looks like a couple dozen miniature test programs and one medium-sized program, and having tried a few times to write a decent x86-64 assembler.

I found the RISC-V spec enjoyable to read, especially with all the rationales; it felt clean, minimal, and well-thought-out. Writing a program in it was probably not much different from writing in x86-64 assembly. The encodings seemed much easier to deal with than x86-64, though as mentioned I didn't go as far as writing an encoder.

What kinds of important features are present in SPARC or MC68k but not in RISC-V? Are they absent from x86-64 as well?

reply

brandmeyer 5 hours ago [-]

There are a few instructions that nearly every modern processor ended up adopting, but RISC-V still lacks. Off the top of my head, (with the RV ISA developers response in parenthesis):

So yeah, there are deficiencies. None of them are crippling, but I wouldn't say that RV is super-wham-o-dyne, either.

reply

Annatar 7 hours ago [-]

Motorola assembler is what is known as orthogonal instruction set. The flow is logical: move.b for byte, move.w for word, move.l for longword, from source to destination, which reflects real life.

The assembler reads almost like a high level programming language. The register scheme is intuitive as well, from a0-a7 being the address, to d0-d7 being the data registers.

Now, let's do a mental exercise: I'm going to load an effective address relative to the program counter, 32-bits wide, into the 1st address register. Then, I'm going to load an arbitrary value from an arbitrary memory location into the second address register. Do the same in RISC-V; compare intuitiveness.

  lea MemoryAddress(pc), a0
  move.l $00bfe001, a1
  rts
  MemoryAddress:  DC.l 0

reply

waterhouse 7 hours ago [-]

Indeed, in RISC-V, full 32-bit constants must be broken across two instructions. The translation goes:

  ; 1. lea MemoryAddress(pc), a0
  auipc a0, [upper 20 bits of (MemoryAddress - label)]
  addi a0, a0, [lower 12 bits of (MemoryAddress - label)]
  label:
  ; 2. move.l #$00bfe001, a1
  lui a1, 0x00bfe000
  addi a1, a1, 0x001
  ; 3. rts
  jalr x0, x1, 0

It is cumbersome in that sense. But it will probably be handled by an assembler (as a macro or builtin). The spec contains an appendix of "pseudoinstructions", which gives the first translation above. There isn't even a dedicated "move" instruction—it's just "addi dst, src, 0" by convention! Clearly anyone writing assembly will use at least that pseudoinstruction.

If there's a canonical format for "pseudoinstructions", and all assemblers handle them in the same way, and the abstraction doesn't leak in any way (i.e. the only temporary registers you use are ones you overwrite fully by the end; it is true that now some "instructions" have longer encodings, but that comes with the compressed instructions anyway; and it is true that an interrupt could happen between the two halves of the "instruction", but I think that shouldn't make a difference), then I don't think there's much of a problem.

reply

Annatar 1 hour ago [-]

"Indeed, in RISC-V, full 32-bit constants must be broken across two instructions."

Any RISC processor has that issue, because encoding is fixed at 32-bits to keep the hardware simple. That's not what I'm referring to.

Look at that retardation, "auipc". Because "auipc" is intuitive, right? (For the record, I'm being sarcastic.) What the hell was the instruction designer smoking? Then there is the square bracket notation, like on intel, and intel has some of the most idiotic, non-conventional assembler syntax -- and this thing mimics something so bad? Every other processor uses parenthesis, that's a well understood norm.

Then there is more intel retardation in the form of dst, src (or dst, src, src on RISC). What kind of a warped, twisted mind came up with that? What was going on in that person's head to work in such twisted ways?

Then there's the "cherry on top":

  jalr x0, x1, 0 ; because "jalr" is intuitive as well, it immediately tells you what it does?

You'll know a bad processor design by its instruction set. Always and forever.

reply

---

http://www.darklife.org/pics/xilinx/src/core.v

"Verilog 16-bit RISC processor!"

`define ROR 4'd0 `define ROL 4'd1 `define ADD 4'd2 `define SUB 4'd3 `define AND 4'd4 `define OR 4'd5 `define XOR 4'd6 `define NOT 4'd7 `define LOD 4'd8 `define STO 4'd9 `define IMM 4'd10 `define MUL 4'd11 `define BRA 4'd12 `define BSR 4'd13 `define RET 4'd14 `define LOP 4'd15

---

it's tough to keep the # of instructions small because, yeah, 16-bit is probably the smallest power-of-two bitwidth that can be used to address reasonably sized programs, but otoh sometimes you really want to index more than 4 gigs of memory/list items (~2^32 bytes). So we seem to be inexorably led to having a plethora of data types in assembly, which greatly raises the # of instructions.

But.. actually floating points can represent some integers too.. and in fact it seems that a 64-bit double can exactly represent every integer up to 2^53. [14]. So it may be possible to get by with just two types; a 16-bit integer and a 64-bit float.

Very few if any assembly-like languages do that i think; even WASM has i32, i64, f32, f64.

I guess a more practical choice would be:

i16, i64, f64

and even more practical:

i8, i16, i32, i64, f64

but imo that's already a lot...

---

to reiterate some thoughts regarding the tiled/fabric CPU-node, memory-node, network-node design at the end of ootAssembly19 and mentioned in ootAssemblyThoughts:

i probably wont have time in my lifetime to muck around with hardware, even FPGAs, much, so having hardware-supported specialized CPU, memory, and network nodes is probably something i'll never get to. So really the best we could hope for is a uniform tiling of one type of node (which could maybe be, in software, specialized into CPU/memory/network nodes). So the lessons from the tiled/fabric CPU/memory/net node design should probably be distilled down to constraints on low-level memory and communications capabilities rather than used directly, even in demos. Which is pretty much what i did at the end of the relevant sections of ootAssembly19 and ootAssemblyThoughts. To reiterate some of this:

so taking the sorta-max of some of these (except packet size), for easy development given that we won't actually have a hardware implementation and so don't really know what the exact constraints are:

in terms of geometry, either:

or:

eh, let's have a payload size of 4k. We are choosing maxs here, right? If the hardware page size is 256 then the Oot VM layer has to do some networking, whatever. Also, this allows all of the constraints to be derived just from the bitwidth of 16; if each node has a one-word address, that's 64k nodes per chip, and if one packet payload can fit one bit for every node on the chip, that's a 4k payload size.

so to summarize further:

(note: this allows a data structure with one bit for every addressable node to fit within one packet payload)

(what happens when the buffer is full? do packets get overwritten? Or do packets get dropped and the sender notified? Or does the sender wait for acknowledgement which never arrives in this case? If packets can get silently dropped or acks can not arrive, then is this any better than an unreliable network?)

(is in-order streaming provided or only datagrams/packets?)

(is error correction/reliability provided/assumed? i think yes; this is supposed to be like hardware/IPC, not like external network transmission; if you wanted to access a 'real' (external) network then you'd want more than 16-bit addresses)

(does each node have multiple 'ports' for packets to be sent to/from? If so, i assume that signals don't have ports?)

Remaining Qs:

i copied those conclusion to ootAssemblyThoughts.

---

let's unify packets and signals, and also allow them to be ordered via pipes (channels)

---

dschuetz 6 minutes ago [-]

The Cell was way ahead of its time. Its security architecture capabilities was almost perfect, but they've screwed up the implementation and they failed to provide good programming tools. And, it was very very expensive.

reply

---

both WASM and LLVM have an instruction called 'unreachable' that is like our BAD instruction. So we should probably call it 'unreachable'.

---

WASM's opcodes 0 and 1 are 'unreachable' and 'NOP'. Which is nice b/c it means you can immediately identify non-instructions by comparing the opcode to 2.

---

WASM has an instruction called 'drop' that i don't quite understand yet.

---

concordance of WASM, RISC-V, LLVM, JVM, ARM Cortex M0 instructions (todo: also LuaJIT?2):

RISC-V 2.2 base 32-bit integer instruction set RV32I version 2.0:

RISC-V multiply/divide extension: mul, mulh, mulhsu, mulhu, div, divu, rem, remu

RISC-V single-precision float extension: flw fsw fadd fsub fmul fdiv fsqrt fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi

WASM: [16]

LLVM:

in addition to what is listed above, there are various 'added later' instructions, vector ops, aggregate ops including getelementptr, fence and atomic memory ops, and intrinsics

JVM:

https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings

with constants: (a
dfi)(loadstore) (a=references, d=doubles, f=float, i=integer)
df)(loadstore)_(0..3)
bcdfils)(aloadastore) anewarray arraylength getfield getstatic multianewarray new newarray putfield putstatic
fil): add sub mul div neg rem
l) and or xor shl shr ushr
f)cmp(gl)
dfil)return athrow goto goto_w invokedynamic invokeinterface invokespecial invokestatic invokevirtual jsr jsr_w lookupswitch ret tableswitch
ne)
negegtlelt)
negegtlelt)
il) f2(dil) i2(bcdfls) l2(dfi)

ARM Cortex M0:

from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html or http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html

note: these are 32-bit operations

hsbsh) (load bytehalfwordsigned bytesigned halfword) ldm (load multiple) str (store) str(bh) stm (store multiple) (pushpop) (push/pop registers onto/from stack)
s)xt(bh) (extend unsignedsigned bytehalfword)
e) (disable/enable interrupts) (mrsmsr) (read/write special register) bkpt (breakpoint)

ARM Cortex M4 floating-point:

VABS, VADD, VCMP, VCMPE, VCVT, VCVTR, VDIV, VLDM, VLDR, VMLA, VMLS, VMOV, VMRS, VMSR, VMUL, VNEG, VNMLA, VNMLS, VNMUL, VPOP, VPUSH, VSQRT, VSTM, VSTR, VSUB

todo mb later after taking the initial sorta-intersection, also consider 68k/Coldfire, MIPS, those educational subsets of RISC-V, GNU Lightning, RISC-V compressed, the ARM subset in this document https://community.arm.com/processors/b/blog/posts/arm-cortex-m3-and-later-basic-integer-math-operations-32-bit-1-5 , what else?

CONCORDANCE:

loads/stores:

load/store byte or halfword integer: all but JVM have some straightforward way to load an integer byte or halfword/short (two bytes), with the option to treat it as unsigned or signed (that is, to zero-extend or sign-extend it). All but the JVM can also store integer bytes and halfwords. The JVM would appear to use array operations (bytearrays) for this?

load/store 32-bit word: all 5 have 32-bit integer load/store commands

add/sub 32-bit ints: all 5 have this

negate signed integers: all 5 can do this; (in RISC-V this would be SUB 0, $x; note that constant 0 is always available even to non-immediate instruction in RISC-V, by using the zero register); in WASM and LLVM i'm assuming you would also subtract 0 from a constant; JVM has a negate instruction; and ARM has a weird-sounding instruction rsbs (reverse subtract; negate).

add immediate: both RISC-V and ARM and JVM; n/a for the others

load upper immediate: all but armcm0 have either a load upper immediate, or direct loading of 64-bit integers

add immediate to PC: riscv and armcm0 both have something like this. Arguably the other 3 are higher level so may not need it as much. ARM can add or subtract 4k [17].

load constant (at least 8 bits, possibly 12; also, all 5 can load a 32-bit constant within 2 instructions): (eg in riscv you would do an ADDI i think; that's a 12-bit immediate; and a LUI, which is 20-bits) (in JVM there's sipush, which i think is 16 bits; a java short?) (what about ARM? i think in ARM there are 8-bit immediates; but also i think maybe the ADR instruction can load a constant in the program code by loading a target address which equals (PC + an immediate).

and, or, xor, shift left, shift right (logical/unsigned), shift right (arithmetic/signed): all 5

immediate versions of shifts: both risc-v and ARM; n/a for WASM and LLVM

(strangely, only RISC-V seems to have immediate XOR; immediates are n/a for WASM and LLVM, but immediate XOR is missing from JVM, ARM, which seems to mean that they can't do bitwise NOT without taking up a register? maybe i'm missing something..)

(note: i'm not sure if i got the comparison's correct, but it's close enough for my purposes)

integer comparisons: RISC-V, WASM, LLVM have: lt_unsigned lt_signed (oddly, JVM has compares, but only for longs, floats, and doubles, afaict) (ARM's compare just updates condition flags) (WASM and LLVM also have eq ne le lt ge gt, both signed and unsigned) (RISC-V also has immediate versions of lt_unsigned lt_signed) (note: the Epiphany guy thought these could have been left out of RISC-V)

integer compare-and-branch, one instruction: RISC-V and JVM and ARM have beq bne blt bge (signed). WASM and LLVM and JVM have BNZ ('br_if' in WASM; BNZ and BZ are not distinguishable in LLVM's 'BR'). JVM and ARM also has ble bgt. JVM and ARM and WASM also have bz. JVM and ARM also have branch_if_lt_zero branch_if_ge_zero. JVM also has branch_if_le_zero branch_if_gt_zero. RISC-V also have bltu bgeu (unsigned).

two instruction integer compare-and-branch: All five have beq bne blt_signed bge_signed. WASM has bz. All but RISC-V have le gt (signed). All but JVM have le gt (unsigned). WASM and LLVM and ARM also have lt ge (unsigned). ARM also has branch_if_overflow.

So it sounds like a sorta-intersection would be: bnz beq bne blt_signed bge_signed. But probably cleaner to just expand this to bz bnz beq bne, ble blt bge bgt (signed), and maybe also ble blt bge bgt (unsigned). (todo: what does LuaJit?2 have? see 'Comparison ops' in http://wiki.luajit.org/Bytecode-2.0 ; it is eq ne le lt ge gt, with two instructions, and the only number type available is f64)

compare-and-branch offset sizes: todo

absolute jump: all 5 have some sort of absolute jump/branch.

absolute jump offset sizes: todo

indirect jump: RISC-V, ARM have low-level indirect branch or jump. WASM, LLVM, JVM have stuff like br_table, indirectbr (which has a list of permissible targets), call_indirect, and invokedynamic.

indirect jump offset sizes: todo

calls: RISC-V and ARM have branch/jump with link. WASM, LLVM, JVM have higher-level calls/returns of various sorts. LLVM and JVM have exceptions; i don't see any instructions for exceptions in WASM 1.0 but i heard talk of a throw instruction in earlier versions, so i don't know if WASM still has exceptions or not.

switch: WASM, LLVM, JVM have some sort of switch or br_table

nop: all 5 have some sort of nop, although this is an intrinsic in RISC-V and LLVM (llvm.donothing)

fence/sync/concurrency: all but WASM have some sort of fence, barrier, or monitor

system calls, special registers: RISC-V and ARM have system calls and special registers. The others don't; presumably the ordinary higher-level calls are also used for this sort of thing. Some have breakpoints.

unreachable: LLVM and WASM have 'unreachable' instructions

rotate right: WASM, LLVM, ARM have a rotate right (or for LLVM, a funnel shift right intrinsic)

registers and stacks: RISC-V's mov is an intrinsic. ARM has mov. JVM has stack ops. WASM and LLVM are higher-level.

multiplication: RISC-V doesn't support multiplication without the multiply-divide extension ('M'); but if RISC-V includes the multiply-divide extension, then i think all 5 support multiply of the form: unsigned 32-bit x 32-bit -> lower 32-bits of the result (i think; i'm not 100% sure i understand the various specs here). If there is no overflow, then this same operation will work correctly on 32-bit signed twos-complement values. If there are signed values and overflow, the sign bit may be incorrect. In addition, RISC-V with the 'M' extension has instructions to get the lower-order result bits for a signed multiply and to get the higher-order result bits; so does ARM Cortex M3; the rest apparently do not offer that (except in the form of converting your 32-bit value to a 64-bit one and then doing 64-bit multiply).

div and rem: ARM M0 doesn't have div and rem, but M3 has sdiv, udiv (there is no rem, but you can use 'mls' after 'udiv' to get a remainder [18]). All the others have signed div and signed rem. All except ARM M0 and JVM also have unsigned div and unsigned rem.

TODO here

floating point arith: ARM M0 has no floating point, but ARM M4 has it optionally (see [19]). fadd fsub fmul fdiv. Only LLVM and JVM appear to have frem. All but JVM has sqrt. TODO.

floating point load/store/convert: todo

64-bit integer load/store/convert: RISC-V with the 64I extension, rather than adding new instructions for 64-bits, respecifies the registers to 64-bits and respecifies all existing instructions to use 64-bits. Then new instructions are added for 32-bits instead: addw addiw subw slliw slriw sariw lw sw. These new instructions are signed. 32-bit quantities are stored in registers as sign-extended 64-bit values, so 'conversion' between 32-bit and 64-bit representations are no-ops/unnecessary.

todo other ISAs

64-bit integer arith: todo

higher-level control flow in WASM, LLVM, JVM only: todo

higher-level data structures in WASM, LLVM, JVM only: todo

(in the above, recall that 'ARM' is a shortcut for 'ARM Cortex M0'; other ARM profiles have many more instructions)

summarizing the sorta-intersection ISA:

mb also: other compare and compare-and-branch variants, ror, switch, syscalls, system registers, ret, div_u, div_s, rem_u, rem_s, fadd fsub fmul fdiv fsqrt

(47 instructions so far (excluding 'mb also'), and havent even included anything from 'div and rem' or later yet)

todo finish summarizing the sorta-intersection ISA

todo modify Boot and BootX? if needed.

BELOW THIS ARE INSTRUCTIONS YET TO BE PROCESSED FOR THE CONCORDANCE

RISC-V 2.2 base 32-bit integer instruction set RV32I version 2.0:

RISC-V single-precision float extension: fmadd fmsub fnmsub fnmadd fmv.w.x fmv.x.w fsgnj fsgnjn fsgnjx fmin fmax feq flt fle fcvt.s.w fcvt.s.wu fcvt.w.s fcvt.wu.s fclass frcsr fscsr frrm fsrm fsrmi frflags fsflags fsflagsi

only in RISC-V, not ARM, and n/a in the others:

only in RISC-V:

WASM: [20]

only in WASM:

only in WASM and LLVM:

LLVM:

in addition to what is listed above, there are various 'added later' instructions, vector ops, aggregate ops including getelementptr, atomic memory ops, and intrinsics

only LLVM and JVM:

only in LLVM, although WASM has memory.*:

only in LLVM:

JVM:

https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings

fil): mul div neg rem
il) f2(dil) i2(bcdfls) l2(dfi)

only LLVM and JVM:

only in JVM:

bcdfils)(aloadastore) anewarray arraylength getfield getstatic multianewarray new newarray putfield putstatic
ne)

ARM Cortex M0:

from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0497a/CIHJJEIH.html or http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html

note: these are 32-bit operations

s)xt(bh) (extend unsignedsigned bytehalfword)

ARM Cortex M4 floating-point:

"The FPU fully supports single-precision add, subtract, multiply, divide, multiply and accumulate, and square root operations. It also provides conversions between fixed-point and floating-point data formats, and floating-point constant instructions."

VADD, VSUB, VMUL, VDIV, VNEG, VABS, VSQRT VCMP, VCMPE (compare; E variant means raise exception even if either operand is quiet NaN?; o/w it raises exception only if either operand is signaling NaN?) VCVT, VCVTR, (convert; 'R' means to use custom rounding mode instead of rounding towards 0 -- only applicable to conversion from float to int), loads and stores and movs: VLDR, VSTR, VMOV, multiple load/store: VLDM, VSTM, special register load/store: VMRS (load from special), VMSR (store to special)

" 7.2.5 Complete implementation of the IEEE 754 standard

The Cortex‑M?4 FPU supports fused MAC operations as described in the IEEE standard. For complete implementation of the IEEE 754-2008 standard, floating-point functionality must be augmented with library functions. The Cortex‑M?4 floating point instruction set does not support all operations defined in the IEEE 754-2008 standard. Unsupported operations include, but are not limited to the following:

    Remainder.
    Round floating-point number to integer-valued floating-point number.
    Binary-to-decimal conversions.
    Decimal-to-binary conversions.
    Direct comparison of single-precision and double-precision values."

" The FPU sets the cumulative exception status flag in the FPSCR register as required for each instruction, in accordance with the FPv4 architecture. The FPU does not support exception traps. The processor also has six output pins, FPIXC, FPUFC, FPOFC, FPDZC, FPIDC, and FPIOC, that each reflect the status of one of the cumulative exception flags. See the Cortex®‑M?4 Integration and Implementation Manual for a description of these outputs. "

only in ARM:

pop) (push/pop registers onto/from stack)

note: RISC-V 'G' extension is shorthand for the base + M+A+F+D, that is, base + multiply/divide + atomic + single and double precision. This suggests that we should prioritize those instructions somewhat.

Also, should prioritize instruction in the RISC-V 'C' (compressed) extension, although the spec for that extension also talks about argument form constraints:

" shorter 16-bit versions of common 32-bit RISC-V instructions when:

the argument form constraints perhaps should inform our choice of 8-bit 'shortcut' instructions in BootX?.

---

hmm i keep thinking it might be a little quixotic to make Boot a 16-bit ISA. Everyone else uses i32.

but i do also think that the for massive concurrency, 16-bit ISAs make more sense.

---

in Boot, may want to expose only 15 or 14 registers instead of 16, so that the compiler has a register or two to work with so that it can e.g. synthesize 'BZ' instructions on RISC-V.

---

interesting:

http://forum.6502.org/viewtopic.php?f=2&t=2541

" Hi there,

I recently found a really nice article on static recompilation of 6502 code into LLVM compiler intermediate representation, and the difficulties that come with it.

Here's my write up https://plus.google.com/108984290462000253857/posts/DPHzWoosXYS https://plus.google.com/108984290462000 ... PHzWoosXYS? and here's the original (lengthy, but worthy) article: http://andrewkelley.me/post/jamulator.html

What I find interesting is how to resolve the static recompilation issues posed by indirect addressing modes, stack-based jumps (like pushing the address of a function from a table onto the stack and then RTS). His final words on that are that it's probably easier to use just-in-time compilation, but I wonder if there still is a way for static recompilation - be it with manually added hints, or simulation are analysis time to find all possible code paths.

What do you think? André

"

---

already copied to plChAssemblyFrequent...:

" IEEE Spectrum in the early 80's had an article on minimal instruction sets. The author made a salient point that 7 instructions did most of the heavy lifting, another 7 did the most of the rest required. He did highlight that one mistake made with instruction sets was not separating the addressing modes from the instruction itself. ... I have managed to rummage around and found the article. It was actually in IEEE Micro May 1982, "A Unique Microprocessor Instruction Set" Dennis A Fairclough.

The article looked at the statistical usage of instructions and placed them into 8 broad groups, Data Movement, Program Modifying, Arithmetic, Compare, Logical, Shift, Bit and I/O & Misc groups.

The result of the analysis was as follows:

For groups data movement and Program Modifying, 1 Instruction - MOVE [Cummulative usage 75%]

For group Arithmetic, 4 Instructions - ADD, SUB, MULT, DIV [Cumulative usage 87.5%]

For group Compare, 1 previous Instruction - SUB [Cumulative usage 93.75%]

For group Logical, 3 Instructions - AND, OR, XOR [Cumulative usage 96.88%]

For group Shift, 1 Instruction - SHIFT [Cumulative usage 98.44%]

For group Bit, 1 Instruction - MOVEB [Cumulative usage 99.22%]

For group I/O & Misc, depends on whether i/o is memory mapped or otherwise, so either 0 or 1 Instruction.

Some possible extended instructions included INC, DEC, I/DBRC and MOVEM. "

---

i haven't quite finished up the 'concordance' above yet, but it's already seeming that Boot is kind of weird, in that it doesn't support all of the things in the concordance, yet it has a bunch of other stuff like the stack ops and the pointer-typed operations. The I/O stuff i feel fits fine, though. Also, it has almost everything in the concordance, notably excepting everything having to do with for floating point and larger integers -- this may be enough.

also doing everything in words instead of bytes is kind of funky, but necessary for portability.

on the one hand, the stack ops seem too high-level and far away from the goal of a 'sort-of intersection' of common platform instruction sets. On the other hand, we want the stack to be opaque for portability/interop, so we can't just implement it manually.

on the one hand, there is a bunch of I/O stuff that is far away from the goal of a 'sort-of intersection' of common platform instruction sets. On the other hand, it's called 'Boot' because it's supposed to be used for bootstrapping a self-hosting compiler, which means that we need some way to do I/O.

maybe what we really need is another layer beneath Boot -- a layer with only the 'sort-of intersection' of common platform instruction sets, with no stack ops and no I/O and no idiosyncracies -- i guess this would only be good for pure computation.

even that wouldn't get past the funkyness of having ptrs and ints have different instructions, though (most CPUs use the same instructions to operate on pointers and ints; that is to say, pointers are usually just unsigned ints).

also, CAS and fence are a bit funky; unlike machine ISAs, programming-language VMs usually don't need these because they usually don't default to relaxed memory orderings. Still, they are good for our purposes.

and on the other hand, the stack ops, at least, don't really hurt our goals because they are easy to implement on any target platform.

anyway, we have about 56 instructions in Boot, and about 128 opcodes in BootX?. It seems to me that the sorta-intersection of common platforms is shaping up to look like about 50 instructions or so (many of which are already in Boot). So it looks like we can fit this in BootX? with room to spare (i'd probably like to use this extra room to add some alternate models of computation, eg combinatorial calculus, reversible gates, mb logic programming, mb some vector/matrix ops like scatter/gather, some loops and switches, etc; and/or maybe extend to the sorta-union of common platforms). Remember that the 128 opcodes give us 128 3-operand instructions, but we can have much more if we convert some of these to 16 2-operand instructions. On the other hand, writing that, it seems like a lot to ask; i should i still need to actually just write the logical extension of (majority-vote-of-common-platforms) ISA and see how much space we have left after that. That would include stuff like:

on the other hand, the reason we aren't just using RISC-V or WASM is b/c i thought those were too complicated and hence hard to implement. But now we are having as many instructions as they are!

also, i want to keep in mind the interesting higher-level control structures etc seen in WASM (and maybe LLVM) -- stuff like that will be useful for the next language up in the stack.

hmm so it seems that i am veering closer to having TWO MORE languages in this stack:

eh, after writing that, i think OVM and the '?' below it can still be merged. The upshot is that this is just a reminder that language supported by OVM should be really simple.

Also, how would Boot Core work exactly? Would we use a sequentially consistent memory ordering and so remove CAS and FENCE? Would we declare that Boot Core is not only pure computation, but pure computation without concurrency? Would we still have opaque pointers? Would we still have opaque stacks? If we still have opaque pointers and opaque stacks, are we really gaining enough by eliminating the few I/O operations to justify the overhead of having a separately-named 'Boot Core'?

and i can't help but think that if we would just allow segmented memory, with an base pointer that is implicitly added to integer offsets, and integer offsets taking the place of addresses, then we could get rid of pointer-typed pointer arithmetic instructions. This seems funkier than it's worth, though.

i dunno. i don't think Boot Core is worth it as some separate thing. Maybe just as the name for a whitelist of pure computation instructions within Boot.

---

floating point support presents a condundrum.

what i'd like to do is have floating point instructions in BootX?, but not in Boot. But since BootX? is supposed to be compilable to Boot, this means that the toolchain that compiles BootX? to Boot must include a soft floating point implementation. This is undesirable for a few reasons:

so, i think i'd rather make floating point support in BootX? optional, a 'profile' or 'extension' sort of thing.

so, a BootX?->Boot toolchain could have optional soft floating point support. But if you exclude this, you are still allowed to call your implementation a standard-compliant implementation of BootX?.

---

remember to check the plChArithmeticImpl subsections under 'De-facto IEEE 754' too see which parts of IEEE 754 are often ignored

---

should have a Boot and BootX? and OVM interpreter, too

---

since we have an immediate addressing mode in BootX?, we don't really need the zero register anymore. So in fact this can be used by the toolchain. So maybe we only need to reserve one register, not 2. Otoh on architectures with 16 regs, one of which is a constant zero register, this doesn't really give us a spare, so maybe still reserve 2.

---

in BootX?, we could make fuller use of what happens when the output operand is in the immediate constant addressing mode. We could either use this to expand the space of two-operand opcodes (two-operand opcode space x16), or we could use it to quickly output one word at a time to various standard streams (also having outputting to stream zero being a synonym for discard, which may allow the toolchain to internally use register 0 for something else), or some combination (eg 0 thru 3 for special streams, the rest for two-operand opcodes x 12)

but we already have plenty of two-operand opcodes. So maybe let the program assign streams to 5-8 and use those as wordwise device output?

---

i was thinking the other day that Boot is just a bare-bones portable assembly language that can be implemented really quickly, and BootX? adds in the other instructions that typical assembly languages have.

But then i remember that if we have room, we may also want to put SECD-machine like stuff and higher-order functions such as function composition, map, reduce, etc (and the stuff in Python Copperhead) (and the stuff in Simplicity) (and maps, arrays) (and the control flow stuff in WASM) (and Forth stuff) (and basic Scheme stuff). But i don't know if we have room yet. So the jury is still out on this.

Also there's a consideration of how high-level we want to be. If we're composing functions then doesn't that require transparent memory allocation (and hence a runtime) on many platforms?

---

"

KMag 24 minutes ago [-]

Second, on architectures designed with 4-operand fused multiply-add (FMA4) from the start, and a zero-register (like the Alpha's r31, SPARC's g0, MPIPS's r0, etc.), why not make the zero-register instead an identity-element-register that acts as a zero when adding/substracting and a one when multiplying/dividing? An architecture could optimize an FMA to a simple add, a simple multiply, or simply a move (depending on identity-element-register usage) in the decode stage, or a minimal area FPGA implementation could just run the full FMA. This avoids using up valuable opcodes for these 3 operations that can just be viewed as special cases of FMA. Move: rA = 1.0 * rC + 0.0. Add: rA = 1.0 * rC + rD. Multiply: rA = rB * rC + 0.0. FMA: rA = rB * rC + rD. "

---

" gpderetta 7 hours ago [-]

...Separate load and op instructions and fixed size instructions are pretty much the only things left differentiating RISC and CISC architectures (there is nothing reduced about modern RISCs), so I do not think the claim that x86 CPUs are RISC inside does hold. "

---

" Almost every architecture has now added SIMD vector extensions, including SPARC (VIS), x86 (MMX/SSE/AVX), POWER/PowerPC? (AltiVec?) and ARM (NEON). Only relatively recent processors from each architecture can execute some of these new instructions, however, which raises backward-compatibility issues, especially on x86 where the SIMD vector instructions evolved somewhat haphazardly (MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX, AVX2). "

---

i don't think i need to read this but it does shed a little light on the primitive operations needed for concurrency: [21]