proj-oot-ootAssemblyNotes23

linux ioctl is applied to open file descriptors, e.g.:

" Open the KVM device, kvmfd=open("/dev/kvm", O_RDWR

Do create a VM, vmfd=ioctl(kvmfd, KVM_CREATE_VM, 0) Set up memory for VM guest, ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region) "
O_CLOEXEC)

---

i think having the polymorphic 64-bit instructions is overengineering; we already have two stages of assembly-like instructions, 16-bit and 32-bit, having a third stage is just spending too much complexity on the assembly-like stuff. The 64-bit LONG instructions should just be the grammar-like, AST-like/data-structure-like flexible encoding.

---

having the 8-bit shortcut instructions prevents 16-bit alignment, which is a pity as 16 is a "more perfect" power-of-2 than 8. Let's get rid of that, and instead have alternate formats for 16-bit instructions. We still have 1 format bit for 16-bits, and when it is one value, we have our ordinary 16-bit instructions, and when it is a different value, we have another format bit; when it is one value, we have a 32-bit (or longer) instruction, and when it is a different value, we have the new 16-bit format.

So the new 16-bit format has 14 bits left for us to work with. We can make the first two of these be more format bits. When they are 11, we have 12 bits left to work with, which we can use to recover the 6-bit shortcuts (in packets of 2) by treating these 12 bits as two 6-bit shortcut instructions. Now we still have 00, 01, 10 to work with.

One thing we could do here is pull out all of the instructions that don't really fit well with the 3 operand model from the ordinary 16-bit instructions, in order to save a few opcodes over there. This could include JREL, LDI, ANNOTATE. Note that these each require at least 11 bits of content, however, (and it would be great if we could give them 12) so if we did all three of those, that would probably be our entire budget; but we may want to save some for longer macro instructions, or for future expansion. The 6-bit shortcuts will probably already need a NOP, so we could also pull out NOP and represent it as two shortcut NOPs.

---

[1] says that SMS payloads, even though they are 160 characters long, are actually encoded into only 140 bytes (by using 7-bit characters). And then a 36 byte header is attached. This suggests:

32 or 64 bits would be a good low-level message size (to fit a little less than this header, or this header + some content), or 128 or 256 bits (to fit a little less than this content, or a little more than it).

---

[2] speaks of:

a 512-bit unit

256-wide ports

"Each memory operation can be of any register size up to 512 bits"

"The L2 to L1 bandwidth in Skylake is the same as Haswell at 64 bytes per cycle in either direction"

"Each (DDR4) channel is 72-bit (64 bit and an 8-bit ECC)"

" Store is now 64B/cycle (from 32B/cycle) Load is now 2x64B/cycle (from 2x32B/cycle)"

" L1I Cache:

    32 KiB/core, 8-way set associative
        64 sets, 64 B line size

"

L1D, L2, L3 caches also each have 64 B line size

" DTLB

    4 KiB page translations:
        64 entries; 4-way set associative"

" Port 4 now performs 512b stores (from 256b)"

L1D cache:

" 128 B/cycle load bandwidth 64 B/cycle store bandwidth "

the picture/section "Individual Core" [3] has some bandwidths of:

16 bytes/cycle

64B/cycle

512bit/cycle

and the Decoded Stream Buffer (DSB) has a "64 B window" and Instruction Fetch and PreDecode? has a "16 B window"

this suggests that message sizes should be 64 bytes (8 bytes through 128 bytes are observed).

---

so it sounds like 64 bytes is a pretty good message size..

---

"Another famous example of NIH is Linux’s epoll(2), which is a broken version of BSD kqueue(2)."

---

"

aseipp 27 minutes ago [-]

If we're talking about desktop processors, it's still irrelevant, honestly; any desktop processor that can compete in a modern world is extraordinarily complex in implementation at every level. POWER, ARM, RISC-V, x86 -- they all try to extract high performance using similar techniques in their execution engines. Out of order execution, pipelined decoding etc are architecture design aspects and have little to do with the ISA. Front-end decoding, etc traditionally takes an extremely small amount of area and space in the overall system, though it's still complex (it's kind of hard to talk about raw "area" metrics when cache is what dominates and is most important.) But the CPUs themselves are incredibly complex, no matter what way you cut it.

If you're talking about embedded CPUs or whatnot, such as picorv32, the frontend is a relatively bigger piece of the "pie" so having a simpler ISA is nice. But really, verifying any CPU design is complex as hell and the decoders/frontends are not the biggest problem (even in RISC-V a source of bugs I've seen float around and even fell into myself, for my own emulator, is the semantics of JALR clearing low bits, for instance -- but that's not a decoding problem, it's a semantics one, flat out. Also, let's be real here, the whole memory-model approach in RISC-V where there's a spectrum of extensions adding TSO on top of weak memory ordering etc is nice and cool but I'd hardly call it fitting in the form of what people think of as a "simple" RISC CPU. That's not even getting into the array of other extensions that will crop up -- bitmanip, a crypto extension will almost certainly pop up, the vector extension, etc etc...) "

---

What if there are 64k nodes and each node had a 64k-bit bit map with each bit corresponding to a node, in each node able to change the state of its bit in everyone else's bitmap (globally; that is, everyone's bitmap shares the same values at all times; they are all copies of one another). That's kind of like how the brain works I guess; each neuron's output is copied to each recipient. Is generalizes the idea of letting each node have a bus shared with its vertical and horizontal siblings. To make it more like the brain's rate code (I'm not saying there's not also Spike timing coding), you could give each node an output byte instead of an output bit.

of course, generalize that one more time and you just get a 64k shared global memory

---

so, if we eschewed the 'geometry' aspects of the tesselated/tiled processor design elsewhere in these notes, and wanted to make massively concurrent 'chips' where every CPU on the chip could communicate equally with every other CPU, a simple architecture could be:

The global shared memory would have a very relaxed memory ordering, however. Updates are atomic and affect 16 bits at once (no 8-bit update option). Updates coming from any given CPU always arrive in order (i think this is called 'processor consistency'?).

Should we have no other memory ordering consistency constraints at all? How about one more: for any given choice of CPUs A, B, C, from the point of view of C, if A generates an update, and B generates an update, and then, before receiving B's update, A generates another update, then C can perceive no more than 255 of B's updates before it perceives A's second update (255 chosen b/c sqrt(64k) is 256). I'm guessing that this will at least allow algorithms that guarantee synchronization. (this implicitly contemplates an implementation consisting of up to 64k 'cache' copies of the 'global' memory with globally broadcast update messages each time someone makes a change, and lossless deterministic routing of these update broadcasts, but different caches receive different messages at slightly different times depending on their nearness to the sender)

however, we could simplify even futher by just getting rid of the globally shared memory and instead providing message passing including global broadcast messages with the above guarantees (namely: (a) if a CPU generates two messages to the same destination, the destination receives them in order; and, (b) if a CPU A generates one message and a CPU B generates one message which A is destined to receive, and A generates another message before receiving B's message, then no CPU that is destined to observe these messages (except B itself) can observe no more than 255 of B's messages before observing A's second message.

one could make the condition (b) much simpler by tightening the constraint somewhat: the maximum propagation time is at most 255 times the minimum propagation time. How to state this in terms of memory ordering constraints?

(yes, this limits the total size of the system, but we're doing that anyways here with our 16-bit addressing; also, you could connect multiples of this system together, but in that case the interface provided to the program is more like a distributed system and less like a CPU)

One could even implement this by making message emit very slow, by just making it block for an amount of time that is at least 1/255th of the max propagation time; i think this turns the whole thing into sort of a quasi-synchronous system, in the sense that there is now a shared unit of time and a shared 'fuzzy clock' whose fuzziness is bounded.

maybe the way to state this stronger constraint is that: if any node observes event P before event Q, then no other node can observe Q plus more than 254 events from the same source as Q before observing P.

---

but... actually the message passing architecture brings us back to the idea of a 'shared memory' where each piece of memory can only be written to by one processor; because now we need buffers/mailboxes to store the incoming messages; and the simplest thing to do here is just to have a per-node buffer/mailbox size of 1 and have any incoming message overwrite the previous one (like interrupts); but that leads to a lot of coupling insofar as all of the nodes' messages are overwriting each other -- so an obvious solution is to have, at each node, one mailbox for each other node. Which, if there are 64k nodes, is like a 64k-word shared memory where each memory location can only be written by one node.

so there is something to the idea of a special 'shared memory' where each word is only writable by one node; just generalizing it to a 64k shared memory might lose the architectural feasibility.

the two constraints from above could still give the memory ordering constraints:

an alternative system would be to have a smaller amount of total mailbox buffer (e.g. a buffer of 100 instead of 64k), and let each CPU set a 'receive mask'; messages from senders not in the receive mask are immediately discarded and don't reach the buffer. You could also combine these somewhat:

use the 'shared receive memory' just as an interface/API concept; that is, there is a RECV instruction that takes as a parameter which sender to receive from (so, it's like reading a memory address), and it just returns the last message that you got from that sender, if any -- but you can only get messages from senders if they are in your 'receive mask', and there is a max number of '1's allowed in your receive mask at any one time.

that is kind of begging to be generalized from 'senders' to 'channels', though -- so that a message doesn't contain the address of its sender, only which 'channel' it is on, and any sender can send on any channel (this model assumes that all senders are trusted, because otherwise it might be nicer to not let different senders overwrite each other's messages, which feature was provided back when each sender had its own dedicated 'channel'). Now each 'channel' is a virtual bus.

So the way it works now would be:

how many subscriptions should be allowed? well, maybe all of them (if 64k are possible, due to the 16-bit registers); i think neurons can often have tens of thousands of synapses. Note that you don't need a separate mailbox buffer for each CPU; you can have shared mailbox caches (which makes a destination parameter on SEND seem a little silly, btw, so more evidence that we shouldn't do that). The architecture could actually do this:

---

one could also imagine fancy architectures where you have fully SUBSCRIBE bitmasks for each space in the buffer, and as messages come in, they overwrite each other (or do more complicated accumulation operations, such as sum). This is sort of circuit-switchy/signal-processy/neural -- i like it.

So for example, you would have say 16 'smart mailbox buffer' locations. You could say "program smart-mailbox-buffer-location #3 to: receive on channels 0,1,2...,2047, and accumulate with the SUM operation". You could also write directly to these smart mailbox buffer locs (eg. you could write a '0' to start over with a new sum). Other accumulate ops could be AND and OR.

This would let you implement simplified 'neurons'; accumulate signals from 'dendrites' with SUM, and then write directly to simulate decay. This suggests of course that what you really want is a WEIGHTED sum accumulation; which can be expressed by letting the subscription be a mask of integers rather than just a bitmask.

oo, i like that. Each 'smart mailbox buffer' is doing a dot product between incoming signals and a weight vector. The fact that there is more than one 'smart mailbox buffer' location is like dendritic computation (eg. you can use arbitrary functions to combine information from different regions of the dendritic tree, where each region is just a linear sum of the synapses into that region).

it doesn't quite allow for per-synapse dynamics though, unless you have enough 'smart mailbox buffer' locations to have one per sender -- in which case the 'smart mailbox buffers' would be like synapses, and the limited number of them is the constraint that you can only fit so many synapses on a neuron.

unfortunately if, per each 'smart mailbox buffer' location, you stored one weight integer per each other node, that would be 64k words per 'smart mailbox buffer' per CPU (since these guys are local to each CPU)! So i guess instead we just want a list of (channel, weight) pairs, and the list can only be so long.

i guess that really brings out how complex the brain could be; if each synapse on each neuron stores important state, and each neuron has tens of thousands of synapses, then we have ~64k words of memory per neuron just to store that state (otoh the 'word size' here is like 4 bits rather than 16 bits, but that's only a 4x savings in memory space; note that Sejnowski's lab found 4.7 bits of information per synapse in 2015).

here's a compromise; each CPU would have 4 'smart mailbox buffer' locations (4 regions of the dendritic tree). Each of these smart buffers is associated with a mask of 64k 4-bit entries. So in total these masks (the synaptic weights) take up 64k (bytes, not words) of memory. So, since we already have 64k words of local memory per CPU, we're less than doubling that. Each 'smart mailbox buffer' has a programmable accumulate operation (overwrite, sum, and, or). AND and OR might be a little redundant here; they're really multiply and sum with normalization. In addition, each CPU has 8 smart buffer locations with 1-bit receive masks, with programmable accumulate operations (overwrite, sum, and, or), for another 64k bytes. So in total these masks consume 128k bytes or 64k words per CPU; that is, we've doubled the local memory requirements.

The 1-bit receive masks operate with a separate set of channels that carry 1-bit messages btw. Note that these can emulate the data-processing part of interrupts (using the OR accumulate operation) (but with interrupts, there is also the control-flow 'interrupt' part! but we can have a 16-bit interrupt mask for these channels!).

So the way it works now is:

total memory cost per CPU of this stuff is about 64k words (128k bytes). We also have the 64k of local memory (both program and data memory).

---

consider adding instructions for higher-level interop with HLLs into which we are embedding.

---

y'know, if in Boot and BootX? we specify that the registers are only required to hold 16 bits, then even if we emulate wider regs in code we may not have enuf; if we have 16 regs + 16 stack locs, that's 32 16-bit regs, or 16 32-bit regs, or 8 64-bit regs (of which only 4 are actually registers, and the other 4 are stack locs). Otoh we do have addressing modes in BootX? so we could maybe just do 64-bit computations in memory. Let's take a look at the addr modes again and see if they would be suited to that.(...) yeah, we have both displacement and indexed addr modes in BootX?, so i guess we can use memory locations to do 64-bit math. Crisis averted (although inefficiently, i guess?)

is it worth it? this makes 64-bit math much less efficient. Otoh the motivation for only requiring 16-bit is to allow massively concurrent systems (with millions of processors). That's probably worth it, given our goals here.

one remaining worry:

Displacement and indexed addr modes split the 4 bits in the operand_value in half, though, so in displacement, assuming that the first half indicates a register holding a pointer to the base of a 64-bit pseudoregister table, then the second half is only 2 bits, so we only get 4 64-bit registers this way. And we don't get the smallstack, so that's not doubled. Indexed, however, could be used to address many more pseudoregisters (up to 2^16), but at the cost of having extra constant loads every other instruction to load the index.

In addition, displacement REALLY only addresses 1 64-bit register, because remember that what we're talking about here is how easy is it to write universal programs that can do 64-bit math even on a 16-bit system, in which case each memory location is 16-bits and it takes 4 of those to make 64-bits, so all 4 displacments are just locations within one 64-bit system. So basically we'd have to resort to indexed mode, with interleaved constants, to do 64-bit math.

Eh, probably worth it, given our goals here. Remember that all registers can hold native pointers, whatever size those are on the current platform, so on a 64-bit register we have 32 spots for pointers. The 4 spots spoken of above are just 4 memory locations

There is an alternative, though; give up predecrement/postincrement addressing mode and replace it with a displacement-segment address mode that uses a fixed register as base; this now allows displacements of a full 4 bits (16 displacements), which now allows addressesing of 4 64-bit-registers-made-out-of-16-bit-ones instead of just 1. Not sure if that's worth it (although getting rid of predecrement/postincrement addressing mode is tempting anyway because it's mutating, which makes analysis ugly).

Another thing we could do is specify Boot as 16-bit but specify BootX? as 64-bit. I'm not sure that's a good idea given our massive concurrency aspirations, though.

One issue is that if we specify registers as greater-of-16-bit-or-whatever-is-needed-to-hold-a-native-pointer, then should 64-bit arithmetic instructions use one register or auto-combine multiple ones when executing on a 64-bit platform? For uniform register use semantics across platforms i would think the answer has to be to auto-combine -- although maybe we don't need uniform register use semantics across platforms when doing >16-bit arithmetic.

What does WASM do? I think WASM has an infinite number of variables, rather than registers, and i think they are typed, so you tell it ahead of time which variables are 64-bit. Also, there are actually 2 WASM architecture variants, wasm32 and wasm64, which says how wide the pointer values are. This certainly shows the attraction of variables rather than registers, but i think that's too high-level (too hard to implement) for us -- implementing WASM is not 'dead simple' imo (if it were, we'd probably just use that rather than invent our own thing).

Another thing we could do is specify some registers as 16-bit and some as 64-bit. This seems a little wasteful, though -- it prevents straightforward compilation on true 16-bit systems, which is half the point.

Now even if, say, 64-bit arithmetic instructions use 4 consecutive registers, what do we do if those registers really are each 64-bits because the native pointer size is 64-bits? I guess we just specify that a precondition is that each of those registers holds a 16-bit integer value, not a native pointer, and that if that's not true the result is undefined.

i guess another solution is to do what many architectures do with floating-point registers, and have a second set of distinct registers with maximum width (64-bits in this case), and provide instructions to load and unload them into the ordinary registers. As ridiculous as this is, it may be the least-bad solution. But what about SMALLSTACK? Are there two SMALLSTACKS? I'm thinking that SMALLSTACK is just 16-bit-or-native-pointer and that's it. Or, again, we could combine adjacent items in SMALLSTACK and specify that any native pointers in there make the result undefined.

this whole mess is making me rethink the very idea of a 16-bit machine. Why not just make the registers 64-bits or at least 32 bits? Will it really limit concurrency so much?

what ARM64 does is have one set of 64-bit registers, but 32-bit instructions only read the bottom halves of these (but reset the top half to 0s upon write): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801a/BABBGCAC.html

hmm... i guess if we don't want to force tiny 16-bit realizations to set aside extra storage for 64-bit words, and we don't want to splinter the architecture into word-size variants, what we should do is just have the 64-bit arith instrs operate on quadruplets of adjacent registers, in which case we only have 3 64-bit GPRs (b/c we have 4 special registers, so 12/4 left) + 4 more in SMALLSTACK.

---

[4]

---

another alternative: leave Boot with 16-bit registers, but specify BootX? to have 64-bit registers. This destroys an easy compilation of BootX? to Boot (but it's not that hard; we just need a register table in memory) but allows us to preserve Boot as good for 16-bit machines and BootX? as an easy target for contemporary compilers. But it kind of destroys the mechanical sympathy of BootX?.

otoh if BootX? only has 4 64-bit regs, that kind of destroys any mechanical sympathy between most HLL programs and BootX? (and hence with Boot). Of course, we knew that before; the idea behind BootX? was supposed to be that it's really good at running small interpreters on tiny MCU-like CPUs within a massively parallel computer, not that it's supposed to be efficient at running ordinarily-structured desktop programs on a traditional (single, large) CPU.

could we just have 64-bit regs throughout? is it really that much of a cost in terms of power consumption, cost, die size? yeah probably, b/c there's probably a reason ppl don't make 64-bit MCUs [5].

having one 64-bit register correspond to two 16-bit ones could cause trouble in terms of making it hard to make compilers understand that when they write to the 64-bit register they invalidate multiple corresponding 16-bit ones.

[6] says that it is inefficient to have dependencies between registers.

otoh x86 already has different register names for the lower bits in each 64-bit register eg rax is 64-bit register A, eax is the lower 32 bits, and ax is the lower 16 bits, and al is the lower 8 bits; and ah is the second 8 bits (the high 8 bits of ax). [7]. So since x86 is popular, compilers probably have some way of expressing this already.

to maximize implementation freedom we would:

---

random page on differences between MCUs of different bit-width:

[8]

---

here is a page about a defunct effort to have the Linux kernel-space stack be 4k. The current default is 8k. The 4k effort failed b/c there were too many times when the 4k stacks overflowed (producing hard-to-diagnose errors b/c a kernel stack overflow just silently corrupts). However the 4k efforts went on for awhile, so it's close.

Some things that were noted causing potential overflows:

"

...

do_select, do_sys_poll

The structure 'struct poll_wqueue' is a large data structure used for the select() and poll() system calls to manage a sub-set of the file descriptors being polled. This structure includes an array of wait queues which can be used immediately (without requiring or waiting for a memory allocation) for polling file I/O.

The number of entries in the array of wait queues can be controlled via macros in include/linux/poll.h nlm_clnt_reclaim

network lock manager for network filesystems. Not applicable to most embedded products (except possibly during development). security_load_policy

An selinux routine, not applicable to embedded.

...

To make matter worse, there's been stuff done to the storage stack that significantly increases stack usage since 4k stacks went away (e.g. the on-stack block plugging changes).

...

There's a good reason 4k stacks went away: it's simply not enough space for the deep 60+ function call stacks we see with even trivial storage stack configurations.

"

[9]


one huge important difference between Boot and ordinary assembly is that pointers are typed in Boot.

if we didn't allow pointers to be coerced to and from ints, then rather than saying that 'each register must be big enough to contain pointers' we could say 'there are 16 registers for ints, 16 for pointers, and 16 for floats, and they are all separate from each other'.

hmm... keeping ints and floats separate does seem to be what more recent ISAs do (e.g. RISC-V). And we might want to have 'fat pointers' instead of pointers, which can be really big, to store a triplet (base, bound, permissions); and if you want to be able to, say, recover an object base pointer from an item within the object, you might want to have a quadruplet, (base, offset, bound, permissions)), which means that a pointer might not fit in a reasonably-sized int anyway. And allowing pointers to cast to ints and back makes automated garbage collection/memory safety impossible (because you can smuggle pointers as ints and then the GC could collect the underlying object, then you transform the ints back into invalid pointers) and in any case doesn't make sense with fat pointers (in which case the hardware or OS must enforce that the pointers are read-only by the user app).

and keeping pointers separate makes it impossible-by-default to accidentally (or maliciously) coerce between pointers and ints. If they are all in one register bank then if the implementation wishes to prohibit this it must either do compile-time typechecking, or keep some extra state around at runtime to keep track of the current type of each register.

also i suspect that contemporary highly-optimized processors like to have their floating-point regs separate so that the floating-point arith unit can 'own' these for efficiency reasons (eg doing stuff in parallel, eg having them close on the die to the FPU logic, etc) (see [10]).

it seems like keeping three separate 64-bit register banks might be best for the goal of an interoperability layer between a performance-insensitive HLL (this allows garbage-collection/memory safety machinery to sit 'beneath' BootX?; which is where it is anyways if we are embedding within a HLL not as a low-level plugin but via an interpreter written in the HLL itself). However having a single 16-bit register bank seems best for the goal of massive concurrency. In addition, 3x64-bit 16-register banks is 384 bytes (8 bytes * 16 * 3) (which is also > 1 cache line), which means that at least that much state must be preserved for each greenthread when it is paused, which means that 4 gigs of greenthread state storage can only store less than 12 million greenthreads (4*230/384.). By comparison, 1x16-bit 16-register bank = 32 bytes (which is also <= 1 cache line btw), so 4 gigs could store up to 128 million greenthreads (4*230/32.). ARM Cortex M0 only has about 16 32-bit registers (64 bytes of state).

hmm this feels like a real fork in the road. I think one choice (3 banks of 16 64-bit regs) is obviously better for interop with HLLs, and the other choice (1 bank of 16 16-bit regs) is obviously better for massive concurrency.

Is there a compromise solution? 1 bank of 16 64-bit regs, and 1 bank of 16 pointers. If pointers are 16-bit, that's 160 bits of state; if pointers are 64-bit, that's 256 bits of state. This gives up 'automatic type safety' between ints and floats, but we don't really care about that, it's pointers that we want to keep separate. It also gives up the efficiency of having separate floating-point regs that can be 'owned by' the FPU (but we gain the efficiency of having less overall state, which is more important for greenthreads).

A more drastic compromise is 1 bank of 16 16-bit regs, and 1 bank of 16 pointers. If pointers are 16-bit, that's 64 bits total.

---

intel assembler syntax is said to be more readable. Comparison: https://www.imada.sdu.dk/~kslarsen/dm546/Material/IntelnATT.htm

---

" 1) Everybody has FP instructions if they care about FP at all; FP simulated via integer is uncompetively slow. As shown above, you have a huge range of choices for how much hardware you want in the actual implementation. 2) RISCs usually don't have hardware for sin, cos, log , etc, because you can do a pretty good job synthesizing them, and because they normally are packaged as out-of-line libraries, so there is little code-density impact. 3) RISCs generally don't have CALL instructions of the VAX ilk, although one might argue for load-multiple+store-multiple as instruction-density improvements.

" -- [11]

---

"

From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: FPU in 64-bit MACHINE Date: 13 Jan 1998 05:50:16 GMT

In article <69et2p$dqp$1@newbabylon.rs.itd.umich.edu>, hasdi@umich.edu (Hasdi Rodzmann Hashim) writes:

> Organization: University of Michigan ITD News Server
>
> Assuming the FPU unit operates on 64-bit registers, in a 64-bit machine
> where the integer register size is also 64-bit:
>
> Is there any reason why FPU unit still must use a separate
> register set than other arithmetic unit's register set?
>
> Case point: Merced. The main register set is still separate from FPU's
> register. Is there any point to separate them now that their register
> sizes are identical?

This was discussed extensively a year or too ago, and the answer is still the same: a) With the same number of bits in an instruction format, you can staightforwardly double the number of architecturally visible registers.

	b) Integer program contain no FP code (by definition).
	FP codes normally contain substantial integer instructions for
	indexing, addressing, and counting, and such actions often offer
	substantial potential overlap with the FP operations.
	c) A combined register file either requires substantially more
	read/write ports, which further adds to the pressure of
	wide-issue, multiple-unit designs on register files, or
	else extra stall cycles for regsier reads/writes when there are
	conflicts, thus complexifying compiler scheduling.
	d) It used to be helpful to allow a well-integrated off-chip FP
	coprocessor; this is no longer very important.
	e) This can ease layout issues, since FP units are area-intensive,
	and it is nice to be able to keep the FP registers close to them
	to keep wire lengths down, and it is nice to keep the number of
	units on a bus down for speed.

The price of course is: e) You need more registers, since they are split. This costs hardware, and perhaps context-save time, although some designs use dirty bits or trap-on-use bits to avoid wasted save/restores of FP registers in integer-dominated multi-tasking workloads.

	f) int<->fp conversion instructions sometimes suffer extra cycles.
	g) You consume some more opcodes, if only because you cannot use
	the integer load/store instructions to directly load/unload the
	FP registers.

Well-known single-register-set machines include the VAX and Motorola 88100; separate register sets are used by at least the following: S/360; MC68K; Intel IA-32; PA-RISC; MIPS; SPARC; POWER+PPC; Alpha; IA64.

-- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: special registers, was TOP 10 MISTAKES Date: 7 Aug 1998 05:14:05 GMT

In article <6qd5jq$eg2@crl3.crl.com>, gherbert@crl3.crl.com (George Herbert) writes:

> The advantage is that you simplify some things: you only end
> up with one register file, and you additionally get bonus
> from being able to define in the compiler and on the fly
> how many registers are doing FP things and how many are
> doing INT things. Assuming you keep the same total number
> of registers, you get a net win on flexibility.

But, of course, you usually don't, that is, quite commonly, you have N-bit register fields, and if you have int & fp sets, you get twice as many registers.

> The primary disadvantage is that your register memory has
> to have more ports to use both int and fp simultaneously,
> which slows down all accesses.

1) And you usually get half the registers. 2) And you get to have some longer busses, since people like to have integer registers near the integer units, and FP registers near the FP units. 3) And you mix together operations that are mostly natural to do in one cycle, with those that are not, which can in some designs add to the port issue. 4) And you cannot play the tricks that many have done to avoid save/restoring FP registers on every context switch.

Anyway, one can make the following observation: 1) I would guess that most people who have actually designed new, real ISAs in the last 20 years have had some-to-much familiarity with the VAX ISA, which of course has one register set used for integer & FP. many microprocessor ISAs were designed on VAXen, compared to VAXen, etc.

2) Nevertheless, with the exception of the Motorola 88K, pretty much everybody chose to use separate integer & FP registers. This doesn't mean that it's necessarily right, but it is a data point that people who actually had to do it (fairly) consistently made the same choice.

-- -john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289 USPS: Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

" -- [12]

---

" Subject: Re: Speed, and its need ... From: mash@mash.engr.sgi.com (John R. Mashey) Date: May 30 1996 Newsgroups: sci.math.num-analysis,comp.arch

...

One more time: in the R4000, a 64-bit integer datapath cost less than 5%; in the R4400, it was more like 2-3%, and in an R4300i, it's hard to believe it's any more than that, and probably less. No one in their right mind would create yet another architectural version (i.e., R4K minus all 64-bit stuff) for a couple % of die space. " -- [13]

---

another good discussion as to why FP regs should be separate from integer regs (just touches on the same points as before, though): [14]

another one (not as good): [15]

also the risc-v spec says:

"We considered a unified register file for both integer and floating-point values as this simplifies software register allocation and calling conventions, and reduces total user state. However, a split organization increases the total number of registers accessible with a given instruction width, simplifies provision of enough regfile ports for wide superscalar issue, supports decoupled floating-point-unit architectures, and simplifies use of internal floating-point encoding techniques. Compiler support and calling conventions for split register file architectures are well understood, and using dirty bits on floating-point register file state can reduce context-switch overhead."

---

probably a lot of good stuff in here; this is some guy's notes on Usenet threads about computers that he found interesting:

[16]

...(later) although, paging through it a bit, i don't see much else of interest to me... i guess the topic that led me there (whether or not the FP registers should be distinct from integer ones) is something that, if it were not directly relevant to what i am doing right now, i would consider very boring.

---

What does Lua do for its stack and registers?

[17] says "Lua employs two stacks. The Callinfo stack tracks activation frames. There is the secondary stack L->stack that is an array of TValue objects. The Callinfo objects index into this array. Registers are basically slots in the L->stack array.". Also note that, below, that page says that register specifiers within bytecode instructions (A, B, C) are at least 8 bits (so 256 regs). TValue, or tagged value, is:

" /* Tagged Values. This is the basic representation of values in Lua, an actual value plus a tag with its type.

/* Union of all Lua values

  1. define TValuefields Value value_; int tt_

typedef struct lua_TValue { TValuefields; } TValue; " -- [18]

so it looks to me like the value part of a Lua TValue (as opposed to the tag part, which is an int) is a union which can be, amongst other things, a native pointer or an int.

---

The JVM DOES have issues with 64-bit architectures making pointers take up lots of space; in fact they have this elaborate mechanism to compress some native pointers into 32-bits: [19]

---

some other points in favor of having only one bank of 16 16-bit regs (i'm writing this on May 11 2019 btw):

and some other points in favor of having 3 banks of 64-bit regs:

also, either way, note that:

---

the only ways i can think of that a VM can preserve memory safety (short of proving stuff about program code) are things like:

these are all heavyweight and unlikely to be used. But at least our architecture, with separate pointer opcodes, makes them possible.

our architecture also allows pointer access to be mediated by a garbage collector (read and write barriers) and allows the garbage collector to put metadata about the pointer into 'fat pointers'.

remember that if you care about security, the VM can't trust the program code not to corrupt native pointers. The above solns can be interpreted as, respectively:

note that even some of these allow one program running on the VM to mess up another program running on the VM (e.g. even if the program has no native pointers in its memory space, but only indices into an external array of native pointers, one program could mess up another program's indices)

the alternative is to make these impossible and specify that pointers are just ints, making us in a sense a "true (or at least conventional) assembly language" (if you believe, as i do, that one of the defining characteristics of 'true' or conventional assembly is that pointers are just ints).

as for needing to know pointer size for memory layout, we could just use sizeof() like C. This probably additionally means the ecosystem gets complicated, as ppl probably will want either something like C's #ifdef or something like Makefiles in order to have different behavior for different targets. But this might be tolerated in a low-level language like Boot.

---

so i guess one conclusion is that having a separate bank of pointer registers doesn't actually improve security, because you have to worry about the program blindly treating pointers-in-memory as ints. You might want a separate bank for implementation efficiency, or because you can increase the number of available (architecturally visible) registers with the same instruction encoding, but it doesn't (by itself) give security.

another conclusion is that if you are leaving open the possibility that pointers are fat pointers, or cryptographically signed pointers, you can't really pre-specify pointer size, which may well need to be greater than 64-bits in some cases. Which means that sizeof() or something like it (eg sysinfo()) is required.

---

so does this help decide the issue?

i guess we can say:

---

ok i think i have a tentative decision:

still deciding about integer register width. On the one hand, it seems silly to force stack slots to be 64-bits if they don't need to be. On the other hand, if 1 stack slot can accomodate a native pointer, then they'll have to be huge in many cases anyways. So the question is, do we use sizeof() or do we make 'slots' big enough for pointers?

another thing to think about is to specify that a pointer takes multiple stack slots. The cost of this is wasted space on machines with small native pointers. The benefit is saved space on machines with large native pointers (b/c non-pointer data doesn't have to be so large). I guess if we care about efficient implementation on 16-bit machines then we want to save space on small machines at the cost of wasting space on large ones. So we want to say that a stack slot can fit a native pointer. However, unless we use sizeof() for everything, not just pointers, this does push us to say that a stack slot can accomodate 32-bit ints instead of just 16-bit ints, b/c o/w on a 32-bit machine a 32-bit int will consume two stack slots = 64 bits -- and on a 64-bit machine 129 bits! So this really pushes us to have 32-bits be the fundamental size, not 16 bits -- which is more practical on contemporary machines anyway (even ARM Cortex M0s are 32-bit).

Saying that a stack slot is at least 16-bits and pointers take 2 stack slots allows efficient 16-bit variables on 32-bit machines, but makes 16-bit machines inefficient. If we only care about 32-bit machines, the additional complexity (and 2x reduction in how many 32-bit vars a constant displacement can skip over) probably isn't worth it, since the 16-bit variables won't be used much.

can we separate the pointers onto a separate 'pointer stack'? Probably not; this requires maintaining two frame pointers, one for the data stack and one for the pointer stack, which not only wastes space but also time (must update both of them upon function call/return, which is a bottleneck). In addition, this doesn't solve the problem of how to layout memory.

what does using sizeof() look like? this means that at compile time, things like offsets into the stack (to select a variable) are resolved to numerical constants depending on the target platform. Which means that the 'bytecode' emitted is not 'universal' in the sense that it is made for just one target platform, with one native pointer size, and cannot be reused on a different target with a different native pointer size. The assembly language source code is universal, however. Can we just include the arithmetic to be computed in the target, to make it universal? No: these will be variable-length arithmetic expressions, to compute how many slots to jump over when accessing stack variables, e.g. LOAD R2+ (sizeof(int) + sizeof(int) + sizeof(ptr)); those can't fit in a fixed-length instruction encoding (or indeed anything less than an AST, unless you use something weird like a special facility for lists of types referenced by offsets within instructions) -- and you don't want to explicitly compute it at runtime b/c the simplest common variable access would become super slow.

so, allowing sizeof() implies that programs in bytecode form are specialized for one target platform, rather than being universal.

what was that thing ppl said about LLVM not being cross-platform? oh yeah here it is:

plChLlvmImpl.txt:* "LLVM IR is generally not considered to be a cross-platform way to distribute code. The problem does not lie with LLVM IR or its containing bitcode format (although you could use platform-specific features like 80-bit floats if you really wanted to). Rather, it has to do with the source language, particularly when the source language is C or C++. Most C compilers, LLVM's clang included, will define types and macros depending on the target platform and architecture, e.g., long, size_t. As you probably already know, the preprocessor resolves macros before compilation even begins, so right away your code becomes (potentially) platform-dependent. Similarly, size_t and friends are fixed way before LLVM bitcode generation begins in order to turn expressions like sizeof(long) into constants as well as just to produce more efficient code in general." -- [21]

another option is to have two kinds of pointers; internal pointers (which are just ints, and which are indices into the memory space of the Boot program, rather the the host memory space), and native pointers (which are opaque to Boot). This would mean that whatever machinery the VM provides for fat native pointers wouldn't get to be reused for the internal pointers; and also if they aren't fat pointers, the speed advantage of having pointers just be pointers would be lost (because dereferencing an internal pointer would at least involve the implementation adding it to the base address of Boot memory (even if not bounds checking it, which should be done)). Native pointers could be indices into an array of native pointers, like in WASM. And internal pointers could be of a known size, and small. I kind of like this approch, let's keep considering it.

so, so far we have two good options:

regarding the JVM, see:

" The word size must be large enough to hold a value of type byte, short, int, char, float, returnAddress, or reference. Two words must be large enough to hold a value of type long or double. An implementation designer must therefore choose a word size that is at least 32 bits, but otherwise can pick whatever word size will yield the most efficient implementation. The word size is often chosen to be the size of a native pointer on the host platform.

...the local variables and operand stack-- are defined in terms of words....

As they run, Java programs cannot determine the word size of their host virtual machine implementation. " [22]

let's explore the second idea a little, because i haven't thought about it much. It sounds like what WASM does. I think that WASM has function pointers which work like 'native pointers as opaque indices into an external array of host pointers'. Does it also have 'internal pointers'? I don't see any in [23]. But that's because they are just ints. In [24] you see that LOADS and STORES take "iPTR"s which are defined as "Within the signature for a linear-memory access instruction, iPTR refers an integer type with the index bitwidth of the accessed linear memory.".

note also that WASM supports (at least in theory?) multiple linear memory spaces. So i imagine you can call malloc to get another one? [25] says that in fact this isn't implemented yet, and right now there can be only one.

we could also have variants of LOAD/STORE (and addr modes) to support multiple pointers sizes; LOAD16, LOAD32, LOAD64 to load from 16-bit, 32-bit, 64-bit pointers (corresponding to 16-bit, 32-bit, 64-bit linear memory spaces). i don't think WASM supports this. But this would allow us to allocate 16-bit memory spaces for most objects without limiting us object sizes to 64k.

---

so where are we right now?

as for the latter choice, i guess what we really need to decide is:

Is the Boot VM sitting above or below where we do garbage collections and (within-Oot) memory safety?

If the Boot VM is below all that, and the next layer up is where we write the garbage collector, then internal ptrs make sense. And this is what WASM does; the language runtime that compiles to WASM provides memory management in whichever way it likes (otoh WASM keeps talking about adding garbage collection in the future); the WASM runtime only checks that the program doesn't escape the sandbox (eg access memory beyond the bounds of the linear memory space). This way we can just expose the raw internal ptrs to the boot program (and so save memory if the boot program needs ptrs smaller than native ptrs).

If the Boot VM is above all that, and the Boot VM itself provides garbage collection and memory safety WITHIN Boot programs (via fat pointers or whatever), then slots that fit native pointers make sense; because now all the pointers, not just external pointers, may need to be fat pointers. So all of the pointers, not just external ones, must be opaque.

And my current guess as to the answer is that Boot is BELOW -- Boot is just supposed to be an easier-to-implement alternative to LLVM or WASM or RISC-V, because i think it would take me too long to implement any of those on each of my desired interop target platforms. And BootX? came in because i couldn't square the circle of being both dead easy to implement, and also sufficiently expressive to do the sorta-intersection of LLVM/WASM/RISC-V instructions (also i wanted addressing modes) -- so i decided to make one language which was dead simple (Boot) and another one which can compile to that (BootX?) yet which can incrementally replace its Boot code with platform primitivies. Neither Boot nor BootX? are supposed to be making the hard decisions, possibly language-specific decisions, on data representation and memory management that language-specific VMs get into. The plan is to have something else, OVM, above Boot and below Oot Core to take care of the administrative nonsense.

---

so how would it work if we had multiple address spaces? WASM hasn't implemented this yet so we can't totally rely on them to provide a good design. And can we have multiple addr spaces of different sizes?

The advantage of multiple addr spaces of different sizes is that we could have 16-bit pointers where appropriate, and therefore save a lot of space (and cache space), while also staying closer to our goal of an ecosystem that could run on a massively concurrent system (which may have to be 16-bits); but still have 32-bit and 64-bit addr spaces for when someone wants to fit a large list in one address space. (even in our massively concurrent machine we'd probably have some large virtual 'address spaces' that map to external memory storage, even if that storage is actually implemented as a little bit of memory spread over many small NUMA memory banks; whether or not those 'address spaces' are actually seen as address spaces from the point of view of the Boot ISA is an open question).

If we have multiple mem spaces we probably need to link between them with external/native pointers. So we need some sort of data type as large as if not identical to 'native pointer' to identify mem spaces. Note that this external pointer could itself be a fat pointer that identifies both the address space and a position within that mem space.

Now for LOAD and STORE do we want to have TWO pointer operands, one to identify the mem space, and one to identify the value? This may not be crazy; after all, we have three operands to work with, and we are currently using the third as a displacement. But what about when we have indirect addr modes (and indexed addr mode, etc) on other instructions? I guess this complexity is another reason to think about getting rid of address modes :). But is there another solution? Some ideas:

Sadly, my immediate first impression is that segment register idea is the best here. A 'mode switch' isnt really any different from this. Different addr modes would be too many addr modes and in any case doesn't help to identify the specific mem space. A segment register is strictly more expressive than an unchangable default mem space. So the remaining choices are:

Assuming it's a native pointer means that addr modes aren't useful for intra-mem space addressing using internal ptrs. So that leaves:

But on second thought, ruling out native ptrs actually seems pretty limiting, because (if addr mode usage is desired) it rules out an implementation that mallocs small data structures from the platform and then has links in those data structures to other data structures from different mallocs. And this pattern might be required for interop with a HLL.

But Boot can't really be what we write interop in anyhow, right? Because we'd have to match the host HLL's data structures byte for byte, so either we'd write the interop in the HLL, or we'd write it in something like C that has sizeof(). If we're using the Boot implementation of the OVM on some platform, then by necessity our interop is limited to various gateways (provided as library functions and I/O primitives), rather than directly sharing memory structures. If we want to share memory structures we'd have to port the OVM to the platform directly. So, if we're using Boot at all, our situation is more like WASM's.

But.. one could imagine starting with the default Boot implementation of OVM and then writing platform-specific 'plugins' for it, ALSO IN BOOT, that provide interop --- IF boot can support dealing with native ptrs and byte-for-byte data layout.

Hmm.. this whole thing makes me lean towards just saying, y'know what, Boot bytecode doesn't have to be cross-platform, it can be specific to one native pointer size. There can be a simple sizeof()-like facility in the assembly syntax (mb as simple as the user defining a constant like 'psz') and in sysinfo() to indicate what the pointer size is on the current target. Then all these other problems go away; we can make memory and stacks byte-addressable rather than using slots and we can have the int data types specify exact widths using i16, u16, i64, etc.

hmm.. actually i do think that's the most pragmatic solution.

---

as for the register bitwidth, we could do 16-bit registers but that means that in the common case where 32-bit values are desired, there are only 8 registers rather than 16, which is a lot of register pressure. So let's get real; although i believe that ultimately, massively concurrent computers will be 16-bit architectures, that's not going to happen for awhile, and in the meantime, ARM Cortex M0, JVM, WASM, etc are all 32-bit. So let's make them 32-bits.

---

so current choices seem to be:

---

if some of the platforms in the concordance (risc-v, wasm, llvm, jvm, arm cortex, luajit2, cil) trap on signed overflow and some don't, which should Boot do? And if some of the platforms in the concordance constrain indirect jumps with a compile-time enumerated list of potential jump targets and some don't, what should Boot do?

i think Boot should (a) trap on signed overflow and (b) constrain indirect jumps with an enumerated list. This will make it easier to port Boot to all platforms by just lowering the Boot instruction to (possibly a short sequence of instructions containing) the similar platform instruction.

---

should call Boot "Boot Assembly" or "Boot VM" to make it easy to search for it?

---

" /* StoneKnifeForth? (currently) only uses the following 23 * instructions, in decimal: * * - 15 157 192: setge %al * - 15 182 0: movzbl (%eax), %eax * - 15 190 192: movsbl %al, %eax * - 41 4 36: sub %eax, (%esp) * - 80: push %eax * - 88: pop %eax * - 89: pop %ecx * - 90: pop %edx * - 91: pop %ebx * - 116: jz (short, PC-relative) (1-byte operand) * - 117: jnz (short, PC-relative) (1-byte operand) * - 129 237: subtract immediate from %ebp (4-byte operand) * - 133 192: test %eax, %eax * - 135 236: xchg %ebp, %esp * - 136 8: mov %cl, (%eax) * - 137 229: mov %esp, %ebp * - 139 0: mov (%eax), %eax * - 143 0: popl (%eax) * - 184: load immediate %eax (e.g. mov $12345678, %eax) (4-byte operand) * - 195: ret * - 205 128: int $0x80 * - 232: PC-relative call to absolute address (4-byte operand) * - 254 200: dec %al * * Moreover, jz and jnz are always preceded two instructions * previously by a test instruction, and setge is always preceded two * instructions previously by a sub instruction, so of the flags, only * ZF and SF are used, and their state is never saved or restored or * transferred across jumps. * */ " -- [27]

" Register-use conventions: `%eax` — the top of the operand stack `%esp` — normally, points to the rest of the operand stack in memory, in the usual `%esp` way: growing downward, predecrement, postincrement. Temporarily swapped with `%ebp` during `call` and `ret` instructions. `%ebp` — normally, points to the return stack in memory, in the usual `%esp` way. `%ecx` — occasionally used as a scratch register Other registers are only used to pass arguments to Linux system calls. ) " -- [28]

so:

---

"Some instructions that proved to be rarely useful are not supported in 64-bit mode, including saving/restoring of segment registers on the stack, saving/restoring of all registers (PUSHA/POPA), decimal arithmetic, BOUND and INTO instructions, and "far" jumps and calls with immediate operands." -- [29]

---

another awesome Mike Pall (LuaJit?2 guy) (mikemike on HN) creation:

https://luajit.org/dynasm.html

---

" mikemike 31 points · 9 years ago · edited 9 years ago

LuaJIT?'s interpreter is fast, because:

    It's written in assembler.
    It keeps all important state in registers. No C compiler manages to do that on x86.
    It uses indirect threading (aka labeled goto in C).
    It has a very small I-cache footprint (the core of the interpreter fits in 6K).
    The parser generates a register-based bytecode.
    The bytecode is really a word-code (32 bit/ins) and designed for fast decoding.
    Bytecode decode and dispatch is heavily optimized for superscalar CPUs.
    The bytecode is type-specialized and patched on-the-fly.
    The dispatch table is patched to allow for debug hooks and trace recording. No need to check for these cases in the fast paths.
    It uses NaN tagging for object references. This allows unboxed FP numbers with a minimal cache footprint for stacks/arrays. FP stores are auto-tagging.
    It inlines all fast paths.
    It uses special calling conventions for built-ins (fast functions).
    Tons more tuning in the VM ... and the JIT compiler has it's own bag of tricks.

E.g. x=x+1 is turned into the ADDVN instruction. This means it's specialized for the 2nd operand to be a constant. Here's the x86 code (+ SSE2 enabled) for this instruction:

Prologue for type ABC instructions (others have a zero prologue). movzx ebp, ah Decode RC (split of RD) movzx eax, al Decode RB (split of RD)

The instruction itself. cmp [edx+ebp*8+0x4], -13 Type check of [RB] ja ->lj_vmeta_arith_vn movsd xmm0, [edx+ebp*8] Load of [RB] addsd xmm0, [edi+eax*8] Add to [RC] movsd [edx+ecx*8], xmm0 Store in [RA]

Standard epilogue: decode + dispatch the next instruction. mov eax, [esi] Load next bytecode movzx ecx, ah Decode RA movzx ebp, al Decode opcode add esi, 0x4 Increment PC shr eax, 0x10 Decode RD jmp [ebx+ebp*4] Dispatch to next instruction

Yes, that's all of it. I don't think you can do this with less instructions. This code reaches up to 2.5 ipc on a Core2 and takes 5-6 cycles (2 nanoseconds on a 3 GHz machine). "


"

A64 Instruction Set

The A64 instruction set is supported by the Armv8-A architecture. Key features of A64 include:

Major Features for Engineers in A64

The A64 instruction set introduces a number of changes from the A32 and T32 instruction sets. These changes include: New instructions to support 64-bit operands

Most instructions support 32-bit or 64-bit arguments. Assumes 64-bit address size

All addresses are assumed to be 64-bits in size. LP64 and LLP64 are the primary data models targeted. Reduced conditional instruction set

The set of conditional instructions has been reduced down to cover branches, compares and selects only. No arbitrary length load/store multiple instructions

Added LD/ST ‘P’ for handling pairs of registers.

"


"

@rdtsc: going from 8 to 16 architectural registers gives big improvements in the amount of spills/reloads for typical code, according to data from simulations in a paper linked from this answer. It affects code size, instruction count, and how important low-latency store-forwarding is. 16->32 is a much smaller effect. AFAICT, 16 architectural registers is a good choice for hardware with register renaming to remove WAR and WAW hazards. – Peter Cordes Sep 9 '16 at 18:39

"

---

should look at Dis for the level above Boot

http://www.vitanuova.com/inferno/papers/dis.html

---

Over the last few days i've been puzzling over a design decision:

In order to allow code to port to different platforms with different notions of pointer (fat pointers (e.g. 256-bit fat pointers), ordinary pointers, 32-bit pointers, 64 bit pointers), while allowing code to at the same time precisely specify the bitwidth of integer operations, i decided to have multiple banks of registers; one for pointers, one for ints (and maybe one for floating points). This has the side benefit of helping instruction encoding density because now you can have more (total) registers for less encoding bits.

But it allows us to perhaps have TOO MANY total registers. Say that the underlying platform is a CPU with 16 registers. If Boot has 2 banks of 16 registers each then they certainly can't fit in the CPU's physical registers. This closes off the option of having an implementation be a dead simple compiler to native instructions with no runtime whatsoever.

So one thing we could do is drop down from 16 registers per bank to 8 per bank (x2 banks = 16 regs total). We want each bank to have registers for SMALLSTACK and TOS (smallstack slots will be large enough to hold a native pointer or an int), and we want a 0 register in the int reg bank and a PC in the addr reg bank, so now we have 16 regs total, of which 3 in each bank are special. So we have 5 regs in each bank for program use, or 10 total. Also, note that the underlying implementation must save the PC, and SMALLSTACK, and probably should save TOS, but that's only 3 regs, less than the 6 specials that the Boot program sees, which means that if the platform has 16 registers then only 13 are required for Boot data, leaving the implementation 3 to work with (or possibly letting us do the same thing in an architecture like ARM32, which has 13 GPRs (R0 thru R12), if we can find a way to operate without any temporary register).

In terms of instruction encoding, now we only need 3 bits for operands, not 4, so we save 2 bits (because the first operand field only got 3 bits anyways in the old plan). But otoh the number of extra instructions we gain by compressing them into subinstructions of a 'more' instruction is now 8 + 8 + 8 instead of 16 + 16 + 16, so we should probably spend these extra 2 bits expanding the opcode field from 4 bits to 6. Now we have 64 instructions with three operands, instead of 16 in the old plan, and also we got rid of the constraint that the first operand of the 3-operand instructions can only address half the registers. And this is before even using our 'more' trick. Conceivably there's some situations in which the 'more' trick could make implementation more complicated, so maybe we can dispense with it (or reserve it for implementation use, which lets the implementation do it's own 'more' trick if there are no problems with that).

At first glance, having 3 operands for everything sounds great, but it does give me pause because of the existence of 'compressed' instruction sets like RISC-V C which explicitly add 16-bit-encoded variants of 32-bit-encoded instructions where some operands are identical etc like in 2-address instruction sets.

And although we have 10 free registers in total for Boot programs, only 5 of those can be used for a single purpose (like pointers), which makes them much less useful than 10 GPRs. And because we duplicated SMALLSTACK and TOS we lost 2 registers; 12 registers would have been much better than 10 (except for the potential need for implementation registers, as noted above). And we are passing up the opportunity to have 32 registers in a 16-bit instruction encoding! How awesome would that be (for code density)!

Also, having so few free registers of each type almost guarantees that most high-level programs will not be able to identify 'variables' with 'registers', as we'll be constantly swapping stuff in and out of the few available registers. Which means that using Boot creates a 'bottleneck' that 'scrambles' the semantics of code, making it unsuitable for use as an IL if we want to (a) compile ('transpile') Oot into another HLL, such as Python, (b) apply some higher-level optimizations (LLVM style) AFTER the Boot stage.

So i think we a fork in the road here, with two or three alternatives: 1) Have one bank of 16 registers 2) Have two banks of 16 registers each 3) Have two banks of 8 registers each

And we have various design goals to consider. I think the most important ones are: 1) Having many registers, for code density, and to preserve source semantics as much as possible. 2) Having few registers, for dead-simple compilation onto CPU ISAs 3) Having banks of registers to allow for the size of native pointers to vary indepedently of arithmetic precision 4) Using the 16-bit encoding for 'compressed' instructions with properties like having some of the operands be identical 5) Not requiring implementations to understand the 'more' trick, or to deal with non-3-address code, for a slight simplification (although imo this is not a very important simplification, but maybe i'm wrong) 6) Having 16 more registers increases the size necessary to save upon context switches, which may matter for massive concurrency. Having less registers fits more with our 'lean, for tiny CPUs in a massively concurrent system' concept.

Note that i don't think having 3-address code is that important for code density here; if you only have 5 registers of each type, you aren't going to be able to preserve things in one register and copy them to another one to be modified as much anyways, so i don't think it'll help much with this fewer registers. It's only good for simplicity in this case imo.

I think i'm mostly decided that we want goal (3), so alternative (1) is off the table. So we just have to decide whether each bank should have 8 or 16 registers. I think the choice can be summarized as:

1) Have two banks of 16 registers each, for awesome code density, and preservation of source semantics in many cases (after all, most procedures will have less than 13 variables). 2) Have two banks of 8 registers each, for dead-simple implementation and fast compilation onto contemporary native processor ISAs

One thing to consider is, if BootX? gets 16 registers per bank but Boot only gets 8, then we lose the property of being able to easily incrementally extend some of the Boot implementation to BootX? implementations. Also, we'll never have as awesome code density if we need to go to 32-bits to use the other 16 registers.

This is a tough choice because:

---

I wonder, is it worth considering mixing these? One idea to mix:

Two banks of 8 regs each; BUT have optional support for the 'more' instructions with a different instruction encoding.

Well, not really, because the MORE instructions would need 2 of the opcode bits for their other encoding. So we can't do this unless we only have 16 opcodes in the base 8-reg instruction set.

Another idea to mix: use alternate instruction encodings. Instead of having an 8-bit encoding, have 2 16-bit encodings. One of them uses 2 encoding bits i guess? I guess that would be the 8-registers-per-bank one? So that one only gets 32 opcodes now?

Seems feasible but we'll have to see if we can really fit in 32 opcodes. Sounds really tight to me (could always use the MORE trick to get 8 + 8 + 8 = 24 additional opcodes though), and also kinda redundant.

---

Even if we don't mix it, i can always just say, hey, there are these 2 separate instruction encodings, that can't be mixed (without some out-of-band way of identifying which one you are using), that are good for separate purposes.

---

So i guess i may as well keep stewing over this awhile i finish ootAssemblyConcordance.

---

i guess two things to consider are:

---

should also consider what LuaJIT?2 does vs. BootX?:

BootX? has 4-bit operands and 3-bit addressing modes and a 7-bit field mixing opcodes and addressing modes. LuaJIT?2 has 8-bit operands, no addressing modes, and an 8-bit opcode.

So, should the next step after Boot (the 32-bit encoding) be to go to ~256 'registers' (local variables instead of registers), or to add addressing modes? Or somewhere in between (e.g. 5 bit operands, 2 bit addressing modes)

---

OK, yknow, actually i do think we can do without 8 addressing modes. Right now, the 8 addr modes in the BootX? proposal are:

Out of these, the last four are all tricky for one reason or another:

So let's just get rid of these for now. Maybe they belong in the higher-level 64-bit encoding/in OVM, but probably not in a 'Boot'-style assembly language.

---

So now we have space for 5-bit operands in BootX?. Which means that, in order to really make use of the 'smallstack' addressing mode, we should have a 32-item smallstack. But that poses a problem if we want to cache the smallstack entirely in registers in a 32-register machine. And, even if we have more registers in BootX? than in Boot, do we want a larger SMALLSTACK? If the BootX?->Boot compiler has to map a 32-item stack onto a 16-item stack, it has to predict when stack spilling needs to occur, which is annoying (it has to compute stack maps etc).

My current feeling is that we should specify a 32-item smallstack for Boot. Even if there are 32 registers, some of those could cache the top items in SMALLSTACK, and the other items could be spilled (although that sorta defeats the purpose of caching any of them i guess, except when SMALLSTACK mode is being used to access them directly).

One issue though is in context switches. Now the context switch overhead is much larger, especially if we allow BootX?'s smallstack addr mode to write past the end of the stack. Hmm.. can we find another use in BootX? for the last 16 operand values when applied to SMALLSTACK addr mode? Probably... we could even reintroduce the constant table... or displacement mode on a subset of registers, with a fixed displacement... oo, OR we could select a distinguished register (probably pointer-R3), treat it as a memory stack, and do indirect addressing to a displacement from that (so this is 'stack mode' on memory, but can also be used as displacment mode for whatever is in that register; so could do -7..+8 displacements; actually 0 is kinda useless, so we can do -8..+8).

hmm okay now i'm back to recommending a 16-item SMALLSTACK.

---

incidentally, i thought of a good reason to have an architecturally visible cyclic stack for once:

with smallstack mode, you can sorta use the stack like registers. But sometimes you also want to use it like a stack. So what if you used the top-most items as a stack and the bottom-most items as registers? Now if you push onto the stack, the bottom-most items simply become deallocated.

This would also more cleanly resolve the annoying semantics around the question of:

i don't think this is a good enough reason. Instead, we should just specify things so that it CAN BE but DOESNT HAVE TO BE implemented as a cyclic stack:

---

since there is an integer TOS register, what if TOS holds a native ptr and those are too big for our integer registers?

i think in this case reads or writes to integer TOS should be undefined.

---

also, i guess an implementation might have a hidden register with the smallstack depth (alternately, the base memory address of the bottom of the smallstack; the SMALLSTACK register could hold a pointer to TOS (or the second item on the stack, actually, if TOS is in the TOS register). Should we expose this? That might help to do fast context switches.

No, the argument above about cyclic stacks suggests that we should NOT require the stack depth to be readable; cyclic stacks don't keep track of a stack depth, that's one of their advantages.

---

and if we can spare one bit from the opcode (so only 32 opcodes instead of 64) then we can have 2 bits for the encoding bit, letting us have a second 16-bit encoding. But since the second 16-bit encoding also needs 2 bits for the encoding bit, that may not be so useful.

---

so to summarize the last few sections:

---

we should have a 'call external function using native ABI' instruction, and also, in order to accept calls into Oot code FROM native fns, a 'prolog for native ABI' and 'epilog for native ABI' (and mb a 'prolog' and 'epilog' for our own ABI).

should we have a datatype 'native int' and conversions to/from it? Probably; it might be important for interop, if we want to pass around shared persistant data structures with native code.

---

i kind of like the following instruction seven tho they are unpopular:

---

MoarVM?'s Expression Trees have kind of the minimalistic macros that i probably want for my BootX? macro assembler:

" The syntax used for expression templates is derived from LISP.

    Lists are the basic constructs of the language. A list is delimited by opening =’(‘= and closing =’)’= parentheses. Lists can be nested. Items in the lists are separated by whitespace.
    A word is anything that matches the perl regular expression of [^\s\(\)#"'].
        A word that consists only of alphanumeric symbols (and possibly underline) is called a symbol: sp_p6oget_o. The first symbol in a list is generally it’s operator.
        Suffixed by : is generally a keyword, e.g. let:
        Prefixed by $ is a reference, either to a MoarVM instruction operand ($1, $2) or to a declared name ($val).
        Prefixed by ^ invokes a list macro, which can be declared with the macro: keyword.
        Macro parameters are prefixed by =,= (e.g. =,obj=), and they are replaced with whatever symbol or list is ‘bound’ to them when the macro is invoked.
        Parameter macro’s are indicated by a prefix &, e.g. &sizeof. They are always substituted with a function macro: (&sizeof foo) becomes sizeof(foo). These must form constant expressions at compilation time.
    A string is anything delimited by =”= - but note that escape sequences are not supported, so =”foo " bar”= doesn’t do what you might think it does

An example of a template definition would be as follows:

(template: sp_p6oget_o (let: (($val (load (add (^p6obody $1) $2) ptr_sz))) (if (nz $val) $val (^vmnull))))

The let: keyword is not an operator but a special construct for the template precompiler. It allows the declaration of names for values that are computed and reused in further operators.

And an example of a macro definition would be:

(macro: ^getf (,object ,type ,field) (load (addr ,object (&offsetof ,type ,field)) (&SIZEOF_MEMBER ,type ,field))) " -- [30]

MoarVM?'s expression tree operators is also a pretty cool small set (44 instructions):

---

So here is how i'm currently thinking of the levels of Oot intermediate languages:

should 'binary structure' (e.g. stuff like constant pools) be added in BootX? or OVM? Probably OVM.

For porting, the idea is to initially bootstrap by porting Boot (this should be dead easy; a few days work). Then, if the porter has more time and wants a more optimized port, they port BootX? (so that Boot is no longer used on this platform/implementation) (this should be easier than porting WASM, RISC-V, LLVM, or JVM). Then, if the porter has even more time and wants better interop or a more optimized port, they port OVM (so that neither Boot nor BootX? are used).

Ideally i'd like one to be able to incrementally extend an implementation of Boot to BootX? and then to OVM, but i'm not sure if i can accomplish that.

---

yknow, i guess most Boot implementations WILL have to keep track of the stack depth; because assuming the stack is stored in memory, even if it's a ring/circular stack with no branching for checking overflow/underflow, still, upon PUSH/POP you have to do (stackPos mod stackSize), but that means that you have to store i) a pointer to the stack base address, ii) a pointer to the current TOS (or equivalently, the (i) the stack base address (ii) current stack depth, or equivalently (i) ptr to TOS (ii) current stack depth).

Which means that a Boot implementation consumes at least 14 registers, not 13:

1) PC 2) current TOS stackptr 3) current stack depth 4) current TOS 5) 5 integer GPRs, 5 ptr GPRs

which means that with 16 GPRs on the platform, the implementation only has 2 registers to work with.

---

so our Boot ISA is shaping up to look like this:

32-ints

if 32 instrs:

cpy pushi (push 9-bit immediate onto stack) add sub mul (result is lower 32-bits of 'full' result) cpy-ptr ptr-add ptr-int-sub ptr-ptr-sub jmp jmpi ??:(for portability, this should be constrained... mb use semi-mandatory ANNOTATEs immediately following to do so, possibly preceded by a JMP? i guess a JMP doesn't end a basic block?) bnz and or xor not? (note: bitwise but can also be boolean if -1 is the representation for TRUE) shl srs sru ld st ld-ptr st-ptr ldh sth ldb stb fence ??? (?undef? addi? ?ldi 6-bit immediate? break? nop? comment? print? impdep?) xcall (platform ABI possibly-external call; arguments are function number, number of arguments (is that needed? probably yes...), ?where to put the returned argument? mb 3rd argument is WHERE the arguments being passed are; on smallstack or in memory pointed to by a pointer (which is needed for external functions taking more arguments than fit on SMALLSTACK; note; passing arguments on a memory stack could be the same as 'in memory pointed to by a pointer')) syscall (we can have 8 syscalls with one input and one output argument; in, out, prolog (platform-native ABI function prolog), xret (platform-native ABI function epilog and return), malloc, mfree, more could be some of these; halt, mfree take one argument) annotate

cpy 0 0 can be the BAD/ILLEGAL instruction. The idea of having control flow be low opcodes clashes with the idea of having the same opcodes for the same instructions in Boot and BootX?, so let's give up the control-flow-in-low-opcodes.

if 64 instrs:

the above, plus: call calli ret xcall xret nop halt (?) break beq bne blt bge clt ...

hmm, looks like we can fit in 32 after all (if there is only 32-bit signed ints, no floating-point or 64-bit)

---

yknow, even if you have 32 registers, it probably doesn't make sense to cache the entire 16-item stack in registers, because registers can't be indexed, so pushing and popping the stack would either mean 16 or so instructions to move things through the registers, or it would mean 16 or so copies of the stack access code to deal with which register is at the top of the stack. So it's okay to make SMALLSTACK be 32 items. In which case we can use SMALLSTACK access in BootX? the way we want to, unchanged.

otoh i'm growing attached to using the other 16 SMALLSTACK addr mode entries as r3 displacements. So let's do that.

---

so i think it'll be like this:

Boot:

types: native ptrs, 32-bit signed ints

registers and stack:

fixed-length encoding: 16 bits: 2 encoding length bits (always 10), 5 opcode bits, 3x3 operand bits

encoding is little-endian. The above listings of fields in the encoding go from LEAST-significant bit to most.

BootX?:

types: native ptrs, 32-bit, 64-bit signed ints, 32-bit floats, 64-bit floats

registers and stack:

addr modes: immediate, register, (smallstack-as-register, r3 displacement), register indirect

encoding (no alignment restrictions): 8 bits: 1 encoding bit (always 0), 128 'shortcut' ops 16 bits: Boot 32 bits: 3 encoding bits (always 110), 6 opcode bits + 2 opcode addr bits (160 instructions; immediate + half of the others are instructions), 3x(5 operand bits + 2 addr mode bits) note: instructions with a bit-prefix of 111 are RESERVED

---

ovm must support weak references

---

yknow, i think it's really too much trouble to compile BootX?'s 32 register banks to Boot's 8. Even with a fixed assignment of the first 8 BootX? registers to Boot registers, this requires a runtime, that is, an initial memory allocation, to have a place to store the other 24 registers. If we're going to do that anyways, just make it a VM (Boot) builtin facility. So have two MOV instructions (cpy-to and cpy-from) to move stuff from the wider 32-register set to the 8-register set, and back.

Each of these MOVs will have one 6-bit argument and one 3-bit. So we have a spare bit. So may as well use the spare bit to provide direct SMALLSTACK-as-register access too. So in that case why not have SMALLSTACK just be 32 items. And in that case, forget about the r3 displacement addressing mode in BootX?... wait actually we would need a second pointer-copy too. So just use the extra bit to distinguish between cpy-to and cpy-from.

So now the Boot instructions look like:

cpy pushi (push 9-bit immediate onto stack), or ldi (load 6-bit immediate into register) (note: or ldi, large immediate format?) add sub mul (result is lower 32-bits of 'full' result) cpy-ptr ptr-add ptr-int-sub ptr-ptr-sub jpc (jump relative to pc) (note: large immediate format?) ji (jump indirect) ??:(for portability, this should be constrained... mb use semi-mandatory ANNOTATEs immediately following to do so, possibly preceded by a JMP? i guess a JMP doesn't end a basic block?) (note: large immediate format?) bnz and or xor not? (note: bitwise but can also be boolean if -1 is the representation for TRUE) shl srs sru ld st ld-ptr st-ptr ldh sth ldb stb fence ??? (?undef? addi? ?ldi 6-bit immediate? break? nop? comment? print? impdep? other branching instrs? leaning towards impdep) xcall (platform ABI possibly-external call; arguments are function number, number of arguments (is that needed? probably yes... but could pass them with the args themselves and then make this a systemm, that would be better right?), ?where to put the returned argument? mb 3rd argument is WHERE the arguments being passed are; on smallstack or in memory pointed to by a pointer (which is needed for external functions taking more arguments than fit on SMALLSTACK; note; passing arguments on a memory stack could be the same as 'in memory pointed to by a pointer')) syscall (we can have 8 syscalls with one input and one output argument; in, out, prolog (platform-native ABI function prolog), xret (platform-native ABI function epilog and return), malloc, mfree, more could be some of these; halt, mfree take one argument) annotate

---

but if we don't like runtimes and so dont want to add new regs in BootX?, then what about the 64-bit/FP regs? i guess the most portable, easiest to compile thing to do would be to just say that each of those 64-bits is really just 2 adjacent 32-bit regs, and you can only use even-numbered registers for 64-bit. That wastes 3 bits per FP or 64-bit instruction, but what can ya do?

also note that, b/c on some systems there might be separate banks of FP registers, the program can't assume that the 64-bit or FP values are actually stored there.

also from a concise-state point of view that makes sense to me, after all we already have 64 registers (32 integer, 32 pointer), jeez!

you can think of everything except for the first 8 regs in each bank as local variables with instructions in the ISA (CPY) to load/store those local vars.

---

but... mixing the FPs in like that is ugly. Not sure if i can get around that though without compromising ease of compiliation, that is, w/o adding a runtime.

Also, on systems in which the ordinary registers can't hold floats or 64-bits, this could complicate special-case code in a vm that names specific registers. But when the vm's registers are actually just memory locations, it's easy to use adjacent memory locations as FPs or 64-bit ints. So why not restrict the 64-bit/floats to the upper 16 registers? Even on a 32-register underlying platform, since we have integer and pointer register banks, only the lower 16 in each bank will be in actual registers, so the upper 16 is already in memory, so that should be easier. So now we only have 8 64-bit/FP registers. Fine with me.

Note that a single 64-bit float can store a 32-bit int too [32]. But not a 64-bit int. So some crazy implementations could even use its FP register bank as 8 of the upper int registers, not that i care about enabling that crazy.

---

eh, it's just too ugly. Just make the last 8 registers 64-bit and floating-point capable. But still possibly each in their own bank (e.g. undefined data if you write a 64-bit and read a 32-bit) in order to allow for platforms that have a separate FP register bank.

---

houston, we have a problem. With this ISA we only have 9 bits to encode JMP targets!!!

With the 16-register ISA we had 12 bits, which is much more feasible.

Although even 12 bits is obviously still not enough, so mb it doesn't matter? We're going to have to use JMPI all over the place in either case.

This is kind of the achilles heel of fixed-length 16-bit ISAs, i guess? Even with 32-bits it seems tough.

What does RISC-V do with their 32-bit fixed-length ISA? Their JAL instruction has a 20-bit signed immediate, but with units of 2 bytes, so it can target a range of +-1MiB? (2^20 bytes). So scaling that down that's like a +-1KiB? range for 16-bit instructions, which still isn't much.

i guess there's no escaping this for 16-bit ISAs; after all, even 16 bits is only 2^16, which is only enough to encode short programs. So we can really only directly encode small, relative jumps.

i guess the one thing we could do is make the 8-bit encoding give up a bit, and then we use this bit for a second instruction format, a 15-bit signed relative jump. So now we can encode relative jumps of +-~16384 bytes. That seems worth it to me. Let's do it.

---

Boot:

types: native ptrs, 32-bit signed ints

registers and stack:

fixed-length encoding: 16 bits: 1 encoding length bits (always 0), 1 sub-format bit (0 except for jump relative (?and mb also ji, LI, LUI, AUIPC?), where it is 1), 5 opcode bits, 3x3 operand bits

encoding is little-endian. The above listings of fields in the encoding go from LEAST-significant bit to most.

BootX?:

types: native ptrs, 32-bit, 64-bit signed ints, 32-bit floats, 64-bit floats

registers and stack:

addr modes: immediate, register, (smallstack-as-register, r3 displacement), register indirect

encoding (no alignment restrictions): 8 bits: 2 encoding bits (always 10), 64 'shortcut' ops 16 bits: Boot 32 bits: 3 encoding bits (always 110), 6 opcode bits + 2 opcode addr bits (160 instructions; immediate + half of the others are instructions), 3x(5 operand bits + 2 addr mode bits) note: instructions with a bit-prefix of 111 are RESERVED

---

hmm.. if we have so few bits for jump targets, i wonder what the 6502 did? The 6502 had a 'zeropage', which is kinda like having 256 registers, so how did they fit that in there?

ah: [33]

As fow the jumps, 6502 opcodes were 1 byte, and included some register choices (but these didn't have separate operand fields, b/c there was so few registers) but they were followed by data. So an absolute JMP was 3 bytes long; the opcode, then a 16-bit jump target. In Boot, we're trying to fit the instruction AND the data into 2 bytes, so that's why we're having trouble.

As for the 256 zeropage 'registers', instructions with zeropage addr mode did fit in two bytes (one byte for the opcode, one for the zeropage 'register' choice), however the cost was that (a) the other register choices were completely determined by the opcode and (b) there was only one or two operands; e.g. you had instructions like ADC, add-with-carry, where the other register was always the A (accumulator) register, and it always mutated the A register, and you had instructions like LSR which shifted 1 bit and mutated the target zeropage 'register'.

So i think it's worth it and do what we are doing; give up three-quarters of the zeropage 'registers' (we have 64 instead of 32) in exchange for a little more power in the instruction (three address code).

---

hmm... if we allow 2 opcode bits in the 'large immediate' format, then we can have 4 instructions with 12-bit immediates. We could use this to encode larger jumps as stereotyped sequences of (lui,..., ji) or (auipc,..., ji), like the way that RISC-V does.

Without this, our LI instruction can only load 9 bits (so call it 8 bits), and the immediate offset of ji can only be 6 bits, so LI, JI only has 15 bits at most. With it, AUIPC could load 12 bits then JI could have another 12 bits, so +-8 MiB? in two instructions -- but at the cost of JPC having a range of +-2k instead of +-16k. But, we also free up another opcode in the other format.

Could we save an instruction by just having an ordinary 12-bit LI and then having JI shift the address in its immediate up by 12 bits? Well, that prevents JI being used for displacing into jump tables, so it's probably not worth it. Also, we need some way to indicate whether or not PC-relativity is being used here -- it might be nice to provide absolute addressing for implementation-dependent and toolchain usage.

so, i think we should do this.

also, dont forget my idea of having an instruction that both shifts and then adds an immediate. This would allow a chain of three instructions to load 32 bits:

ldi (load 8-bits) register #imm shift-by-12-then-load-12-bits register #imm shift-by-12-then-load-12-bits register #imm

wait tho we dont have space for a register operand in large-immediate mode. So how about pushing these onto the stack:

pushi (push 8-bits) #imm shift-by-12-then-load-12-bits #imm shift-by-12-then-load-12-bits #imm

so now it takes 4 instructions (8 bytes) to load a 32-bit (4-byte) constant into a register:

pushi (push 8-bits) #imm shift-by-12-then-addTOS-12-bits #imm shift-by-12-then-addTOS-12-bits #imm cpy reg smallstack

and it takes 3 instructions to do a jump anywhere with a 32-bit range:

pushi (push 9-bits) #imm9 shift-by-12-then-addTOS-12-bits #imm12 shift-by-12-then-ji #imm12

so mb the large-immediate instructions that we want are:

or alternately?:

hmm that's nice but then we need 4 instrs instead of 3 to do PC-relative jumps with a 32-bit offset. Also the semantics gets confusing if the result is > 2^32. Also if ji gets a register and a large immediate, now we just added another instruction format, which might scare off implementors. Otoh wouldn't you like a nice ldi #imm8? That's the same new format.

i feel like it's gotta be:

This means that to do a PC-relative jump with a 32-bit range:

cpy smallstack, 0 shift-TOS-by-12-then-add-imm-to-TOS #imm12 shift-TOS-by-12-then-add-imm-to-TOS #imm12 add TOS TOS PC ji TOS #imm9

ugh... that's 5 instructions (80 bits), not 3 or even 4. Can't we do better?

And for a PC-relative jump with a 16-bit range:

cpy smallstack, 0 shift-TOS-by-12-then-add-imm-to-TOS #imm12 add TOS PC ji TOS #imm9

It makes sense though. You want the shift-TOS-by-12-then-add-imm-to-TOS to be something that you can iterate to get more and more bits. And you can't add the PC in until all the shifting is done.

What if it were add-imm-to-TOS-then-shift-TOS-by-12? You still can't add the PC in until all the shifting is done.

Maybe we can make use of ldi somehow:

ldi smallstack, #imm9 shift-TOS-by-12-then-add-imm-to-TOS #imm12 add TOS TOS PC ji TOS #imm9

doesn't work; we only have 30 bits, not 32.

we could demand a weird accumulating ADDI instruction adds a source register AND a 4-bit immediate AND a destination register, and puts the result in the destination register

ldi smallstack, #imm9 shift-TOS-by-12-then-add-imm-to-TOS #imm12 addi TOS PC #imm4 ji TOS #imm9

i dunno if that's worth it though -- would it be used anywhere else? And it's weird.

so i still think our best option may be:

wait actually that's not good:

shift-TOS-by-12-then-add-imm-to-TOS #imm12 shift-TOS-by-12-then-add-imm-to-TOS #imm12

leaves a 12-bit gap in the low order bits; ji reg #imm9 can't fill that gap. We need two variants; one which shifts and one which does not.

Let's just give up on the 12-bit shifter instruction and use ldi. And let's use up the 9th bit to give ldi two forms:

1) ordinary ldi 2) shift by 8 bits, and then addi

Now make ji the same:

1) ordinary ji 2) shift by 8 bits, add #imm8, then add PC, then ji

so now we do:

ldi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi-then-addPC-then-ji reg #imm8

what's nice is that this works well for 16bit and 64-bit also:

16-bit: ldi reg #imm8 shift-8-then-addi-then-addPC-then-ji reg #imm8

64-bit: ldi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi-then-addPC-then-ji reg #imm8

and to do ABSOLUTE (not PC-relative) ji takes just one more instruction:

16-bit: ldi reg #imm8 shift-8-then-addi reg #imm8 ji #0

yknow what tho this seems a little ugly and besides we could make ji more powerful at the expense of making these sequences longer by having ji's alternatives be:

1) ordinary ji 2) jpci

Now we have:

16-bit PC-relative ji: ldi reg #imm8 shift-8-then-addi reg #imm8 jpci reg #0

16-bit absolute ji: ldi reg #imm8 shift-8-then-addi reg #imm8 ji reg #0

32-bit PC-relative ji: ldi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 shift-8-then-addi reg #imm8 jpci reg #0

etc

this is prettier and more regular/easier to process; large constants are in bytes, and also: to find large constants, just look for sequences of the form:

ldi shift-8-then-addi*

which are repeats of the same opcode (with a different subopcode bit at the beginning)

so it's gotta be:

Also you might consider CPY to have its own format.

So we now have 4(!) instruction encoding formats (after the encoding length bit):

is it worth it trying to keep the register operands at the same places everywhere, like in RISC? we can best achieve that by splitting the immediate in the 3bit/9bit format. Yes, let's do that:

hmmm... that RESERVED guy... it kind of suggests using it for CPY, so that we have one less weird format... hmm... then we'd have 2 bits left over for a subopcode, so that would be 3 subopcodes left for other 32-register 2-operand thingees.. hmmm... but an issue is that on 16-register platforms, the implementation already only has 1 register for its own use, and this would require the implementation using a temporary while doing the copy when doing a copy whose source and destination are both outside the 8 register set; the current way of doing it (force either src or dest to be in the 8 register set) allows the implementation to use one less register, which i think might be the difference between being able to do this without more implementation state, and not. so no.

we could use it for a jmp absolute with #imm12. But that's kind of useless; jmp absolute would only be useful for platform-specific addresses, and those will be at least 16 bits.

we could add a shift-12-then-addi. That actually saves us one instruction when creating >16-bit constants. But it takes away the nice properties that (a) those constants are being built up in 8-bit chunks, (b) the instructions in constant builds are all using the same major opcode.

i have an idea. you know what we really need larger immediates for, just as with jpc? displacement mode addressing for ld and st. we could do:

(organized so that register operands are in the same places as usual, of course)

btw we could avoid splitting the immediate by having the registers at the end, rather than next to the opcode. Then to make which-register-in-which-encoding-position-does-what uniform, we could put the destination register at the end, instead of at the beginning. But i notice that RISC-V always puts the destination register first/next to the opcode. Is there a reason for that?

MIPS did it the other way: https://en.wikibooks.org/wiki/MIPS_Assembly/Instruction_Formats

i'm going to guess it doesn't matter much. Let's put the dest register always last.

also if we did that should the assembly reflect this ordering, or is it easiest for the reader to have the dest register first? "Intel syntax uses "dest, source" while the AT&T syntax uses "source, dest"." So i guess ppl can tolerate bothe.

So lets put the dest reg always last, and have assembly reflect that.

now we don't have to split the immediate for that other format:

so it's:

and now we have some free opcodes in the regular encoding, too!

---

do not use ordinary comparison operators on floating points; floating point equality is not even reflexive! Must use a separate 'fcmp' instruction to make the difference clear. (we have to anyhow because floats are a different type and our instructions are non-polymorphic)

---

should probably provide our variant of IEEE 754 totalOrder, since that's more useful than fcmp for floating points anyhow:

" Total-ordering predicate

The standard provides a predicate totalOrder which defines a total ordering for all floating-point data for each format. The predicate agrees with the normal comparison operations when one floating-point number is less than the other one. The normal comparison operations, however, treat NaNs? as unordered and compare −0 and +0 as equal. The totalOrder predicate orders all floating-point data strictly and totally. When comparing two floating-point numbers, it acts as the ≤ operation, except that totalOrder(−0, +0) ∧ ¬ totalOrder(+0, −0), and different representations of the same floating-point number are ordered by their exponent multiplied by the sign bit. The ordering is then extended to the NaNs? by ordering −qNaN < −sNaN < numbers < +sNaN < +qNaN, with ordering between two NaNs? in the same class being based on the integer payload, multiplied by the sign bit, of those data.[21] " [34]

our variant will just compare all NaNs? as equal, because some platforms will not let us access details about NaNs? anyhow. I guess mb we'll provide two variants, one that puts NaNs? less than everything else, and the other that puts them greater than everything else. Although if there has to be the default, it should be the one that puts the nans at the top, b/c when you print out a nan on most platforms it says "nan", never "-nan", so this will be less surprising to people.

if a platform doesn't have an 'isnan', then we can determine that by using the platform's floating-point comparison operator and checking if x == x. If it's a NaN?, this will be FALSE, otherwise it will always be TRUE [35].

we will provide an isnan function, of course; and an isinf function.

i'm not 100% sure what the proper way to make an isinf is, but i think it's probably: 0*number = 0, 0*inf = nan, 0*nan = nan. So since we already know how to identify a nan, i think we can now identify an inf too.

to distinguish -0 and +0, divide by it: 1.0/0.0 = inf, 1.0/-0.0 = -inf. Some platforms, e.g. Python, don't like even floating-point division by zero though, so watch out. Python has math.copysign though.

---

for specification of arithmetic, let's just refer to WASM's spec. They appear to have done a good job, and the spec is free/libre ( https://github.com/WebAssembly/spec/blob/master/document/core/LICENSE ) and easily available on the web.

https://github.com/WebAssembly/spec/blob/master/document/core/exec/numerics.rst

---

RISC-V had an interesting note on why they didn't include conditional MOV (CMOV):

" We considered but did not include conditional moves or predicated instructions, which can effectively replace unpredictable short forward branches. Conditional moves are the simpler of the two, but are difficult to use with conditional code that might cause exceptions (memory accesses and floating-point operations).... Both conditional move and predicated instructions add complexity to out-of-order microarchitec- tures, adding an implicit third source operand due to the need to copy the original value of the destination architectural register into the renamed destination physical register if the predicate is false. "

---

the above specs have 32 smallstack items for Boot and 16 smallstack items for BootX?! which is it? i think my last word was 16 smallstack items, so that in BootX? the other 16 values for SMALLSTACK addr mode would actually be r3 displacements.

---

So now the Boot instructions look like:

cpy swap? (generic stack or just smallstack? if generic stack then why not DUP and DROP too?) add sub (note: sub 0, x == neg) mul (result is lower 32-bits of 'full' result) cpy-ptr ptr-add ptr-int-sub? ??? bne (note: bnz is a special case) beq (note: bez is a special case) blt bne-ptr blt-ptr and or xor (note: xor x,-1 == not x) ??? shl srs sru st cas (compare-and-swap) st-ptr ldh sth ldb stb fence ??? mb xcall (platform ABI possibly-external call; arguments are function number, number of arguments (is that needed? probably yes... but could pass them with the args themselves and then make this a systemm, that would be better right?), ?where to put the returned argument? mb 3rd argument is WHERE the arguments being passed are; on smallstack or in memory pointed to by a pointer (which is needed for external functions taking more arguments than fit on SMALLSTACK; note; passing arguments on a memory stack could be the same as 'in memory pointed to by a pointer')) syscall (we can have 8 syscalls with one input and one output argument; in, out, prolog (platform-native ABI function prolog), xret (platform-native ABI function epilog and return), malloc, mfree, more could be some of these; halt, mfree take one argument) annotate

(?undef? addi? break? nop? comment? print? impdep? other branching instrs? leaning towards impdep)

large immediate formats:

---

i can't say the zero register is doing much good here... it's useful in:

well, i guess that's enough to justify it after all. It's not like we have that many spare opcodes.

---

what about stack ops?

we don't need DUP because that's just

cpy smallstack tos

we don't need DROP b/c that's just

cpy 0 smallstack

i guess SWAP would be nice? ok i added it

---

let's get rid of ptr-ptr-sub. it raises annoying questions about when the distances between pointers is not within 2^32.

let's get rid of NOT. it can be synth'd easily with xor:

not? (note: bitwise but can also be boolean if -1 is the representation for TRUE)

---

instructions i took out from boot_reference (for now):

sysinfo (query system metadata) , cvt-int-ptr addi-int, neg sub-ptr, cvt-int-to-ptr bitnot, halt (terminate program, returning a value, or reset program, reinitializing and restarting it), pushpc (push program counter to stack) over push, pop

addi-ptr-int, swap? (generic stack or just smallstack? if generic stack then why not DUP and DROP too?)
memory allocation (optional)malloc, mdealloc, mrealloc
stream I/O and IPC (files and channels) (optional)read, write, open, close, seek, flush, poll
other I/O peek-i8, poke-i8, devop
debugging log, break
platform library call library

---

do we need compare-and-swap for NATIVE POINTERS or just for 32-bit ints?

---

i recall a bunch of complexity was introduced by having to distinguish 'stack aware instructions' which would act on SMALLSTACK rather than push/pop to it, eg. LOAD could use its offset to treat SMALLSTACK locations as registers.

i think i'll just forget about all that. It's too complicated for Boot, and in BootX?, we can just use stack addr mode.