proj-oot-ootAssemblyNotes23

linux ioctl is applied to open file descriptors, e.g.:

" Open the KVM device, kvmfd=open("/dev/kvm", O_RDWR

Do create a VM, vmfd=ioctl(kvmfd, KVM_CREATE_VM, 0) Set up memory for VM guest, ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region) "
O_CLOEXEC)

---

i think having the polymorphic 64-bit instructions is overengineering; we already have two stages of assembly-like instructions, 16-bit and 32-bit, having a third stage is just spending too much complexity on the assembly-like stuff. The 64-bit LONG instructions should just be the grammar-like, AST-like/data-structure-like flexible encoding.

---

having the 8-bit shortcut instructions prevents 16-bit alignment, which is a pity as 16 is a "more perfect" power-of-2 than 8. Let's get rid of that, and instead have alternate formats for 16-bit instructions. We still have 1 format bit for 16-bits, and when it is one value, we have our ordinary 16-bit instructions, and when it is a different value, we have another format bit; when it is one value, we have a 32-bit (or longer) instruction, and when it is a different value, we have the new 16-bit format.

So the new 16-bit format has 14 bits left for us to work with. We can make the first two of these be more format bits. When they are 11, we have 12 bits left to work with, which we can use to recover the 6-bit shortcuts (in packets of 2) by treating these 12 bits as two 6-bit shortcut instructions. Now we still have 00, 01, 10 to work with.

One thing we could do here is pull out all of the instructions that don't really fit well with the 3 operand model from the ordinary 16-bit instructions, in order to save a few opcodes over there. This could include JREL, LDI, ANNOTATE. Note that these each require at least 11 bits of content, however, (and it would be great if we could give them 12) so if we did all three of those, that would probably be our entire budget; but we may want to save some for longer macro instructions, or for future expansion. The 6-bit shortcuts will probably already need a NOP, so we could also pull out NOP and represent it as two shortcut NOPs.

---

[1] says that SMS payloads, even though they are 160 characters long, are actually encoded into only 140 bytes (by using 7-bit characters). And then a 36 byte header is attached. This suggests:

32 or 64 bits would be a good low-level message size (to fit a little less than this header, or this header + some content), or 128 or 256 bits (to fit a little less than this content, or a little more than it).

---

[2] speaks of:

a 512-bit unit

256-wide ports

"Each memory operation can be of any register size up to 512 bits"

"The L2 to L1 bandwidth in Skylake is the same as Haswell at 64 bytes per cycle in either direction"

"Each (DDR4) channel is 72-bit (64 bit and an 8-bit ECC)"

" Store is now 64B/cycle (from 32B/cycle) Load is now 2x64B/cycle (from 2x32B/cycle)"

" L1I Cache:

    32 KiB/core, 8-way set associative
        64 sets, 64 B line size

"

L1D, L2, L3 caches also each have 64 B line size

" DTLB

    4 KiB page translations:
        64 entries; 4-way set associative"

" Port 4 now performs 512b stores (from 256b)"

L1D cache:

" 128 B/cycle load bandwidth 64 B/cycle store bandwidth "

the picture/section "Individual Core" [3] has some bandwidths of:

16 bytes/cycle

64B/cycle

512bit/cycle

and the Decoded Stream Buffer (DSB) has a "64 B window" and Instruction Fetch and PreDecode? has a "16 B window"

this suggests that message sizes should be 64 bytes (8 bytes through 128 bytes are observed).

---

so it sounds like 64 bytes is a pretty good message size..

---

"Another famous example of NIH is Linux’s epoll(2), which is a broken version of BSD kqueue(2)."

---

"

aseipp 27 minutes ago [-]

If we're talking about desktop processors, it's still irrelevant, honestly; any desktop processor that can compete in a modern world is extraordinarily complex in implementation at every level. POWER, ARM, RISC-V, x86 -- they all try to extract high performance using similar techniques in their execution engines. Out of order execution, pipelined decoding etc are architecture design aspects and have little to do with the ISA. Front-end decoding, etc traditionally takes an extremely small amount of area and space in the overall system, though it's still complex (it's kind of hard to talk about raw "area" metrics when cache is what dominates and is most important.) But the CPUs themselves are incredibly complex, no matter what way you cut it.

If you're talking about embedded CPUs or whatnot, such as picorv32, the frontend is a relatively bigger piece of the "pie" so having a simpler ISA is nice. But really, verifying any CPU design is complex as hell and the decoders/frontends are not the biggest problem (even in RISC-V a source of bugs I've seen float around and even fell into myself, for my own emulator, is the semantics of JALR clearing low bits, for instance -- but that's not a decoding problem, it's a semantics one, flat out. Also, let's be real here, the whole memory-model approach in RISC-V where there's a spectrum of extensions adding TSO on top of weak memory ordering etc is nice and cool but I'd hardly call it fitting in the form of what people think of as a "simple" RISC CPU. That's not even getting into the array of other extensions that will crop up -- bitmanip, a crypto extension will almost certainly pop up, the vector extension, etc etc...) "

---

What if there are 64k nodes and each node had a 64k-bit bit map with each bit corresponding to a node, in each node able to change the state of its bit in everyone else's bitmap (globally; that is, everyone's bitmap shares the same values at all times; they are all copies of one another). That's kind of like how the brain works I guess; each neuron's output is copied to each recipient. Is generalizes the idea of letting each node have a bus shared with its vertical and horizontal siblings. To make it more like the brain's rate code (I'm not saying there's not also Spike timing coding), you could give each node an output byte instead of an output bit.

of course, generalize that one more time and you just get a 64k shared global memory

---

so, if we eschewed the 'geometry' aspects of the tesselated/tiled processor design elsewhere in these notes, and wanted to make massively concurrent 'chips' where every CPU on the chip could communicate equally with every other CPU, a simple architecture could be:

The global shared memory would have a very relaxed memory ordering, however. Updates are atomic and affect 16 bits at once (no 8-bit update option). Updates coming from any given CPU always arrive in order (i think this is called 'processor consistency'?).

Should we have no other memory ordering consistency constraints at all? How about one more: for any given choice of CPUs A, B, C, from the point of view of C, if A generates an update, and B generates an update, and then, before receiving B's update, A generates another update, then C can perceive no more than 255 of B's updates before it perceives A's second update (255 chosen b/c sqrt(64k) is 256). I'm guessing that this will at least allow algorithms that guarantee synchronization. (this implicitly contemplates an implementation consisting of up to 64k 'cache' copies of the 'global' memory with globally broadcast update messages each time someone makes a change, and lossless deterministic routing of these update broadcasts, but different caches receive different messages at slightly different times depending on their nearness to the sender)

however, we could simplify even futher by just getting rid of the globally shared memory and instead providing message passing including global broadcast messages with the above guarantees (namely: (a) if a CPU generates two messages to the same destination, the destination receives them in order; and, (b) if a CPU A generates one message and a CPU B generates one message which A is destined to receive, and A generates another message before receiving B's message, then no CPU that is destined to observe these messages (except B itself) can observe no more than 255 of B's messages before observing A's second message.

one could make the condition (b) much simpler by tightening the constraint somewhat: the maximum propagation time is at most 255 times the minimum propagation time. How to state this in terms of memory ordering constraints?

(yes, this limits the total size of the system, but we're doing that anyways here with our 16-bit addressing; also, you could connect multiples of this system together, but in that case the interface provided to the program is more like a distributed system and less like a CPU)

One could even implement this by making message emit very slow, by just making it block for an amount of time that is at least 1/255th of the max propagation time; i think this turns the whole thing into sort of a quasi-synchronous system, in the sense that there is now a shared unit of time and a shared 'fuzzy clock' whose fuzziness is bounded.

maybe the way to state this stronger constraint is that: if any node observes event P before event Q, then no other node can observe Q plus more than 254 events from the same source as Q before observing P.

---

but... actually the message passing architecture brings us back to the idea of a 'shared memory' where each piece of memory can only be written to by one processor; because now we need buffers/mailboxes to store the incoming messages; and the simplest thing to do here is just to have a per-node buffer/mailbox size of 1 and have any incoming message overwrite the previous one (like interrupts); but that leads to a lot of coupling insofar as all of the nodes' messages are overwriting each other -- so an obvious solution is to have, at each node, one mailbox for each other node. Which, if there are 64k nodes, is like a 64k-word shared memory where each memory location can only be written by one node.

so there is something to the idea of a special 'shared memory' where each word is only writable by one node; just generalizing it to a 64k shared memory might lose the architectural feasibility.

the two constraints from above could still give the memory ordering constraints:

an alternative system would be to have a smaller amount of total mailbox buffer (e.g. a buffer of 100 instead of 64k), and let each CPU set a 'receive mask'; messages from senders not in the receive mask are immediately discarded and don't reach the buffer. You could also combine these somewhat:

use the 'shared receive memory' just as an interface/API concept; that is, there is a RECV instruction that takes as a parameter which sender to receive from (so, it's like reading a memory address), and it just returns the last message that you got from that sender, if any -- but you can only get messages from senders if they are in your 'receive mask', and there is a max number of '1's allowed in your receive mask at any one time.

that is kind of begging to be generalized from 'senders' to 'channels', though -- so that a message doesn't contain the address of its sender, only which 'channel' it is on, and any sender can send on any channel (this model assumes that all senders are trusted, because otherwise it might be nicer to not let different senders overwrite each other's messages, which feature was provided back when each sender had its own dedicated 'channel'). Now each 'channel' is a virtual bus.

So the way it works now would be:

how many subscriptions should be allowed? well, maybe all of them (if 64k are possible, due to the 16-bit registers); i think neurons can often have tens of thousands of synapses. Note that you don't need a separate mailbox buffer for each CPU; you can have shared mailbox caches (which makes a destination parameter on SEND seem a little silly, btw, so more evidence that we shouldn't do that). The architecture could actually do this:

---

one could also imagine fancy architectures where you have fully SUBSCRIBE bitmasks for each space in the buffer, and as messages come in, they overwrite each other (or do more complicated accumulation operations, such as sum). This is sort of circuit-switchy/signal-processy/neural -- i like it.

So for example, you would have say 16 'smart mailbox buffer' locations. You could say "program smart-mailbox-buffer-location #3 to: receive on channels 0,1,2...,2047, and accumulate with the SUM operation". You could also write directly to these smart mailbox buffer locs (eg. you could write a '0' to start over with a new sum). Other accumulate ops could be AND and OR.

This would let you implement simplified 'neurons'; accumulate signals from 'dendrites' with SUM, and then write directly to simulate decay. This suggests of course that what you really want is a WEIGHTED sum accumulation; which can be expressed by letting the subscription be a mask of integers rather than just a bitmask.

oo, i like that. Each 'smart mailbox buffer' is doing a dot product between incoming signals and a weight vector. The fact that there is more than one 'smart mailbox buffer' location is like dendritic computation (eg. you can use arbitrary functions to combine information from different regions of the dendritic tree, where each region is just a linear sum of the synapses into that region).

it doesn't quite allow for per-synapse dynamics though, unless you have enough 'smart mailbox buffer' locations to have one per sender -- in which case the 'smart mailbox buffers' would be like synapses, and the limited number of them is the constraint that you can only fit so many synapses on a neuron.

unfortunately if, per each 'smart mailbox buffer' location, you stored one weight integer per each other node, that would be 64k words per 'smart mailbox buffer' per CPU (since these guys are local to each CPU)! So i guess instead we just want a list of (channel, weight) pairs, and the list can only be so long.

i guess that really brings out how complex the brain could be; if each synapse on each neuron stores important state, and each neuron has tens of thousands of synapses, then we have ~64k words of memory per neuron just to store that state (otoh the 'word size' here is like 4 bits rather than 16 bits, but that's only a 4x savings in memory space; note that Sejnowski's lab found 4.7 bits of information per synapse in 2015).

here's a compromise; each CPU would have 4 'smart mailbox buffer' locations (4 regions of the dendritic tree). Each of these smart buffers is associated with a mask of 64k 4-bit entries. So in total these masks (the synaptic weights) take up 64k (bytes, not words) of memory. So, since we already have 64k words of local memory per CPU, we're less than doubling that. Each 'smart mailbox buffer' has a programmable accumulate operation (overwrite, sum, and, or). AND and OR might be a little redundant here; they're really multiply and sum with normalization. In addition, each CPU has 8 smart buffer locations with 1-bit receive masks, with programmable accumulate operations (overwrite, sum, and, or), for another 64k bytes. So in total these masks consume 128k bytes or 64k words per CPU; that is, we've doubled the local memory requirements.

The 1-bit receive masks operate with a separate set of channels that carry 1-bit messages btw. Note that these can emulate the data-processing part of interrupts (using the OR accumulate operation) (but with interrupts, there is also the control-flow 'interrupt' part! but we can have a 16-bit interrupt mask for these channels!).

So the way it works now is:

total memory cost per CPU of this stuff is about 64k words (128k bytes). We also have the 64k of local memory (both program and data memory).

---

consider adding instructions for higher-level interop with HLLs into which we are embedding.

---

y'know, if in Boot and BootX? we specify that the registers are only required to hold 16 bits, then even if we emulate wider regs in code we may not have enuf; if we have 16 regs + 16 stack locs, that's 32 16-bit regs, or 16 32-bit regs, or 8 64-bit regs (of which only 4 are actually registers, and the other 4 are stack locs). Otoh we do have addressing modes in BootX? so we could maybe just do 64-bit computations in memory. Let's take a look at the addr modes again and see if they would be suited to that.(...) yeah, we have both displacement and indexed addr modes in BootX?, so i guess we can use memory locations to do 64-bit math. Crisis averted (although inefficiently, i guess?)

is it worth it? this makes 64-bit math much less efficient. Otoh the motivation for only requiring 16-bit is to allow massively concurrent systems (with millions of processors). That's probably worth it, given our goals here.

one remaining worry:

Displacement and indexed addr modes split the 4 bits in the operand_value in half, though, so in displacement, assuming that the first half indicates a register holding a pointer to the base of a 64-bit pseudoregister table, then the second half is only 2 bits, so we only get 4 64-bit registers this way. And we don't get the smallstack, so that's not doubled. Indexed, however, could be used to address many more pseudoregisters (up to 2^16), but at the cost of having extra constant loads every other instruction to load the index.

In addition, displacement REALLY only addresses 1 64-bit register, because remember that what we're talking about here is how easy is it to write universal programs that can do 64-bit math even on a 16-bit system, in which case each memory location is 16-bits and it takes 4 of those to make 64-bits, so all 4 displacments are just locations within one 64-bit system. So basically we'd have to resort to indexed mode, with interleaved constants, to do 64-bit math.

Eh, probably worth it, given our goals here. Remember that all registers can hold native pointers, whatever size those are on the current platform, so on a 64-bit register we have 32 spots for pointers. The 4 spots spoken of above are just 4 memory locations

There is an alternative, though; give up predecrement/postincrement addressing mode and replace it with a displacement-segment address mode that uses a fixed register as base; this now allows displacements of a full 4 bits (16 displacements), which now allows addressesing of 4 64-bit-registers-made-out-of-16-bit-ones instead of just 1. Not sure if that's worth it (although getting rid of predecrement/postincrement addressing mode is tempting anyway because it's mutating, which makes analysis ugly).

Another thing we could do is specify Boot as 16-bit but specify BootX?