Bayle Shanks's website: proj-oot-ootAssemblyNotes15

pcwalton 13 hours ago [-]

Also worth noting: x86 supports four hardware breakpoints, which are sometimes more convenient, since they don't involve overwriting any instructions. (They're also more flexible because they can be used to trap on memory reads, etc.) https://en.wikipedia.org/wiki/X86_debug_register

EvanAnderson? 2 hours ago [-]

I remember the first time I read about the hardware-based debugging functionality in the 80386 and being in awe. I'd programmed 6502 and 8086 assembler and was just blown-away by the treasure-trove of debugging functionality the '386 had available to it. It seemed like it was just short of an in-circuit emulator and so very, very cool.

compiler-guy 59 minutes ago [-]

It's not just x86, pretty much every modern architecture has hardware breakpoint registers, data watch point registers and so on.

wazari972 2 hours ago [-]

I thought that these HW 'breakpoint' registers were (only) used to implement watchpoint, that is, read/write operations on data. Can they be used to implement 'instruction breakpoints'?

pcwalton 1 hour ago [-]

You can set the "break on" bitfield to "execute" (0) to have the debug register function as a normal breakpoint.

jasonzemos 1 hour ago [-]

On register %dr7 there are 4-bit switches for each of the 4 breakpoints. The first two bits can trap on Instruction (0b00), Write (0b01), and ReadWrite? (0b11). The second two bits specify the alignment of Byte (0b00), Short (0b01), Long (0b10), and Int (0b11).

--- https://m.phys.org/news/2017-01-middle-popular-yields-more-efficient-parallel.html

" Cilk adds just two commands to C: "spawn," which initiates a fork, and "sync," which initiates a join. That makes things easy for programmers writing in Cilk but a lot harder for Cilk's developers. ... All previous compilers for fork-join languages are adaptations of serial compilers and add the runtime invocations in the front end, before translating a program into an intermediate representation, and thus before optimization. In their paper, the researchers give an example of what that entails. Seven concise lines of Cilk code, which compute a specified term in the Fibonacci series, require the compiler to add another 17 lines of runtime invocations. The middle end, designed for serial code, has no idea what to make of those extra 17 lines and throws up its hands.

The only alternative to adding the runtime invocations in the front end, however, seemed to be rewriting all the middle-end optimization algorithms to accommodate the fork-join model. And to many—including Leiserson, when his group was designing its first Cilk compilers—that seemed too daunting.

Schardl and Moses's chief insight was that injecting just a little bit of serialism into the fork-join model would make it much more intelligible to existing compilers' optimization algorithms. Where Cilk adds two basic commands to C, the MIT researchers' intermediate representation adds three to a compiler's middle end: detach, reattach, and sync.

The detach command is essentially the equivalent of Cilk's spawn command. But reattach commands specify the order in which the results of parallel tasks must be recombined. That simple adjustment makes fork-join code look enough like serial code that many of a serial compiler's optimization algorithms will work on it without modification, while the rest need only minor alterations. "

https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf

" Tapir introduces three new instructions to LLVM’s IR in order to represent parallel tasks. These instructions are detach , reattach and sync . "

detach label <detached>, label <continuation> reattach label <continuation> sync

http://wmoses.scripts.mit.edu/docs/parallelir.pdf

" Namely, all of these frameworks share two fundamental con- cepts: a detach operation and a synchronize operation. In this sense, a detach operation represents marking code to be run in par- allel. In OpenMP?, detach can be thought of entering a parallel con- text such as a parallel for loop. In Cilk, detach can be though of a spawn operation, assuming all function arguments have been eval- uated. Using PThreads, detach represents running a thread on a specific function. The synchronize operation represents taking code that was pre- viously detached and waiting for it to complete "

" The new detach instruction is a block terminator. This func- tion acts much like a conditional break, as it is provided to blocks which it can branch to. However, unlike the conditional break, the detach instruction represents branching to both blocks simultane- ously. For clarity, we will denote the first argument of the instruc- tion as the continuation while the second argument represents the spawned block. Additionally, a new reattach instruction will be created. This instruction is also a block terminator. It takes one argument – the block representing the continuation. Theoretically this is not nec- essary, however, its inclusion greatly simplifies optimization passes and compilation. The reattach has one successor – the continu- ation. However, unlike successors produced by a normal branch instruction, successors to a reattach instruction cannot use SSA registers inside of the reattach ed block –even in PHI nodes. This is done to represent the fact that the detach ’ed block is not guar- enteed to have executed before the continuation. However, any reg- isters before the detach statement can be used in either the contin- uation or the detach ed block without the use of PHI nodes since it is guarenteed that they were executed before entering the block. When a reattach instruction is executed in a serial context (e.g. no parallelism is available on the machine), it simply acts as an un- conditional branch to the continuation – thus representing the serial ellision. When executed in a parallel context, however, a reattach instruction is used to signal the end of execution for the detach -ed thread. Any detach ’s to function calls using the old syntax can be con- verted to use this new syntax as Figure 3 illustrates. In order to maintain the correctness as shown above, we give this new detach function-like properties, where it can consider any memory accesses in the detached code to be its arguments. "

http://wsmoses.com/tapir.pdf

" Conceptu- ally, the detach instruction is similar to a function call, and the reattach instruction is similar to a return. The three in- structions have the following syntax: detach label b , label c reattach label c sync where b , c ∈ V . The label keywords indicate that b and c are (labels of) basic blocks in V . A detach instruction takes a detached block b and a con- tinuation block c as its arguments, and it allows b and c to operate in parallel. A detach instruction in a block a ter- minates a and spawns a parallel task starting with b which can operate in parallel with the continuation block c . The CFG contains a detach edge ( a , b ) ∈ E and a continue edge ( a , c ) ∈ E . The block b must be a single-entry block for which every exit from the sub-CFG reachable from b — the parallel task spawned by a — is terminated by a reattach whose continuation c matches the continuation in the corre- sponding detach instruction — the detach instruction reat- tached by the reattach . For each such reattach instruc- tion, the CFG contains the reattach edge ( b 0 , c ) ∈ E , where b 0 is the block terminated by the reattach . In a sense, detach works like a function call to b that resumes at c after the subcomputation rooted at b returns, whereas reattach acts like a return. But unlike a function call and return, b is a block within the CFG of the function containing it, rather than a di ff erent function in the program, and b and c can operate in parallel. A detach does not require that b and c execute in parallel, however. It simply allows the runtime system to schedule them for parallel execution, if it so desires. Tapir assumes that every CFG G = ( V , E , v 0 ) obeys the following invariants: 1. A reattach instruction reattaches one detach instruc- tion. 2. For each reattach instruction j that reattaches a detach instruction i , every path from v 0 to the block terminated by j passes through the detach edge of i . 3. Every path starting from the detached block of a detach instruction i must reach a block containing a reattach instruction that reattaches i . 4. If a path from a detach instruction i to a reattach instruction j that reattaches i passes through the detach edge of another detach instruction i 0 , then it must also pass through a reattach instruction j 0 that reattaches i 0 . 5. Every cycle containing a detach instruction i must pass through a reattach instruction that reattaches i . 6. Any immediate successor of a reattach instruction can- not contain “ φ instructions.” "

---

instead of hard limits on length of custom instruction code and nesting of custom instructions, we could just say 'whatever is fine, as long as no fully expanded custom instruction is more than 64k'. But that would mean that if someone is programming custom instructions and using a library of someone else's custom instructions nested within them, that changes in the implementation of the library could make the top-level illegal. So mb better to have hard limits. But that's such a waste, because the limit will almost never be achieved, so we'd be wasting a lot of potential custom instruction length.

How about: hard limits on every layer except the top one; the top one's limit is computed assuming that the layers beneath it are all their own max, but by using more or less of them its own limit changes. This way, the top one's limit is not dependent on how the ones beneath are implemented. This means that the 'layer number' of a custom instruction is part of its fixed interface signature.

But in fact, we could do the same thing for every intermediate layer. The various intermediate layers could have different max expanded code size budgets, monotonically nondecreasing. Presumably the max expanded code size budget of each layer would be powers of two.

But in this case, we could simplify it by throwing out the idea of a fixed numbering of layers, and just specify the max expanded code size budget of any custom instruction as part of its interface signature/API. These could be constrained to be powers of two, and the total max is 64k (2^16). So there are 16 choices (so this parameter can be specified in only 4 bits; maybe the other 4 bits could be flags like 'deterministic', 'side-effectless', 'idempotent', 'threadsafe', etc).

Alternately, the other 4 bits could indicate how much hiddenstack is needed/consumed by this custom instruction, which the compiler should track since there is a small, hard limit on hiddenstack size.

What should the stack size limit be? These 4 bits suggest 16. 16 is also nice because we can't have more than 16 layers of custom instructions. https://people.ece.cornell.edu/land/courses/ece5760/DE2/Stack_cpu.html mentions 16 and 8 as sizes for small stacks. https://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html#212 mentions 16 as the size for a stack buffer (" In the case of a stack that is only used as an expression evaluation stack, "large" may only be approximately 16 elements, since single expressions seldom nest very deeply (Haley 1962). "). Let's say 16.

---

note: we used to have 12-bit operands (4k variables). Is this enough to encode most HLL programs? Probably; MS suggests using no more than 64 locals because the CLR runtime won't consider enregistering anymore [1]. So that's only 6 bits of local variables, and we'd be giving 12 bits for all variables.

but currently we have 8-bit operands (256 vars). This is only 4x the 64 suggested locals in CLR.

---

note: since we are throwing out stack addressing and address modes in SHORT, the hiddenstack is no longer enough for custom instructions. We have three alternatives:

at least one reserved register to pop the hiddenstack into, and this register must be saved/recalled upon calling/returning from a custom instruction
re-introduce 1-bit address modes into SHORT, use them to distinguish between register direct/register indirect (so register indirect on a stack accessing TOS), reduce the number of SHORT-accessible registers from 16 to 8, and possibly remove the LOAD/STORE instructions
mmap the hiddenstack TOS to memory location 6

discussion on what to do about the lack of direct access to the hiddenstack TOS:

one of the simplest things to implement is to mmap the hiddenstack TOS to memory location 6. The disadvantage is that this is a special case that doesn't expand our expressive power except by just enough for this problem. We could also add TOS mmaps for datastack and callstack, but then we'd have more than 8 special purpose registers, and only 8 GPRs. An obvious choice is to make the hiddenstack and datastack TOSs mmapped, but not callstack.

btw, somewhat tangentially, we probably need a RET instruction if that CALL instruction is to be useful in SHORT. Otherwise you have to return with a temporary location still in use: (a) pop the callstack (or possibly, the entire regreg saved state) to a temporary variable, (b) set the PC to the popped value.

btw, i got rid of swap, but didn't i figure out that we need it, if there is no scratch register within hidden functions? how else would you swap between datastack and callstack?

if we have a 'scratch' register reserved for custom instructions, that's nice i guess, but we incur the cost of saving and restoring this register in between calls to custom instructions. Although i suppose we could have a 'callee saves' policy there, rather than doing it automatically. I guess if this is not automatic, then it is also simple to implement, i guess even simpler than the mmap.

re-introducing 1-bit address modes into SHORT is the most powerful option, and we would get to eliminate the LOAD/STORE instructions, which means we'd have room for a RET and SWAP. But it also introduces the most theoretical complexity, and also it means the SHORT instruction set is no longer RISC.

i guess based on implementation and conceptual simplicity for SHORT the best answer is a reserved scratch register for use by custom instructions.

hmm should we try for 12 instructions instead of 16?

(b/c base 12 = 2*2*3)

currently we have:

ANNOTATE_OPCODE = 0 LOADI_OPCODE = 1 CPY_OPCODE = 2 ADD_U16_U16_OPCODE = 3 SUB_U16_U16_OPCODE = 4 LEQ_U16_U16_OPCODE = 5 ADD_PTR_U16_OPCODE = 6 SUB_PTR_U16_OPCODE = 7 CALL3_OPCODE = 8 LEA_OPCODE = 9 SYSCALL_OPCODE = 10 JZREL_OPCODE = 11 JNZREL_OPCODE = 12 J_OPCODE = 13 JI_OPCODE = 14 SWAP_OPCODE = 15

initial thoughts:

we are 'missing' MOV, COW, AND, OR, NOT, NEG, MUL
we have LEQ but also JZ and JNZ
we have JZREL, JNZREL, J, and JI. We can get rid of some of these if we allow jumps via PC mutation (but i doubt that's a good idea)
do we need both JZREL and JNZREL? Not formally, but for the sake of branch predictors it's nice to have both so that the common case can always be the one without the jump.
do we really need SWAP? LEA?
we have CALL3 and SYSCALL
should we get rid of the 'stack' address modes, and add PUSH and POP? If we get rid of SWAP and LEA we could add these.
if we allow a special case where the syscalls are mapped to the high address space, we can get rid of SYSCALL and just use CALL3. But this wouldn't work out so well for SHORT.

let's get rid of SWAP at least. Probably also LEA.

also, i'm glad that i thought through SHORT, but now, instead of officially specifying both SHORT and MEDIUM, mb just list SHORT as 'reserved for future use' and don't include it in the spec (except that it still is called 'SHORT' and still has format bits 00 and still is a 64-bit fixed length package). No, as the next paragraph indicates, we are still making changes to everything based on SHORT.

also remember to look at ootAssemblyNotes14, which suggests a different instruction set with less 'signatures' and which also removes the need for addr modes in SHORT, at the expense of making the PC writable. The more i look at that list (today), the more i kind of like it. Perhaps we could still provide the J instructions in MEDIUM mode and make PC read-only in MEDIUM mode? But would that require signatures again?

in that list there are SKIPZ and SKIPNZ. But since we have two input operands, we can use the second input to specify how many to skip. In which case we may as well allow short backwards jumps too and call this JZREL and JNZREL, right? wait no, we had to leave it SKIP because of custom instruction expansion; remember that we were going to let the implementation use the two unused operands to turn each SKIP into a jump of up to 64k, allowing for depth 2 of custom instruction nesting if each custom instruction is up to ~256 instructions. Since the PC is writable in SHORT, we can already do a JREL by just ADDing or SUBing to the PC.

also is CALL3 really CALL3? Because one input operand must be used to give the location of the function reference in memory, so really we just give the function one input and get one output, right?

hmm.. we could actually replace SYSCALL with J (jump to label, that is jump to entry in constant jump table), and just define the first n entries of the jump table to be syscalls. But no, part of the point of syscalls is that they accept an input and give an output, like call3s. Well, how about we rename CALL3 to CALLI (call indirect), and replace SYSCALL with CALL? The low CALLs are the syscalls (at least 16 of them, because in SHORT we want to reach at least 16 syscalls, but probably many more; how about reserving 256? but no, currently our MEDIUM operands are only 8 bits, so that would be all of them, so maybe this should just be SYSCALL and not CALL). This isn't that useful, though, because we already have custom instructions; can't they do everything that CALL does, and more? Will have to think about this more.

as nice as REGREG is, it should probably be a syscall that has an argument saying how many registers are being saved.

registers/memory map with a 0 register prepended: 0: 0 register 1: datastack 2: callstack 3: hiddenstack 4: first-class function (? do we still need a register for this now that SHORT operands can access 16 memory locations instead of 4? probably not) 5: err 6: GPR (now that SHORT has 4 operand bits, we probably don't need a GPR here) 7: PC 8: regreg (this is used to tell the implementation to copy registers to memory so that the copy will persist across function calls)

ok so it's been 5 months and i don't remember the details. Must re-read at least ootAssemblyNotes14, ootAssemblyNotes15, and ootAssemblyThoughts. But after only glancing at some stuff, for future reference, i think the current new plan is:

16 instructions, all can be signature 'out in in' (except for maybe loadi?): annotate (note: annotate with all zero operands is NOP) loadi load store cpy skipz (note: this skips the rest of this 64-bit instruction pack, if we're in a SHORT, and the entire next 64-bit instruction/instruction pack, which might be either one instruction (if it is MEDIUM) or 4 (if it is SHORT)) (note: the skips only use one operand, the other two are reserved for the implementation for use in storing an adjusted target relative jump during custom instruction expansion) skipnz calli (call3?) pop (note: pop and push take stack pointers (pop's at in and push's at out?) and i guess also take a number saying how far down in the stack they operate) push add-u16 sub-u16 leq-u16 add-ptr (i guess that when navigating a struct this should go to the next struct field, even if this means you are moving different amounts of bytes as you move over different fields) sub-ptr syscall

(how shall we use the extra operands in load, store, cpy?)

no addr modes in SHORT.

SHORT encoding format: 1 form bit + 4 packed 16-bit instructions (actually 1 15-bit instruction + 3 16-bit instructions): 1 form bit (0) + (3 opcode bits + 3 x (4 operand bits) ) + 3 x (4 opcode bits + 3 x (4 operand bits)). (so the first instruction in every SHORT must be one of the first 8 primitives (or is it, an even-opcode primitive?); the others must be one of the first 16 primitives; note that the first instruction can be decoded as normal). Encodings are little-endian, and bits are from LSB to MSB.

MEDIUM encoding format: same as before. 2 form bits (10), 2 reserved bits, 5 x (4 addr mode bits + 8 data bits). field1 is the opcode. field0 is a type argument to allow dynamic types to be passed in calls to polymorphic functions.

writable PC in SHORT

read-only PC in MEDIUM

addr modes in MEDIUM. But no STACK addr modes: 000: immediate 001: register direct 010: register indirect 011: RESERVED 100: constant table 101: boxed register direct 110: boxed register indirect 111: RESERVED 1xxxx: RESERVED

some non-primitive instructions (do we allow 'inout inout inout' signatures here? do we have special signatures for the jump instructions?): MOV COW AND OR NOT NEG MUL J JI LEA SWAP

registers/low memory map: 0: 0 register 1: PC 2: err/status (usually zero for no error, nonzero for error, but also used for carry) 3: scratch register (by convention, this may be overwritten by some custom instructions, and should be callee-saved, if needed, onto the hiddenstack before invoking any custom instruction) 4: hiddenstack (hard limit on hiddenstack stack size: 15) 5: datastack 6: callstack 7: reserved

note: we took out 'regreg' so now we want two syscalls to load and save register state

note: we want syscalls to load from the constant table and from the jump table

(assuming that we no longer need a first-class function register (see above); must think about that more, but note that it's not even referenced in pymovm.__init__.py).

some ways that this differs from typical bare-bones assembly:

'data structures' (addr subspaces) traversed by add-ptr
custom instructions (and the hiddenstack)
call (first class function call with one input and one output in operands)
boxed addr modes
annotations
polymorphism
large number of 'registers'

declarations of custom instructions include the following as part of the signature (note that each of these fit in a nibble, so both of them together can be packed into a representation within one byte):

code length budget (integer between 1 and 16, inclusive; the actual code length budget is 2^x, where x is that number. This is the maximum length that a fully expanded custom instruction can expand to. Each nested call to a custom instruction is counted as an expenditure of its full code length budget)
hiddenstack consumption (integer between 0 and 15, inclusive) (the compiler can use this to check if this much hiddenstack is still available whenever the custom instruction is called from within another custom instruction. The compiler assumes that this much hiddenstack is 'in use' when this custom instruction calls another custom instruction. The compiler can also check that more than this amount is never used).

The compiler can check custom instructions to make sure that hiddenstack overflow and underflow is impossible. Custom instructions must leave the hiddenstack pointer in the place where they found it. Custom instructions cannot call each other in a recursive loop. The compiler can check that the code length budget for custom instructions is not overflowed (this allows an implementation to expand custom instructions into a sequence of primitive instructions, and change SKIPs into internally-implemented JRELs which jump by no more than 64k, which can be held in the 16 bits of the two unused operands in a SKIP in MEDIUM encoding).

should put the stuff in this section into an updated README after the code is updated to make it true:

remove stack addr mode
update primitive instruction list
change this to make it clear that the CALL instruction only passes in 1 input: "Single-instruction function calls to first-class-functions: a memory location holding a reference to function is called, passing in 2 inputs, and assigning one output"
"SHORT (16-bit fixed length instructions; defined but not implemented yet" --> mention that they are packed into 64-bit packages
.. what else?

---

interesting idea for even easier initial bootstrapping on a new platform: write a full OVM interpreter completely in SHORT (or at least compile one to that). Now for an initial bootstrap, someone only has to implement a SHORT interpreter. So they don't have to think about addressing modes or custom instructions or polymorphism. Further simplifications along these lines may be possible.

Of course, if they want good performance or good host platform integration, they will need to write their own full OVM interpreter in the host language, but at least this way they don't have to. And it doesn't add complexity, since SHORT is a proper subset of OVM anyway.

Ooo, i think i like that.

We might rename SHORT to OVMCORE to make this clearer, although that may be too easily be confused with Oot Core. Maybe go back to OotB? or SOot? I guess "Soot" is the easiest and most distinct to remember; so we'd have "Soot" (SHORT Oot), then OotB? (Oot Bytecode, the language than Ovm executes), then Oot Core, then Oot.

so i guess now i'm going the other direction from something i said before: instead of not specifying Oot SHORT yet, we very much specify that. Of course, this specification is tentative and unreleased and liable to change at any time as we gain experience from implementing the full Ovm.

actually i'll call it Boot, not Soot. This way Oot Core could be called Soot later. The Ovm language can just be called Ovm.

I started drafting an initial spec: https://gitlab.com/oot/boot/blob/master/doc/boot_reference.txt

---

in order to make Soot simpler, maybe the skips will just skip one soot instruction, regardless of quad. This provides one more way to have errors in OVM, but to do otherwise would mean to (a) have to plan for an 8x reduction in custom instruction length budgets (from 64k to 8k) due to the possibility of skipping over 7 SHORT instructions that have been represented in the MEDIUM encoding, (b) make translation from SHORT to MEDIUM more complicated, since a SKIP in SHORT couldn't be represented by a SKIP in MEDIUM.

---

hmm it's pretty obnoxious not to be able to JREL. How about we allow JREL and give up some of our total 'custom instruction length budget' in order to allow skips. Our total budget is 64k. If we skip more than one, does that just divide the budget, or is it worse than that? I think it just divides it, because if the outermost layer of instructions can still make it over the gap, then all of the inner ones can too, because that means the total gap is less than 64k.

in SHORT the most we'd want to skip is 16, which would mean our total budget would go to 4k instead of 64k. But actually we want negative skips too, so 8 at most, so our total budget is 8k. That seems reasonable; that's saying that no custom instruction can ever expand to more than 8k of SHORT instructions, which sounds like quite a lot to fit into one instruction.

---

todo think about and add some of the vmcalls i thought of before (i assume they are in ootAssemblyThoughts or ootAssumblyNotes14 or something like that) to boot_reference, add stuff from similar kernels that include things like file I/O eg http://www.shenlanguage.org/learn-shen/shendoc.htm#The%20Primitive%20Functions%20of%20K%20Lambda

---

3 operand get/set for use with maps/arrays/OOP? could replace PUSH and POP by changing k argument to register direct but then would have to LOADI all the darn time when doing stack manipulation

ok moved PUSH and POP to 2-operand instruction list and removed their k argument

---

'read' and 'wait for event' are similar if the file being 'read' is an event stream. Also similar to both of these are iterators with iterator.next().

---

'poll' can be used as 'sleep' too, b/c could wait on an empty set of events with a timeout

---

instead of having a 'poll' (or epoll or kqueue/kevent or completion ports) to wait on multiple events at once, could simply create a virtual stream of the union of events from all of the things you're waiting for; then this virtual stream can be 'read' via the file I/O system. So if you have a way to multiplex/union event streams together like that, then all you need afterwards is a 'wait', not a 'poll'.

---

for awhile i had 'get' and 'put' in the Boot instruction set. But then i thought:

get and put are useful for the extensible Ovm future, but how do they actually work right now? are these vmcalls? arrays? maps? objects? Do we really want a get and a put here? Are they used for channels/files?

i decided that the most sensible non-polymorphic get and put would just be load and store with offsets, which we already have. Although we could have load and store with the offset being a compile-time constant, or stored in a register, i guess we should have both. But save the words get and put for polymorphic stuff later in OotB?.

---

todo: since regsave takes a continuous range of registers, figure out about the ordering of the first 8 special registers; which ones do we want to make it easy to save/restore? do we include stacks?

---

note that i like having REGSAVE k and REGLOAD k so as to make static verification easier -- the compiler should be able to always easily know the types of the contents of each register and also to have a stack map at least sometimes.

similarly, i like having the PC readonly and having to use jd to write to it.

i envision the HIDDENSTACK (mb that's not the best name) as being compile-time verifiable. The CALLSTACK should usually be compile-time verifiable but since it contains recursion this is trickier, also i'm not sure if we want to absolutely require verifiability here, maybe some weird interpreter or metaprogramming techniques can use it in a weird way. DATASTACK is not intended to be verifiable ; you can use DATASTACK to share stuff across multiple subroutine returns without popping and repushing it, and you can push a variable number of elements onto DATASTACK.

note: in MEDIUM mode, there are duplicate opcodes for the TWO, ONE, and ZERO guys. This is to make analysis easier; remember, MEDIUM is the encoding for facilitating analysis.

---

so how should call/ret work? The simplest thing seems to be that (call t k) is a macro for: regsave k; push callstack pc; jd t, and (ret) is a macro for pop scratch callstack; jd scratch. Note the asymmetry: CALL here does the regsave, but RET doesn't do the corresponding regload; the instruction after the CALL instruction should contain the regload. This is because we have 'regload k' (k is the number of registers to restore), and the callee doesn't know what 'k' should be. In this case, though, we don't even really need call/ret, right?

should regsave k save k items to the stack, or just 1 pointer to a list of k items elsewhere in memory? It seems inefficient to be dereferencing a pointer in the extremely common operation of subroutine return. Otoh, since this is a language for VMs more than for hardware, the vm can just implement this by having a pointer to the register file; then regsave and regload are just reading/writing to that pointer. Note that this means that either you can only do that when k = 8 (the biggest possible), or you just say that 'regload k' is allowed to overwrite ALL of the registers 8-15 if it chooses, but it is guaranteed to restore at least k. Saving a pointer also makes it simpler to verify the callstack, because the number of items there per function is somewhat fixed, although we still have the annoyance that the callstack contains not only code addresses but also the pointers produced by REGSAVEs. If we did do a hardware implementation, could just have a 'shadow callstack' that contains the actual saved items; the regsave'd pointers could just be something used internally rather than pointers into main memory. Otoh that would make it more difficult to implement the functionality in which the user program is free to treat the regsave'd pointer as ordinary first-class data and pass it around, copy it, etc.

which reminds me, if REGSAVE produces a pointer, than who deallocates that memory?

now, even if regsave just saves the items on the callstack directly, rather than saving a pointer, then we probably want to also push a length onto the callstack, so that just by looking at the callstack you can tell how many registers are saved. This allows a routine other than the callee to clear the stack of everything having to do with the callee in case you decide never to return. Which reminds me, what about exception/llvm "invoke"? This makes me want to just have an opaque struct, a single item on the callstack, containing something like a tuple (return address, regs, exceptional return address) (maybe a little more flexible than that, though). But that's sounding higher level than we want for Boot -- not only is this harder to implement/too complicated, but it may be making design choices that make it unusable for some languages.

Haskell had some issue with LLVM's 'call' (or was it invoke?) instruction, what was it again? Oh here it is, i wrote:

"CALL/RET, which GHC Haskell apparently doesn't use due to the need to return to after an 'info table' instead of immediately after CALL instruction"

" if you are wondering why GHC manages its stack separately from the C stack, and does not make use of the CPU's CALL and RET instructions, see section 7.1, "Using call/return instructions", in Marlow, Yakushev, Jones, Faster Laziness Using Dynamic Pointer Tagging (2007). In summary: (a) GHC supports large numbers of threads, each with very small stacks, but operating systems take signals on the user stack, and don't do limit checking (i.e. i think they mean that very small stacks might cause stack overflows during OS signal handling), (b) in order to tell the garbage collector which items in a stack frame are pointers, the code for each 'case continuation' is normally preceded by an 'info table', but when you RET from a CALL, you RET to the instruction right after the CALL instruction, which is where we'd like to put the info table, necessitating an extra JMP (not sure if i'm understanding that part correctly; my understanding is that the info table is embedded in the code and that they save one JMP instruction by skipping over the info tables with their synthetic returns, is that correct?). "

also i wrote: " Furthermore, LLVM itself had to be modified to be able to work with Haskell. LLVM added: (a) tail-call optimization, and (b) a new calling convention that allows the LLVM programmer (GHC) to control register allocation in some way. I don't fully understand (b) but here's what it seems to me that people are saying is happening:

Lazy functional languages (or at least, Haskell) tend to generate orders of magnitude more short-term memory allocations than other types of languages (see http://lhc-compiler.blogspot.com/2009/01/why-llvm-probably-wont-replace-c.html , http://lhc-compiler.blogspot.com/2009/01/case-against-cllvm.html ). This means that GHC presumably feels the need to hyper-optimize this sort of thing. Doing the usual things (whatever they are, i dont fully understand this part) w/r/t allocating stuff on the heap, pushing stuff onto the stack, talking to garbage collector subsystems, just doesn't cut it (i'm guessing; note that the last two links are NOT about GHC). Therefore, GHC keeps track of available registers and uses registers as global variables, rather than pushing stuff onto the stack, whenever there are free registers. Even the stack isn't fast enough for GHC! ... So, LLVM added an option to give the programmer (GHC) more control over more registers. See http://stackoverflow.com/questions/6394937/llvms-calling-convention-for-ghc for info on the additions that LLVM made for GHC. "

Also tangentially related, i wrote "Apparently, it currently isn't POSSIBLE to port the haskell runtime to LLVM because the fine-grained register control it requires can't be expressed in LLVM http://marc.info/?l=haskell-cafe&m=130253901622420&w=2 "

" i'm struck how even LLVM, C, and EXE ( http://groupoid.space/exe.htm#raise ) have escape continuations/exceptions of some sort. It seems like even core languages include these. I note that EXE models them as effects, so i guess that's similar to Haskell's IO monad exceptions. "

llvm tail calls literally just reuse stack frame and so require caller callee identical signatures (prototypes)

llvm cannot do tail calls if 'invoke' is used (which is 'call' but with exception handling) b/c it cant bubble up exceptions thru discarded stack frames

this suggests that llvm 'call' w/o delimited continuations etc is a useful special case that could be identified in annotaions of oot AST because it would allow lower levels of the compiler to optimize it to platform primitives (such as LLVM CALL, and many assembly languages CALL/RET, which GHC Haskell apparently doesn't use due to the need to return to after an 'info table' instead of immediately after CALL instruction) "

so toread (re-read): section 7.1, "Using call/return instructions", in Marlow, Yakushev, Jones, Faster Laziness Using Dynamic Pointer Tagging (2007). http://stackoverflow.com/questions/6394937/llvms-calling-convention-for-ghc http://marc.info/?l=haskell-cafe&m=130253901622420&w=2 LLVM call LLVM invoke

llvm invoke just specifies a call address and an exception landingpad address and says that RET returns to call, while RESUME returns to the exception landingpad. The exception address's first non-phi instruction must be LANDINGPAD. The INVOKE docs say "This instruction is used in languages with destructors to ensure that proper cleanup is performed in the case of either a longjmp or a thrown exception. Additionally, this is important for implementation of ‘catch‘ clauses in high-level languages that support them." LANDINGPAD specifies the types of exceptions which may be caught, and the types which may be thrown (note, these types can be things like 'double', not exception classes (at least, they can definitely be 'double' and i don't think they can also be classes, but i'm not sure there).

---

for 2-operand instructions (after we run out of spots for 3-operand), in-place or accumulator?

cse.unl.edu/~jiang/cse430/Lecture%20Notes/reference-ppt-slides/isa_introduction.ppt page 8 suggests that people stopped making accumulator ISAs, but still make 2 operand ISAs. But their definition of 'accumulator' is more narrow than, say, having two register input arguments and putting the result in the SCRATCH register; they seem to take one of the inputs from the accumulator as well.

what does RISC-V do? RISC-V's ADD, FMUL, etc are 3-operand (they have 32-bit instructions so they have room for 3 operands, and more)

" Due to the large number of bits needed to encode the three registers of a 3-operand instruction, RISC processors using 16-bit instructions are invariably 2-operand machines, such as the Atmel AVR, the TI MSP430, and some versions of the ARM Thumb. RISC processors using 32-bit instructions are usually 3-operand machines, such as processors implementing the Power Architecture, the SPARC architecture, the MIPS architecture, the ARM architecture, and the AVR32 architecture. " [2]

thumb appears to use in-place:

https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html#insn-alu-adc

avr's ADDWF takes as inputs a specified location and the accumulator, and puts the output in a specified location (so the accumulator is one of the input operands, but is not in the output)

MSP-430 appears to use in-place: https://www.ti.com/sc/docs/products/micro/msp430/userguid/as_5.pdf

so let's use in-place

---

RISC-V atomic instructions in Chapter 6, page 33 pdf page 43 of [3], "A" Standard Extension for Atomic Instructions, Version 2.0":

memory consistency model: release consistency

" .1 Specifying Ordering of Atomic Instructions The base RISC-V ISA has a relaxed memory model, with the FENCE instruction used to impose additional ordering constraints. The address space is divided by the execution environment into memory and I/O domains, and the FENCE instruction provides options to order accesses to one or both of these two address domains.

 o provide more e cient support for release consistency [6], each atomic instruction has two bits, aq and rl, used to specify additional memory ordering constraints as viewed by other RISC-V threads. The bits order accesses to one of the two address domains, memory or I/O, depending on which address domain the atomic instruction is accessing.  No ordering constraint is implied to accesses to the other domain, and a FENCE instruction should be used to order across both domains.

 f both bits are clear, no additional ordering constraints are imposed on the atomic memory oper- ation.  If only the aq bit is set, the atomic memory operation is treated as an acquire access, i.e., no following memory operations on this RISC-V thread can be observed to take place before the acquire  memory  operation.   If  only  the rl bit  is  set,  the  atomic  memory  operation  is  treated  as a release access, i.e., the release memory operation can not be observed to take place before any earlier memory operations on this RISC-V thread.  If both the aq and rl bits are set, the atomic memory operation is sequentially  consistent and cannot be observed to happen before any earlier memory  operations  or  after  any  later  memory  operations  in  the  same  RISC-V  thread,  and  can only be observed by any other thread in the same global order of all sequentially consistent atomic memory operations to the same address domain.

Theoretically, the de nition of the aq and rl bits allows for implementations without global store atomicity. When both aq and rl bits are set, however, we require full sequential consistency for the atomic operation which implies global store atomicity in addition to both acquire and release semantics. In practice, hardware systems are usually implemented with global store atomicity, embodied in local processor ordering rules together with single-writer cache coherence protocols. "

LR/SC (Load-Reserved/Store-Conditional) swap add and or xor max min

" AMOs can also be used to provide sequentially consistent loads and stores. A sequentially consistent load can be implemented as an AMOADD of x0 with both aq and rl set. A sequentially consistent store can be implemented as an AMOSWAP that writes the old value to x0 and has both aq and rl set "

---

" On systems with multiple processors, non-blocking synchronization algorithms are much easier to implement[citation needed] if the instruction set includes support for something such as "fetch-and-add", "load-link/store-conditional" (LL/SC), or "atomic compare-and-swap". "

---

" . Some examples of "complex" instructions include:

    Transfering multiple registers to or from memory (especially the stack) at once
    Moving large blocks of memory (e.g. string copy or DMA transfer).
    Complicated integer and floating-point arithmetic (e.g. transcendental functions such as sine, cosine, square root, etc.).
    SIMD instructions, a single instruction performing an operation on many homogeneous values in parallel, possibly in dedicated SIMD registers.
    Performing an atomic test-and-set instruction or other read-modify-write atomic instruction.
    Instructions that perform ALU operations with an operand from memory rather than a register.

Complex instructions are more common in CISC instruction sets than in RISC instruction sets, but RISC instruction sets may include them as well. RISC instruction sets generally do not include ALU operations with memory operands, or instructions to move large blocks of memory, but most RISC instruction sets include SIMD or vector instructions that perform the same arithmetic operation on multiple pieces of data at the same time. "

---

"VAX has been perceived as the quintessential CISC ISA, with its very large number of assembly-language-programmer-friendly addressing modes and machine instructions, highly orthogonal architecture, and instructions for complex operations such as queue insertion or deletion and polynomial evaluation.[1]"

---

could use the last bit of the MEDIUM format addr mode as a 'displacement'; eg if the last bit is 1, then this operand resolves (effective address) memory location x + y, where x is the operand value, and y is the value of the first 3 bits of the addr mode. So we can do things like "the memory location found in R9, + 7". Unfortunately, +7 bytes doesn't even hop over one 64-bit word (then again, we're dealing with at least 16-bit words, so +4 16-bit words does indeed equal 1 64-bit word). Otoh this does allow us to access the top 7 items on the stacks without even using the k in the LOAD and STOREs.

---

could use the last bit of the MEDIUM format addr mode as a 'sequential consistency needed for this instruction' (like both aq and rl in RISC-V). That may be a waste, because presumably most instructions won't even support a sequentially consistent variant

---

if CAS, or some atomic, isn't supported (on this memory subspace, at least), then put an error code in ERR when it is attempted

---

pass everything on datastack for syscalls? shouldn't we use registers for at least some of them?

discussion for golang to use 3 regs for inputs and 3 for outputs. They are worried that it makes tracebacks in exceptions harder: https://gist.github.com/dr2chase/5a1107998024c76de22e122ed836562d

https://news.ycombinator.com/item?id=7676954 notes that currently Golang uses the stack for everything, compared to MS x86-64 which passes 4 words via registers. This commentor is critical of uses only the stack, however.

i think we care about simplicity even more than golang here. If it's possible to get away with only using the stack, as apparently it is, then we should do it. So let's do it.

---

well, i was all set with a core of 16 3-operand instructions, and so i wrote boot_reference.md, until i realized we could cram in another 48 instructions with fewer operands via TWO, ONE, ZERO. Now the stuff in boot_reference really should just be more notes here because the choice of 48 more instructions is still in flux, but maybe i'll leave it over there because i already started.

---

which RISC-V most compressed compressed instructions are we lacking (excepting variants)?

interestingly BEQZ, branch if zero, is more frequent; not BEQNZ. This justifies our decision to put SKIPZ in the first 8 instructions (which are the only ones allowed in the first instruction in each quad), but not SKIPNZ.

immediate variants
scaled variants
sign- and zero- extending variants
load immediate into upper halfword
stack pointer variants
bit-width variants
SRLI, SRAI (shift right logical, shift right arithmetic, in that order)
JAL, JALR (jump with link register, JAL dynamic ('register'))

---

note that (i think) zero-extension can be easily done via shifts.

---

i removed CALL and RET because i think, depending on how much they should do, either they so simple so as to be unnecessary, or they are too complicated to implement (given our goal of dead-simple implementation)

---

i went back and forth between using u16 vs. u16n (16-bit integers, vs. AT LEAST 16-bit integers, where the real bitwidth is platform-dependent). The pros for u16n is that if the actual platform doesn't actually have u16s, but instead has bigger things, then the Boot implementation can still translate single Boot arithmetic instructions into single platform arithmetic instructions, which is easy, and which means that Boot code will run more efficiently when it can guarantee that values will never have carries/overflows. The downside is that Boot code must manually check for overflow/carry upon doing addition when this is possible, which is inefficient and bloats Boot code. This won't be a problem at the OotB? level because OotB? can have custom instructions for each case (that is to say, we can have both u16 and u16ns in OotB?).

Currently i'm thinking u16s are better. It seems dumb to do (carrying while ADDing) at the Boot level when most platforms will be able to do it in one instruction anyways.

---

seems like we are missing single-bit sets and clears. These would be two-operand instructions, and currently we are full there. For the first 8 bits, this can be accomplished in two instructions via AND and OR (that is, using LOADI to load an immediate bitmask, then using AND and OR to apply it); for the next 8, you can either use LOADK to load a constant, or use LOADI and then use shifts to shift the bit.

---

i took out the instruction:

sub-ptr: op ip ii (subtract a u16 from a pointer)

because i don't see a need for it, since we discourage unnecessary pointer arithmetic.

---

an issue i have with the current proposal is that it's a mishmash of register-style instructions and stack-style instructions. If one is going register-style, one might also wish for immediate instruction variants. This highlights the need for a 'stack addressing mode' in MEDIUM, but i thought i decided that 'stack addressing mode' was too complicated to implement (i can implement it no problem, but Boot is supposed to be dead simple). The lack of immediate variants will be solved with immediate addressing mode in MEDIUM, but the lack of stack addressing mode demands an alternate stack version of almost every instruction, in addition to stack manipulation instructions. This is making me question my idea of not having a stack addressing mode.

(but note that even if MEDIUM had a stack addressing mode, it seems silly not to have stack operations in SMALL (Boot), because Boot/SMALL has a more compact encoding, which is what stack ops are supposed to be good for anyhow)

Or, perhaps we could get rid of the syscall-y zero-operand instructions and replace them with stack variants of various other instructions. But no, we want our stack instructions to take a 'which stack' argument so that they can be used on any of datastack, callstack, hiddenstack (and maybe cap), right? Maybe we just need to create another set of 16 one-operand instructions by giving up another spot in the two-operand list, like "one" does. This pushes the total # of instructions over 64, though, which seems crazy. Maybe we could justify it by saying the syscall-y instructions are their own thing and should only sorta count.

another note: if we want stack versions of everything, we need stack jump instructions too. RET could be like jd for the CALLSTACK, which seems convenient.

this mishmash problem probably ultimately stems from my kooky idea that the ISA should support both register stuff and stack stuff, and should have a separate DATASTACK and CALLSTACK.

if this is going to be this complicated, why even bother with Boot? We still want OotB? for a platform-independent Oot implementation, but if Boot is not going to be dead simple to implement, while remaining a useful subset of OotB?, then it's not going to be much harder to port an OotB? implementation to each platform directly, which porters will eventually want to do anyways because it's more efficient and provides better interop.

perhaps the problem is that the ability to have a zillion lesser-operand instructions just by reserving one greater-operand instructions is too awesome, which makes the design space too large. Before i realized that you can do that, i thought i had a set of 16 three-operand instructions with a simple 16-bit 4-field uniform encoding that seemed somewhat clear to be among the optimally elegant such sets, which is when i started writing boot_reference. But now that i can fit 64+ instructions into such an encoding, there are many choices and it no longer seems like the current proposal stands out as better than the many alternatives. It is true that there appears to be a basic set of operands in the fuzzy intersection of most modern ISAs; but this basic set is many more than 16 operations. Perhaps the truth is just that any ISA which includes all of these is also more complicated than i would like, therefore there is no obvious universal best choice (or set of near-best choices) and what is best for a particular project depends on its particular requirements. In which case, why bother inventing something new, just use RISC-V or WebAssembly? or LLVM or JVM or CLR or ...

So let's go back to our project requirements and see why i was motivated to invent something new. We want portability and interoperability. We want OotB? so that we can write the Oot implementation once and port it to many platforms. OotB? implementations need to be able to use platform-native composite data structures and operations for interoperability, which is perhaps the primary reason why we want it to have custom instructions and address subspaces. But we also want a naive OotB? implementation that can itself be easily ported, which means that we probably want to write it on a simpler platform that is itself easily ported, because OotB? might be too complicated to port from scratch. I fear that none of WebAssembly?, LLVM, CLR are simple enough to easily port to a new architecture in a day or three, and probably not RISC-V or JVM either.

on the complexity of a stack mode in MEDIUM: i sort of remember that most of the complexity arose from the COMBINATION of having to provide read/write in operand signatures combined with stack mode, because the 'framework' code that calls the actual instruction code is supposed to push or pop for it, but (a) without read/write signatures, it doesn't know whether to push or to pop, (b) we also have to introduce a convention about the ordering of all of these push/pop operations, because multiple operands may target the same stack. Maybe the thing to do is to just to come up with a simpler, clearer, more generic model for how MEDIUM custom instructions read from, write to, and get the effective address of their operands. Not sure we can get much better than signatures without sacrificing either performance (eg having each custom instruction implementation call a special vmcall to read from or write to its own operands; this would also be bad because it destroys any convention for ordering between operand read/writing) or memory (eg having two extra bits in each addr mode to say whether this operand is pushed before the instruction, popped after the instruction, or neither or both).

i'm currently leaning towards leaving stack mode as it is (signature-based) in MEDIUM, and not having any non-stack-manipulation stack ops in Boot/SHORT.

(for future reference, the "current proposal" that i spoke of above in this section is:

The 16 three-operand instruction opcodes and mnemonics and operand signatures are: 0. annotate: ? ? ? (or BAD, if all zero bits) 1. loadi: oi k k (load 8-bit u16 immediate) 2. loadk: k k k (load from 12-bit constant table, and push onto DATASTACK) 3. load: o i si (load from memory (plus offset) to register) 4. store: so i i (store from register to memory (plus offset)) 5. two: ? ? k (two-operand instructions)) 6. jrel: 0 0 k (relative jump) 7. skipz: 0 0 ii (skip if zero) 8. skipnz: 0 0 ii (skip if non-zero) 9. add-u16: oi ii ii (addition of u16s) 10. sub-u16: oi ii ii (subtraction of u16s) 11. leq-u16: o ii ii (less-than-or-equal of u16s) 12. add-ptr: op ip ii (add a u16 to a pointer) 13. 14. RESERVED 15. cas: pio i i (compare-and-swap)

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register (in other architectures this is usually called "MOV")) 1. or: io i (bitwise OR: op0 = op0 OR op1) 2. and: io i (bitwise AND: op0 = op0 AND op1) 3. xor: io i (bitwise XOR: op0 = op0 XOR op1) 4. swap: io io (swap the contents of two registers) (note: 'roll STACK 0' can be used to swap the top 2 elements in a stack) 5. mul-u16: io i (multiplication: op0 = op0 MUL op1) 6. malloc: po i (allocate memory) (?since memory allocation is expensive and syscall-y, should this be a 0-operand instruction that passes arguments on the stack?) 7. mrealloc: po pi (resize memory allocation (like C's realloc)) 8. pick: sm i (push a copy of the n-th item in a stack (zero-indexed) onto the top of the stack) (note: this can also do dup (with r0), over) 9. roll: sm i (rotate the top n+2 elements on a stack) (note: this can also do swap (with r0), rot) 10. sllk: io k (op0 = logical_shift_left(op0 by k bits)) 11. srlk: io k (op0 = logical_shift_right(op0 by k bits)) 12. srak: io k (op0 = arithmetic_shift_right(op0 by k bits)) 13. pop: o sii 14. push: som i 15. one: ? k (one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. stackload: k (pop saved stack items further, and push to DATASTACK) (todo: 'k' is unused!) 2. stacksave: k (save stack items further back than k, and push to DATASTACK) (todo: do we really need a 'k' here?) 3. log: pi (log a message) 4. mdealloc: pi (deallocate memory (like C's free)) 5. capunpack: pi (from a capability, generate a struct describing it. there is no 'cappack' because you can't edit capabilities) 6. capalloc: i (allocate new capability stack; returns the new stack pointer on data stack. Deallocate with ordinary mdealloc) 8. addstk-u16 sm 8. substk-u16 sm 9. leqstk-u16 sm 10. addstk-ptr sm 11. RESERVED 12. syscall: k (platform-specific zero-operand instructions) 13. regload: k (pop saved k registers from CALLSTACK and restore k registers starting with 8) 14. regsave: k (save registers greater than 7 and less than 7+k, and push to CALLSTACK; stored register format is opaque (but is it one or many entries on CALLSTACK?!? b/c we want to be able to manually manipulate CALLSTACK and even move saved registers elsewhere in memory, right?)) 15. zero: k (zero-operand instructions)

The 16 zero-operand instruction opcodes and mnemonics and operand signatures are: 0. fence 1. delete 2. walk 3. rand 4. sync/flush 5. seek 6. fork 7. wait (fork's join) 8. open 9. close 10. read 11. write 12. poll 13. time (in ns; 64-bits (4 16-bit items, ordered least significant to most)) 14. halt (pop DATASTACK and terminate program execution, returning the popped value) 15. break

)

---

tombread: https://www.google.com/search?q=minimal+instruction+set&oq=minimal+instruction+set&gs_l=mobile-heirloom-serp.3..0i30.16225.20450.0.20630.17.10.0.0.0.5.299.1102.5j3j1.9.0....0...1c..34.mobile-heirloom-serp..15.2.369.9L_CESk-vrY

---

i'm rethinking the inclusion of the constant table into Boot.

the issue is that we want the constants to be able to be more than 16 bits, so that we can support Boot programs of length more than 64k words which can have jumps to anywhere within the program. But, different platforms will support different lengths, so we don't want to mandate support for any particular bitlength above 16 bits. We could allow constants of variable lengths, but leads to more implementation complexity; also, if each constant in the table had a length prefix, that would be a space-ineffecient encoding of the constant table; the more efficient way would be to have segments of the constant table for various sizes.

we could just fix the constant table item length at 64 bits.

another issue is that it's cache-inefficient to store things which may only be used in one or two places, like jump addresses, in the constant table -- either we end up caching the whole constant table, in which case much of this cache is wasted, or we don't cache the constant table, in which case those few constant which are frequently used are not cached. It would be better to have a way to store constants directly interspersed with code. If we did that we could eliminate the constant table in Boot, too (OotB? would still have a constant table).

one thing we could do is have a 'loadk' instruction which loads the following word or few words as a constant. This doesn't sound too dirty, because after all it's just like having an instruction whose semantics include '.. and then skip over the next instruction(s)'. But it's a little annoying, because a dead simple interpreter would like to just loop over each instruction and pass it to an evaluation function, and this makes that not work. On the other hand, it's not TOO bad, because if we limit the maximium bit size of constants, we can just give the evaluation function a window of instructions that starts with the actual instruction and includes a few extras after it. On the other hand, that's probably inefficient, as we're almost always passing around extra data that we don't need (which is fine if we're actually just passing a pointer into the instruction stream, but imagine a hardward implementation; there there's probably actual electrical signals copying the data around, although i'm not too sure how those things work). Also if we did this we'd want to worry about 64-bit alignment, which would be annoying.

another thing we could do is mimic the ISAs that have a 'load high' instruction; this allows an ISA to have register widths twice as wide as the largest supported immediate constant width (eg register width of 16 bits and immediate constants of 8 bits), and then use a pair of instructions to load the low and high bytes of the 16-bit word. If we don't want to specify the maximum supported constant bitwidth, we could generalize this to a "multiply and load immediate" instruction that multiplies the previous contents of the register by 256, and then adds a new 8-bit immediate constant to the result. This would allow us to embed arbitrary-length constants in ordinary code in most-significant-bit-first (big endian bit) order.

perhaps to make things easier on compilers (who will want to check at compile time to ensure that there is no sequence of 'multiply and load' instructions that is so long so as to exceed the platform-specific maximum constant size), we could require that any 'multiply and load immediate' instruction must be preceded by either (a) a LOADI instruction to the same register, or (b) another multiply and load immediate instruction to the same register. That is, we forbid doing a 'multiply and load immediate' instruction on a register that previously has a computed value. This is annoying because it's cross-instruction syntax, but in practice it's not too hard to deal with. We'd also want to add a field in the .ob0 files that gives the 'maximum constant length'.

a downside of the 'multiply and load immediate' technique is that it's space-inefficient; each instruction has the overhead of the (redundant) multiply and load immediate opcode, and, if we have a destination register, of that. We could get rid of this by having a special processor state for being in the middle of such a series, which seems to be to be simple enough, but i bet there's some hidden issue that i'm not thinking of that would add to implementation complexity there. So maybe ignore this space-inefficiency.

instead of specifying the destination register in here, we could (a) use this space for more data, or use it to denote the (b) position within, or (c) the remaining length, or (d) the termination, of the sequence. A problem with (a) is that now we're adding to the constant in units that are not a power-of-two bits. A problem with (b) or (c) is that, if we want to support constants more than 16 bytes (not words; each LOADI instruction only has 2x4 bits of constant) (=128 bits) long, either we can't, or we need an escape special case. 128 bits is pretty long, but the Ethereum Virtual Machine, for example, uses 256-bit words, so 128 may not be enough for all cases. (d) is fine but it is (mostly or totally) a special case of (c), and it wastes three bits of space for each item in the sequence.

i guess destination operand is the simplest because anything else requires either (a) we always 'multiply and load immediate' to the same destination register, probably TOS of DATASTACK, or (b) we maintain extra processor state to remember what the destination operand is in the middle of a sequence of constant load instructions.

i'm leaning towards:

replacing LOADK with a LOADIM (load immediate with multiply), each of which loads 8-bits of constant
adding a 'maximum constant length' field in the .ob0 format
destination operand in the extra operand

---

speaking of 16 bits, mandating everything be at least 16 bits would make it annoying to implement Boot on a 6502 processor, which is sad. It would still be very possible to implement Boot on a 6502, though, so not a total loss.

---

i guess Boot's full name should be 'Oot Boot' or 'ootboot', to make it Googleable. The full bytecode (not just Boot) is called OotB?.

---

don't the CALL and RET macro instructions, which manipulate the callstack, and the syscall-y instructions, which do whatever, violate the no-main-memory-access-outside-of-load/store/push/pop rule?
- i took out CALL and RET. The syscally things are still there; they are really just a push to the callstack and a jump, but one could still argue that's too much.
two/one/zero are too good to give up, but now we have not 16 but 64 'primitive' instructions. That's a lot, right? It's more than Lua's 37. Should we reduce the bit-width of our 4 fields? If we had 3 bits of data and 1 addr mode bit (immediate/register direct), then we'd only have 32 primitives, which seems more reasonable, but we'd also only have 8 registers instead of 16, which means we'd probably have to give up some of our 3 stacks, 1 capability stack, ERR, SCRATCH; also then 2 fields would only provide a 6-bit constant instead of an 8-bit constant, which seems too limiting to me, and also then we have an addressing mode again (although immediate/direct is no big deal imo).
- yeah, i think 8 regs is too limiting if we want all of these special stacks
otoh 64 instructions is the next power of two above Lua's 37, and still seems simpler than webassembly's binary encoding. But maybe not simpler than RISC-V (well, the encoding may be, but not the instruction set), and maybe not much simpler than JVM (except for JVM's more types and OOP instructions)
- yeah i think 64 is tolerable
otoh it's worse than Tiny RISC-V http://www.csl.cornell.edu/courses/ece4750/handouts/ece4750-tinyrv-isa.txt
so far i think we'll stick with this, even though it has 64 instructions and is not RISC-y
consider immediate forms of a bunch of things, like eg RISC-V
- ok added them. also added stack forms.

---

Should we store the return op0 on the callstack when we CALLI? If so, then the contents of CALLSTACK is something other than codeptrs.

---

vs. RISC-V atomics, we're missing: * acquire vs. release vs. both vs. neither bits in atomics * LR/SC (ie LL/SC; but we have CAS instead) * swap, add, and, or, xor, max, min

note on atomics: yes, CAS lets you do eg atomic add (c = a + b), but afaict all threads must also agree on a 'lock' memory location corresponding to a,b,c.

which atomic operations are supported by various VMs/ISAs, in addition to CAS, LL/SC, fence and OOP stuff?

RISC-V atomics: swap add and or xor max min
C# interlocked [4]: add inc dec exchange read_64bit_word_atomically. But https://msdn.microsoft.com/en-us/library/system.threading.interlocked.aspx only mentions exchange dec inc
C++ atomics [5]: atomic_flag_test_and_set atomic_flag_clear exchange cas add sub and or xor. But requires a separate type for all atomic objects (so, might be augmenting with a 'lock' location)
LLVM atomic load, atomic store, atomicrmw: xchg add sub and nand or xor max min (signed max and min), umax, umin (unsigned max and min)
pnacl: llvm.nacl.atomic.store, llvm.nacl.atomic.load, llvm.nacl.atomic.rmw (read-modify-write operations; compile-time constant specifying the operation as one of: add, sub, or, and, xor, exchange), llvm.nacl.atomic.cmpxchg, llvm.nacl.atomic.fence, llvm.nacl.atomic.fence.all, llvm.nacl.atomic.is.lock.free: See atomic intrinsics
asm.js atomics addendum [6]: load store exchange add sub and or xor
js proposal [7]: i only skimmed it and i may have missed a lot. what i see is: load store xchg add sub and or xor (in list in 'Atomics.isLockFree' and also in '6.3.3Function Properties of the Atomics Object')
some GPUs (NVIDIA/CUDA) offer: barrier synchronization ("syncthreads") between a 'warp' of SIMD threads; add sub xcng min max inc dec and or xor cas; threadfence "waits until all global and shared memory accesses made by the calling thread prior to call are visible to all threads in the thread block. In general, when a thread issues a series of writes to memory in a particular order, other threads may see the effects of these memory writes in a different order."[8]
gcc follows C [9]: load store xcng cas add sub and or xor nand test_and_set clear
i don't think webassembly has atomics yet [10], although i'm not sure, and the previous link and [11] imply that atomics will be inspired by LLVM/PNACL/asm.js
i didn't look at this, but there's also https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man3/atomic.3.html

so, near-intersection of all of the above is:

xchng (not in RISC-V, which does have 'swap'; i'm interpreting 'swap' as swapping two memory addresses (shared memory) whereas xchng swaps a (local) register with a memory address, although i dunno if that's what RISC-V meant)
add (not in one version of C#, but in the other one)
and or xor (not in C#)
load/store on integers up to 64-bit (not sure if in RISC-V, probably is. Does C# have only load and not store?)
sub (not in RISC-V or C#, but C# has dec)
atomic_is_lockfree: in many of the C-ish higher level ones

so, my thoughts are:

i guess CAS in a tight loop is equivalent to XCHG? i see why they want XCHG though, that would be more efficient if that's what you want
add, or at least inc
load/store on integers up to 64-bit
maybe: and or xor dec (dec/sub not supported by RISC-V, so probably not)

is there any value to having only inc instead of add? One version of C# does this. Cassandra's counters only support inc: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/counter_type.html. However afaict both the old and new implementation of Cassandra counters could have supported add: http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters .

so, in order from most pressing to least, we want:

load on integers up to 64-bit (2 operand)
add (3 operand, but could deal with 2 operand)
store on integers up to 64-bit (2 operand)
xchg (2 operand)
and maybe, and or xor (2 operand)

so, we'd like 4 2-operand spots. Actually, load/store on multiple words can use datastack, so only need 1 operand spots for those 2, so 2 2-operand spots and 2 1-operand spots. Actually, we need another 2-operand spot b/c we need ADD in both u16 and ptr variants. Done.

also, the is_lockfree stuff should be covered by our cpuinfo analog

---

i'm having second thoughts about including any capability stuff at all into Boot.

The reason is that checking capabilities requires inserting extra instructions into the LOAD and STORE implementation (and probably also fat pointers). We don't want to incur the efficiency costs of this while the OotB? implementation (written on top of Boot) is running -- and probably nor while certain parts of the Oot implementation is running on top of that.

For OotB? or maybe just for Oot, we wanted boundschecking anyways. But not for Boot -- Boot code can be unsafe.

We could have ordinary, unsafe LOAD and STORE, and then additional safe LOADWITHCAP and STOREWITHCAP instructions. But in this case, couldn't we just have the LOADWITHCAP and STOREWITHCAP stuff in the OotB? implementation at the next level up?

If we eliminate these, this brings up the question as to whether we want a special register CAP (and whether we even want CAP to be a stack). It would be useful to have an SOS (second of stack) register instead. However, OotB? will definitely want a CAP register and it might be nice to let OotB? have 8 other GPRs.

i guess i think we should eliminate capabilities and the CAP register. I guess we should make it SOS.

---

for NUMA systems, how to specify the desired topology of our processes when FORKing?

maybe the Fortress language has something to say about that

i think it's probably too complicated/esoteric for me to worry about yet, though

---

so, memory ordering. What kind of memory barriers do we need?

RISC-V atomics has a FENCE instruction, which provides that "no other RISC-V thread or external device can observe any operation in the successor set following a FENCE before any operation in the predecessor set preceding the FENCE"

RISC-V also says "After much debate, the language community and architecture community appear to have nally settled on release consistency as the standard memory consistency model and so the RISC-V atomic support is built around this model."

and "To provide more e cient support for release consistency [6], each atomic instruction has two bits, aq and rl , used to specify additional memory ordering constraints as viewed by other RISC-V threads....If both bits are clear, no additional ordering constraints are imposed on the atomic memory oper- ation. If only the aq bit is set, the atomic memory operation is treated as an acquire access, i.e., no following memory operations on this RISC-V thread can be observed to take place before the acquire memory operation. If only the rl bit is set, the atomic memory operation is treated asa release access, i.e., the release memory operation can not be observed to take place before any earlier memory operations on this RISC-V thread. If both the aq and rl bits are set, the atomic memory operation is sequentially consistent and cannot be observed to happen before any earlier memory operations or after any later memory operations in the same RISC-V thread, and can only be observed by any other thread in the same global order of all sequentially consistent atomic memory operations to the same address domain.

RISC-V also says "Theoretically, the de nition of the aq and rl bits allows for implementations without global store atomicity. When both aq and rl bits are set, however, we require full sequential consistency for the atomic operation which implies global store atomicity in addition to both acquire and release semantics. In practice, hardware systems are usually implemented with global store atomicity, embodied in local processor ordering rules together with single-writer cache coherence protocols"

RISC-V also says "In general, a multi-word atomic primitive is desirable but there is still considerable debate about what form this should take, and guaranteeing forward progress adds complexity to a system. Our current thoughts are to include a small limited-capacity transactional memory buffer along the lines of the original transactional memory proposals as an optional standard extension \T"."

A Tutorial Introduction to the ARM and POWER Relaxed Memory Models says "The dmb/sync barriers are potentially expensive, but satisfy the property that if inserted between every pair of memory accesses, they restore sequentially consistent behaviour". DMB is a fence on ARM and SYNC is a fence on POWER.

so, it seems like C11, RISC-V, LLVM all use the 'release model' in which ordinary access is relaxed (which has a particular meaning in C++14, at least) unless synchronized by acquire/release commands (which can be placed on memory accesses or on separate memory barriers), and there is also a way to specify an acquire+release+sequential_consistency memory barrier. So let's do that (except only provide memory barriers, don't attach acquire/release to commands, because we don't have any free bits in the SHORT (Boot) instruction encoding). Apparently ARM is the weakest popular memory model hardware, with the old POWER being similar to it, and both of them offer a memory barrier that ensures sequential consistency between barrier statements, so that should be feasible. Noting RISC-V's statement on seq. cst. implying 'global store atomicity', let's divide up our memory into 'memory domains' so that if, say, some of the memory is actually distributed, we don't have to provide sequential consistency between blocks of memory on different nodes.

RISC-V suggests that there's some difficulty with multiword atomics necessitating a transactional memory buffer. What exactly are the issues, and what exactly are they thinking here? I'm guessing that the issue is not in making a multiword LOAD and STORE (which we want), but rather in making a multiword CAS; https://www.google.com/search?q=multiword+atomic+primitive pulls up results talking about CAS.

---

https://en.wikipedia.org/wiki/RISC-V#Atomic_memory_operations says:

" Release consistency is also an unusual choice.[3] The IBM System i and x86 both implement a compare-and-swap (cas). First; cas guarantees that all threads will progress. Also cas tests data, not bus access, simplifying the bus logic. The classic problem with cas is that a thread "A" can read the data, followed by a read by thread B, and a write by A, causing failure in which both thread A and B appear to have access. In general, this is solved with a complex lock structure. However, a complex lock slows and complicates the computer. For example, the x86's "double-wide cas" requires a special instruction format to specify multiple registers, performs several reads and writes, and can have complex bus operation.[3]

However, release consistency is more efficient:[3] It usually requires only one memory load, and minimizing slow memory operations is desirable. It's also exact: It controls all accesses to the memory cell, rather than just assuring a bit pattern. However, unlike cas, it can permit a livelock, in which two or more threads hang waiting for each other. RISC-V guarantees forward progress (no live-lock) if the code follows rules on the timing and sequence of instructions: 1) It must use only the "I" subset. 2) To prevent repetitive cache misses, the code (including the retry loop) must occupy no more than 16 consecutive instructions. 3) It must not include any system or fence instructions, or taken backward branches between the lr and sc. 4) The backward branch to the retry loop must be to the original sequence.[3] "

i don't really see why 'release consistency' is the name of lr/sc (ll/sc). isn't that separate?

possibly related/possible toread anyways: http://www.realworldtech.com/arm64/5/

---

some remaining issues:

64 instructions (or worse, 64 instructions + 16 syscalls) seems like a lot for a 'dead simple' bootstrap language. Perhaps we should just leave many of the opcodes 'implementation dependent' and then define them with custom instructions in OotB?? But, how would custom instructions target things in the middle of TWO, ONE, etc? Perhaps we could define them in the OotB? implementation, rather than with custom instructions?

done:

do we handle capabilities at the Boot level? Do we have a CAP register? A MODE register? Are they read-only by ordinary program code? Do we have the concept of supervisor mode?
- done: i said that access privileges are implementation-dependent (within limits), and renamed CAP to IMPREG and said its semantics (including whether it is readable/writable) are implementation-dependent
when our operands were 12 bits, we had lots of opcodes, but now that we have only 256 opcodes in MEDIUM, things are looking tighter, esp. if we already have 80 of them defined by Boot. Recall that we wanted each Boot instruction, including those buried within TWO, ONE, etc, to correspond to one medium opcode -- now it's true that some of the Boot instructions will be 'duplicates' that map to the same opcode (things like ADD and ADDI where the additional addressing modes in MEDIUM let you cover multiple SHORT/Boot instructions with one MEDIUM opcode), but some aren't.
- although, due to constant table addressing mode, at least we'll have another 256 opcodes in the OotB? module 'custom instruction' table
- also, it's not like we've been adding instructions to Boot that we wouldn't have added to OotB? anyways
- done: i think this is fine b/c we would be adding these to OotB? anyways
If we are just using the last two addr mode bits in MEDIUM as a convenience displacement, then consider having 10-bit opcodes (field1) (we still want at least 2 addr mode bits in the opcode field for the 'call3' indirect/dynamic method of first-class function/instruction execution)
- not worth the complexity, esp. b/c we want to limit the number of builtin opcodes anyways

---