Bayle Shanks's website: proj-oot-ootAssemblyNotes12

if SootB? is also SHORT, i guess we really should find a way to at least call custom instructions from there.

also, if we've 'broken the seal' of non-fixed-size instructions in SHORT, OR if we use 16-bit fixed size instructions, should consider if we can do better and get some other 'addressing modes' in there besides pure stack.

if we throw some register addressing in there, that doesn't really increase the complexity of implementation very much imo, it's still a good bootstrap language.

---

we basically want 3 addr modes:

immediate
register
stack

we need at least ~4 bits for opcode, plus some way to call longer 'custom' instructions.

3-operand format is nice but 2-operand mode is probably more efficient.

do we need to call first-class instructions? nah, save that for MEDIUM. So we never need the 4-th (3rd) operand.

With fixed length, we'd need at least 16 bits; because you really want at least 8 bits for absolute jumps, and you want the bitlength to be a power of 2. Another reason for this is 4 opcode bits + 2 operands * (2 addr mode bits) = 8 bits, without even putting in the operands yet.

So let's say we have 16 bits.

For absolute jumps, that gives us an instruction format with 4-bit opcodes and 12 bits of data.

For 2 operands with 2 bit addr modes, that's (16 - 4-bit opcode - 2*2 addr mode bits)/2 = 4 bits per operand left.

Now if we need more opcode bits, so that we can have more instructions, that reduces the above number. For 3 bits per operand, we can have 6-bit opcodes, and for 2 bits per operand, 8-bit opcodes.

but recall that since this is SHORT mode, we need 2 form bits now.

---

Since we only really need 3 addr modes, and since stack addr mode doesn't need an operand, we could compress things further:

1 bit: register or (immediate+stack)
3 bits: (immediate+stack), if 0, stack, o/w, # - 1 is immediate
or, we could just pretend that the stack register is magic, and have 7 real registers, and reads/writes to the stack register do stack addressing; or, we could have a PC register too, like with MEDIUM; actually, i like that better

---

So we're looking at:

2 form bits
14 bits left, in 2 formats:
2 opcode bits, 12 bit immediate operand (JMP, etc)
6 opcode bits, 2x (4 bit operand)

---

wait, we COULD do first-class functions in here pretty easily, since we have 6 opcode bits, if we just use the first 2 bits to indicate when that happens;

2 form bits
14 bits left, in 3 formats:
2 opcode bits, 12 bit immediate operand (JMP, etc)
6 opcode bits, 2x (4 bit operand)
3 bits + 3-bit first class function operand + 2x (4 bit operand)

---

if we added a 4th format we could also pack 3 4-bit stack opcodes in there.. jeez this is getting to be a lot of formats. Also, that would no longer be fixed-length. So mb not? Also, will chains of 3 stack-addressed instructions be all that common when we allow register addressing too?

---

could we possibly reduce this whole thing to about 8 bits?

2 form bits
6 bits left, in 3 formats:
1 opcode bit, 5 bit immediate operand (LOADI)
2 opcode bits, 2x (2 bit operand): immediate 0, immediate 1, stack, register 0 (or just register-style access to TOS?)
1 opcode bit, 1 bit for which register the first class instruction is in, 2x (2 bit operand)

---

so that reminds me, we'd really like at least 4 addr modes:

immediate
register
stack
register-like access to the first few stack locations

---

previous proposal is nice but 5-bit immediates isn't much; that only gives us a JMP table with 32 entries. We could accept variable-length addressing w/r/t the FORM bits. This wouldn't be too much more crazy than the old, simpler, 4-bit proposal:

we already had 'variable length' in the old proposal because sometimes we had those 12-bit immediates
we already only had 4 bits of opcodes, now we're just adding a standardized format to do some things that would require multiple 4-bit instructions, some of which with 12-bit immediates, before

---

so how about:

2 formats:

64-bit MEDIUM IMMEDIATE: 2 form bits + 12-bit opcode + 3x (4 bit addr mode + 12 bit operand) combined into >=32-bit immediate
2 form bits + 1x 6 bit (see below) + 3x 8-bit SHORT:

well now we're getting into the same sort of packed SHORT format we had before, where the first (and maybe the last) instruction has a special encoding, to accomodate those 2 form bits. But now we're trying to do a 2-operand format instead of 1. The old ootAssemblyNotes7 SHORT format used a system similar to the limitations on expressions in terms of pushes/pops and parenthesis described slightly earlier in ootAssemblyNotes11. The idea is that you probably want to PUSH the outcome of most of the instructions except at the end.

---

So how about:

2 formats:

64-bit MEDIUM IMMEDIATE: 2 form bits + 12-bit opcode + 3x (4 bit addr mode + 12 bit operand) combined into >=32-bit immediate
64-bit SHORT (4 packed instructions): 2 form bits + 1x 7-bit + 2x 8-bit + 1x 7-bit

hmm:

immediate 0 and 1 are probably less useful than you might think
surely we could spare a bit to indicate if the output should be PUSHed onto the stack, or if it should overwrite a register, or if it should overwrite TOS? Or mb just always PUSH.
also, i forgot we only need 1 format bit for SHORT, not 2
this format doesn't leave much room for annotations

---

for the SHORT ones:

4 bit opcode
various formats depending on opcode:
- imm4: 4-bit immediate
- (in

(in

out)2 (in	out)2: 2x 2-bit ordinary (see below)
out)2 imm2: 2-bit ordinary + 2-bit immediate

2-bit ordinary operand encoding:
- register 0
- register 1
- stack (POP)
- TOS (read the input from the top of the stack, but don't pop)
output is always PUSH'd, except for when there is an 'out' argument
the 7-bit instruction (the first one) replaces the first input in2 with 1 bit: register 0

POP, or if it's an imm4, then it becomes an imm3

instructions:

ANNOTATE imm4: annotation_type2,annotation_value2: informative/notification (not consequential/impactful) annotation; has no effect (could be hints, information like stack maps used for program verification, etc). NOTE: should reserve one or more annotation_types for implementation-defined annotations (or does 'comment' serve as this as well?)
JS in2 imm2: jump to in2 + imm2
BR imm4: relative branch by (signed) imm4 instructions
SKZ in2: if in4 was zero, then skip next instruction (note: in MEDIUM mode, the skip can have a constant from 1 to 256, but this must always be 1 in user code, because it may expand as custom instructions are inlined)
SKNZ in2: if in4 was non-zero, then skip next instruction
MOV out2 in2:
ADD-U15 in2 in2: add
SUB-U15 in2 in2: subtract
LEQ in2 in2: <=
LOAD out2 in2: Load value from memory location at in2 to out2
STORE out2 in2: Store value from in2 into memory location at out2
LOADI imm4: Push constant
SYSCALL n4: Call out to external library function or custom instructions n4, which will pop any needed arguments from the stack, and push any results to the stack
??: LEA in2 in2: Load Effective Address; inputs are address mode and operand
??: CALL
??: RET
??: MOV1 (MOV with two input operands concatenated to a single long input operand)
??: MOV2 (MOV with dest and one of the input operands concatenated to a single long output operand)
?:: alternative translation block (eg CALL as a single LONG instruction vs CALL as many medium instructions; but in general every alternative block may span multiple instructions). Perhaps this has the effect of an unconditional JMP to the last alternative block, for those interpreters that can't understand the other alternatives.
??: MOVSTACK (when you want to mmap a datastack or callstack)
CALL4, CALL5, CALL6 (for calling first-class functions in SHORT mode)
SAVEREGS(highest # reg to save), RESTOREREGS(highest # reg to restore) (to memory, or to stack?)
todo: would be nice to have a LOADI2 out2 imm2
todo: would be nice to have an ERR or RAISE that reports (or throws?) a specific error code, and would be nice to have a BRK that traps (or just a TRAP, at least)

plus one MEDIUM instruction (LOADI 16-bits; note it can fit more than 16 bits but for SootB? we don't assume that our regs can hold more than 16 bits; i guess we should offer variants of LOADI that load into register 0, register 1, stack, or TOS? ordinary MEDIUM mode can support each of those. Maybe we also offer a 'STOREI' which writes a 16-bit immediate to any 11-bit address in (register?) memory)

NOTE: in MEDIUM, 'BR' in user-level code is constrained to only use 8-bits (one of the two inputs). The other input is reserved for the implementation. It is expected that the implementation uses this when inlining custom instructions (which increases the distance between points in user-level code).

todo: for hierarchical modules, need a way to LOADK an imported module into somewhere, and then LOADK from that memory location to access something exported by that module. I guess just treat modules as address subspaces.

---

y'know, i'm backing off the idea of unifying SootB? and SHORT. The thing is, SHORT is supposed to be efficient, and we won't know what's efficient until we profile, and we may want to change it each release to be most efficient. Whereas SootB? is for bootstrapping, so it prboably shouldn't change so rapidly. Otoh SootB? bootstrapping will run more efficiently if code density is high.

i guess the answer is, unify them but not yet? mb until then just use the subset-of-MEDIUM thing?

---

eh, maybe unifying SootB? and SHORT is good after all.

i guess what's really going on is that i think it's not worth spending as much time and energy as i have been on the details of SootB? (and even OotB?) assembly encodings at this point. This is supposed to be just the underpinnings for an easy-to-implement, interoperable implementation of Oot. And Oot isn't even designed yet.

---

i'm considering killing the second 4 addr modes and replacing them with a 'meta' bit. This might fit in with 3-Lisp too (although there the 'reflection procedures' were declared as such at the declaration site, not the callsite).

we also want some way to do 'stack' addressing, though. Maybe mmap it after all.

as for the sort of thing a 'meta' bit could be for, i'm copying this bit in here from ootAssemblyNotes7:

Edges may be represented as reified 'metanodes', with 'metaedges' connecting them to their source and target
An edge from one node A to another node B may point to one of many 'views' of the target node B. For example, in addition to the typical view, there may be a 'hyperview'; in the metaview, metaedges and metanodes are explicitly represented. If there is an edge from node B to node C in the ordinary view, then in the hyperview there will be an edge from node B to a node reifying that edge.

---

(following the previous) If meta level 2 is specifying a metaview as the view when targeting an edge (that is, the target of an edge is a node, but we can specify the metaview of that node, and then from there we can traverse from that node to other 'nodes' representing reified edges), then maybe meta level 3 is like an edge whose target is a view which is a graph which represents the different meta levels and their interrelationships.

 noting that there's different ways to construe meta (eg application level meta, implementation level (pointers), etc), as noted above

---

the 'meta' bit could choose between one of two 'view registers', which (similar to the capability registers) contain something which changes the 'view mode'.

my sense however is that 'meta' should be a mode within addressing modes (eg a direct product with existing addressing modes, doubling their amount; a modifier bit attached to addressing modes).

---

the interesting thing is that the 'unbox' bit and 'meta' bit are going in opposite directions, in a sense. A BOXED thing is a pointer (when the unbox bit is OFF), and a meta thing is a pointer (when the meta bit is ON); a pointer is more complicated (more meta), so the compilicated case occurs when the unbox bit is off but when the meta bit is on.

of course, if i had made it a 'box' bit that wouldn't be. But there's a reason it's an 'unbox' bit; treating a pointer in memory as a pointer is more primitive, less abstract, then pretending it is the thing it points to.

i think it's because 'unbox' refers to application level/platform stuff, whereas 'meta' refers to pure abstraction.

---

copied from ootModuleNotes1:

About C and C++ header files, and the C++ modules proposal [1]:

header files are useful because it allows the #including module to be compiled (otherwise) separately, because the header file provides enough information to determine the in-memory representation of types
in C++, header files must be re-loaded many times, upon each #include, because they might have macros that affect the file #include'ing them, or they might compile-time-execute differently in different files (eg b/c of logic like #ifdef)
in C++, header files cause a significant increase in compilation times
in C++, the macros in header files make things difficult for tooling
- Holzmann's P10 rules recommends avoiding "Token pasting, variable argument lists (ellipses), and recursive macro calls" in macros, and requiring "All macros must expand into complete syntactic units". They recommend minimizing but not totally avoiding conditional compilation directives
another problem with header files is that they make you have to change stuff in two places, rather than one, when you change function signatures

recommendations:

abstractly, consider the type signatures of a module different from its content
in binary semi-compiled module files (OotB??) represent this explicitly (maybe with a hash, so you can easily see when it did not change?)
- this has implications for semver versioning; when the major version is above 0, patch version increases should not change the signature section hash; minor version increases can add new signatures but not alter existing ones. Major version increases can change everything.
this will be something like a module descriptor/interface/header file in order to allow separate compilation. Need enough information to determine in-memory representation of types.
avoid macros in that file
avoid any logic that would require the file to be loaded more than once if imported by different modules
the order of module imports should be irrelevant
private attributes of classes/structs should not be visible

---

could we just skip both SootB? and OotB? and go directly to Root Core?

No, i don't think so, because it's more intimidating to write a parser than to write an interpreter for a linear instruction stream.

---

Could we just trust OotB? code and dispense with capabilities (capabilities would then be on the Root Core level)? Maybe... this gives up the ability to execute OotB? code in a sandbox, but perhaps we don't really need that, perhaps we only need to sandbox in the Oot Core interpreter.

Of course, OotB? could still be an efficient bytecode REPRESENTATION for Oot Core. But i'm saying that the bootstrapping OotB? interpreter need not handle capabilities.

And we could reduce the primitives instructions for (R?)Oot Core down to what we are getting for SHORT mode (SootB?) anyways..

And we could write the module loading code later, as part of the Root Core implementations, so that the bootstrapping intrepreter doesn't have to do it.

So then we'd have only:

Primitive OotB? : This is the only thing that a porter must write an interpreter for
- OotB? stdlib: Adds convenience to OotB? to produce 'full OotB?'
'Full' OotB?: Portable high-level 3-address IR/VM
- Root Core interpreter. This is where we implement security (capabilities) and language services (preemptive greenthreading, garbage collection)
Root Core: Restricted subset of Oot Core, designed to be easy to compile to efficient code
- Oot Core interpreter
Oot Core: language used for metaprogramming in Oot
- Oot implemented as Oot Core + metaprogramming

to bootstrap a naive port, just implement Primitive OotB?. To make it faster:

incrementally override parts of OotB? with a native implementation
then, write a native RootB? Core interpreter
then, write a native Oot Core interpreter

Note that Root Core can be encoded in OotB?.

So what about the other difficult thing in OotB?, the unboxing (and 'meta' bits, and LONG, etc)? Just leave that in the encoding, but don't use it in OotB?; that's for Root Core representated using the OotB? encoding!

hmmm.. i kinda like this... seems like the Root Core interpreter may not be the best place for language services, though? I thought we wanted to implement those beneath the implementation of language constructs, at least naively. Well... i guess that qualifies, actually.. the Root Core language implementation will be much simpler, and we are an interpreter tower level below the Oot Core implementation.

---

ok so if we are following the immediately previous section, then the tower can be easily conceptualized:

OotB?'s goal is to be easy to implement, for bootstrapping, while as serving as a simple 'universal bytecode encoding' for Oot (and maybe other) ASTs
Root Core has two goals. First, the implementation of Root Core encapsulates various language administrative services, such as garbage collection, preemptive greenthreading, capabilities. Second, Root Core is a restricted (statically typed, no metaprogramming, non-lazy, etc) subset of Oot Core which is easy to compile/interpret (somewhat) efficiently.
Oot/Oot Core are implemented in Root Core. Oot is just Oot Core with some metaprogramming done within Oot Core, so this is not a separate interpreter tower level.

OotB? is a language in itself, but it's also an AST representation encoding which can be used to encode Root Core or Oot Core code. When we come to additional 'higher-level' things like that for which it is not clear how OotB? should support them, we should try to push them up to Root Core. For example, there is an unboxing bit in OotB? MEDIUM format, but this isn't used in OotB?, only in Root Core. Also, when there is something that would put too much implementation burden on porters, we should to push it up to Root Core. For example, capabilities.

In a naive Oot implementation, then, the Oot program is syntactic sugar for an Oot Core program. The Oot Core program runs on an interpreter implemented in Root Core, which is itself running on a Root Core interpreter implemented in OotB? (which is itself syntactic sugar for a program composed only of OotB? primitives).

---

mb should have a different name for Oot Bytecode, the encoding format, and OotB?, the language.

Maybe Oot Bytecode vs Oot Assembly.

---

the BCPL language is interesting. BCPL was the predecessory to B, which was the predecessor to C. "B was essentially the BCPL system stripped of any component Thompson felt he could do without.". B was the first language to use a bytecode, called OCODE, for portability. Later, a lower-level assembly VM was added below the bytecode, called INTCODE. INTCODE was initially very simple, with 8 primary functions: LOAD, ADD, STORE, JUMP, JUMP ON TRUE, JUMP OF FALSE, CALL, EXECUTE OPERATION, with 23 more functions accessible via EXECUTE OPERATION, and then about 5-10 OS interfacing functions also in EXECUTE OPERATION.

There were two accumulators, two index register, an address register, and the PC. Like OotB?, the size of words was unspecified. An instruction was "a 3 bit function field, and an address field of unspecified size, 2 bits for index modification and an indirection bit"

See proj-plbook-plChSingleIntermedLangs for details.

---

hmm, as nice as it seems when i describe the idea of keeping OotB? simple and pushing anything complicated up to Root Core, it feels wrong.

So maybe keep capabilities and unboxing in OotB?.

I do think we should abandon SootB?, though.

And perhaps also Root Core should not be a wholly separate stage. I do think that the Oot Core implementation should be written in a restricted subset of Oot, and i do think we might have a special Oot implementation for this restricted subset (since it can assume static typing, etc). And so maybe when Oot is interpreted, it will be interpreted by an interpreter written in Oot Core, which is itself interpreted in an OotB? interpreter. But when compiled, i don't think ordinary Oot code should compile to this restricted subset, it should compile directly to OotB?. And since we can write platform-independent self-hosting compilers, we'll write a compiler from Oot Core to OotB?.

So the plan is:

OotB?
- Oot Core implementation; possibly separate Root Core implementation
Oot Core

---

another way we could have quintuples and interpret one field as 'meta' is just to reserve one field for metadata/annotations/prefix modifiers.

Eg we could take 2 bits from each of the 4 fields to make 8 bits of metadata per instruction, reducing the operand sizes from 12-bits to 10-bits (we could also interpret this metadata as being a 2-bit 'meta addr mode' plus 6 bits of 'meta operand').

Or we could also say that the last three fields (the 'true operands') should be largest, then the instruction field, then the meta field, and similarly with their addr mode bits. If the instruction field were only 8 bits, and the meta addr field were only 1 bit (so the whole meta field were 7 bits, not 8), then we could make the 'true operands' each 11 bits: 2 form bits, 1 meta addr mode bit, 6 meta operand bits, 2 instruction addr mode bits, 8 instruction operand bits, 3 x (4 addr mode bits + 11 operand bits). Lotsa nasty odd numbers in there, though. Also, the addressable instructions drop from 4k to 256. Also, that goes against the principal of allowing any register to serve hold a first-class function which can be immediately called. Unless... we reduce the number of registers to those addressable by the instruction field, and let the three 'true' registers hold larger numbers than the number of registers!

I don't think we need 8 bits of meta, and 10 bits operands are getting tight. How about 256 registers and:

2 form bits, 4 meta bits, 2 instruction addr mode bits, 8 instruction operand bits, 3 x (4 addr mode bits + 12 operand bits)

8 instruction operand bits sounds perfect, from the perspective of how many instructions (and custom instructions) we need, and from the perspective of keeping the max number of registers down to a reasonable number. But otoh recall that we are intending to map HLL local variables to registers. 255 locals is a pretty low limit; out of the popular languages that i've looked at, only Lua and 1988 C comes near that (200 local limit and 127, respectively). Also, Perl6's MoarVM? has ~700 instructions.

So maybe:

2 form bits, 1 meta addr mode bit, 4 meta bits, 2 instruction addr mode bits, 10 instruction operand bits, 3 x (4 addr mode bits + 11 operand bits)

or just:

2 form bits, 4 meta bits, 2 instruction addr mode bits, 11 instruction operand bits, 3 x (4 addr mode bits + 11 operand bits)

2k (11 bits) makes me happy in that:

it's a decent number of HLL local variables (but.. http://web.eecs.umich.edu/~tnm/trev_test/techRepts/2000.10.The_Need_for_Large_register_files_in_Integer_codes.pdf shows that the Perl interpreter would need around 2k local vars)
if each of 2k registers is 16 bits, then that's 4k for the register page, which is exactly a common OS page size [2]. 4k is already more than http://web.eecs.umich.edu/~tnm/trev_test/techRepts/2000.10.The_Need_for_Large_register_files_in_Integer_codes.pdf suggests is almost ever useful

buuuut... i think taking a whole OS page for the registers is pushing it. It would be nicer if the interpreter could fit the registers, and some additional internal state, onto one page.

So, mb back to:

2 form bits, 1 meta addr mode bit, 4 meta bits, 2 instruction addr mode bits, 10 instruction operand bits, 3 x (4 addr mode bits + 11 operand bits)

The issue with this is just that those 5 meta bits are totally unused right now, which is a big waste of space.

---

one reason to have 8 bits of registers is the idea of per-module custom instruction dispatch tables. With 10 bits of registers, even if we can only have 256 modules imported into any one module, that's an 18-bit dispatch table.

But even with 8 bits of registers, that's a 64k x 2 (assuming 16-bit addressing) dispatch table. Which is already way too large.

So maybe make the 'custom instructions' per-program instead of per-module.

---

incidentally, if you want quints, then we're getting closer to 128-bit instructions if we wanted 16 bit operands:

5 x (4 addr mode bits + 12 operand bits) = 80 5 x (8 addr mode bits + 12 operand bits) = 100 8 form bits + 5 x (8 addr mode bits + 12 operand bits) = 108

still have 20 bits free though, and we don't really need 8 addr mode bits, so this doesn't look like a great bargain.

---

i'm thinking of bringing SootB? back in the form of an 'extensible interpreter' with EXEC; that is, you can specify, in Oot code, how to handle the 2 custom addr mode bits (including the unbox bit), the form bits, and the meta bits. And how to deal with address subspaces, which can include getters and setters and capabilities. And perhaps handle 'custom instructions' too. You specify this stuff with 'plugins' for the interpreter, then you do something like 'EXEC' to tell the interpreter to start using those plugins. The underlying dispatch in the interpreter doesn't change; so this way you are not running an OotB? interpreter on top of a SootB? interpreter, you are running an extensible OotB? interpreter with plugins written in OotB?.

---

here's a proposal (i think this was a quals proposal, not an actual project) for a language to define data representation format/layout:

http://tap2k.org/projects/WIL/

---

note: to prevent security hazards, mb 0 is false and 1 is true, but must admit the possibility of some other integer, eg 2 or -1, being given.

---

mb just make it uadd08 after all, why not?

well, here's a reason why not: we mandate that registers must be able to accomodate at least U15s. Less confusing for implementors this way.

---

http://cs.lmu.edu/~ray/notes/squid/ is a neat IR, close to our own in some ways. It aspires to platform representation format independence. This webpage which describes it has admirably concise wording.

in addition, because the examples at the bottom show HLL-style syntax rather than assembly style, this would make it clear to a newbie how you could possibly do this sort of HLL stuff in assembly.

---

So if we don't want a Primitive OotB? implementor to have to implement LOADMODULE, then what primitives do we need?

We could make custom instruction lookup extensible. But that seems to gut the purpose of have an extensible interpreter instead of one interpreter running on the other, because it would be really slow. Remember, in the simplest case (unbox bit zero, meta bit zero, instruction-level meta bits all zero?), we want to run an instruction without touching the plugin code.

But if we don't do that, then the implementation must at least sort of understand the concept of different modules, right? It must know which module a piece of code is currently in, and maintain a dispatch table for each module (or map them all together).

No, not if we make custom instructions per-program instead of per-module. Which is fine provided that custom instructions are really just a bootstrapping technique, not a truly modular instruction set. In this case all we need is ISDEFINE and DEFINE. Note that SYSCALLs can also be defined.

---

LONG format would be simpler if we just had a series of 64-bit length fields!

but since we're 16-bit-y, maybe a series of 16-bit length fields instead? This is still different because the grouping of fields into 'instructions' is variable-length.

---

so i guess the main things in LONG are:

represent arbitrary-length fields
represent arbitrary modifiers on any field (and modifiers on modifiers, etc)
do we want to do the 'syntactic categories' (see previous proposal in ootAssemblyNotes8; search for 'current proposal')
arbitrary graphs, with meta stuff
the unfinished graph-y proposal at the bottom on ootAssemblyNotes7
there's another list of LONG goals in ootAssemblyNotes7

for representing arbitary-length fields, there are a few obvious choices:

when the last m bits of a number is zero, let the first n bits of a number just encode a (small) number
when the last m bits of a number are non-zero, depending on what they are, do one of:
- let the first n bits encode the first n bits of the output, and continue adding on bits in this way until the last m bits are zero
- let the first n bits encode the length of the field to follow; but if the length is the maxlength then go that much and then look for another length header (which may be zero, if the max length is exactly the actual length; or we may exclude zero and reduce maxlength by 1, for redundancy)
- continue going until we have a sentinel (NULL, 0); also, first have a table of where actual NULLs should be inserted
- continue going until we have a sentinel (NULL, 0); have quoting
- continue going until we have a sentinel (NULL, 0); have neither quoting, nor a NULL location insertion table

---

y'know, on an actual 16-bit machine, yes, 1k registers (which, in most implementations, probably each must be able to hold a native pointer) will take up 2k bytes of memory; but on typical machines, which are 64-bit, 1k regs will take up 8k of memory, which is two full pages! To make it take up half a page, you'd need at most ~256 regs. Which means <256 local vars, since the implementation needs some regs.

of course, most functions won't need all those regs, and better implementations will take advantage of that.

---

eh, if we're gonna have 'quints', may as well be more uniform:

2 form bits + 2 flag bits + 5 x (4 addr mode bits + 8 data bits)

dunno what that'll accomplish, maybe something like letting the meta stuff address any register. This is kind of a big waste, though, since there are 12 bits of meta stuff (and 2 custom flag bits) that aren't needed for most instructions. But it's general and clean...

and it means that even with max registers, if each register is the size of a native pointer, then the registers fit into half of a 4k page on a 64-bit machine, leaving the other half for other implementation state.

This means that we would succumb to Lua-esque restrictions on number of local vars. Maybe worse, as, to be clean, we'd probably just limit the HLL to 128 vars and let the implementation use the other 128.

We'll also have trouble inlining custom instructions that use more than a few registers. We should define a small, fixed set of scratch regs for the custom instruction. Mb the implementation is responsible for saving and restoring these via a hidden stack? Or mb each custom instruction says how many it needs? Or mb just provide the 'hidden' stack explicitly, make each custom instruction declare how much hidden stack space it needs, and make the custom instruction 'callee save' to it as needed. Callee save is insecure, so we'll have to assume that any custom instructions are maximally trusted. Since custom instructions must not be recursive, we can statically determine the maximal nesting depth and similarly the maximal hidden stack space needed once we know the set of available custom instructions.

If we don't use it in the language, then the implementation could use the meta stuff or the flag bits for implementation-dependent annotation, eg whether or not this instruction has already been JIT'd. I guess we should specify whether that's allowed.

Yeah, ok, let's give the implementation at least one bit:

2 form bits + 1 implementation-dependent bit + 1 flag bit + 5 x (4 addr mode bits + 8 data bits)

the implementation-dependent bit must be 0 in Oot bytecode in a module file.

---

a minimal/bootstrap implementation of capabiliies would be easier if we didn't provide the ability to union capabilities together; so code would have to be constantly swapping different capabilities into the CAP register in order to access different address subspaces. Perhaps this unioning could be implemented later at a higher level?

i don't immediately see how to simply add the unioning on at a higher level later.. and forcing even higher-level code to manually swap stuff into the CAP register just for this is unacceptable.

could we instead just get rid of the special CAPMALLOC'd subspaces, and just unify capabilities with pointers? But that would mean necessarily typechecking to make sure we don't coerce capabilities.

or... capability tags on everything.. or... a separate map of where all capabilities are in various address subspaces, checked before each write

hmm.. none of these sound very appetizing

---

stuffing as much as possible into address subspace getters/setters seems like a good idea for easy bootstrapping.

---

hmm... we could just go OOP early and encapsulate the allowed operations on capabilities that way... ("early" just means in Oot Assembly, which is supposed to be somewhat lower level..)

---

ok i think you could do most everything by providing the following hooks in the interpreter:

custom instructions
'timer': executes every k instruction (where k is implementation-specific, mb even given on the commandline?), or every k seconds, or some random amount of instructions or seconds whose expected value is finite k. This allows you to implement preemptive multitasking in greenthreads.
'wrapper': this is called for each instruction. It is what actually calls the custom instruction, the getters and setters to find the effective addresses of the operands, etc. This allows you to impose atomicity depending on flags or do other weird things based on the instruction-level flag or the meta field.
'get' and 'set': these are called whenever a read or a write is performed to memory, or even to a register. This is what interprets the addressing modes, and does stuff like unboxing, and most crucially, enforces r/w capabilities. Accepts opaque arguments so that 'wrapper' can tell it things like 'and do it atomically'.
'jump': this is called whenever a JMP is made that may cross address subspaces (not just a skip or a branch). This is what enforces execute capabilities.
'short': this is called whenever a SHORT encoding is encountered. It is supposed to translate the SHORT encoded instruction into a series of MEDIUM instructions
'long': this is called whenever a LONG encoding is encountered. It is an arbitrary wrapper and has full access to the state of the interpreter (PC, etc).

Regarding address subspaces, they each also have getters and setters (and maybe jumpers), but calling these hooks is done by the core get and set hooks, so the porter doesn't have to worry about that. The type of address subspace will be statically known and fixed per-program (and hence fixed if you are using OotB? to run Oot) (in the same sense that the set of custom instructions will be fixed), so a more advanced implementation of an OotB? compiler can inline the address subspace-specific getter and setter hooks at compile time (and implement them using platform-native primitives).

Note that 'get' and 'set' and wrapper are themselves executing Oot code, but while executing this code, no wrapper is called, and getting and setting is done atomically! So the interpreter must have at least two 'modes' (or privilege 'rings'). Could generalize this to an interpreter tower. Note that in any case, it must be impossible for ordinary code to 'escape' back into the primitive 'mode' where getters and setters are unhooked, because this would allow them to ignore capabilities.

Problem is, the implementation of these hooks need to be able to call each other, and call out to other components of the implementation. Maybe use some of the 'meta bits' to indicate calling from (our) 'real mode' into 'protected mode' (ring transition; also we might want the reverse, ie TRAP)? And/or use LEA to resolve addresses, first-class functions (with some sort of 'emulation' bit to show that it isn't a CALL) to call instructions? Or just have a special jump table in some known memory location? Or use SYSCALLs?

One argument for an extensible 'ring tower' is that CPUs already have protected mode and real mode, and to some extent we're duplicating that (but also extending it, via capabilities). Which suggests that some applications running on top of us may like to extend it further in the future.

But here, b/c security, you can never go back down the tower, only further up. Contrast to 3-Lisp and reflective procedures. Maybe we should have a 'root bit' in our ambient capabilities (by 'ambient' i mean the ones not tied to any particular addr space; this could be encoded by only paying attention to them when they are attached to some special sentinel addr space, that would allow it to still be safe for MALLOC to always return full capabilities and to union these into our capability set) that allows you to ascend (just one level, not all levels); this would allow a 3-Lisp to extend and make use of our 'tower' encoding syntax.

---

We need a way for these hooks to call parts of the base-level (or next-lowest-level, if we have a tower) implementation. For instance, need to resolve (addr mode, operand) tuples into effective addresses.

LEA instruction? way to call getter and setter?

We also need ways for these hooks to call each other, eg for one hook to call, not the underlying implementation's LEA or getter or setter or jumper, but rather the same-level LEA/getter/setter/jumper hook.

---

do we need to hook LEA too?

---

is the previous worth it? my fear is that:

invoking a wrapper around each instruction, and get/set wrappers around each memory access, will be really slow, almost as slow as just executing a new interpreter on top of this one. All we are really gaining is that we don't have to do instruction decoding and dispatch twice.
the implementation complexity of the custom instructions is managable, but the complexity of those two privilege 'rings' may be significant -- if the only reason for this is to allow newbie implementors to implement it very quickly and easily because we've pre-written a lot of the code in the hooks so that they don't have to, having to sometimes enable and sometimes disable getters and setters (without making a security mistake) may be just as complicated as just telling them to implement everything directly.

All the implementation is really doing 'hardwired' is instruction decoding and dispatch. Is this useful? Having a simple, somewhat fixed instruction decoding is a little useful b/c it allows for other tooling (eg disassemblers), but that's sort of a different question (you could have that with one interpreter running on top of another, too). It's probably a little useful speedwise, and the same with dispatch.

Of course having these hooks is also kind of cool in itself, and if we have a tower, could be useful for something. Since a primary goal at this stage is ease of implementation, though, i'm not sure if those benefits really count.

---

each register must be large enough to hold either of:

a 16-bit integer
a native pointer (or at least, something that the implementation pretends is a pointer)

---

if we are going back to 8-bit operands, then we have to worry again about relative branch expansion while inlining custom instructions, limited jump table size per module, limited constant table size per module, and limited number of module imports.

jump table size per module and constant table size per module are not huge problems, because the JMP and LOADK instructions can just combine both input operands to give a 16-bit index into jump and constant tables. However, we may want to use one of these inputs to specify a module and the other one to specify an index within a module, in which case we're back to only an 8-bit within-module index.

If we don't use one of the indices for JMP and LOADK to select a module, then we'll need some special instruction JMPEXTERN and LOADKEXTERN to do that sort of thing. This might be preferable anyways; make inter-module operations bigger and uglier in order to optimize intra-module operations.

256 module imports max is probably fine, but mb not if each module is constrained to 256 exports, which it is.

The relative branch expansion is a big problem, though. If we constrain user-level code to only do SKIPs, and constrain fully expanded (recursively expanded) custom instructions to never expand to more than 256, then we can expand the skips into relative branches. But this would mean needing a lot more JMP labels, and since our JMP label table is now very limited in size this'll probably cause trouble. Also, having all these JMPs when we could have BRs seems ugly and wasteful. Also, do we want to limit custom instructions to fully expand to only 256 instructions (or maybe 127 since branches might be negative)? It's not a huge deal since we also have a stdlib (so eg SHA-256 can be a library function). And i guess custom instructions should be 'small', right?

We could probably live with these limitations with hand-written code. But a compiler targeting OotB? code will probably have to know how to break up modules, and replace primitive lookup with do multistep hierarchical module item lookup, when it hits the jump table or import/export limits.

No, wait, we don't really need 256 export max; 256 is just the max export # that can be directly accessed, within a single LOADK instruction using immediate addressing. By using a series of two LOADs (first LOADI a 16-bit number into a register, then do a LOADKEXTERN with register addressing (or if LOADKEXTERN is a syscall, then have the first LOADI push the index onto the stack), we can address 64k modules and address 64k items within each module.

We don't care if it's easy for compiler to optimize cross-module calls, so the fact that the target of the LOADKEXTERN is slightly obscured is ok.

So i guess we're good.

---

One thing that we could do to ease pressure on the JMP table is to introduce a BR instruction which takes a 16-bit input. Then we restrict it in user-level code to only taking an 8-bit input, and use the other 8 bits to deal with relative branch expansion while inlining custom instructions.

---

note: BRs count 64-bit 'instructions'; a BR within a SHORT can only jump to the first instruction of each 64-bit instruction group, it can't jump within SHORTs.

maybe SKIPs can jump within SHORTs though?

---

recall that when using the new combined 'immediate/constant' address mode, half (~127) of the items are immediate (so we can only count to 127?), and half (~127) are entries in the constant table.

similarly for instructions, perhaps half of the opcodes (~127) are real instructions (both primitive and custom), and the other half are syntactic sugar for putting the two inputs on the stack, CALLing something in the stdlib (or something in some sort of module-specific CALL table? or 64 of each?), and then returning and popping the return argument and writing it to the 'dest' operand's effective address. CALLs aren't inlined and so there's no relative branch expansion to worry about (so these can be things like SHA-256, garbage collection, hash tables, etc)

---

ugh, i really don't like not having immediate arguments up to 255. But i also don't like not having any addressing mode access to the constant table. And i also don't like not having any addressing mode access to the stack. And i also don't like not having any addressing mode access to LOAD/STORE.

This suggests that either 8-bit operands are too small, or 2-bit address modes are too small.

We could just reduce the size of those meta fields.

512 regs. 2 format bits + 1 implementation-dependent bit + 1 flag bit + 8 meta bits + 4 x (4 addr mode bits + 9 data bits)

One issue here is that 9 bits, being just one bit over 8, forces the implementation to use 16-bit shorts instead of 8-bit ints, which is a waste of 7 bits. Not too concerning since this is supposed to be a 16-bit-y thing, but still.

But this extra bit is exactly what lets us fit both 255 immediate indices, and 255 constant table indices, from one operand. And 512 regs is nice b/c it means we can give the HLL at least 256 regs. This also means that if each register is 64-bits, the register page is exactly 4k.

This also means that fully expanded custom instructions can go up to 255 MEDIUM instructions (b/c a user-level BR instruction can go +-255, and the implementation-reserved second argument of a BR instruction can be used to add up to 255*256; actually this gives a limit of 512 but let's leave it at 255 because it's more symmetric; this also gives us space to let the implementation intersperse an annotation in between each instruction).

---

OR we could have 256 regs, and use 1 bit of the 9 bits to switch between registers and stack, allowing us to use 'register mode' to directly access stack offsets, with only 4 addr modes.

hmm, i like this. We are essentially fitting in two extra addr modes.

otoh that's kind of silly/inefficient. Why not just call it what it is, and have 5 addr mode bits and 8 bit operands, if this is what we want? The addr modes would be:

immediate
constant
register direct
stack direct (no push/pop)
register indirect
stack indirect (no push/pop)
stack addressing (pre/post dec/inc), starting from a register
stack addressing (pre/post dec/inc), starting from the stack (but not push/popping the first stack)

for custom instructions, the 'stack addressing' could refer to the 'hidden stack' rather than to the actual stack. This is how they access temporary storage. If they need to access the actual stack, they must manually go through the stack register. When inlining, the hidden stack can just be placed on the actual stack, and stack indexing can be translated; alternately, the hidden stack can actually be in a register, and user code is just forbidden from using it (i like that a little better, actually; so reserve some registers for the implementation).

hmm... it's not crazy.

also mb migrate the 'meta' addr mode bit into the 'meta bits' in front, and make the flag bit implementation-dependent, now it's nice and even again:

2 format bits + 2 implementation-dependent bits + 12 meta bits + 4 x (4 addr mode bits + 8 data bits)

so, here's 8 special registers:

ERR/STATUS (holds variously errors, carries, etc)
GPR 1 (accessible from SHORT)
PC
STACK
CALLSTACK
RESERVED (MODE? FLAGS/STATUS?)
reserved for implementation (hidden stack)
reserved for implementation (TBD)

note: if we have an extensible interpreter tower, the pattern of 'hidden stack' being the current level, and 'STACK' being the level just 'above' you, and the register being shared between all levels, could be repeated. If we use two of the 12 'meta bits' for 'go up', 'go down, 'stay here', where 'go up' means 'run this command in the emulated environment' and 'go down' means 'TRAP to request that this command be executed in privileged ring (the one emulating us)', then that could also be a repeated pattern. The 4th bit combo could mean 'execute at highest available level (infinity; as user code)' (this makes more sense than 'execute at level 0' because who knows how things are being represented at level 0? Also, it would be a security hazard to just do that, so that would actually be a trap request). Of course, in other contexts (eg when representing graphs instead of Oot Assembly) we might use this to indicate 'level 0' instead of 'level infinity'.

---

actually it would be nice to have per-argument meta, because o/w how else to do a MOV between items on the stack on different levels of the tower?

i suppose though that having two levels' stacks accessible provides the same thing. So maybe the instruction-level meta bits control which two stacks these are?

---

TODO: need to define an API for LOADMODULE; where does it put the code? Where does it put the JMP table? Where does it put the constant table? Also, need a plugin for LOADKEXTERN, to go with LOADMODULE. Is this API so complex that it's better just to make each porter implement their own LOADMODULE? Their own getter/setter/timer/jumper/SHORT/LONG? Note that all this stuff must be (eventually) exactly specified, since there can be third-party implementations. If that spec gets too long, then we made it too complicated.

yeah, i'm beginning to think that specifying all this interpreter plugin stuff is not much shorter than specifying the full interpreter (except for custom instructions). The reason is that there are all these interdependencies; LOADMODULE has to put the constant table somewhere where LOADK knows to look for it, and also where LOADKEXTERN can find it; the JMP table, once loaded, must be able to be found by the 'jumper' code; there are offsets within the plugins, and then there are native pointers.

Custom instructions are different, they are just inlined, they are not entangled with all this stuff; they still make sense.

Otoh it would be annoying to read a specification of all that junk, without plugins.

Maybe we can salvage the plugins. Don't have an API for the constant table or the JMP table. Translate to native pointers, and then reuse LOADKEXTERN for an actual constant-table addressing plugin, and use 'jumper' plugin to actually do the JMP.

---

so the obvious format for interpreter plugins:

all little-endian. All offsets (in the table given below; BR offsets are still in units 64-bit 'instruction packs') are in units of 2-byte 'words'. Total must be < 64k BYTES (<32k words) (why bytes and not words? Some implementations might want to translate this code to use byte addressing with 16-bit addresses). No single plugin or custom instruction may be more than 255 instruction packs (64-bit/8-byte groupings) (so, max 2040 bytes/1020 words). All of this should be loaded into a memory subaddress space. There is no JMP table; JMP targets in this code are interpreted as 2-byte offsets within this memory space. There is no constant table; only immediate constants are supported.

2 bytes: version
2 bytes: offset to wrapper code, or 0 if none
2 bytes: offset to getter code, or 0 if none
2 bytes: offset to setter code, or 0 if none
2 bytes: offset to jumper code, or 0 if none
2 bytes: offset to timer code, or 0 if none
2 bytes: offset to SHORT code, or 0 if none
2 bytes: offset to LONG code, or 0 if none
2 bytes: offset to MALLOC code, or 0 if none
2 bytes: offset to MFREE code, or 0 if none
2 bytes: offset to MLEN code, or 0 if none
2 bytes: offset to CAPMALLOC code, or 0 if none
2 bytes: offset to CAPMFREE code, or 0 if none
2 bytes: offset to CAPMLEN code, or 0 if none
2 bytes: offset to CAPUNION code, or 0 if none
2 bytes: offset to CAPONLYADDR code, or 0 if none
2 bytes: offset to CAPONLYPERM code, or 0 if none
2 bytes: offset to CAPGETPERM code, or 0 if none
2 bytes: offset to CAPGETADDRS code, or 0 if none
2 bytes: offset to CAPSEAL code, or 0 if none
2 bytes: offset to CAPUNSEAL code, or 0 if none
2 bytes: offset to CAPJMPSEAL code, or 0 if none
2 bytes: offset to LOADKEXTERN code, or 0 if none
2 bytes: offset to LOADMODULE code, or 0 if none
padding, so that the following code is 64-bit aligned, if needed
code (MEDIUM OotB?)

Each plugin is called with the following on the hidden stack: (a) the return address, and (b) its arguments, and (c) a pointer to a special 256-word (512 byte) global address space for use by plugins (of course, they will MALLOC when more is required); and returns by consuming these and leaving arguments on the hidden stack, and placing a return address at the top of the hidden stack (which need not be the same return address they were passed), then executing a RET instruction. Any RET instruction in here refers to a return to user code. These plugins may JMP between each other's code, or to shared code in their file; absolute JMPs within plugins refer to 2-byte offsets within this memory subspace. They cannot do absolute JMPs to the custom instruction implementations, that's a different memory subspace.

Now here's the format for the custom instruction plugin. Same encoding rules as above, except that custom instruction plugins may NOT do absolute JMPs.

2 bytes: version
2 bytes: custom instruction table length
- 2 bytes: custom instruction #
- 2 bytes: offset to the implementation of this custom instruction
padding, so that the following code is 64-bit aligned
code
code (MEDIUM OotB?)

Perhaps have a requirement that basic plugins + custom instructions together fit in 64k - 2 bytes (the 2 bytes are for an offset to the beginning of the custom instructions)? This allows interpreters to just allocate 64k for this ROM.

---

ok wait, first off, SYSCALLS are often platform primitives, but otherwise they are called like ordinary functions; they don't use the hidden stack.

Second, why have the standard library loaded like a module? Just map it into the special global memory. Standard library calls, also, are called like ordinary functions. Well no, mb the standard library should be loaded like an ordinary module; the implementation can optimize this if it likes by just mapping into memory what the result of loading it would be, but at least in theory it's just an ordinary library except that there are shortcuts to call its functions.

OK so if the stdlib is loaded like a module, that presents a little bit of a problem in terms of checking its signature, because the code to check a Ed25519 signature is in the stdlib.

Well, since the stdlib is already part of the distribution, just trust it and don't check its signature; problem solved.

Well, not quite, b/c we want to reuse the same code to load modules later. How about, this code CAN statically call syslib functions, as long as dynamically, they are not actually called until after syslib is loaded (ie during the loading of syslib). Mb LOADMODULE rejects calls to load module 0 (stdlib) if module 0 is already loaded, but otherwise, module 0 is trusted.

---

how will addresses be represented? as a tuple (subspace, address_within_subspace) or as a single linear address? this should be pluggable. And i guess it is, b/c all you can do with addresses is get them, set them, or jump to them, which are all plugins.

but i guess we need some way to jump to stdlib fns? No, LOADKEXTERN should do it.

but wait, LOADMODULE needs access to MALLOC, so what does MALLOC return then? it must be a wrapper around the true MALLOC.

---

to see what SYSCALLS are needed, consider looking into the C POSIX library (section with links in plChStdLibraries)

---

y'know, we can just make all the plugins syscalls; there's no reason why user code shouldn't be able to call any of them.

---

revised format for plugins:

(moved to ootAssemblyThoughts)

---

regarding SHORT mode:

there's an awesome ISA for a tiny simple chip called the J1. I described it in a section in proj-plbook-plChIsaMisc. Copied from there:

" The J1 does not have:

condition registers or a carry flag
pipelined instruction execution
8-bit memory operations
interrupts or exceptions
relative branches
multiply or divide support ... This description follows the convention that the top of stack is T, the second item on the stack is N, and the top of the return stack is R. J1’s internal state consists of:
a 33 deep x 16-bit data stack
a 32 deep x 16-bit return stack
a 13-bit program counter

There is no other internal state: the CPU has no condition flags, modes or extra registers. Memory is 16-bits wide ... there are five categories of instructions: literal, jump, conditional jump, call, and ALU.

... Instruction encoding ((paraphrased from figure)):

literal: 1 + 15-bit literal
jump: 000 + 13-bit target
conditional jump: 001 + 13-bit target
call: 010 + 13-bit target
ALU: 011 + 12 bit ALU instruction (see below for fields) + 1 unused bit ... ALU operation codes:

T N T+N and or xor ∼ == < rshift −1 R [T] lshift depth u<

((note that in the simplified successor, the J1a, "multi-bit shifts are gone, instead the J1a has single-bit shifts")) ... ALU instructions are composed of an ALU code plus 8 other bits:

field   width  action
T'         4	 ALU op, replaces T, see table II
T -> N     1	 copy T to N
R -> PC    1	 copy R to the PC
T -> R     1 	 copy T to R
dstack +-  2	 signed increment data stack
rstack +-  2	 signed increment return stack
N -> [T]   1 	 RAM write

this lets you make, among other things, the following Forth primtives:

dup over invert + swap nip dropN ; >r r> r@ @ !

(pick and roll are also primitive, but they require more than one instruction)

the primitives ('basewords') listed in: https://github.com/jamesbowman/swapforth/blob/master/j1a/basewords.fs

are:

noop + - xor and or invert = < u< swap dup drop over nip >r r> r@ io@ ! io! 2/ 2* depth exit hack

TODO copy the cool stuff from it to here, such as its 12-bit ALU ops

If the SHORT format is 16-bits instead of 8-bits, and if JMP addressing is done in 16-bit words rather than 64-bit, then can return to right after a CALL, so SHORT mode could use subroutines like normal

If we aren't saying that SHORT mode = primitive Oot, then we can add some more common primitive instructions such as bit shifts to SHORT.

In fact if we have 16-bit SHORTS then we can probably fit all of the usual primitive instructions, and then some.

---

16-bit idea:

4 fields:

opcode: 1 addr mode bit (immediate vs register) + 5 data bits
dest: 1 addr mode bit (register vs stack) + 1 data bits
src1: 2 addr mode bits + 2 data bits
src2: 2 addr mode bits + 2 data bits

eh, that's kinda nasty. 2 of the 4 main regs can't be written to, and the first-class fns can be any reg 1-31, which is useless

how about:

opcode: 5 data bits (and reserve 4 opcodes for first-class fns in each of the 4 GPR regs)
dest: 1 addr mode bit (register vs stack) + 2 data bits
src1: 2 addr mode bits + 2 data bits
src2: 2 addr mode bits + 2 data bits

we have 4 'stack' data choices, so how about: data stack pop/push, call stack pop/push, data stack TOS destructive write, call stack TOS destructive write (or, instead of call stack destructive read/write), could access 2nd item on data stack)

i doubt the call stack will be needed so often though.

---

"In computer engineering and in programming language implementations, a belt machine is a real or emulated computer that uses a first in, first out (FIFO) queue rather than individual machine processor registers to evaluate each sub-expression in the program. ... A belt machine implements temporary storage with a fixed-length FIFO queue, or belt by analogy to a conveyor belt. The operands of the arithmetic logic units (ALUs) and other functional units may be taken from any position on the belt, and the result from the computation is dropped (stored) in the front position of the belt, advancing the belt to make room. As the belt is fixed length, drops in the front are matched by older operands falling off the back; pushed-off operands become inaccessible and must be explicitly saved if still needed for later work. Most operations of the instruction set work only with data on the belt, not on data registers or main memory cells. ... For a typical instruction like add, both argument operands come from explicitly named positions on the belt, and the result is dropped on the front, ready for the next instruction. Operations with multiple results simply drop more values at the belt front. Most belt instructions are encoded as just an operation code (opcode) and two belt positions, with no added fields to specify a result register, memory address, or literal constant. This encoding is easily extended to richer operations with more than two inputs or more than one result. "

---

more on the GA144:

kragen 104 days ago [-]

I think the GreenArrays? F18A cores are similar in transistor count to the 6502, but the instruction set is arguably better, and the logic is asynchronous, leading to lower power consumption and no need for low-skew clock distribution. In 180nm fabrication technology, supposedly, it needs an eighth of a square millimeter (http://www.greenarraychips.com/home/documents/greg/PB003-110...), which makes it almost 4 million square lambdas. If we figure that a transistor is about 30 square lambdas and that wires occupy, say, 75% of the chip, that's about 32000 transistors per core, the majority of which is the RAM and ROM, not the CPU itself; the CPU is probably between 5000 and 10 000 transistors. The 6502 was 4528 transistors: http://www.righto.com/2013/09/intel-x86-documentation-has-mo...

The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.

Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)

Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.

The GreenArrays? team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.

I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)

So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.

---

so the belt machine makes a lot of sense for temporaries, actually. It's SSA -- the processor benefits because it can assume that no one mutates these guys but it.

so, mb:

registers for locals
stack for expressions and calling (trees/depth)
belt for temporaries

so instead of 4 GPRs should we have a belt?

nah, i get the impression that the main benefit of a belt to the implementation is that b/c it's SSA you can pipeline without being afraid that instructions will update a register. I guess without this they have to store both versions of the register (before update and after update), and so with the belt, the memory used to store these multiple versions can instead be devoted to giving the user more registers (well, more belt locations). With only 4 GPRs this probably wouldn't free up that much space for us. Anyhow a compiler could compile something like OotB? to a belt assembly language.

---

yes, pascal has pointers:

typed pointers (cite http://www.swissdelphicenter.ch/en/niklauswirth.php)
- see http://borlpasc.narod.ru/english/faqs/pointers.htm for examples of pointer use in Pascal

---

http://www.eighty-twenty.org/2012/11/27/arm-tail-calling-convention

A calling convention for ARM that supports proper tail-calls efficiently

Because proper tail calls are necessary for object-oriented languages, we can’t quite use the standard calling conventions unmodified when compiling OO languages efficiently to ARM architectures.

Here’s one approach to a non-standard, efficient, tail-call-supporting calling convention that I’ve been exploring recently.

The big change from the standard is that we do not move the stack pointer down over outbound arguments when we make a call.

Instead, the callee moves the stack pointer as they see fit. The reason for this is so that the callee can tail-call someone else without having to do any hairy adjusting of the frame, and so that the original caller doesn’t have to know anything about what’s left to clean up when they receive control: all the clean-up has already been completed.

This bears stating again: just after return from a subroutine, all clean-up has already been completed.

In the official standard, the stack space used to communicate arguments to a callee is owned by the caller. In this modified convention, that space is owned by the callee as soon as control is transferred.

Other aspects of the convention are similar to the AAPCS standard:

    keep the stack Full Descending, just like the standard.
    ensure it is 8-byte aligned at all times, just like (a slight restriction of) the standard.
    make outbound arguments leftmost-low in memory, that is, “pushed from right to left”. This makes the convention compatible with naive C struct overlaying of memory.
    furthermore, ensure argument 0 in memory is also 8-byte aligned.

Details of the stack layout

Consider compiling a single subroutine, either a leaf or a non-leaf routine. We need to allocate stack space to incoming arguments, to saved temporaries, to outbound arguments, and to padding so we maintain the correct stack alignment. Let

    Ni = inward-arg-count, the number of arguments the routine expects
    No = most-tail-args, the largest number of outbound tail-call arguments the routine produces
    Nt = inward-temp-count, the number of temps the routine requires
    Na = outward-arg-count, the number of arguments supplied in a particular call the routine makes to some other routine

Upon entry to the routine, where Ni=5, No=7, Nt=3, Na=3, we have the following stack layout. Recall that stacks are full-descending.

(low) (high)

outbound		temps		shuffle	inbound
0	1	2	---	0	1	2	---	-	-	0	1	2	3	4	---

                    ^                                               ^
                  sp for non-leaf                                sp for leaf

I’ve marked two interesting locations in the stack: the position of the stack pointer for leaf routines, and the position of the stack pointer for non-leaf routines, which need some space of their own to store their internal state at times when they delegate to another routine. Leaf routines simply leave the stack pointer in place as they start execution; non-leaf routines adjust the stack pointer themselves as control arrives from their caller.

Note that the first four arguments are transferred in registers, but that stack slots still need to be reserved for them. Note also the padding after the outbound arguments, the temps, and the inbound/shuffle-space.

The shuffle-space is used to move values around during preparation for a tail call whenever the routine needs to supply more arguments to the tail-called routine than it received in turn from its caller.

The extra shuffle slots are only required if there’s no room in the inbound slots plus padding. For example, if Ni=5 and No=6, then since we expect the inbound arguments to have one slot of padding, that slot can be used as shuffle space. Addressing calculations

Leaf procedures do not move the stack pointer on entry. Nonleaf procedures do move the stack pointer on entry. This means we have different addressing calculations depending on whether we’re a leaf or nonleaf procedure.

    Pad8(x) = x rounded up to the nearest multiple of 8.
    sp_delta = Pad8(No * 4) + Pad8(Nt * 4), the distance SP might move on entry and exit.

Leaf procedures, where the stack pointer does not move on entry to the routine:

inward(n) = rn, if n < 4

temp(n) = sp - sp_delta + (n * 4) outward(n) (tail calls only) = rn, if n < 4

sp - Pad8(Ni * 4) + (n * 4)

sp - Pad8(Na * 4) + (n * 4)

Nonleaf procedures, where the stack pointer moves down by sp_delta bytes on entry to the routine:

inward(n) = rn, if n < 4

temp(n) = sp + (n * 4) outward(n) (non-tail calls) = rn, if n < 4 outward(n) (tail calls) = rn, if n < 4

sp + sp_delta - Pad8(Ni * 4) + (n * 4)

sp - Pad8(Na * 4) + (n * 4)

sp + sp_delta - Pad8(Na * 4) + (n * 4)

Variations

This convention doesn’t easily support varargs. One option would be to sacrifice simple C struct overlaying of the inbound argument stack area, flipping arguments so they are pushed from left to right instead of from right to left. That way, the first argument is always at a known location.

Another option would be to use an argument count at the end of the argument list in the varargs case. This requires both the caller and callee to be aware that a varargs-specific convention is being used.

Of course, varargs may not even be required: instead, a vector could be passed in as a normal argument. Whether this makes sense or not depends on the language being compiled. tonyg posted at: 12:06 EST

tags: [tech]

permalink (2 comments)

...

((note that the caller must do any cleanup required before the tail call:))

⚓ Tony Garnock-Jones 10:23, 28 Nov 2012 (in reply to this comment)

Stack allocation requires you to wait around for the called subroutine in order to then release the allocated space, so it isn't really a tail call. Imagine modelling stack allocation in Scheme:

(let ((v (stack-alloc!))) (let ((result (do-something-with v))) (stack-release! v) result))

The call to do-something-with can't be in tail position, because there's a pending storage-reclamation-action in its continuation.

)

---

_asummers 4 days ago [-]

Const does not mean immutability, only immutable references to the outermost pointer. It is equivalent to final in Java. While that solves the issue with numbers changing state, it does not help objects e.g. For that you need something like immutable.js from Facebook.

---

there is a tension between using Oot Assembly as a bytecode to be executed somewhat efficiently, vs using Oot Assembly as an interchange language capturing high-level concepts without low-level details

---

one example of the bytecode/HLL-interchange tension is polymorphic instructions. For a HLL interchange language, we want a single polymorphic ADD; for a bytecode with efficient dispatch, we want specialized fns ADDUINT32, ADDDOUBLE, etc. This one is relatively easy to resolve; just offer both opcodes (have a polymorphic ADD opcode, but also a separate ADDUINT32 opcode, etc).

but what about things that are even more language-specific; can we have NORMALIZE and REDUCE operands? SOLVE for equations and constraint systems, OPTIMIZE, PROVE and VERIFY-PROOF? PARSE? can we even have MATCH (string match? regex? graph regex? ADT 'case')?

my feeling is that we should have them.

i am wondering if we should use those 'meta' bits to provide a 'context' for these instructions.

"Nothing that transcompiles into JavaScript? can fix JavaScript?'s lack of a native integer. "

---

"Much like immutability: if it had been the default, with a "DANGER: mutation ahead" keyword required otherwise, that would arguably have been good. But it's too late, now. (Java) Beans, beans, the magical fruit... "

---

https://github.com/Microsoft/nodejs-guidelines/blob/master/windows-environment.md#max_path-explanation-and-workarounds

" MAX_PATH explanation and workarounds

For the uninitiated, MAX_PATH is a limitation with many Windows tools and APIs that sets the maximum path character length to 260 characters. There are some workarounds involving UNC paths, but unfortunately not all APIs support it, and that's not the default. This can be problematic when working with Node modules because dependencies are often installed in a nested manner. "

---

a problem with optimizing implementations in the potential presence of metaprogramming:

Consider the 'while' control structure. Optimizers (and transpilers) will need to make assumptions about how 'while' relates to control flow (to CFGs, etc). But if 'while' is just an ordinary function that takes a block, and especially if this function can itself be tweaked or overridden when it appears in a metaprogrammed context, then how can an compiler know when these assumptions are valid?

So, we need to make it easy to (a) annotate when stdlib functions such as WHILE are present in their original form (not overridden or tweaked by metaprogramming), and (b) annotate contexts in which normal control flow has not been overridden by metaprogramming..

---

i like Lua's "specified endianness on disk, host endianness in memory"

---

todo: check sootb against webassembly again to see if there are any any unneeded instructions in SootB? that we should get rid of

---

interestingly, we DO care at least a little bit about performance for OotB? (more than Oot)

---

if we had an extra meta bit on operands, it could also be used for modality in some subformats, eg 'forall' vs 'exists', 'necessary' vs 'sufficient', 'requires' vs 'ensures'

---

ga144's capability of executing code streamed to it from a port is really cool

---

You can't add pointers. But you CAN concatenate paths, eg ".x.3" can be concatenated with ".y.2" to get ".x.3.y.2"

Similarly, you could add a base pointer P (which is like a path from 0) to the path "add 2, dereference, subtract 3, that's your effective address", to yield "*(P+2) - 3"

---

so should this sort of 'pointer path' be a primitive data type for us?

---

i added a PLATFORM syscall for platform-specific syscalls. This could be used for eg access to the DOM on webbrowser implementations of Oot.

i also added IMPLEMENTATION for implementation-specific syscalls. The difference from PLATFORM is that PLATFORM syscalls are officially defined for each platform, but implementation-specific syscalls are not standardized. This sort of thing is discouraged but we should support it, for eg embedded systems.

---

should we allow redefinition of custom instruction 'further up the tower'?

---

implementation would be simpler if the cap register only contained a single capability (for a single address subspace) at a time. But this would make the programs much more annoying, because they'd be constantly copying different capabilities from cap storage into the cap register in order to use multiple address subspaces over the course of the program.

---

i guess we want 'stack addressing' to also be polymorphic in the sense that we want the language to be able to create custom 'stack' data structures and then to be able to access them via 'stack addressing'. So how does that work? I guess we just need an LEA hook (or rather, mb one LEA hook for each addr mode, or for certain 'custom' addr modes).

---

bcpl had 'ocode' and "The global vector also made it very simple to replace or augment standard library routines. A program could save the pointer from the global vector to the original routine and replace it with a pointer to an alternative version. The alternative might call the original as part of its processing. This could be used as a quick ad-hoc debugging aid."

i guess that's similar to our 'custom instructions'

---

so it would be nice to free up an addr mode b/c SHORT mode can't reach more than 4 (or mb 2!) addr modes.

we could combine 'immediate' and 'constant table' addr modes by sacrificing half of the constant addr space to immediate (eg there would be one addr mode for both constants and immediates, and the first half of the values mean 'immediate', and the second half mean 'constant'; i think Lua does something similar).

but if we're going to have 8-bit operands, if we have immediates at all, then we should be able to represent every 8-bit immediate.

Or, we could mmap the constants (eg put a pointer to an address subspace holding the constants into a well-known register).

but having a LOADK in short mode is probably good enough.

---

so any good ideas on what the 'meta' operand (or 12-bits) should be used for?

i guess my best guess is that it can be used to select an 'interpreter' which is used to interpret this instruciton. An 'interpreter' can be a constant, or it can be held in memory. An 'interpreter' can redefine all the 'hooks' (hook syscalls), and/or it can hold a different set of registers (and even share some memory addresses with other interpreters, yet have some adjacent addresses be unshared; i guess the hooks are sufficient to implement that? yes, as long as it can't link register in that fashion, because registers don't use the get and set hooks; or i guess by hooking vmexec_instr it could even do that with the registers, it could do whatever it wants), and/or it can have the effect of executing things normally, in the typical memory context, but as if various 'mode bits' that i've previously contemplated were set (eg one interpreter fetches inputs 1 and 2 atomically, another one does the whole instruction atomically, etc), and they can have different sets of custom instructions (or even override the typical instructions, again via vmexec_instr).

---

what is the format of LABEL annotations?

If we only have two operands of 8 bits to work with, we probably need all of them for label IDs.

what about within custom instructions though? There we won't need more than 256 labels per instruction.

could reserve one of the operands for the implementation, which can then use it to store INLINELEVEL as it is doing inlining. So, during inlining, the label operands would be (INLINELEVEL, LABELID).

could even accomplish SYSCALL without a syscall opcode, just by combining local labels with stdlib labels (eg some of the labels refer to stdlib functions, so just JMP to them). We don't really need that though.

if we had 12 bit operands then could have: 5 bit inline level, 5 bit label, 1 bit stdlib or not, 1 bit global/local. But we don't need GLOBAL labels in custom instructions, and we need more than 32 local labels in user code.

---