Bayle Shanks's website: proj-oot-ootAssemblyNotes13

panic 626 days ago [-]

it occurred to me that (some) Forth systems are awfully close to being macro assemblers, and that really the only reason you’d use a higher-level language rather than a macro assembler is that error handling in macro assemblers, and in Forth, is terrible verging on nonexistent.

Another reason to use a high-level language is that macro assembly code and Forth programs aren't very portable across machines. I guess you could model a virtual machine in the same way the authors model x86, target the virtual machine instead, then have a verified translation from VM operations to x86 instructions, ARM instructions, etc. I wonder if you could get similar performance to a native compiler with this approach.

kragen 626 days ago [-]

Forth programs are fairly portable across machines — I would guess at least as portable as C — although often not across Forth implementations. That's why Open Firmware drivers are written in Forth.

Macro assembly code can be portable across machines; that's how e.g. SNOBOL4 was ported to 50 different machines starting in the late 1960s, including even the IBM PC; see http://www.snobol4.org/history.html and https://news.ycombinator.com/item?id=3108946 (a thread here that I posted in, before HN was apparently taken over by drooling cock-obsessed chimpanzees who can't take the time to read EVEN THE ABSTRACT of linked papers, by which I do not mean you, panic) for more details.

In effect, the coqasm approach moves portability, just like safety, from the language into a library. Normal macro assemblers (and, for that matter, C) make it difficult to tell if your application code has avoided processor dependencies. It seems like Coq might actually make it possible to prove that you've done that.

---

i guess you can think of both C and Forth as 'close to the metal'; 'portable macro assemblers' plus a little bit more.

Oot Assembly is supposed to be kind of on that level (although not quite; it isn't CTM for performance motivations, just for the sake of portability and interoperability). It isn't exactly Forth, because it has variables, not just a stack.

And it also has some functional stuff, such as the first-class functions.

---

actually, a variant of 'inline expansion' is a common bug/attack vector:

https://en.m.wikipedia.org/wiki/Billion_laughs

---

another name for something was 'moot' (minimal oot)

---

could combine the fifth 8-bit field (the 'meta' field) to any of the other four in order to make a single 16-bit operand field, when the 'meta' field's addr mode is 0-3 (specifying which other field to combine it with).

This would allow eg 16 bits to call imported subroutines, 16-bit constants without loadk, etc.

kinda ugly though.

since i can't think of a way to use meta that would use it a lot, maybe this is simpler and better:

2 form bits + 2 implementation-dependent bits + 4 x (4 bit addr mode + 11 bit operand)

---

otoh, when i say i want this to be usable in intermediate representations, who am i kidding, in reality everyone is going to unpack the MEDIUMs into:

1 byte: the 2 implementation-dependent bits 4 x (1 byte addr mode + 2 byte operand) 3 bytes padding (to 64-bit align, so 128-bits at the end)

maybe should just give up on 64-bits and have that for MEDIUM, or at least make it easier for that to be in LONG.

SHORT could be 32 bits (or even 16, as now) and could be the only encoded one.

---

an obvious (but wrong) SHORT 16-bit instruction encoding would be: 4 x (2 addr mode bits + 2 operand bits)

note that the first addr mode would be constrained in the first instruction of each 64-bit grouping

the problem is that this is not enough operand bits for the opcode operand. Could ditch the addr mode bits for the opcode:

4 opcode bits + 3 x (2 addr mode bits + 2 operand bits)

---

so mb:

256 regs

256 imports per module. Each import can import 256 symbols. Each module can have 32k symbols internally and can export up to 32k of these symbols, but in groups of 256.

Code is addressed by 16-bits, but 128-bit (16-byte) aligned (yuck! but how else would we fit 4 x 16-bit operands?).

SHORT: 16-bit (128-bit aligned)

4 opcode bits + 3 x (2 addr mode bits + 2 operand bits) (every 128 bits, the first opcode bit must be "0")

MEDIUM: 128-bit 16 bits: header (format bits + ?) 16 bits: implementation-dependent 16 bits: addr modes (4 x 4 bits) 16 bits: reserved 16 bits: operand 0 16 bits: operand 1 16 bits: operand 2 16 bits: operand 3

LONG: variable length. 128-bit aligned.

can CALL functions with at most 2 input arguments and 1 return argument via the 'first class function' feature where the function is indicated by operand 0. This module's function table is accessible via immediate address mode with values >=32k. Imported module's functions are accessed via constant table address mode (16-bit operand0 in constant table addr mode = 1 byte to specify import #, and one byte to specify fn within that import).

hmmm i still like 12-bit land though. but 64 imports of 64 fns each? that sounds like way too little..

---

mb 'parens' are not just annotations, but the only representation of expressions (by which i mean, no explicit PUSHs and POPs needed, at least not for that); they would translate to the appropriate PUSHs and POPs on the data stack

---

mb try to make an even more minimal set of 8 SootB? instructions, since SHORT has a limitation on which instructions can be used in the first instruction on a 128-bit boundary

---

maybe have a generic way for a routine to demand that some (potentially custom) permission is set in the CAP register (either ambient, or on a specific addr space)

---

should check upon module load that all the indices from one table to another in the module metadata are valid

---

i said above that mb we should have parens not as annotations but actually implying the pushes and pops, which would not be present.

One thing to think about here is that although an ordinary, eagerly-evaluated expression is computed 'bottom-up' by transforming the expression to RPN and using pushes/pops, a lazily-evaluated expression is evaluated top down.

so, as interop, the parens don't really MEAN using pushes and pops in a certain way. Although when they appear in based OotB?, they could mean that.

---

y'know, if we have modules signed using public key encryption, then maybe we should understand the concept of 'delegated' keys, eg when doing a module import (or on the cmdline) you can specify that you accept signatures by such-and-such a public key; then that public key can also delegate its power to some other public key; now if the other public key signs the module, and if your system is aware of the signed delegation, now it accepts the module.

---

should we have a separate 'string' section for strings, and a separate 'blob' section for binary blobs, in addition to the ordinary constant table (which would be for structured data of known size and type)?

---

co/contra/in-variance on generic type variables in module type table? <: type bounds?

---

u should be able to export a symbol that was imported from another module

---

MEDIUM:

nah, i just can't stand the large size of 128-bit instructions, and the large size of 64k tables.

so go back to:

2 form bits, 2 implementation-dependent bits, 4 x (4 addr mode bits + 11 operand-data bits)

2k limits on # of module imports, # of fns in module, # of fns exported

use CALL for inter-module CALLs: one input is the imported module being CALL'd into, the second argument identifies the fn in that module. Possibly use LONG for this, so as to clearly mark what is being passed in/out.

use 1st-class CALL for intra-module CALLs (eg you just put the fn # in the opcode field, with addr mode constant table; this implicitly refers to only fns in the current module). Or possibly use this to reference the stdlib. Or possibly make the limits 1k instead of 2k, use 1k for this module and 1k for the stdlib. But i think the inter-module calling convention is fine for the stdlib. We already have the stdlib custom instructions which can be highly optimized (just with no cross-instruction recursion).

this is fine b/c we don't care about optimizing inter-module things as much.

the file format allows 16-bit quantities for things, and tables up to 64k length; but when using it for OotB? only 2k can be used

2k has the nice bonus that 2k * 16-bits = 4k bytes which is one page size on many systems

---

We can make the 64k-limited version a variant of LONG, splitting LONG into 2 (fixed and variable)

---

i should point out that one reason i'm attracted to these 2k limits is that, as per the dream of Oot supporting 'brain-like' programming, whatever that means, or at least massively parallel programming, i envision a very large number (>=~64k) of MIMD processors, each one relatively slow and with a relatively small amount of memory (~64k), each one running some OotB? code (and the Oot runtime), with OotB? code being dynamically pushed around this network of processors at runtime.

So i would like some pressure on the interpreter runtime and the stdlib (not necessarily the compiler) to fit in these small spaces. This is also why i am hung up on 16-bit words and 16-bit integers instead of 64-bit; i imagine that in such a machine, if each processor only has 64k of memory to address, there's no need for more than 16-bit words, and there are heat dissipation and power and code density (and data density, due to pointer size) savings if the processors were actually 16-bit, and these things would in fact be important because (a) we need ~64k or more processors, so i bet heat and power will probably be an issue, at least if we want the thing to fit on a consumer's desk (or pocket!) and connect to ordinary household outlets and run decently fast, and (b) memory in each processor is at a premium, so 16-bit pointers and words would actually save space. However, i don't actually have experience with embedded stuff, and i don't want to focus TOO much on efficiency concerns especially at this stage, esp. since afaict there aren't any cheap machines of this form being produced right now; so the desire for the runtime to fit there is more theoretical than practical.

Note that i am imagining that each processor has ~64k; maybe less than that (8k?), but things like the GA144 cores, where each processor has only 64 memory locations, are smaller than what i am targeting here. Of course, it wouldn't be a problem if a huge array of physical processors each like a GA144 core were configured into a smaller number of virtual processors meeting my specs.

Note: if i want 'brain-like', why am i giving each processor so MUCH memory? Shouldn't each processor be like a neuron? Well, perhaps if all you wanted to do was actually simulate a neural network, and you had a good idea of what sort of neuron models you would want to simulate; but we want Oot to be more general purpose than this. So we want each processor node to support an interpreter runtime for a general purpose programming language, which could then be used to run simulations of neurons, or to do other stuff.

Note: by the arguments above, the processors might end up having 16-bit pointers (because 8-bit pointers is too little memory, and 16-bit is the smallest amount that would be sufficient). By a similar argument, perhaps the processors would be grouped into clusters of at most 64k processors (so that you could have 16-bit routing addresses within each cluster).

---

ok in the course of writing the previous section i realize:

the file format can clearly SUPPORT up to 64k-length tables, but b/c of our small-interpreter-runtime aspirations we should probably actually forbid >2k for some or all of them.
i'll probably skip a fixed-length variant of LONG. This variant wouldn't be usable in our 'brain-like' target architecture, and compilers etc can have that in-memory representation anyways even if we don't provide it as an encoding.

---

todo consider the following ABI/calling convention:

At the time of a call (and at the time of a return), the stack pointer points to the space reserved for the stackwise highest return value; note that this means that the arguments are actually ABOVE the "TOS" if "TOS" is defined as the location to which the stack pointer is pointing.
- This would enable https://eighty-twenty.org/2012/11/27/arm-tail-calling-convention . However, since/if we have a special 'stack addressing mode', this may not actually be efficient because (a) either the caller is going to have to either load the arguments onto the stack manually/without use of the stack addressing mode, or we have to change stack addressing mode to accept negative offsets (b) implementations cannot use some platform-dependent stack to implement the stack, because we are demanding that things below the stack pointer are actually still persisting in memory.

because of (b), i think we won't do this for now.

---

now that SHORT includes 4 GPRs, i guess the first one should be ERR/CARRY, the second should be STACK, and the other 2 should be just GPRs; or maybe also a CALLSTACK register (remember, Forth has two stacks and this works well), and then just 1 GPR. Yeah, i guess the latter. But wait, what about the 'hidden stack' for custom instruction implementation? Maybe load the pointer to that into the 4th reg when calling a custom instruction, and push the previous ('actual') contents of the 4th reg onto the top of that stack.

---

so if MEDIUM has only 256 registers, but 11-bit operands, then in 'register' and 'register indirect' addressing mode, we have 3 extra bits.

Three ideas:

(a) go back to 2k registers

(b) in 'register indirect' addressing mode, have the extra 3 bits be an offset

(c) just let there be some extra bits here; maybe the implementation will use it to denote hidden registers, or for some other annotation

i'm leaning towards (c). (a) would be nice for the # of locals, but it's also nice not to make that demand on implementations. Also, some implementations might enjoy having a power-of-two number of bits to address a register. (b) adds one more unnecessary complexity to the design.

---

should we just get rid of the 'hidden stack' and just use the data stack for that?

wait, i forgot, we do need a hidden stack so that indirectly called custom instrutions can access the data stack

---

should we forbid 'register mode' access to a stack pointer?

argument for:

this allows the stack to be implemented by a native primitive on the target platform, even if the target platform's native stack primitive isn't memory-mapped

argument against:

this makes it more annoying to implement saving parts of the stack on the heap, and memory-like access to items deep in the stack.

However, the stuff in 'argument against' can be done, just not as efficiently, and as long as we encapsulate that stuff in custom instructions, implementations with memory-mapped stacks can substitute more efficient algorithms.

so i guess yes, make it illegal.

---

but since register mode is illegal on stack pointers, could use register mode in SHORT to access registers 4-6. This would allow us to access some register even though the first four registers are taken by ERR, the stacks, and the hidden stack.

---

should distinguish 'ignorable' and 'semantics-affecting' annotations

---

do we really allow up to 2k custom instructions, or do we hold it at 256 to make it easier for implementations?

---

hmm.. if we only need 4 addr modes, not 8, then can get rid of 1 addr mode bit and have 'constant table' be the 'boxed' form of immediate constant read!

now we have 12 operand bits instead of 11, and we have 4k limits instead of 2k

---

we can get rid of module 'jump tables' if we limit module code size to 64k words

---

the issue with whether to represent Oot Assembly expressions with POPs and PUSHs or with parens, and in the latter case, what happens in a context when they should be interpreted lazily eg top-down rather than bottom-up, suggests that that sort of issue should be bumped up to Oot Core, that Oot Core should be the interop language, not OotB? (even though OotB? can be a representation format for Oot Core), that we should use PUSHEs and POPs and leave expressions to Oot Core, and also provides some insight into the sort of way that Oot Core is going to be focused on defining the interpretation of Oot Core.

---

but mb even in Oot Core representation we should use PUSH and POP in expressions, instead of a 'subaddress mode'. During top-down lazy evaluation, these could just be processed differently.

---

todo: with register direct mode on stack pointers illegal, we currently have no way to directly address items deep in the stack

---

tangentially, 3 defns of 'high level assembly language', given in http://www.plantation-productions.com/Webster/HighLevelAsm/HLADoc/HLARef/HLARef_pdf/01_Overview.pdf :

" David Salomon in his 1992 text "Assemblers and Loaders" (Ellis Horwood, ISBN 0-13-052564- (alternate, search)2) ...offers the following definitions for a High Level Assembler (or HLA): A high-level assembler language (HLA) is a programming language where each instruction is translated into a few machine instructions. The translator is somewhat more complex than an assembler, but much simpler than a compiler. Such a language should not have features like the if , for , and case control structures, complex arithmetic, logical expressions, and multi-dimensional arrays. It should consist of simple instructions, closely resembling traditional assembler instructions, and of a few simple data types. "

" A high-level assembler language (HLA) is a language that combines most of the features of higher-level languages (easy to use control structures, variables, scope, data types, block structure) with one important feature of assembler languages namely, machine dependence. "

" A "high level assembly language" (HLAL) is a language that provides a set of statements or instructions that practically map one-to-one to machine instructions of the underlying architecture. The HLAL exposes the underlying machine architecture including access to machine registers, flags, memory, I/O, and addressing modes. Any operation that is possible with a traditional assembler should be possible within the HLAL. In addition to providing access to the underlying architecture, the HLAL must provide some abstractions that are not normally found in traditional assemblers and that are typically found in traditional high level languages; this could include structured control statements (e.g., if, for, and while ), high level data types and data structuring facilities, extensive compile-time language facilities, run-time expression evaluation, and standard library support. A "High Level Assembler" is a translator that converts a high level assembly language to machine code. "

Oot Assembly doesn't meet these criteria because we don't expose the underlying architecture. The reason we are 'assembly-like' is just because we want to be easy to implement. I include these quotes, rather, because they list some things that Oot Core, at least, and mb Oot Assembly, will want to support.

" MASM and TASM are probably best considered medium-to-high-level HLALs since they provide high level data structuring facilities, structured control statements, high level procedure definitions and invocations, a limited block structure, powerful compile-time language (macro) facilities, standard library support (e.g., the UCR Standard Library and many other available library modules), and other high level language features. "

" syntax but otherwise maps statements fairly closely to the assembly counterparts. It does provide some higher-level data structuring capabilities, though this is inherited from the underlying assembler(s) on which Terse is based. PL/360 and PL516 are definitely high-level HLALs because they fully support simplified arithmetic expressions, control structures, high-level data types, and other features. These languages provide access to the underlying architecture, but the emphasis is to use these languages as a high level language and drop down to the machine instructions only as necessary. "

---

i are thinking about having various 'classes' of address spaces, but how many will we have to have? Possibly lots; because address spaces for structs count in terms of items in the struct, not by words, therefore every struct is sort of a different 'address space class'. Of course, we could implement structs via one parameterized class.

---

Haskell GHC Cmm doesn't support C--'s runtime API. This is salient b/c the C-- website said "The run-time interface is the most novel and most distinguishing feature of C--. ".

I dunno if that means the C-- runtime API is a bad idea, it could just be that since Cmm is only being used by a single language (Haskell) with a complex runtime, it just wasn't worth it to have a language-independent middle layer there.

---

ok i've been wondering about how the semantics of stack mode addressing, expressed as pre and post decrement and increment, play well with custom instructions. Here's some ways i came up with before but don't like:

whether decrement or increment is applied goes solely on the argument position (eg dest, input2, input1); dest is always PUSH (decrement) and the inputs are always POP. The problem here is that it doesn't admit custom instructions that use all three operands as both input/output, as may be needed for eg reversible computation
effective addresses are passed, and 'read' and 'write' are hooked, and whether decrement or increment is applied depends on whether reads or writes actually occur within the custom instruction. The problem is that, first, this is inefficient, but worse, the semantics are complex, because what if the custom instruction reads or writes to this thing more than once?
all three values are passed as inputs, then all three are pushed as outputs, and updated as needed. The instruction also pushes a specifier that says which of the 3 should be incremented and decremented, if we are in that mode. But this isn't much better than signatures.

i think we gotta bite the bullet and just have signatures for custom instructions. The signatures determine how many inputs will be pushed onto the hidden stack, and how many outputs will be popped from it. Valid signatures include:

i (eg JMP)
o i
o i i
io io io

o is output, i is input, io is both, 'a' is effective address. The values corresponding to 'i's will be pushed onto the hidden stack before the call as inputs (and will be POPed from the target stack if in stack addr mode), the values corresponding to 'o's will be popped from the hidden stack afterwards as outputs (and will be PUSHd to the target stack if in stack addr mode), and ios will be both pushed onto the hidden stack before the call and popped from it afterwards, and simply replaced on the target stack.

None of these signatures give the custom instruction the effective address, only the value which is found there.

---

so how would we actually do dynamic types and generics?

actually, a quint with a 'meta' arg could be really useful here. The 'context' in the 'meta' arg could include a concrete assignment to all type variables. So eg if you had a generic RMS routine that expected a LIST-OF-<T>-typed input and used generic ADD and MULT instructions, and you called it with a LIST-OF-DOUBLES-typed argument (so T=DOUBLE), then when you call a generic ADD, the 'meta' field of that instruction could reference a Context struct that includes 'T=DOUBLE' (perhaps the Context struct is sitting in a register and the 'meta' field is in register direct mode, referencing that register).

This is good because now a custom instruction implementation of ADD can pass along that Context object without much extra ceremony. This also means that you can 'pass in the type arguments' to any generic instruction while still having 3 operands available for ordinary use.

Now, one potential issue with this is that you still probably wouldn't need to switch the value of the meta field much within an instruction, right? You'd just pass the same Context argument into every subfunction that you call. In which case, we could save instruction encoding space by just having a dedicated Context register.

But in fact, perhaps if there were multiple type arguments then it would be the generic custom instruction implementation's job to break them out of the Context struct and sent the appropriate type to each instruction called from the implementation. Perhaps some of these called instructions would require multiple type arguments, necessitating passing a smaller Context struct, and others would only need a single type, in which case the type could be passed directly. So in this case, various instructions within the custom implementation would in fact be passing different things in their 'meta' field.

Another issue is that if primitive types are 8 bits (b/c 8-bit immediate meta field), that's not really enough bits to specify the cross product of dest_type x input2_type x input1_type, unless there were only 4 primitive types. But that's fine, we can use pointers to structs for specifying multiple types, and 8-bits is enough for a single primitive type.

---

i think that's a pretty powerful argument in favor of the 'meta' field, actually. So i removed the following from ootAssemblyThoughts:

2 form bits, 2 implementation-dependent bits, 4 x (3 addr mode bits + 12 operand-data bits)

and went back to:

2 format bits + 2 implementation-dependent bits + 5 x (4 addr bits + 8 data bits)

---

so now we're back to limits of 256. We can only have 256 locals and 256 imports, but more critically, each module can only have 256 exports (that can be called in one step, at least, and maybe total).

---

y'know, UADD15 may make things slightly easier on some platforms but since it's uncommon, it's also just a weird thing i thought up that'll be one more thing for ppl to understand

so i'm gonna get rid of it and just start with addu16 instead

wait... nevermind, the JVM doesn't have unsigned integers, but it does have 16-bit signed integers.. so stick with U15

---

so how are we going to represent eg 'while' loops, if-then-else, etc?

well we don't really need to at all, we can just compile while loops to branches. But we'd like to, to make it easier for a transpiler to translate OotB? into some arbitrary HLL in a semi-readable way.

the obvious thing to do is just to have annotations in the code; eg a 'while' annotation at the beginning of the while loop which contains the offset to the end of the while loop.

but.. what if we want to have the offset to the end of the condition, too? In that case, since we only have 3 operand fields (assuming we don't repurpose the meta/generic type field for this), we don't have enough fields for 2 16-bit offsets. And if we don't have a separate opcode for while, but it's really just an annotation, then the 'dest' field is burnt specifying the annotation type, so we only have 2 one-byte fields.

In addition, even 16-bit offsets don't go so far when you recall that custom instruction inlining can make offsets expand by up to 256. So even a 16-bit offset can only be guaranteed to cover 256 instructions (we'd have to require user code to only use 1 of the two bytes for such offsets, and the compiler then has room to increase it during inlining). Surely we'll have some longer while loops sometimes.

So, either a while block takes multiple annotations to define, or we use the LONG format to keep it in one long instruction, as i'm proposing to do with CALLs that need more than 2 input arguments.

I think we should take the LONG route.

---

note: if we combine all three operands for RJMP, we get 24 bits, or 32 bits if we get rid of two of the three redundant addr modes; which means 24 bits actually since we must reserve 8 bits for custom instruction inlining expansion. Which isn't so shabby.

---

ok, regarding the LONG encoding.

What's the most basic/obvious/easiest-to-implement way to do this? Probably to have a length header that says how long the entire LONG instruction is. Just as i originally had the idea that the whole language would have variable-length everything and be bitwidth agnostic, but then went in the other direction for the sake of simplicity, here we can save implementors a lot of trouble if we accept a fixed upper length.

This length should probably be either 256 or 65536 (16-bit words, i guess). 65536 would approach true variable-length-ness. Otoh an instruction decoder that pre-allocated buffers for whatever it might need would then have to allocate a huge buffer to hold potentially 65536-word-long instructions. So 256 is better (and maybe even this is too large).

Should we still have a null terminator at the end? But then that interferes with nulls within the LONG. I guess the easiest to implement thing is to not have that.

Note that we want LONGs to be 64-bit aligned. So we probably pad them.

The easiest thing is to have the LONG be, at first glance, a 16-bit length header followed by a 16-bit header followed by a stream of 16-bit words, followed by padding for 64-bit alignment. But as we said above, we don't want the length header to be able to go as high as 2^16. So we can combine the length header and the following header into one 16-bit header.

Now we want a tree structure so we probably want something like the 'phrases' and 'roles' described in proj-oot-ootAssemblyNotes8.

We might want to consider a special format for an entire phrase in one 16-bit word. This would involve (a) a format bit taken from the length header, (b) a role, (c) the data. The obvious choice would be to:

use the most-significant-bit in the length for this, which restricts the max length of the whole thing to 128 words instead of 256 words, which, given our worries about the decoder having to allocate too much memory, is probably fine
then have 7 bits of role
then have 8 bits of value

If we wanted more than max 128 words in a LONG, we could use the most-significant-four bits to mark single-word phrases; now we can go up to a length of 240. Now we have either: 4 bits of role and 8 bits of data, or 8 bits of role and 4 bits of data.

Or we could have:

first bit is single-word marker; so max 128 words (256 bytes) per LONG
4 bits of role
11 bits of value

or do we want addr modes? Then:

first bit is single-word marker; so max 128 words (256 bytes) per LONG
4 bits of role
3 bits of addr mode
8 bits of value

But would we ever do this in the initial header? Probably not because we need more than one phrase to make a complete instruction; and even if we only needed one 16-bit word to say what we've gotta say we gotta be 64-bit aligned so fitting an entire LONG into 16-bits doesn't help us at all. So we go back to max 256 bytes per long, but now we have a limit of 128 bytes per phrase, which probably isn't so bad.

Do we want phrases WITHIN a long to be 64-bit aligned, though? mb...

that would restrict us to 64 phrases (if we have 256 16-bit words), which isn't so bad.

but i think what the CPUs like is more that if you have a 64-bit atom, such as a 64-bit integer, then that should be 64-bit aligned. But we want LONGs to be able to contain actual 64-bit atoms, and this would probably be encoded as a 16-bit phrase header followed by a 64-bit atom; which means that the phrase itself would not be power-of-two words (unless we waste 3 words to pad it to 8 words).

In this case, if there is intra-phrase padding, how/where do we specify that?

also, in our example of a 5-word phrase with one 16-bit phrase header and then 4 16-bit payload words for a 64-bit payload, if we do want the payload to be 64-bit aligned, we might have to add up to 3 words of padding. So, if we demand payload alignment, in fact we do have to be ready to pad up to 8. Given that, the simplest thing might be to just have it be 8 words all the time in this case.

Does that mean we make all phrases 64-bit aligned? No, because we can still have one word (16-bit) phrases. We're just saying that maybe phrase lengths should include the maximal possibly-necessary padding. I don't think we are even requiring that phrase lengths be powers of two (but perhaps there is some other reason that we should; and note here that phrases of the form (1 word header, power-of-two-word payload) will always end up having a power-of-two length under this maximal possibly-necessary padding rule.

If all we care about is 64-bit alignment, we need at most 3 words of padding, so we can specify padding using a 2 bit index saying where the payload begins within the phrase. If we take two bits out of an 8-bit role, that leaves us with a 6-bit role, which sounds ideal anyways. Otoh mb only a few roles (those marking 32-bit and 64-bit payloads) even need padding, in which case we can just do this in those cases rather than generally (so we'd need 6 roles to encompass those two roles x padding; 32-bit 0 padding, 32-bit 1 padding, 64-bit 0 padding, 64-bit 1 padding, 64-bit 2 padding, 64-bit 3 padding; leaving us 250 other roles if we had 8-bit roles, or 10 other roles if we had 4-bit roles).

But now if we also want the option to encode 128-bit and 256-bit literals, now we need 3 bits of padding and 4 bits respectively; so we've burned 2 + 4 + 8 + 16 = 30 roles. Which still isn't too bad, if we started with 256 (or even 64).

Also, with this sort of phrase, the phrase length is redundant. We could just guess the literal size from the phrase length, and burn only 16 roles on this to specify the padding.

The roles described in proj-oot-ootAssemblyNotes8 were:

0: operation/head/verb; this selects what sort of construct or operation is represented by this phrase or sentence. Somewhat analogous to opcode.
1: input/rvalue/source: Somewhat analogous to input operands.
2: output/lvalue/destination: Somewhat analogous to destination operands.
3: modality: modifications to the way that the meaning is processed. For example, lazy vs strict.
4: constraints ("such that" or "where" clauses): For example, "orange" in "Pick up the orange ball."
5: conjunctions. For example, 'a and b and c' or 'a or b or c' or 'a xor b xor c'
6: alternate formats. For example, can be used to select an alternate, spartan 'packed' format for this frame.
7: grouping constructs that span multiple frames. For example, hints about how long a multi-frame sentence is; parentheses; quoting and antiquoting; annotations

i guess the multiword literal encodings could be role 6 in here (but recall that there would actually be 16 such roles, to account for padding).

we probably also need a role for 'padding'. This could be a type of 'annotation' within longs. So we need an 'annotation' role. And of course, we need to separate annotations into 'annotations that every decoder must understand' and 'annotations that the decoder can ignore'.

do we mandate that the roles be stored in some certain order? Do we want to have only one way of representing a given LONG, or do we want to allow it to be reordered? Because of padding requirements, allowing reordering may decrease space requirements by allowing the compiler to search for an optimal ordering w/r/t minimizing padding.

do we need an additional role for the 'meta' or 'type' field? or do we use some form of 'modality' for that?

do we do as the above suggests, and use one 'input' role for all operands, letting their ordering with the LONG be significant? (probably, yes, b/c otherwise in order to accomodate CALLs with many inputs, we'd have to burn many roles).

what about operands that are both input and output? do we have a separate role for that, or do we just reuse 'output' for this?

ok i added a new LONG proposal to ootAssemblyThoughts.

---

so what about when we have eg a WHILE loop or a CALL instruction with many arguments? Should we encode it in a single LONG instruction (semantically correct), or encode it down to many MEDIUM instructions (easiest to implement interpreter), or should we have many medium instructions with a series of annotations so that transpilers can see what's going on?

How about both:

we do encode down to many MEDIUM instructions, so that naive interpreters don't have to bother with LONG or with control structures
we do annotate. But Instead of many annotations, we have one LONG-format annotation
this LONG-format annotation is a grouping annotation that indicate the block of MEDIUM instructions which are just implementing one WHILE or one CALL or whatever
but the LONG-format annotation includes, as a subphrase, a LONG-format specification of that WHILE or CALL. This means that advanced transpilers/interpreters can just strip out the indicated block of MEDIUM instructions wholesale and replace them with its understanding of the WHILE or CALL.
so, we need a special kind of ('language_specifier'?) ('super-role') to indicate LONG instructions that provide 'alternative translations' of a block of MEDIUM instructions

---

well, actually, sometimes the 'alternative translation' might span multiple instructions. So this really needs to be an OotB? primitive.

i added:

alternative translation block (eg CALL as a single LONG instruction vs CALL as many medium instructions; but in general every alternative block may span multiple instructions). Perhaps this has the effect of an unconditional JMP to the last alternative block, for those interpreters that can't understand the other alternatives

---

so now an Oot interpreter that doesn't understand LONG can't just scan naively the code beforehand and crash if there are any LONGs present. Rather, it must accept LONGs that it will never reach due to the alternative translation blocks. It must also accept LONGs that it can ignore due to LONG_TYPE<64. So it must be able to at least read the LONG header to determine the LONG length (so that it can skip it) and the LONG_TYPE (to see if its skippable), and it must also ignore other LONGs in the non-default alternative translation blocks.

Aside from these, though, it can crash upon module load if it sees LONGS in code.

---

we can fit both RJMPs and JMPs into the same opcode, even if we only used 2 operands instead of 3, if we constrain them each to half of the address space, eg max RJMP of +-64 instructions (not 128) and max 32k jump table per module (not 64k). (with JMP probably want to use 3 operands, so its even better, but 2 operands could be useful for grouping annotations)

---

of course, with just 24-bit absolute jumps (all 3 operands, and use the unused addr modes (require them to be 0 in user code) for the extra x256 for inlining expansion) and 16-bit relative jumps when one operand it taken (so for a block annotation; dest operand is annotation type, other 2 operands are length, require addr modes to be 0, unused addr modes are for the extra x256 inlining expansion), we might not even need a jump table.

i guess we still need a table for intramodule CALLs? no, b/c again we don't need the output b/c it'll go onto the stack (this is assuming a calling convention where we don't have to tell the CALL the number of inputs or outputs), so we have 24 bits (3 operands) to specify the target, plus use the 2 unused addr modes for the extra x256 inlining expansion.

---

still, even if no table, might be cool to fit both JMP and RJMP into the same instruction..

we could use immediate addressing mode for RJMP and constant table addressing mode for JMP via a constant table (or even for absolute immediate jump)

---

i dunno if we should have a separate instruction to mmap the stack. Maybe it would be simpler just to assume the stack is always mmapped, and let the implementation trap register-mode accesses to the stack registers if they have a better way to do it.

we also need to have a way to mmap the register, again so that we can save state between function calls/coroutine yields, etc.

---

ok, the primitive instruction set and instruction encoding is coming along. Things that remain to be done:

timer hook
type signatures, including for functions
layout of composite data structures in address subspaces (eg if there is a struct whose first item takes 8 words, then accessing x.2 still accesses the second item, not x.8)
- and .data declaration in textual assembly format?
address subspace hook methods?: getters, setters, constructors, serialize; how to malloc them
cow (copy-on-write) (the possiblity to implement this, not necessarily a primitive) (does this need to be able to be done implicitly as a setter, or is explicit enough?)
capabilities
aliases?
ffi
should we have a primitive hook-ful interpreter (vmget, etc) and implement capabilities and fancy addr subspaces etc on top of this, or just implement this stuff directly?
exit with exception: need to specify a way for an Oot program to exit to the host platform but with an expection.
- do we also need the possibility to let the host tell us to resume after the exception?
  - this would provide us with coroutines, so may be a good idea
need a way to query the height of a stack
currently we have one addr mode bit free
how to do unboxing hook?

(copied to proj-oot-ootAssemblyThoughts)

---

thoughts on the previous:

the following is straightforward:

timer hook
exit with exception
query height of stack

there is also

type signatures
ffi
and .data declaration in assembly?
currently we have one addr mode bit free. hook?
how to do unboxing hook?

and then most of the rest has to do with functionality of address subspaces that could be provided by per-subspace hooks, and the question of whether we should implement this stuff in the primitive Oot interpreter, or if the primitive Oot interpreter should have lots of hooks.

to the extent that FFI requires the implementation to know about this stuff, that argues for implementing it in the primitive Oot interpreter.

but we want to keep the primitive interpreter easy to implement.

perhaps if really all we need are per-subspace hooks, then the way to go is to implement a primitive Oot interpreter without many hooks itself, and with simple addr subspaces with hooks.

could we do unboxing as an addr subspace hook too?

---

so, the idea i'm kicking around is:

addr subspaces are classes with the following methods:

ctor
get (takes 'unbox' argument)
set (takes 'unbox' argument)
len
serialize (serialize to stream of 16-bit words)?
- but what about cyclic references? i guess we could pass in a list of already visited pointers so that we can recursively call this while still cutting off cycles? and how would we serialize pointers? i guess we could just print them as arbitrary unique integers, with a type marker to show that they are pointers.
strserialize (serialize to an ASCII string, for interchange purposes (could also be sent across the network), also for printing for debugging)
unserialize
unstrserialize (or unserialize could detect when it's dealing with an ASCII string)
dtor
.. and can be extended with further methods (eg for atomicity)?

the free addr mode bit is implemented by an instruction wrapper (to do atomicity, etc)

which addr subspace class is present must be known at compile-time (no dynamic dispatch)

malloc takes two arguments: the class, and a word (typed as pointer?) that will be passed into the class's ctor

---

interesting that at this low level, 'addr subspace' objects have 'len' where Python objects would have 'dir'. This is because at this level, parts of the struct are addressed by a contiguous range of unsigned integers, instead of by name. HLLs either map names to unsigned integers (C's 'enum's, structs, presumably Python objects), or use hash maps (Python 'dicts').

---

should we bake in a higher-level object model interoperability? what would that buy us? introspection of fields? subclassing?

---

some lower-level languages provide both a 'goto' for intra-function gotos, and a 'jump' for inter-function gotos

---

would be nice if functions that read/write lists could be polymorphic about whether they are reading/writing from an array of memory, or popping from/pushing onto the stack

eg SAVEREGS, RESTOREREGS could use this

---

ok so we're doing pretty good for how things are implemented here, do we need to introduce lots of concurrency primitives at this low level though? or save that for oot core?

---

https://en.m.wikipedia.org/wiki/Canonical_S-expressions

these could be useful for serializaton

or also for LONG encoding. Otoh there i think having the fixed-length length header is ok.

otoh these XML features are cool, mb add them:

"every XML element has a name" [1]
"An XML element may also have attributes" in addition to children [2]
"XML ID and IDREF attributes" [3]
"an XML element may contain comments and/or processing instructions" [4]

"The first atom in a csexp list, by convention roughly corresponds to an XML element type name in identifying the "type" of the list. "

also http://json-ld.org/ looks cool

---

hmmm...IDs and IDREFs sound like the right way to serialize pointers

---

TODO add csexp and XML stuff, above, to LONG.

---

use ad-hoc polymorphism to update the GC when copying or moving pointers, and to do special things for COW

---

i guess you kinda have to hook writes for the GC? or is just hooking 'unboxed' writes enough?

depends on threat model; is it bad if attacker code can bypass the Garbage Collector but not the capabilities?

---

may need a type for 'type'

---

our idea is that the 'primitives' should provide the 'only thing you need' but with possibly abysmal performance. Then later layers can add things that do not provide as many guarantees, and can be written in terms of the primitives; but then the implementation can replace these later layers with higher-performing platform primitives.

By this logic, at the more primitive levels, we should have wait-free data structures whenever possible, and then the (simpler to implement) NON-wait-free versions should be provided at higher levels. The reference implementation can implement the non-wait-free versions just by calling the wait-free versions, but then higher-performing implementations can replace this by actual non-wait-free versions.

One issue with this is that we don't want the porter to have to implement wait-free data structures. We can deal with that however by just having the actual primitives include atomicity things like CAS, and then the reference implementation provides the wait-free data structures.

Another reason we might have 'wait free by default' is that we care about predictability over performance, so we care about worst-case performance more than average-case performance (similar to how we care about latency over throughput). Eg we'd prefer no sporadic long GC pauses at the cost of low throughput.

Note: what do we do if the platform doesn't even provide CAS? Cross our fingers and pray?

oh; i guess you implement CAS via locks, and implement locks via Dekker's algorithm (which busy-waits), which only requires a shared memory with a decent level of memory order consistency. So you'll have CAS, but i guess it won't be waitfree (?).

---

it's known that a FIFO queue with enqueue and dequeue and an additional peek1 operation (is that all of its operations, or am i forgetting one?) has infinite consensus number.

what about a stack with atomic push/pop of multiple items, and with peek1? that would seem appropriate for assembly languages like ours, which already make use of stacks

---

Over the past few days the main currently issue (which is currently a blocking issue) for the design of Oot Assembly continues to be one that i have mentioned before:

We want to have capabilities and address subspaces (which might just be my way of saying OOP classes, i'm not sure) and timers. Either capabilities and timers must be built in at a low level, or we're going to have to emulate another VM on top of this one in which they are built in, because ultimately we're going to be running untrusted Oot Core code compiled to Oot Assembly, and we want to place the burden of the capabilities system on the runtime instead of the compiler, and so we can't allow untrusted code to be able to escape the capabilities system merely by bypassing it with 'unboxed' primitives or something like that.

And we're also going to have boxing, to handle stuff like lazy thunks, later. This will probably be implemented in the address subspace 'classes'.

So, it seems we have three choices for the capabilities and the address subspaces:

provide a minimal VM with per-program 'hooks' that are loaded (and fixed) during program initiaization. These hooks then hook get/set and implement address subspaces, and either the hooks or the address subspaces implement capabilities
provide a minimal 'SootB?' VM and then implement a more featureful 'OotB?' VM running on top of it that includes capabilities and address subspaces
hardcode the capabilities and address subspaces in the VM design

If portability was really the only design goal, then clearly the answer would be to have a minimal SootB? VM and write an OotB? VM on top of it (no hooks, just one VM running another one). There are two reasons not to do this:

performance. I know i say i don't care about performance much, but i care about it more down here in OotB? than in Oot Core, and i care about it more vs portability than vs simplicity and vs metaprogrammability.
simplicity. If we add another layer, that's one more thing (more 'accidentaly complexity') for devs to understand before they can hack on the Oot implementation.

(why do we want portability anyways? Three main reasons:

i want a language i can use everywhere. I was disappointed a few years ago that i couldn't use Python on Android (and it's still kind of hard as far as i here)
i want a glue language, and one way to get there is to be able to implement it within other languages. Which means a lot of implementations, which means it should be easy to port.
system adoption can depend a lot on whether the system is available on platforms that people want to use it on, and that changes over time, and whether or not it gets ported is probably critically dependent on how easy it is to port )

So, one VM on top of another (the middle alternative) is probably out. There is some commonality between the first and the second alternatives, because we could design SootB? specifically for running other similar VMs on top of it, such that that you don't have to totally run another VM eg the dispatch could be shared; in this case the second VM can be viewed in terms of hooks.

So there is a decision between:

have a super extensible VM, with one or both of lots of hooks, and/or facilities for semi-efficiently/semi-elegantly running other VMs on top of it and switching between 'levels'
have an ordinary VM which includes capabilities and addr subspaces

Pros of (1): A super extensible VM would be cool in itself, and removes the need for the porter to bother with capabilities and addr subspaces, and also makes it easier for us to update the system, because if eg the capability system were part of the VM, then changes to its design obsoletes existing ports that hardcoded the old design, whereas with hooks we just update the platform-independent hook code. Also, the design of capabilities and addr subspaces is uncertain, so shouldn't we avoid baking in design decisions regarding those?

Pros of (2): On the other hand, this 'super-extensible VM' feels like an unnecessary distraction from our task, which is merely to provide a lower layer for the actual language we are interested in, Oot Core. The reasons to have a lower layer are:

to make Oot portable by allowing a porter only to implement an interpreter or compiler for Primitive OotB?, rather than all of Oot
to make the design of the Oot implementation more comprehensible by separating 'low-level stuff' like implementation of garbage collection and a greenthread scheduler from the 'high-level' implementation of Oot Core semantics
to inform my thinking about the design of Oot Core by clarifying what the available 'low-level computational primitives' are

Not having to implement capabilities might make OotB? slightly more easy to port, but this probably won't make a big difference because implementing a lot of hooks might be almost as bad. And there's a good chance that the hook will be even harder to implement than capabilities. And other than portability, nothing in our motivations for the lower layer seems to be encouraging us to go off into the la-la land of extensible VMs. Other languages and VM projects seem to be showing that the way to go is to design a VM for one language, rather than just trying to design a good 'generic' VM. The ultimate task of this VM is just to run one program, the Oot Core implementation, so why spend a lot of time and complexity on extensibility? This seems to be a place where we should embrace the Golang ethic of making things simple by accepting mechanical sympathy instead of pursuing generality.

So, i am torn. Super-extensible VM because it's cool in itself, it might be easier, and the design of capabilities etc is liable to change? Or ordinary VM because it's foolish to venture into extensible VM la-la land? I keep going back and forth (that is, thinking i am certain i'm going to go one way, and then later feeling equally certain about the other). This is a blocking decision because an naive implementation of an OotB? interpreter would have to be mostly rewritten to switch between them.

What i'd really like is for stuff to eventually fall into place like they did with using a fifth 'meta' operand for 'context' including the runtime types that polymorphic functions are being specialized to. Perhaps some sort of 'meta bit' to switch between 'levels' of an interpreter tower?

I guess one of my concerns is that both ways seem to have too many moving parts:

the 'extensible interpreter' way either:
- has very few, very simple, very powerful hooks, in which case it's much like just running another VM on top of this one
- has many, constrained hooks, in which case there are too many moving parts
the 'implement stuff directly in the VM' way means at the least putting capabilities in there, which already seems like too many more primitive instructions

hmm, a new idea i just had while writing this: One way things could 'fall together' would be if i could find a basis set of a few, simple, yet constrained hooks, and perhaps encode some of the capabilities functionality in the VM outright, and use these few hooks to do the rest. Maybe this could end up being a middle ground between the two choices? Which may be good. That sounds like it would minimize moving parts.

One might say, just implement both of them and see which one you like, but the thing is that i won't know which one i like until i implement the rest of Oot Core on it, which is a very long way off. I guess i could just pick one and get going and plan to revisit the decision, but i bet the decision will end up getting frozen and not really revisited later.

Some other related concerns:

We probably need low-level hardcoded (or at least hooked) support for security stuff like capabilities and timers. But there are other features where it's unclear if they should be handled by extensible VM stuff. For example, 'symlink' variables? Read/write barriers for GC? I talk about OotB? being both a VM in itself, but also a serialization format for Oot Core code; does that mean that we want the VM to handle eg read/write barriers so HLL code can be represented naturally in OotB? without interspersing GC stuff, or is this just an interchange format and whoever executes that serialized Oot Core code won't be the OotB? VM, at least not without an intermediate compiliation step (to add in stuff like the GC stuff)? If we want to have these in the VM, and we have hooks, then we could implement these sorts of things in the cross-platform hooks, the same way we handled capabilities, which is nice, so this would argue for the extensible VM. But i'm not sure if we are going to demand this other stuff at all.
Right now we have 4 addr mode but but only 4 hardcoded addr modes (plus one variant, constant table), which leaves 11 addr modes undefined (or, roughly 2 bits). One of these bits may be an 'unboxing' bit, which allow the VM to implicitly include stuff like lazy thunks, which means that the semantics of this bit would have to be extensible. What do we do with these 2 bits?

In summary, i gotta either choose between hooks or a direct implementation of capabiliites and addr spaces and who knows what else in the VM. Or i gotta find a middle ground via having a few, constrained hooks. Right now the middle ground idea sounds good.

---

is the open/close read/write a good enough basis set of primitives/the best way to do it?

maybe look the Plan 9's thing (9P, latest version apparently 9P2000, Inferno apparently calls 9P2000 Styx, not sure if those are exactly identical).

https://en.wikipedia.org/wiki/9P_(protocol)#Implementation gives a summary of the 9P protocol:

version: Negotiate protocol version
error: Return an error
flush: Abort a message
auth, attach: Establish a connection
walk: Descend a directory hierarchy
create, open: Prepare a fid for I/O on an existing or new file
read, write: Transfer data from and to a file
clunk: Forget about a fid
remove: Remove a file from a server
stat, wstat: Inquire or change file attributes

(details linked from https://en.wikipedia.org/wiki/9P_(protocol)#Implementation )

more docs at

http://man.cat-v.org/plan_9/5/intro
and possibly at http://9p.io/sys/doc/

---

'middle ground' ideas:

get
set
timer
lea?

here's an older list (currently in ootAssemblyThoughts)::

" The following syscalls are considered 'vm plugins': vminit, vmexec_instr, vmlea*, vmsea, vmjmp, vmtimer, vmshort, vmlong, vmloadk, vmsyscall, vmterminate. This means that they are invoked automatically by the VM in the course of processing ordinary code (but not while processing code inside a syscall implementation). In conjunction with the other syscalls in syscall.bin, these are responsible for implementing 'address subspaces', cababilities, pre-emption (via vmtimer), decoding of SHORT and LONG encodings, module management, etc. "

y'know that's already not actually so many... mb we're already at the 'middle ground'.

and a list of this plus included syscalls (currently in ootAssemblyThoughts):

The following syscalls are guaranteed to be included:

vminit
vmexec_instr
vmlear, vmsear
vmlea0, vmlea1, vmlea2, vmlea3, vmlea4, vmlea5, vmlea6, vmlea7
vmlea, vmsea, vmjmp
vmtimer
vmshort, vmlong
vmterminate
vmloadk
vmsyscall
vmread, vmwrite? (these get the type of the thing they are supposed to be reading or writing as an argument, so that they can do different things if it is a pointer or an int, etc)
malloc, mfree, mlen
capmalloc, capmfree, capmlen
capunion, caponlyaddr, caponlyperm, capgetperm, capgetaddrs, capseal, capunseal, capjmpseal
loadmodule
unloadmodule
halt
handlesignal

this list is longer but that's b/c it includes stuff like all the cap* syscalls, which are just syscalls (rather than custom instructions) so that they can be called from vmlea, vmsea, vmjmp and access the global syscall state. If this 'extensible' interpreter wasn't being used for this instruction set, maybe they wouldn't be there/would be different. We don't really have ANY required ones, because there is an obvious default (do nothing) for any that aren't there.

---

mb the thing to do is to have vmlea, vmsea etc be defined on a class instance for an addr subspace. So that bakes in addr subspaces. Then we can also define capabilities in a similar way (which also allows us to bake in an error handling scheme for capabilities, which will probably be as simple as setting ERR to a certain predefined integer). Also, we bake in the 4 basic addr modes, we don't let vmlea/vmsea handle that.

we could also defined unboxing in a similar way, and/or we could only let unboxing look at values, not the effective address that that value is coming from/going to. And should unboxing be per-program, or per-address subspace? And for the forth addr mode bit, the putative meta bit or atomicity bit, we could let vmexec_instr see these bits, but not anything else (that probably isn't enough, though...).

And should per-address-subspace vmlea/vmsea be able to do stuff like getters/setters, or just remap the address (so the addr subspace can only be a struct that hides the sizes and layout of its contents)? Maybe they can be getters/setters on a per-program but not a per-addr-subspace-class basis? op mb a per-addr-subspace-class basis (remember, the class is guaranteed staticaly known) but just not per-addr-subspace-instace? And i guess they get to know the type of the thing being loaded or stored?

Should CAP have a register, or should it be 'hidden state'; maybe the latter, since its representation is opaque and you're only supposed to interact with it via the CAP* custom instruction API (and it isn't being saved via 'caller save' and stuff like that).

mb for caps have just one hook: bool vmhascap(ptr, cap), which checks if the progr Currently has the indicated capability for the indicated address

---

wait, do the address subspace hooks really have 'instances'? Or is it just static methods on a class, and the only 'instance' is the program-visible data actually stored in the address subspace? If all methods are static, then it's really just a callback, isn't it, not a class instance? Otoh it really is a class instance if you could the program-visible data.

note that we want to encapsulate structs via address subspace hooks; at runtime can there be new kinds of structs with their own layout? Possibly, but what if whatever the hooks need to decode the layout is actually stored in the program visible data; for example, consider a variable length array; if the program-visible data is, first a length header byte, and then the array elements, then the address subspace hook doesn't need to store any extra hidden 'instance data'.

But what if the user compiles a program from some C-like HLL and in their program they define some special kind of struct. Now do we recompile the VM? Or do we just say, yeah, well, just like Java, you can't actually expose the byte layout in this VM except via the special FFI and Bytes data types and such.

---

should we automatically place the context operand into a Context register when calling a subroutine or custom instruction? nah, for the same reason we don't do that with operand inputs; it'll get clobbered in a subroutine call. We should place it on the stack (or the hidden stack), instead.

and since SHORT doesn't have a context operand, how do we interpret that? I'm guessing that every SHORT instruction just uses some default context. Is this DYN or is it something low-level and static (like U15 or native pointers, depending on the instruction signature?). I guess this is just a question of optimization, since everything about SHORT is just a question of optimization. So, eventually if we care, try both and profile them (obviously, as we switch from one to the other we'll have to recompile programs with different subset of code in SHORT and in MEDIUM format). My guess is that low-level static, like U15 and PTR, will be better than DYN, because we'd prefer to optimize the code density for stuff like the custom instructions and syscalls which are in the interpreter loop. This caters to slow, naive implementations, rather than fast advanced implementations that implement all that stuff as platform-native code, but that's okay because advanced implementations will probably directly compile Oot Core to some other optimized bytecode anyways.

---

i did a Google search for https://www.google.com/search?q=extensible+vm to see what others did.

top hits include:

Towards An Extensible Virtual Machine by Chandrasekhar Boyapati. I read it. This is an 'area exam report' and it is mainly a literature review. It links to some cool things that you could do with an extensible virtual machine, but doesn't actually enumerate a set of hooks.

Extensible virtual machines by Timothy L. Harris. Skimmed it. This one does provide hooks but its purpose is a bit different than ours. It seems to be mostly providing hooks for individual applications to provide optimized overrides of stuff like greenthread scheduling and lazy class loading. So the hooks are higher-level than what we are looking for; their hooks are stuff like 'blockProcess'. It does have a literature review of 'flexible virtual machines' but i haven't read that part yet (already added to ootToReads). There's a few papers by Harris on the same theme, i didn't look at them.

https://www.academia.edu/1462710/A_generic_purpose_cross-platform_highly_extensible_virtual_machine . This is more our style as the intent is for one VM implementation that can run custom instruction sets, but although it says there is a "Small kernel (less than 40 lines of ANSI-C code)", this writeup doesn't say what's in the kernel, and the actual instruction set has 300 instructions. The purpose seems to me more for experimentation and education than to allow easy bootstrapping by making the extensible VM itself dead easy to code.

Jupiter: A Modular and Extensible Java Virtual Machine Framework. Skimmed the first chapter or two. This breaks a JVM into the following modules: meory, execution engine, opcode spec, threads, objects, monitors, classes, native, frames/contexts:

memory: basically malloc
classes: Class metadata. Classes, MethodDecls?, MethodBodies?, Fields.
objects: two kinds, arrays, and class instances. 'new' is in here
frames/contexts: call stack for a thread, frames, in each frame: local vars, operand stack, PC
execution engine: contains a function ee_executeTo(context, stop_frame), which executes the method whose frame is at the top of the context's call stack, until the stop_frame is found to be at the top of the call stack
thread and monitor: closely related. Do what you'd expect (monitor in the sense of synchronization)
native: native methods. getNative provides the appropriate Native for any native method.
opcodeSpec: what you'd think; a declarative list of instructions and what their behavior is

http://wayback.archive.org/web/20160328032222/http://www.simondobson.org/2010/05/languages-extensible-vms/ isn't really talking about this at all, it's asking which implementation language/platform makes implementing an extensible VM easier. We're trying to make one that is easy to implement on any platform, so we don't care.

---

in some ways a Forth-like extensible VM may be a good idea, because calling is cheap, due to no caller-saving of registers. I guess our custom instructions and syscalls are like this.

---

an argument against allowing full getters and setters for every field in the addr subspaces:

these guys are sorta of like object instances (except they can be like arrays in addition to structs), so you can just put in get and set methods in the 'struct' if you need them

but an argument FOR allowing full getters and setters for every field in the addr subspaces:

this lets you transparently do GC stuff (read/write barriers, reference count increments/decrements, keeping a list of where pointers are) when the addr subspace is read from or written to

---

should we only allow 'ground'/'saturated' types (those without any unfilled type variables in them, eg 'List Int' is saturated, 'List' is not) to be assigned a type number in modules? or do we allow unsaturated/generic types too?

if we do allow generic types, perhaps reserve some of the type address space. If the type address space is 32k (15 bits), then maybe 16k of that is saturated types, and 16k is unsaturated.

---

with imports, probably want to allow 'range imports', which say eg 'import module 3's symbols #23-#45 into our symbols starting at #134'

---

i'm still not sure what to do with the last addr mode bit. Let's leave it reserved for now (must always be 0), and/or have a vm hook for it

---

old ovm.py code, before i decided (reluctantly) to write module loading in Oot Assembly.

!/usr/bin/python

import argparse, struct

class OvmException?(Exception): pass class OvmNotAObFileError?(OvmException?): pass class OvmVersionNotUnderstoodError?(OvmException?): pass class OvmVersionTooLowError?(OvmVersionNotUnderstoodError?): pass class OvmVersionTooHighError?(OvmVersionNotUnderstoodError?): pass

OB_MAGIC_NUMBER = 1953459995 # the four bytes at the beginning of any .ob file should always be this (1B 6F 6F 74) LOWEST_VERSION_UNDERSTOOD = 0 HIGHEST_VERSION_UNDERSTOOD = 0

parser = argparse.ArgumentParser?(description='Execute Oot Bytecode.') parser.add_argument('file', metavar='f', type=str, nargs=1, help='Oot Bytecode (.ob) file to execute')

args = parser.parse_args() mainfilepath = args.file[-1] currentfilepath = mainfilepath with open(currentfilepath, 'rb') as fh: f = fh.read()

bytes 0-3 are the magic number: 1B 6F 6F 74 (1953459995 as a little-endian 32-bit integer) magic_number = struct.unpack("<I", f)[0]
convert the first 4 bytes to binary
'<' prefix means 'little endian, standard byte sizes' (as opposed to the default 'native byte ordering, native byte sizes')
'I' is 'unsigned int' (32-bit integer)
struct_unpack returns a tuple of unpacked fields, but in this case we only asked for one (one 'I') if magic_number != OB_MAGIC_NUMBER: raise(OvmNotAObFileError?("First 4 bytes ('magic number') of putative .ob file '%s' should be 1651798006 (1B 6F 6F 74), but was %s" % (currentfilepath, magic_number)))
bytes 4-5 are the version number
H is 'unsigned short' (16-bit integer) version_number = struct.unpack("<H", f[4:6])[0] if (version_number < LOWEST_VERSION_UNDERSTOOD): raise(OvmVersionTooLowError?("This OVM implementation only understands OVM versions %s through %s, but file '%s' is version %s" % (LOWEST_VERSION_UNDERSTOOD, HIGHEST_VERSION_UNDERSTOOD, currentfilepath, version_number))) if (version_number > HIGHEST_VERSION_UNDERSTOOD): raise(OvmVersionTooHighError?("This OVM implementation only understands OVM versions %s through %s, but file '%s' is version %s" % (LOWEST_VERSION_UNDERSTOOD, HIGHEST_VERSION_UNDERSTOOD, currentfilepath, version_number)))
bytes 5-6 are reserved (so ignore for now)

---

if non-immediate-mode opcode field meant to execute a single instruction whose opcode is taken from the effective address, we'd be in trouble for disassembly b/c we wouldn't know whether the addr modes are normal or are being used as operand data (as in JMPs) until we resolve the opcode. But since non-immediate-mode opcode just means first-class function, we're fine.

---

a friend (DR) pointed out that the scheme i was thinking of for runtime polymorphic functions isn't 'open', in the sense that the types that a function can be ad-hoc polymorphic on would be fixed ahead of time (generically polymorphic is already fine).

but here's how to fix this. Instead of implementing ad-hoc dispatch functions, let the OVM (what i was calling Oot Assembly) language do that. The OVM maintains the dispatch table for ad-hoc polymorphic functions. Now when you load a module, it can hook up more types to an existing polymorphic function.

---

ok i started coding up an assembler and disassembler.

---

One thing still to be resolved is exactly how the jump and branch instructions' operands work; whether they take offsets; whether they are relative or absolute; whether there is a module jump table.

i guess i am in favor of a jump table, instead of jumps targeting actual PC-locations in the instruction stream. The reason is that the idea of subroutine identities being equal to line numbers in a linear instruction stream seems peculiar to Turing machine models, rather than general to computation. I am also influenced by LLVM's demand that indirect branches statically enumerate all possible branch targets.

on the other hand, in SHORT encoding mode, you really want branches to be relative, because you don't have many bits (2 or 4 bits) and you can't fit all of the jump locations for the entire module into a 4-item or 16-item jump table.

also, it would be nice to allow implementations to get rid of the jump table and use direct (relative or absolute?) jump locations. This would also allow us to bootstrap without jump tables.

maybe have some hybrid scheme where if an 'indirect branch' or 'offset' operand is immediate 0, the others are treated as absolute or relative locations, or equivalently indices into a jump table (1-indexed, not zero-indexed, i guess, since immediate zero is being used to indicate something else), and if the 'offset' operand is otherwise, then the first address is treated as jump table identifier and the offset operand is an index into that jump table. This would necessitate having not just one but multiple jump tables per module. The idea is that, for LLVM-like indirectbr requirements to have an enumerated jump table for indirect offsets, this gives us that. mb this is too special-casey and we should just have a separate opcode for indirect branch.

---

in writing the assembler and disassembler, i noticed that the multiple encodings are more of a pain than i thought. i'm still in favor of them though.

---

i'm guessing that the instruction signatures are also going to be a huge pain.

Why did we want those again?

First, to save some space so that we could have space to inline-expand jump targets.

Second, so that stack-addressing knows whether to PUSH or POP each operand.

Third, so that we can have 'reversible computation' that uses all 3 operands as both input and output (esp. if combined with stack addressing, where the stack depth should end up unchanged in this case).

Is this worth it? Is there a simpler alternative?

---

mb in the capabilities hidden register, have a 256-bit bitmask indicating, for each of the 256 syscalls, whether that syscall is allowed to be called by this caller.

---

regarding whether the SHORT instructions, which have no field0, should 'default' to primitives like U15, or to DYN, i guess it's primitives. The reason is that, since they have no field0, our technique for runtime generic resolution doesn't work, because they can't pass along the types that they would have been given via field0.

---

still not quite happy with the JMPs.

My determination to have indirect branches only into a jump table is shaken by RET. RET pops from the call stack, but since we have a first-class call stack that could be anything. LLVM deals with this only because it has a notion of the call stack that permits only conventional control flow. And now we have JMP, BRI, BRZ, RET. And RET only uses one of its three operations.

Seems to me that we could have:

JZ can be a conditional jump to an absolute address in the jump table (signature cond jmp16, which is in jmp)
BZ (or JRELZ) can be a conditional relative jump (signature cond jmp16, which is in jmp)
BRIZ can be a conditional indirect jump, and also return (signature ret cond target)

---

old notes moved from implementation:

    (1, 'JZ', ('in', 'jmp')),

mb when field2 is immediate 0, this is a relative jump to an immediate (or an absolute jump to the module jump table), and when field2 is otherwise, field2 dynamically resolves to an offset into a jump table (a module jump table? or a jump table found at 'jmp'?) or is this just a relative jmp with a dynamic offset? Considerations include the annoyance of having a module jump table, and LLVM's insistence that we statically know the potential targets of dynamic jumps ("indirect branches", LLVM indirectbr). If we have a BRI instruction, then how do we make the 'in' argument not a waste, couldn't we just use BRZ for this?
note: so are the stack regs normal pointers or not? If so, we can do DUP via MOV, by using an indirect read for the source and a PUSH write for the dest.
wait, even if register mode is prohibited on stack pointers, register indirect can be allowed. so we're good.
CSWAP can do SWAP.
older idea:
(11, 'PUTILS', ('imm', 'in', 'inout')), # 4 more primitives, TBD. This is provided so that we can do more in SHORT instructions, eg maybe manipulate the hidden stack, and do more MOVs. TODO this is annoying, is is worth it?. ideas for primitives: dup, rot, swap, over, nip, xor, mov1, mov2, nop, prefix modifier, label, CCNOT, CSWAP, CAS, COW, CENTIS.

'jmp' is an immediate 16-bit jump table identifier (some implementations may just store the jump table in memory, in which case this can just be a pointer; implementations may use the vacant 8 bits of addr mode here). 'in' is the index into that table.

    (2, 'BRI', ('in', 'jmp'))    # indirect branch (unconditional). 'jmp' is an immediate 16-bit jump table identifier (some implementations may just store the jump table in memory, in which case this can just be a pointer; implementations may use the vacant 8 bits of addr mode here). 'in' is the index into that table.

    (8, 'RET', ('in')),   # the 'in' is the value to be returned; note that, because this jumps to an address on the call stack, this allows a way to jump to addresses other than via BRI jump tables. todo, is there a way to make use of the other unused 2 operands? Should we add an operand for the jump target?

    (4, 'CMOV', ('out', 'in', 'in')),  # MOV 0 0 1 is NOP. Note that this is a memory-to-memory MOV. In MEDIUM mode, this is CMOV, a move conditional upon the third operand being 0. In SHORT mode, instead of CMOV, we only have two operands, and we split up the 4 bits from the third operand so that each of the other two operands have 4 data bits instead of 2 (16 registers instead of 4) (you can represent this as a CMOV with immediate 0 in the condition operand).

    (15, 'LOADI', ('out', 'imm')),

    (4, 'CMOV', ('out', 'in', 'in')),  # MOV 0 0 1 is NOP. Note that this is a memory-to-memory MOV. Conditional upon the third operand being 0.

    (15, 'LOADI', ('out', 'imm')),

---

too many instruction encodings:

""" here's what the signatures mean:

imm: immediate. The addr mode and data bits are combined.
imm2: two operands are combined to make a longer immediate. In MEDIUM, two operands are combined and the user may specify a 16-bit immediate. The 8 addr mode bits are unused and may be used by the implementation in an implementation-dependent way. In SHORT, the operands are combined and their data bits are also used in order to let the user specify an 8-bit immediate.
in: an input
in2: two operands are combined to make a longer input operand. In MEDIUM, two operands are combined and the user may specify a 16-bit immediate or an 8-bit other-mode operand. In SHORT, two operands are combined and the user may specify a 6-bit operand in any addressing mode.
out: an output
out2: like in2, but output
inout: both an input and an output

as seen in the table below, disregarding the in/out distinction, this means that there are 4 instruction formats allowed:

imm imm imm (only used in ANNOTATE, so this is a special case)
x imm2
x x x
x x2

including the in/out distinction, we have:

imm imm imm
in imm2
in in in
out in2
in out2
out in in
in inout inout

"""

PRIMITIVE_OPCODES = [ (0, 'ANNOTATE', ('imm', 'imm', 'imm')), # 'jmp' signature means concatenate two fields into a 16-bit immediate for use by the user, plus another 16-bit immediate for use by the compiler to deal with inlining expansion (1, 'JZ', ('in', 'imm2')), (2, 'JZI', ('in', 'in', 'in')) # indirect branch (conditional) and also RET. The arguments are; condition, target, return value. 'Return value' only acts as a return value if the target is a CALL instruction; otherwise it is moved into the ERR register. (2, 'JZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is zero) (3, 'JNZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is not zero) (4, 'MOVFROM', ('out', 'in2')), # MOV 0 0 1 is NOP. Note that this is a memory-to-memory MOV. (5, 'ADD-U15', ('out', 'in', 'in')), (6, 'CALL', ('out', 'in', 'in')), (7, 'LEQ', ('out', 'in', 'in')), (4, 'MOVTO', ('in', 'out2')), (9, 'SUB-U15', ('out', 'in', 'in')), (10, 'SYSCALL', ('out', 'in', 'in')), (11, 'CSWAP', ('in', 'inout', 'inout'), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate). (12, 'CALL4', ('out', 'in', 'in')), # CALL the function held in register 4, passing two inputs and taking one output. This is provided so that SHORT mode can make first-class function calls. (13, 'LEA', ('out', 'in', 'in')), (14, 'EVAL1', ('out', 'in', 'in')), ]

---

so how can we simplify that?

we need imm2 so that the compiler has 8 bits free to deal with inlining expansion in relative jumps.

we need imm imm imm, or some other special case, for ANNOTATE

we need x x x

x x2 lets us do fat asymmetric MOVs in SHORT mode, to reach more than 4 registers

i guess we could get rid of x x2

but actually, MOVTO is obnoxious for the programmer: (9, 'MOVTO', ('in', 'out2')),

so change that to out2 in, making even another format

--- old stuff:

    (8, 'MOVFROM', ('out', 'in2')),  # MOV 0 0 is NOP. Note that this is a memory-to-memory MOV. 
    (9, 'MOVTO', ('out2', 'in')),

---

yknow i think it's still too complex. It would be better to waste bits and make the decoding simpler, since our goal here is easy porting, and implementations that care about efficiency will do something else anyways.

--- old stuff:

inw, outw: a third operand is cannibalized and split even between the other two. This is not very useful in MEDIUM mode, since there are still only 256 registers (although it does allow for 14-bit immediates), but in SHORT mode it lets MOVs address 16 registers instead of 4.
xw xw
out in2
out2 in

    (8, 'MOV', ('outw', 'inw')),  # MOV 0 0 is NOP. Note that this is a memory-to-memory MOV.

imm2: two operands are combined to make a longer immediate. In MEDIUM, two operands are combined and the user may specify a 16-bit immediate. The 8 addr mode bits are unused and may be used by the implementation in an implementation-dependent way. In SHORT, the operands are combined and their data bits are also used in order to let the user specify an 8-bit immediate.

---

ok i got rid of in2, inw, etc. Now there's just in, out, inout, imm, and imm2, and in imm2 the address modes are not used as data, they are just supposed to be sitting there at 0 (or being used by the implementation). I like this b/c now you don't have to write a special decoding case to treat the addr modes as data; the instruction implementation can just combine the operands it was given.

Could we do something similar for ANNOTATE? Just say that the addressing modes must be fixed at 0? I think it's not as important here because the ANNOTATE instructions are skipped at runtime (presuming we don't use them as prefix modifiers). In any case, they won't have implementations that must be called.

---

in fact, i'm going to highlight that 'imm' is only available to ANNOTATE, as a special case.

old:

    (0, 'ANNOTATE', ('imm', 'imm', 'imm')),

new:

    (0, 'ANNOTATE', ()),

old: (12, 'CSWAP', ('in', 'inout', 'inout'), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate).

new: (12, 'CSWAP', ('inout', 'inout', 'inout'), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate).

---

ok here's what i have now, i think this is much better, because now the decoder doesn't have to care about the format, the instruction implementation can do everything:

""" here's what the signatures mean:

imm2: two operands are combined to make a longer immediate. In MEDIUM, this allows the user to specify a 16-bit immediate; in SHORT, a 4-bit immediate. The addr mode bits must be set to mode 0 (immediate mode) in user code, but may be used by the implementation in an implementation-dependent way.
in: an input
out: an output
inout: both an input and an output

as seen in the table below, disregarding the in/out distinction, this means that there are 2 instruction formats allowed (excluding ANNOTATE, but that's a special case):

x imm2
x x x

including the in/out distinction, we have (excluding ANNOTATE):

in imm2
in in in
out in in
inout inout inout

Custom instructions must have one of these 4 signatures.

"""

PRIMITIVE_OPCODES = [ (0, 'ANNOTATE', ()), (1, 'JZ', ('in', 'imm2')), # conditional absolute (or jump table) jump (2, 'JZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is zero) (3, 'JNZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is not zero) (4, 'CALL', ('out', 'in', 'in')), (5, 'ADD-U15', ('out', 'in', 'in')), (6, 'MOVZ', ('out', 'in', 'in')), # CMOV 0 0 1 is NOP. Note that this is a memory-to-memory MOV. Conditional upon third operand; MOV takes place if it is zero. (7, 'LEQ', ('out', 'in', 'in')), (8, 'JZI', ('in', 'in', 'in')) # conditional indirect branch and also used as RET. The arguments are; condition, target, return value. 'Return value' only acts as a return value if the target is a CALL instruction; otherwise it is moved into the ERR register. # todo what goes here? (10, 'SUB-U15', ('out', 'in', 'in')), (11, 'SYSCALL', ('out', 'in', 'in')), (12, 'CSWAP', ('inout', 'inout', 'inout'), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate). (13, 'CALL4', ('out', 'in', 'in')), # CALL the function held in register 4, passing two inputs and taking one output. This is provided so that SHORT mode can make first-class function calls. (14, 'LEA', ('out', 'in', 'in')), (15, 'EVAL1', ('out', 'in', 'in')), ]