An Efficient Implementation of SELF, a Dynamically-Typed Object-Oriented Language Based on Prototypes.
SELF push self onto the execution stack LITERAL <value index> push a literal value onto the execution stack SEND <message name index> send a message, popping the receiver and arguments off the execution stack and pushing the result SELF SEND <message name index> send a message to self, popping the arguments off the execution stack and pushing the result SUPER SEND <message name index> send a message to self, delegated to all parents, popping the arguments off the execution stack and pushing the result DELEGATEE <parent name index> delegate the next message send to the named parent NON-LOCAL RETURN execute a non-local return from the lexically-enclosing method activation INDEX-EXTENSION <index extension> extend the next index by prepending the index extension
"The byte codes needed to express SELF programs fall into only three classes: base values (LITERAL and SELF), message sends, and non-local return." ((and index extension, which is encoding))
SEND is sorta just CALL and a SELF or other object sorta just provides a "context" of mutable state for the call; consider self.x used to load the current value of instance variable x onto the stack. for oot, this suggests a system with both multi methods but also a single stateful/world 'context' for each call (by 'world' i refer to the undoable versionable state worlds)
are the other 'message' bytecodes needed? what exactly does the non-local return bytecode do?
"Smalltalk-80 defines a much larger set of byte codes [12], tuned to minimize space and maximize interpretation speed, and includes byte codes to fetch and store local, instance, class, pool, and global vari- ables and shortcut byte codes for common-case operations such as loading constants like nil, true, and 0. Smalltalk-80 systems also use special control flow byte codes to implement common boolean messages like ifTrue:ifFalse: and whileTrue:; the Small- talk parser translates the message sends into conditional and unconditional branch byte codes, open-coding the argument blocks. Similarly, the == message is auto- matically translated into the identity comparison primitive operation byte code. A similar optimization is included for messages like + and <, which the parser trans- lates into special byte codes. When executed, these byte codes either directly invoke the corresponding integer primitive operation (if the receiver is an integer), or perform the message send (if the receiver isn’t an integer). "
"Each byte code in the byte array represents a single byte-sized virtual machine instruction, and is divided into two parts: a 3-bit opcode and a 5-bit object array index. " eg for self.x, the five bits are the index of method 'x' within object 'self'.
i guess that's kinda like indirect indexed address mode ; you could have just loaded a literal first and then did a SEND instead of putting the object array index in the instruction; my split indirect indexed addressing mode is a more address-modey solution to the same desire to include all this in one instruction, except that with Self, the object instance address is on the stack, NOT in the instruction.
---
we can take some interesting cues from the RISC-V compressed extension set
https://en.wikipedia.org/wiki/RISC-V#cite_note-isa-3
https://riscv.org/wp-content/uploads/2015/11/riscv-compressed-spec-v1.9.pdf
RISC-V has at least 16 and (usually?) 32 integer registers and optionally, 32 floating point registers. Memory is addressed as 8-bit bytes. The Compressed instruction encoding has 16-bit instructions. Instructions tend to have 32-bit, 64-bit, and 128-bit variants. C.LI, Load Immediate, has a 6-bit immediate. C.J, PC-relative jump, has an 11-bit signed immediate (+-2k, which i think is +-1k instructions, since instructions are 16-bits). C.BEQZ has an 8-bit signed jump, or +-128 instructions. There is stack-offset addressing (meaning, read and write directly into the stack; there is no autoincrement, no PUSH/POP addressing). There are special instructions to add to/subtract a 6-bit quantity from the stack pointer in multiples of 16 bytes, and to get the stack pointer plus a 16-bit offset in multiples of 4 bytes. The latter suggests that (a) the stack pointer is usually 32-bit aligned, that is, on the stack are 32-bit integers, 64-bit integers and floats and larger, and pointers, (b) there are typically no more than 64 local variables (in which case you case get the address of any local variable on the stack with one of these special add instructions), or maybe 4-128 (in which case you can adjust the stack pointer in the prologue or epilogue with one of the special stack pointer mutation instructions). According to [1], the non-compressed ISA "was designed to permit position-independent code. It has a special instruction to generate 20 upper address bits that are relative to the program counter. The lower twelve bits are provided by normal loads, stores and jumps." -- in other words another instance of 12-bits being used for something.
This is interestingly rather close to what we have, except that we support 16-bit alignment instead of 32, and our instructions are 64-bits instead of 16- (and with a more uniform/simple encoding), (and of course we aren't concerning ourselves with bit-width variants of instructions; and we are CISC not RISC (we had addr modes, not LOAD/STORE)), we support 12-bit immediates instead of 6, and we are supporting 2k local variables instead of 64 (or maybe 128). Note that if we had 12-bit operands with split mode, we'd have 64 instead of 2k, which fits in well with RISC-V.
copying from proj-plbook-plChMipsDlxIsa section 'Summary of Risc-V Compressed (16-bit encoding) opcodes':
MOVs and loads and stores and LOADK:
Jumps and branches:
Other stack-pointer-related:
Arithmetic and boolean logic:
Misc:
with our indirect-immediate split mode, we could do both of the load/stores here (with either the stack register or the target register), except without the scaling. And if we also had a 'register-immediate' (take the value held in a register, and add an immediate, that's your value (NOT your effective address) then we could do a PC-relative jump but with only 6-bit range (and unscaled); we could do the branch but with 6-bit range (and unscaled). Ordinary addition on the stack pointer can do the two 'other stack-pointer-related' with our 3-address format and immediate and register addressing modes. Yes, we should have shifts, ADD, SUB, AND, OR, XOR, NOP, BREAK, and BAD.
So this suggests that we should have a split indirect-immediate mode for efficiency, even though we only get 6 bits for it. AND maybe a 'register immediate' split mode too (or even a 'PC plus immediate offset' non-split mode, to allow for 12-bit position-independent/PC-relative JMPs; or mb have a 4/8 split for a register immediate mode, to allow either the stack or the PC to be the base register for this).
actually, since JMP has two input arguments, if JMP just adds its two input arguments, then someone wanting to make a relative JMP with a 12-bit signed offset could just specify register direct mode for the first argument, and specify the PC for the register, and then specify immediate mode for the second argument. Which is what our current proposal has JMP doing anyways (in proj-oot-ootAssemblyNotes7). So i guess we're already done.
---
so it's interesting how little our ideas for oot assembly have changed since the proposal in proj-oot-ootAssemblyNotes7. The proposal initially had an encoding with 8 bits in the opcode operand, and 4 addr mode bits, and 2 flag bits; the current version has 12 bits in the opcode operand, and no flag bits, and 2 addr mode bits.
And we decided on 3 'optimization' addr modes that might be nice to hardcode instead of have as custom.
i'm currently feeling pretty good about this proposal, maybe it's finally stabilizing. We'll see, i've been wrong before.
Recall that the purpose of Ootasm is to make it easy to naively implement Oot on a new platform, by having an OotBytecode? that is really easy to implement on any platform, and bootstrapping the rest of Oot in this. At the same time, Ootasm is supposed to retain enough of the high-level semantics to allow interop features to be provided below the OotBytecode? interpreter runtime (eg by using Python dicts to implement Ootasm dicts, etc). Ootasm should make it easy to begin to optimize this implementation by swapping out parts of the Oot implementation for platform primitives (by having the Oot Bytecode interpreter replace calls to various custom instructions which are themselves written in crossplatform Ootasm with calls to platform-specific implementations using native platform primitives). Really advanced implementations of Oot on a given platform
I feel like the current OotBytecode? proposal would be pretty easy to implement in Python, for example.
---
still concerned about whether we can have relative jumps in conditional branches. In the current proposal we only have skips and absolute JMPs. The idea behind that was, as we inline-expand custom instructions, the relative spacing between instructions will expand, so (a) things that might be within relative jump range in the source code might be pushed out of it after inline expansions, and (b) this would be complex or inefficient to make the compiler rewrite affected relative jumps between each expansion, so the compiler is going to just assign each relative jump a label anyways and then resolve the labels once after all inline expansions.
But we only have room for 4k jump labels in each module's constant table. This isn't enough for all branches in the code. Which means that an inlining compiler must maintain a much bigger jump table, which means that (a) intermediate compilation stages wouldn't be valid oot, and (b) in the end the jumps are going to have to be PC-relative, rather than absolute, anyways, and (b) also implies (c) currently, PC-relative jumps are limits to 12 bits (maybe we want to add a JMPREL (relative jump) instruction, which is implicitly PC-relative, so the jump offset has 24 bits instead of just 12? 24 bits is 16 MB, which is not huge by conventional standards, but perhaps an okay largest size for a naively-compiled Oot compilation unit (i say 'naively compiled' because a really fancy compiler could split one Oot module into many Ootasm modules).
Seems to me (but i havent thought hard about it) that if, in all source code, near jumps are limited to an offset of 12 bits, then we can do up to 12 inlining steps and still guarantee that the end result will be within an offset of 24 bits.
Which means that if we limit the Oot Custom Instruction 'call stack' to a depth of 8 (or even 12; but 8 gives us some breathing room), we're good to go. We just specify that (a) the Ootasm 'runtime' must be able to handle the 'JMPREL' opcode, but Ootasm 'source code' must never contain it.
(although i guess the compiler could also just STOP INLINING if the call stack gets deeper than that; but since this isn't a normal stack it would be nice to give it a predefined limit anyways).
---
list of 'magic' numbers/file format signatures https://en.m.wikipedia.org/wiki/List_of_file_signatures
---
in a way Oot 'Bytecode' is a misnomer, since we have 12 bits of opcode, not just 8 (that being said, the primitive ops are limited to 8 bits).
---
i think it's time to write an updated draft of the current proposal (currently found in proj-oot-ootAssemblyNotes7). Not much has changed, but i bet the second draft will be better.
---
yknow after writing the 'motivation' section below, actually i'm not so sure that we can get by without 'core oot'. The motivation below stresses that one of the reasons for this bytecode instead of others is so that it is high-level enough to produce readable code when compiled into an HLL. But if bytecode is flat/linear, then how is that true? Sure, we can reconstruct expressions via tracing use of the implicit stack, but that seems kind of backwards.
Maybe we should just have an AST and walk/execute the AST. This is probably less efficient than linear bytecode but the point here was to make a naive implementation, not an efficient one.
Anyway, didn't i read that Perl5 executed the AST, or maybe that was Ruby? (i think Perl6 does have bytecode though, ie MoarVM?)
https://www.reddit.com/r/learnprogramming/comments/w6i0p/bytecode_execution_vs_ast_walking_speed/c5arute gives a simple example of AST walking:
" Consider the trivial program
x = 3 + 2
It produces an AST along the lines of
ast = { op_code: OP_ASSIGN, name : 'x', expr : { op_code : OP_ADD, lhs: { op_code: OP_LITERAL, value: 3 }, rhs: { op_code: OP_LITERAL, value: 2 } } }
So an evaluation loop for the trivial language would something like the following, using a Pythonish pseudocode.
op_environ = {} op_stack = [] result_stack = [] op_stack.push(ast) while op_stack: op = op_stack.pop() switch op.op_code: case OP_ADD: op_stack.push(AUX_OP_ADD_END) op_stack.push(op.rhs) op_stack.push(op.lhs) case AUX_OP_ADD_END: rhs_result = result_stack.pop() lhs_result = result_stack.pop() result_stack.push(lhs + rhs) case OP_ASSIGN: op_stack.push(AUX_OP_ASSIGN_END) result_stack.push(op.name) op_stack.push(op.expr) case AUX_OP_ASSIGN_END: value = result_stack.pop() name = result_stack.pop() op_environ[name] = value case OP_LITERAL: result_stack.push(op.value)
yada... yada... yada... " (my note: i changed OP_ASSIGN_END and OP_ADD_END to AUX_OP_ASSIGN_END and AUX_OP_ADD_END)
as you can see, the AST walker implementation is not harder to implement or to read and can be organized into a big SWITCH just like bytecode. The 'implicit stack' is very explict here and replaces our 'stack addressing' within expressions (which is nice, because it makes a clear separation between the 'implicit stack', which is invisible to the code and is an implementation detail of the AST walker, and the stack which is visible to the code).
it kind of reminds me of an SECD machine. Of course, we would add partial application stuff to our implementation, bringing it closer to an SECD machine.
Now what about compilation to more serial target languages like assembly (or maybe BASIC, which allow arithmetic expressions within lines of code, but not blocks of code such as while loops)? Now porting Oot to these is a little harder because the person doing the port must do an additional compile step to replace subexpressions with a sequence of instructions using an in-program stack. But since this is the same for every target language, we could provide a module to do this for them.
Or, maybe we should just include parens in the bytecode? But i guess that is basically a serialized AST...
So i guess what i am proposing is a unificiation of Oot Core and Oot Bytecode. This is a 'bottom-up' way of designing Oot Core. The notes files proj-oot-ootCore, proj-oot-ootCoreThoughts, proj-oot-ootCoreNotes1, proj-oot-ootCoreNotes2 contain a more 'top-down' approach, which will eventually have to be merged with this one.
If we unified Oot Bytecode and Oot Core, then the subset of 'primitive operations' in Oot Bytecode becomes another, smaller core language within Oot Core; it might be called Oot CoreCore? (a less silly name would be Oot Core Primitives, but i think Oot Core Primitives is harder to keep straight than Oot CoreCore? (ie i think ppl would confuse 'Oot Core' with 'Oot Core Primitives' but no one will confuse Oot Core with Oot CoreCore?).
So i guess all this thinking about the Oot Bytecode encoding will eventually become just a fancy, well-though out AST serialization format. I think this way of thinking about it is somewhat valuable, however, as it allowed me to get a good feeling for how assembly-language constructs map onto the AST. As i've seen, assembly language is more determined by encoding constraints than i would have thought, so thinking about these constraints and designing an assembly-language-like AST serialization format with similar constaints first probably helps in making Oot Core something that will be more easy to practically implement on a variety of platforms.
Now, what happens to our stack addressing mode if programs are an AST tree? I guess it stays (we can use it for the call stack and/or for a data stack holding local variables and/or passing parameters and return values across function calls). And how do we actually encode subexpressions? A naive AST in a language like Python would probably just give each node an arbitrary-length list of references to child nodes, so at the low-level a node would probably be represented in memory as objects which hold a pointers to a dynamically-sized list of pointers. But our Oot Bytecode was designed so that each instruction had exactly 3 operands. So instead of variable-length lists, each node is a fixed-size struct (64-bits, to be exact).
Note that immediately-specified conditionals and jumps in our current scheme uses 12-bit relative offsets or absolute node IDs (locations in the code stream), as opposed to an AST which would probably use native pointers between nodes.
Currently, the way you'd express the expression "x + y" would be a sequence of statements, PUSH x; PUSH y; + (stack) (stack);. But an AST expresses it via a node + that has pointers to the nodes for "x" and for "y". So how do we update our stack addressing mode to do this? We could add an additional SUBEXPR addr mode for subexpressions. Instead of native pointers, each node could hold a relative offset to other nodes.
Another thing about executable ASTs is that they seem to tend towards a LISP-y or Haskell-y idea of everything-is-an-expression. By contrast, one thing about assembly and flat bytecode is that it tends towards serial sequences of statements. How would a sequence of statements be represented in an AST? Possibly either as a 'block' node with arbitrarily many children (like Lisp PROGN), or with each node having a 'successor' pointer. But we'd rather do it the assembly way; with sequences just being successive entries in the flat array of statements.
In other words here's how i see this working. Consider encoding something like the following:
x = 2 y = 3 z = x + y
it would (still) be:
CPY x, 2 # where X is a register, and 2 is immediate 2 CPY y, 3 ADD z, x, y
whereas
x = 2 y = 3 z = 5*x + 7*y
would be:
CPY x, 2 CPY y, 3 ADD z, *SUBEX1, *SUBEX2 # where * is 'subexpression addressing mode', and SUBEX1 represents the relative distance to the SUBEX1 label J CONTINUE ( SUBEX1: MUL RESULT*, 5, x # where postfix * is 'result of subexpr addr mode' ) ( SUBEX2: MUL RESULT*, 7, y ) CONTINUE: ...
(note: i guess you don't really need the '('s in the instruction stream, only the ')'s; and in fact not even those, as the subexpression terminates when RESULT* is encountered; note that this would mean that it must be verified that every control flow path through the subexpression terimates with a RESULT*; to avoid having to do that verification, you could just include the ')'s as a fallthru default (if one is encountered, you return NIL))
(note: 'RESULT*' could just be '0' in subexpr addr mode)
(note: in a strict language, you'd always need to execute SUBEX1 and SUBEX2 before the ADD, so you'd want to put them before the ADD and execute them upon encountering them, postfix style; but in a lazy language, you need to see the ADD first. So this way is more general, although more inefficient, since to execute SUBEX1 and SUBEX2 you're jumping all over the place (which is how the AST tree walking always works, however; it encounters the top-level of an expression first, then expands it top-down). If you want more efficient execution of small subexpressions, could use SHORT encoding)
So, this is a pretty neat thing we're contemplating; kind of a superpowered assembly that can include expressions and has a 'subexpression addressing mode', or a superpowered AST that can include linear sequences without the overhead of a 'next' pointer in every node, with relative addressing instead of native pointers, and with a concise fixed-length node format.
Also, there is a unique obvious linearization of this, so if you do need to convert it into flat bytecode, there is an obvious way to do that (convert the subexpression addr mode into stack ops).
note: for this reason, register '1' might want to be reserved! (so that it can be used as the stack pointer for the implicit stack in linear bytecode)
note: so is non-local control flow disallowed within subexpressions? Probably not; since in languages like eg Scheme, everything is a subexpression, this would make it hard to allow continuations. Also, you could always CALL something from within a subexpression that generates an exception that you have to bubble up past the original call.
if non-local control flow is allowed, then you might want to expose the 'implicit' stack to the program so that it can eg save it if it is serializing its state when a continuation is reached. Which means that you may just want to combine it with the data stack.
i'm having second thoughts about whether this is precisely Oot Core, though. Does Oot Core really address conditionals in terms of 12-bit relative branches? It seems to me that Oot Core is more of an ideal data model for an AST, possibly with some attached constraints, and that this bytecode is just one representation of that data model, such that the physical constraints of this representation are reflected in the constraints attached to the Oot Core specification. Also, should those constraints even be there? Philosophically, the idea that "2k functions is enough for one module" and "2k local variables is enough for one function" sound similar to "2 bytes is enough for one number"; but i thought we wanted arbitrary-precision arithmetic at the Oot Core level. So maybe these limitations should apply to the bytecode format but not to Oot Core, in which case there is a distinction between them.
More glaringly, Oot Bytecode doesn't assume that the platform has garbage collection and schedulers, and so provides these, although they can be swapped out; so there may be explicit calls to GC and scheduler stuff all over the place, right? But Oot Core probably just assumes these things. All of which suggest that Oot Assembly is, actually, different than Oot Core.
another issue: shouldn't the interpreter implicitly perform read and write barriers in the heap for us when we access the heap through the ordinary addressing modes? although it can call Oot Bytecode Stdlib routines to do this.
Another issue: What about polymorphism? We've kind of swept that under the rug. The reasoning for having just ADD, rather than ADD8, ADD16, ADD32, ADD64, UADD8, ADD_RATIONAL, etc, was that (a) we want to use the numeric types of whatever platform Oot is running on, so we can't guarantee that any particular type of number is available, (b) sometimes we really do need late bound polymorphic ADD (eg when ADDing OOP instances), and (c) it's simpler; if Lua can do that, why not us?
But this means that at runtime a type tag is stored for each value, or that at compile time the program is being compiled down to a level below Oot Bytecode, a level at which ADD is resolved into things ADD8, with type annotations in the Oot Bytecode being used to statically choose, or that Oot Bytecode contains a prefix annotation before each ADD saying what the type is, which would effectively make ADD instructions 128 bits, not 64 (or i suppose we could have a 'default' type for each operation (eg INT16 for ADD) and prefix instructions for others).
Also, the HLL might say 'i promise this value is always < 216', in which case 16-bit integers can be used; but the HLL might also say, 'i need wraparound at 216!' in which case 16-bit integers MUST be used, or at least emulated.
re: polymorphism; i guess the idea is that we do the usual thing (ADD32, ADD64, UADD32, etc; that is, a different opcode for each specialized method), except that the primitive ADD (or UADD0 or something like that) isn't any of these; UADD015:
If order to avoid making assumptions about what happens upon overflow, the Oot implementation promises not to call UADD015 unless it can guarantee that the result will be < 2^15. Either it is known ahead of time that the result is < 2^15, or USUB015 is first used to check if ((x < 2^15 - 1 - y) AND (y < 2^15 - 1 - x))
Super-slow, yes, but also makes it super-easy for the implementer to get their first naive implementation running on a new platform. Later on, more advanced implementations can override ADD32, etc, with platform primitives.
hmm otoh maybe the interpreter should support a polymorphic ADD instruction. This would leave open the possibility of doing dynamic dispatch on boxed values. On primitive values, the interpreter would consult its stack/memory type maps to see what type something is, and then dispatch to that. This could be done at runtime; or the interpreter could specialize the ADD at load time based on the stack/memory map.
The advantage here is it allows generic code to be written (eg you can have an ADD function that you can pass to MAP that works on a list no matter whether the numbers to be added are 015, 16-bit, 32-bit, etc). The disadvantage is slowness, unless the interpreter bothers to consult its memory maps and pre-specialize.
Let's do it.
So, the Oot Bytecode implementer still only has to implement UADD015, and the dynamic dispatch in ADD. But when they implement the interpreter, they have to deal with the dynamic dispatch.
Maybe we should also make a rule that a directly-addressable memory location cannot change its type during one function; although it can be of type DYN.
Maybe type maps should have space for 2k per-module types, plus 2k type variables (used in generic functions).
---
why not support 8-bit, eg UADD07? Because
even on systems which don't have native instructions for adding and subtracting 16-bit numbers, it's often easy to do so. Eg, on the 6502 [2]:
SEC ; RESULT = NUM1 - NUM2 LDA NUM1L ; After the SBC NUM2H instruction: SBC NUM2L ; V = 0 if -32768 <= RESULT <= 32767 STA RESULTL ; V = 1 if RESULT < -32768 or RESULT > 32767 LDA NUM1H SBC NUM2H STA RESULTH
---
the primitive operations are not truly minimal; for example, we have a USUB0 operation even though you could build subtraction out of addition and conditional branching. Another example: if you have branch-if-zero and jumps you can make branch-if-nonzero.
hmm otoh mb we SHOULD make it more minimal just to drive home the point that implementors really really should make a not-100%-naive implementation ASAP.
---
another random note: i guess ideally LONG format would not be just a random semantic-less serialization format for future DSLs and languages, but rather a certain way of expressing predefined concepts such as prefix instructions and addressing mode flags which has a fixed semantics in terms of more unweildy MEDIUM-format code constructs. In other words, i want to give LONG a semantics in terms of specifying how to automatically compile any LONG code into MEDIUM code.
---
we need a way to pass simple instructions as 'first-class functions', too!
(one idea is to make a 'builtins' module which resides at constant 1 of every constant table, but i prefer the next idea:
another idea, which i currently like best, is to just interpret immediate-mode arguments given where a function is expected as opcodes for instructions
(eg if opcode 4 is CPY, and the first argument of a MAP instruction expects a function, then by giving immediate-mode 4 for that argument, CPY is indicated as the fn).
---
old verbiage:
Oot Bytecode is a simple, easy to implement virtual machine. Oot Core is implemented in Oot Bytecode. This makes it easy to bootstrap a naive implementation of Oot; all you have to do is to implement Oot Bytecode.
Oot Assembly is a textual language which corresponds to a binary representation called Oot Bytecode. Oot Bytecode consists of a very small and easy to implement core language (Oot Bytecode Core), and the Oot Bytecode Standard Library, which is itself written in the Oot Bytecode core language.
In order to port Oot to a new platform, all you have to do is to implement an interpreter or compiler of Oot Bytecode Core (plus whichever OS services, such as file access, you want to be able to use from Oot). The reference implementation of Oot Core is provided in Oot Bytecode, and the reference implementation of Oot can be bootstrapped from Oot Core.
Why did we create a new language Oot Core?
Existing popular target languages were judged to be too laborious to implement or too low-level. But why do we require our target language to be easy-to-implement and high-ish-level?
Oot is intended to be a cross-platform 'glue language' and 'scripting language' that can be compiled not only into various lower-level target platforms (for examples, GNU/Linux, MacOS?, Windows, Android, iOS, JVM, CLR, Webassembly, LLVM), but also into various other high-level languages (HLLs) (for example, Java, Javascript, C, Python). It is intended to be able to smoothly interoperate with code written in these language (to call and be called by foreign code, to pass data to and from foreign code, and to be compilable to reasonably readable foreign code); the idea is that if you already have a project in some other language, but there is a library in Oot that you would like to use, then you should be able to extend your project by scripting in Oot; and if you already have a library in some other language but you want to write a project in Oot, then you should be able to use that library from Oot.
Because Oot wants to live on many different platforms, Oot's implementation must be easy to port. Furthermore, for the sake of interoperability, Oot should be able to run 'on top of' various other platforms, making use of the host platform's data types (such as strings, lists) and services (such as garbage collection and greenthreads).
The idea is that an interpreter or compiler and runtime for Oot Bytecode Core should be really easy to implement naively on a new platform. Many existing target languages are too large to be really quick to port.
The Oot Bytecode standard library then uses the Oot Bytecode Core primitives to implement basic data structures, control flow, and services such as lists, while-loops, garbage collection, greenthreads etc. At this point, the implementer can bootstrap a naive implementation of Oot on the new platform.
Later, in order to provide better interoperability or performance, the implementer can incrementally 'swap out' the default Oot Bytecode standard library implementations of things like lists, while-loops, garbage collection, greenthreads, etc, for platform-native implementations.
So, we want almost all of the data types, control flow, and services that the host platform might provide to be able to live 'below' the Oot Bytecode layer if desired, being implemented by the Oot Bytecode implementation using native platform primitives (even though, in an initial naive Oot Bytecode implementation, these things are provided by the Oot Bytecode Standard library). If we just compiled everything to, say, LLVM, then, say, while-loops would no longer be easy to recognize when expressed as LLVM assembly. This would make it harder to compile Oot to readable code in a target HLL later on. Instead, we want a while-loop to retain its identity as a while-loop. We could compile to LLVM and introduce a complex system of annotations on top of the LLVM bytecode, but that's almost as much work as creating Oot Bytecode.
Why not just implement Oot Core directly? Really advanced Oot implementations on some platforms could bypass Oot Bytecode altogether and provide an interpreter or compiler directly from Oot Core to the host platform; that will probably yield the best results. But that's slightly more work, and we want to make it really easy for people to quickly write naive ports even when they don't have the time or expertise to implement Oot Core.
---
should we mb have 2k memory locations instead of 4k, allowing the interpreter to use the other 2k for something else?
also limit depth of 'subexpression' address mode implicit nesting to 2k
---
so i guess each module must have a polymorphic function lookup table for functions that it provides?
---
todo: update the following for the AST walking stuff
---
RISC-V's memory model looks pretty simple, mb we should use it. It's 'relaxed', with a two FENCE instructions: FENCE (for data) and FENCE-I (flush instruction cache).
FENCE is parameterized by 8 bits, P(I,O,R,W), S(I,O,R,W) which indicate a 'predecessor set' and 'successor set' of operation types. "no other RISC-V thread or external device can observe any operation in the successor set following a FENCE before any operation in the predecessor set preceding the FENCE". Whether a read/write is considered R/W or I/O is implementation-dependent.
---
RISC-V places their FORMAT bits in LSB, not MSB, so maybe we should do the same:
" We chose little-endian byte ordering for the RISC-V memory system because little-endian sys- tems are currently dominant commercially.... We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that the length-encoding bits always appear rst in halfword address order. This allows the length of a variable-length instruction to be quickly determined by an instruction fetch unit by examining only the rst few bits of the rst 16-bit instruction parcel. Once we had decided to fix on a little-endian memory system and instruction parcel ordering, this naturally led to placing the length-encoding bits in the LSB positions of the instruction format to avoid breaking up opcode elds. " [4]
RISC-V encodes sign with the MSB, i think:
"Sign-extension is one of the most critical operations on immediates (particularly in RV64I), and in RISC-V the sign bit for all immediates is always held in bit 31 of the instruction to allow sign-extension to proceed in parallel with instruction decoding."
---
here's what RISC-V says about overflow:
" We did not include special instruction set support for over ow checks on integer arithmetic operations, as many over ow checks can be cheaply implemented using RISC-V branches. Over- ow checking for unsigned addition requires only a single additional branch instruction after the addition. Similarly, signed array bounds checking requires only a single branch instruction. Over ow checks for signed addition require several instructions depending on whether the ad- dend is an immediate or a variable. We considered adding branches that test if the sum of their signed register operands would over ow, but ultimately chose to omit these from the base ISA. "
however [5] says this is a terrible idea and that HLL performance is dominated by overflow checks.
---
i guess our 'types' are more like Haskell typeclasses, because we can have 'abstract' ones like INT?
---
(text of current proposal moved to ootAssemblyThoughts)
Isn't it really slow to check all the time if ((x < 2^15 - 1 - y) AND (y < 2^15 - 1 - x))?
Yeah, but this allows the implementer to just call the platform primitive in UADD015 without doing extra work to communicate a detected overflow using Oot's conventions. This makes it dead easy to get an initial naive port running. In addition, often we will know ahead of time anyways that the result is < 2^15, in which case we just want to do the addition without extra unnecessary work. When we don't know ahead of time, Oot code will be using higher-level functions like UADD16 and ADD16, which almost all implementations will be overriding and implementing using platform primitives anyways; only in the most naive implementations will the default implementations, which do these checks via subtraction, be used.
(also, it may not be that inefficient; i think there was even an ISA that took a similar approach; was it RISC-V?)
---
Why only 2k directly-accessible memory locs, rather than 4k, since immediate memory addresses are 12 bits?
B/c the other 2k immediate address values are reserved for use by the implementation in intermediate representations.
---
i guess we don't strictly need prefix modifiers because we can have custom instructions that examine the next instruction and execute it in a wrapper
---
has anyone else tried to make a polymorphic assembly language? Yes, apparently "TALx86". Should read about it mb:
https://www.cs.cornell.edu/talc/papers/talx86-wcsss.pdf
---
[6] agrees with me that "Interop between languages needs to happen at a level far higher than abstract assembly to get any meaningful benefit. "
---
design choice commentary:
It's kinda weird that we have limits like 2k 'virtual registers', and all these other 2k limits. Many programming languages have 64k limits. LLVM has infinite virtual registers (and we're supposed to be 'higher level' than LLVM!). Why didn't we do that?
What LLVM does is it has a scheme for representing arbitrarily-large integers. At first i wanted to do that too, but i decided not to, because (a) such a scheme is more complicated to implement, and (b) besides the effort spent directly in implementing this, this means that other parts of the implementation, for example any program analysis, cannot assume an upper bound on how many registers there are; so you cannot for example have a 2-byte integer 'register label' anywhere in your implementation, because what if more than 2^16 registers are used? Since one of our primary goals is that this be easy to port, via making it easy to implement, i judged that this additional implementation complexity was not worth it.
Now, why not 64k limits instead of 2k? If we had 64k limits with our 4-address format, then we'd have 64 bits per instruction just to hold the operands. If we want to 64-bit align our instructions, that's 128 bits per instruction, which is much higher than other bytecodes (even 64 bits is higher than even most 'high level' bytecodes). And, whether or not we jump all the way to 128 bits, most of the time most of the 64 operand bits will just be 0s, because it will be very rare that any of these values goes above 2k. I just couldn't stand that amount of waste, even though this is premature optimization; and since we can get close to 64-bits, i think it is cleaner to stick close to a power of 2 (and there is probably some efficiency benefits to that as well).
2k is still much larger than most other assembly languages (and most of those operand bits will still be 0s), and yet much smaller than the 64k limits on most HLLs. Is this a Goldilocks perfect compromise, or a useless system that tries to serve multiple use-cases and ends up doing neither well? We'll see.
---
One other thing we could do is to also provide a compiler from the textual Oot Assembly into Oot Bytecode which removes most of the limits; eg it internally uses structs in functions which have more than 2k locals, and breaks apart modules which have more than 2k functions or constants.
---
"LLVM is a simpler language to target than traditional assembly languages—it has nice features like infinite registers and recently support for stack frames. " [7]
---
"Rust compiles into LLVM because anything higher-level wouldn't provide sufficient control over the processor and memory." [8]
"a lot of languages that compile to assembly do so because it has the fewest opinions on how your program should run" [9]
---
[10] advocates a Rust compile-target for interoperability:
" 1. Rust has an advanced type system. Interoperability between languages is limited because often times the least common denominator between two programming languages is at the assembly level...
2. Rust has several useful programming constructs that don't exist in LLVM. A lot of ideas we take for granted in our programming languages like loops, closures, and enums (sum types), have to be re-implemented every time a new language compiles to LLVM. By compiling a language to Rust, not only do you get access to these constructs, but in my mind it's generally far easier to write the code generator than generating LLVM by hand, particularly with quasiquotations.
3. Rust comes with a package manager, Cargo...If new languages compiled to Rust, not only would they get a package manager for free, but they also get free access to programs written in every other language that compiles to Rust.
4. If new languages compiled to Rust, not only would they get a package manager for free, but they also get free access to programs written in every other language that compiles to Rust....This means writing a language that compiles to Rust gets all existing Rust libraries potentially for free—no need to rewrite HTTP servers or command-line parsing for the millionth time.
5. Rust only imposes one opinion on your program: that it should be safe. I would argue that the only real difference at the runtime level between programs you can write in LLVM and programs you can write in Rust is that Rust requires all of your programs to be safe, i.e. not have memory errors/data races. Rust does not require you to have dynamic types, garbage collection, or really any particular runtime overhead—all of the hard work is done at compile time. "
for Oot, though, having Rust be the only compile target is insufficient, b/c afaict Rust only target what LLVM can target [11], which probably doesn't include transpiliation to Lua etc, at least not without losing HLL constructs (also i think writing a new LLVM backend is probably too complicated).
i guess it should be ONE of our compile targets though. If Rust becomes the new C then it will become the default interop compile target.
---
All i know is that (a) it'll mostly look like assembly (but are there subexpressions, blocks?), and (b) there will be some compile-time macros and possibly other compile-time metaprogramming stuff
I guess insofar as there are a bunch of less important/bikesheddy syntax decisions, we should follow the syntax of one or more existing popular assembly-like textual formats. Let's survey what these are:
In summary, the more commonly mentioned ones (plus newer standards that i think we should consider) are: