notes-computer-jasper-jasperAssemblyThoughts

not sure if this should be part of Jasper core or Jasper or not at all.. anyways i've been thinking about a sort of 'augmented assembly' language for statements (as opposed to expressions). The basic idea is 'Polymorphic assembly'. Ideas:

Polymorphic, e.g. instead of having different instructions for different types, you have polymorphic instructions. Whether or not this maps to polymorphic opcodes, or whether this is compiled to a lower-level bytecode which has different opcodes, i'm not sure.

User defined types

user defined opcodes like parrot

user defined addressing modes addressing modes are like expressions but the only expressions are addressing modes, which each must be defined once in a precompilation phase mb to make the language almost pointer-safe, could only allow raw pointer arithmetic within addressing modes; otherwise use type descriptors ("stream descriptors"), e.g. with each data type is associated an element size and with each data object, a length this means that you should also be able to access the metainformation in the type descriptors (e.g. the element size) when you are defining custom addressing modes

macros for generating inline, as opposed to called, subroutines

as for expressions, it seems like the point of a bytecode is to spell out the operations imperatively and for each instruction to be made of parts that take up a sorta fixed amount of size ( http://onlinedisassembler.com/blog/?p=23 points out that x86 allows an arbitrarily many number of 'prefixes' within each instruction, and i think that even modulo prefixes, x86 instructions are of variable length; but i think their components are of fixed length ). so i don't think we want arbitrary expressions in jasper assembly, except as they are predefined as suggested for user defined addressing modes, above

design todos

addressing modes

immediate addressing mode (constants) seems like a given

register addressing mode isn't that different from absolute indirect addressing mode if we are thinking of registers as just 'i'm using this soon' annotations

register indirect addressing mode seems like the * operator in C

register indirect with (immediate or register) offset addressing mode seems like accessing a structure's items, or like accessing an array element. in Jasper we don't want to expose the linear memory layout (except thru the FFI), so we want to have array/structure access but not actually adding offsets to pointers

scaling offsets seems applicable only if you are exposing the linear memory layout

increment and decrement are identical if you allow the user to specify the amount of the increment and allow this amount to be negative (i think ARM does this, but i'm not sure if we want that tho)

i'm not sure we want pre/post increment/decrement because this adds side-effects at the addressing mode level, which breaks the concept of addressing mode as (only) an expression, and also breaks the concept of addressing mode as (only) a way of pointing to things

addressing modes seem reminiscent of the need in jasper to be able to talk about either a pointer itself, or the thing it points to (and to talk about an individual copy of something synced vs the synced thing, which is sorta the same i guess, etc), and to sometimes defeat the __get and __set methods and access data directly, but usually use the __get and __set.

ISA encoding ideas

Proposal for opcode table: allow multiple custom opcode lookup tables in memory. This allows an interpreter to maintain computed goto versions of itself for each custom opcode table, which will execute faster. Now have a command to switch the current table (or, actually, i think this should usually be lexically scoped rather than dynamic?). In addition, have a command that says 'execute the specified command in the context of the specified opcodetable, then skip the next instruction'. This can be used for the effect of executing an instruction which is not currently assigned an opcode. Also, it seems we have an operand to spare there (one operand giving the location of the command to execute (usually the next instruction); one operand giving the desired codetable; and a third operand which would normally be a result, but is not needed now). Perhaps the codetables have a parameter which can be set here? This command does not need to be in the always-present 16-command set because it is just a way of allowing the user to have more than 16 custom commands usable without switching codetables; if the user needs that they can make it one of their custom commands.

special purpose registers: PC, data stack register, instruction stack register, link register, parameter register? what else? how do we represent these if 'register mode' is just memory? where is the stack stored? is it the same as registers?

note: if the PC is an accessible register and if you had a zillion operands, then you don't need BLE and BNE per se, you could have 'if A <= B then put C into D'. But unless it was register 0 or 1, then you couldn't do this in tiny mode, which would cripple tiny mode.

todo?: multiple stacks in place of registers

ok we're getting close to a good basic polymorphic assembly, but we need to add in all the jasper-y things..

also we need to take out stuff that only makes sense with linear memory mapping, and replace it with HLL constructs

ok, the yuck about having the weird format for 4-bit pointers and having to convert between 4-bit and 16-bit pointers; really that's because we're exposing the linear memory model, and we're exposing the internal representation of pointers, neither of which we should do.

it might be time soon to stop the exercise of worrying a lot about bit widths and instruction encoding compactness (code density) and everything being aligned to a power of two; this is a useful exercise for enforcing simplicity through a design constraint, but ultimately it's the simplicity we're after, not the code density; a very good high-level assembly will probably end up with better code density anyway. some things we've learned:

idea: parity (or rather, opcode mod n for some n) could determine how an instruction is unpacked, namely, whether any operands are split to produce a >3 operand instruction, or which operands are inputs and which are outputs idea: immediate output on an output operand could mean discard the output

considerations in designing the Jasper assembly

Our goal is theoretical elegance and expressiveness and ease of implementation, not efficiency on current hardware:

considerations in designing the instruction encoding

One way would be to have idiosyncratic forms and lengths for various instructions or groups of instructions under the constraint that no two instructions can be confused. This may be optimal in terms of efficiency, and indeed x86 does this to an extent, but it is not very uniform and requires a large specificiation (and makes it more work to implement), so we don't want that for Jasper.

Another way is to have all instructions of the same length and to have the same fields in the same places in each instruction. This is simple but it hardcodes the instruction length, which hardcodes choices about how many bits the various fields are. It is my observation that a lot of hubbub is created when processor families move from, say, 8-bit words to 16-bit words, 16-bit to 32-bit, 32-bit to 64-bit, etc. This is an architectural detail that we don't want to overly expose in Jasper. Rather than making compilers or programmers do tricks to, e.g., deal with 64-bit values if only 12-bit immediates are available (this makes the implementation of a compiler more difficult), a Jasper assembler (assembly compiler) should be able to simple represent arbitrarily large bit-width values, and operations on them, directly.

Therefore, we want to be able to have instructions that take the word size that they operate on as a parameter. Also, we want our instruction encoding to support arbitrarily large immediate values.

The traditional way to support different word sizes is to have each instruction have a range of opcodes, corresponding to different word sizes. This is not our choice because semantically, the choice of word size seems orthogonal to the choice of instruction. Therefore, for the most part, we will explicitly encode the word size in the instruction encoding. This has the disadvantages that: (a) sometimes we won't need to specify a word size, so by forcing one anyway we'll waste bits; (b) it would be more efficient to just have a longer opcode field that encodes opcode x word size; this doesn't need to set a maximum word size, as we can specify a pattern that uses the code mod n to specify the word size, which would allow arbitrarily large word sizes to be encoded by arbitrarily large encodings; however this is more complicated and makes the implementation harder, so we won't do it; (c) we are still left with some instructions that use a different word size on different operands. We'll have to handle those in a non-uniform manner, specifying word size with the encoding for one operand and specifying it in an operand for another. We could overcome this by simply having a uniform format that separately specifies word size for each operand, but this would waste a lot of bits because in most cases the desired word size is the same for all operands (or differs in a prespecified way, e.g. sometimes when multiplying two numbers you want a double-length result).

Thinking about this a little, I realized that word size is just a special case of type. So we will store type information, generally, not just word sizes.

Now, as to how to allow our instruction encoding to support arbitrarily large immediate values. One way is to store the immediate values separately from the instructions themselves, and to use a variable-length encoding. A second way is to use a variable-length encoding for the instructions themselves. The latter is more complex, makes implementation more difficult, and makes decoding less efficient, but has the benefit that the number of custom opcodes can also be arbitrarily large, which is desirable for reasons of elegance even if not practically useful; in essence we avoid hardcoding a choice on the maximum information content of a single instruction, which is theoretically attractive. It also has the practical benefit of allowing small instructions with a high code density, and the pragmatic benefit of prompting us to consider how to best use the few bits of the smaller instruction lengths. However, in practice we expect that most Jasper Assembly implementations will eschew arbitarily-long instructions, and instead restrict themselves to one or a few fixed choices, perhaps some subset of 16-bit, 32-bit, 64-bit. If i had to choose just one i would choose 32-bit, which would become significantly more functional if the two format bits were reassigned to opcode or to opcode and type.

In fact i might end up doing that in general/'for real', and leave this variable-length stuff as an interesting thought exercise. The JVM only has 9 basic types and the CIL has 15 so i should be able to make do with 16. Lua only has 35 instructions so i should be able to make do with 64.

Now, if the instruction encoding is variable-length, you must have a way to tell when one instruction ends and the next one begins. We could have an idiosyncratic method depending on instruction groups and opcodes, but we choose not to do this for the same reasons as given above. We could have a method with a sentinel terminator, or a method in which the instruction length is announced explicitly at the beginning of each instruction. We choose the latter method as it would seem to make the implementation of decoding and encoding simpler, and might have efficiency benefits if hardware can be designed that loads in and parses the entire instruction in parallel, at least for the smaller lengths.

Let's constrain the instruction lengths to be powers of two. This is inspired by real-world efficiency concerns, but in the real world instruction sets exist with e.g. 6-byte instructions; the real-world instruction lengths are not powers of two but only multiples of some power of two. Theoretically, that would not be very constraining, because any even number (or any number, if 2^0 is considered) is multiple of a power of two, so for the sake of having an inspiring design constraint, we'll stick with actual powers of two for now. Note that we'll feel free to break this constraint later if we want.

Now we have less information to represent when we announce the instruction length; all we have to do is encode a non-negative integer n such that the length of the instruction in bits is 2^(n+d), where d is some fixed offset.

How shall we encode the instruction length integer n? There are various schemes but i think that the one which uses the smallest number of bits for the smallest values of n is unary encoding with a terminator. If you demand that the encoding of zero be as short as it can be, and then given that encoding for zero you demand that the encoding of 1 be as short as it can be, and then given those encodings for zero and for 1 you demand that the encoding of 2 be as short as it can be, etc, I think (but have not proved) that you must end up with an unary encoding. Unary encoding takes O(n) bits to represent n, and so relative to other encodings such as a positional/radix encoding, is very inefficient for large values of n, but in this context, who cares? Since the total length of the instruction increases as 2^n, the overhead fraction taken up by an initial n bits decreases as n gets larger and converges to zero. But we do care a lot about the efficiency of representing small values of n, because space is tight there, and also these will presumably be the most frequently encountered case, so that matters a lot for code density.

So each instruction will start with a terminated unary encoding of n, specifically, we'll encode 0 by '0', 1 by '10', 2 by '110', 3 by '1110', etc, and the instruction length will be 2^(n+d). What shall d be? It would be elegant to choose 0 for d. In this case, for n=0, the instruction length would be 1 bit, and the instruction length field would occupy 1 bit; so there would be no room for anything else. This might be fine, because we could choose the most commonly encountered instruction (in which i include the choice of operands) and represent it this way, allowing a 1-bit representation of the most frequent instruction, which would improve code density. A similar thing occurs when n is 1 or 2; for 1, we have 2^n = 2, and the instruction length field would occupy 2 bits, and for 2, we have 2^n = 4, and the instruction length field would occupy 3 bits. So we represent the most common three instructions in this manner, and then we really get started at n = 3, an 8-bit instructions in which 4 bits are occupied and 4 remain free for us to use. We won't be able to do much with that, although if we're clever we might be able to encode a small Turing-complete instruction set into 4 bits (not sure), so really the first 'really usable' instruction length would be 16 bits. So a program would have a bunch of 1,2, and 4 bit instructions with no operands that compactly specify the most common instructions; and a bunch of 8-bit instructions which are really 4-bit instructions which specify a few more common instructions, and then 16-bit instructions (which are really 11-bit instructions) that can actually do a wider range of things. I've found that 16 bits isn't much to work with if you want a uniform, extensible instruction set, and 11 bits will be even worse, so really we'll end up going to 32 bits for lots of things.

This might actually be very efficient, if the common instructions are wisely chosen, and it is elegant, in that we aren't arbitrarily choosing a lower bound on instruction size. However, the sort of elegance i am after is a compact yet orthogonal representation of the primitives of universal computation, which is this scheme isn't available until we reach 16- and 32-bit instructions. Since there were a bunch of great computers which had a bunch of 8-bit instructions (often they also had larger instructions too; e.g. the 6502), it seems lame to have a variable-length encoding scheme which can't do very much in 8-bits.

Another solution is to group multiple instructions together into "packets", allowing you to amortize the cost of the instruction length bit(s) and possibly some other shared fields. For example, one could fit 3 x 4-bit instructions, plus a four-bit header field, into 16 bits. Another advantage of this is that your instructions are aligned on a larger power of two. This has been tried: https://www.cs.uaf.edu/2011/spring/cs641/lecture/01_25_instruction_frequency.html says "Itanium, Intel's strange 1990's 64-bit architecture. Instructions come in quasi-VLIW 3-instruction "packets" of 16 bytes, but often the packets are full of no-ops"; this quote suggests that code density isn't helped all that much because often you won't have 3 consecutive instructions that fit into 4 bytes or that can't be put in the same packet for some reason. Dividing by 8, you could do 3 4-bit instructions, plus a 4-bit header. However i'm not sure there would be many places where 3 consecutive instructions would fit in 4 bits. Of course, it may work better with, say, two six-bit instructions plus a 4-bit header in 16 bits, or 3 8-bit instructions plus an 8-bit header in 32 bits. I think Itanium might have been focused on executing instructions in a packet in parallel, but if you serialize them then you could assume they are in a 'pipeline' and omit the third operand from the first instruction and the first operand from the second instruction. (btw in http://www.intel.com/content/dam/www/public/us/en/documents/manuals/itanium-architecture-vol-3-manual.pdf i see that those things are called 'bundles', not 'packets').

How then to choose the minimal instruction size? Since there were few computers with 4-bit or less instructions (even http://en.wikipedia.org/wiki/4-bit only gives examples of computers with 4-bit words but 8-bit instructions), it seems desirable for us to be able to do a lot with 8-bit instructions, at the cost of not being able to do as much with smaller instructions.

Another criterion we can use is to look at various small systems for universal computation and see how much instruction width they would minimally require, or parts of them would require.

Universal combinatorial logic (only universal for combinatorial logic; not Turing-complete yet) requires gates with two inputs and one output. One might represent the connections between these gates as two operands, or three. A single gate type (NAND or NOR) sufficies, however a simpler system breaks this into 3 gate types (AND, OR, and NOT), which would require two bits to choose among. The goal for Jasper is not merely to provide universality, but to provide it in a simple way, so, although it would be cool to have single instructions for each universal instruction and we probably will, before worrying about that we want to have the simpler sets which are only universal together, such as AND, OR, NOT.

What combinatorial logic is missing is conditionals and recursion, or looping. Both can be implemented in the assembly language paradigm with a conditional branch instruction; a 4-opcode set (2 bits) could represent AND,OR,NOT,BCC. However the operand needs to be large enough to allow you to branch sufficiently far; I don't know how much is required.

Universal combinatorial reversible computing requires three-operand gates with three inputs and three outputs. Two universal reversible gates are the CCNOT (Tottoli) gate and the CSWAP (Fredkin) gate. They can be built out of the vague concepts IF, AND, NOT, IDENTITY, and SWAP; CCNOT is "if the first two bits are set, it inverts the third bit, otherwise all bits stay the same." and CSWAP "swaps the last two bits if the first bit is 1.". IF, AND, NOT, IDENTITY, and SWAP are five, and so would require 3 bits, but maybe we could make IDENTITY a default and so get by with 2 bits.

A Turing machine head only looks at one bit at a time, but this is far outside the 'instruction' paradigm. Nevertheless we'll look at small Turing machines. There is a tradeoff between how many Turing head states there are, and how many symbols there are on the tape. According to https://en.wikipedia.org/wiki/Universal_Turing_machine#Smallest_machines , there is also a notion of 'instruction' in at least some of these machines. From http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.8947&rep=rep1&type=pdf ("Four small universal Turing machines"), we see that 'instruction' means a combination of register state and symbol; so e.g. a machine with 5 states and 5 symbols could have at most 25 instructions, but some of these combinations could be declared illegal, yielding less instructions. According to that same Wikipedia page, "If we denote by (m, n) the class of UTMs with m states and n symbols the following tuples have been found: (15, 2), (9, 3), (6, 4), (5, 5), (4, 6), (3, 9), and (2, 18).[4][5][6] Rogozhin's (4, 6) machine uses only 22 instructions, and no standard UTM of lesser descriptional complexity is known." This quote mentions quantities that would take from 1-5 bits to represent. Each of the pairs from the quote has at least one member which would take 3 bits to represent. The sum of bits of the numbers in the pairs range from one to six bits. Especially notable are the pairs (15, 2) (it takes 4 bits of internal state to work with a binary tape), (6,4) and (4,6) (it takes 4 bits of internal state to work with pairs of bits and vice versa), and (9,3) and (3,9) and the gap around them (suggesting that it takes 3 bits of internal state to work with nibble words, and vice versa).

Nock is a small combinatorial language with 5 instructions, plus an implicit cons, plus 5 more instructions that could have been expressed in terms of the first 5 and the cons. The core 5 instructions have 2 or 3 arguments, one of which is a state-carrying argument ('state-carrying' wont make sense unless you know Nock). An 8-opcode set (3 bits) could handle the first 5 arguments and the cons; but it would take 4 bits to handle all 10 of the Nock instructions.

There are various "one instruction set computers" (OISCs) ( https://en.wikipedia.org/wiki/One_instruction_set_computer ). The 'universal instructions' include subleq ("SUbtract and Branch if Less than or EQual to zero"), which requires 3 operands. There is also subleq2, which uses an accumulator and requires only 2 operands. Looking at https://en.wikipedia.org/wiki/One_instruction_set_computer#Synthesized_instructions , we see that at least three 'registers' are needed for simple computations with subleq2. Note however that memory-memory addressing is assumed; if we actually used registers, we'd also need an indirect addressing mode or LOADs and STORES or some other way to reach main memory. The example given for flipping bits also uses an immediate-mode operand of -1 (it would take 2 bits to represent -1,0,1,2). Other OISC instructions are 'subneg' ("SUbtract and Branch if NEGative"), with three operands, and rssb (Reverse Subtract and Skip if Borrow), which has only one operand but also uses an accumulator. That Wikipedia page also mentions Transport triggered architecture, in which there is only MOV but certain memory locations have magic side effects including arithmetic.

So it seems that we want at least immediate and also either register or memory mode, and that we need 2 or 3 operands, opcodes of at least 3 bits, and that operands must be at least 2 bits (because subleq2 needs to access 3 different locations in order to flip all of the bits in a byte). If we did all of these, we'd already be over 8 bits (at least 1 format bit, 3 2-bit operands, 3-bit opcode, and that's neglecting the need for immediate mode, which would add 3 more addressing mode bits). However there are combinations that fit within 8 bits (e.g. 1 format bit, 2 2-bit operands, 3 bit opcode).

We also have stack machines; it would be good if our core architecture supported PUSHes and POPs with small instruction lengths.

Also, how to represent the fields? x86 and ARM have a large number of idiosyncratic field representations for various instructions. We don't want this, as it leads to a large specification and makes it difficult to implement decoding and encoding. Shall we have a single, uniform field specification for all instructions, varying only based on instruction length? Maybe, but it also seems reasonable to have a small number of formats that is more than 0. For example, NIOS II has 3 formats, one with a single very large immediate value (used for jumps), one with a few operands and a large immediate value, and one with no immediate value and many registers.

How to represent the bit-width types? The most elegant thing to do is to include a type for each power of 2. We might also include both a signed and an unsigned type for each of these except 1-bit. We might also have a 'dynamic' type, and custom types. However this requires many type bits before we reach the commonly used bit-widths; 4 bits to get to 8-bit widths, and 5 to get to 16-128.

So we might also consider omitting some of the choices below 8, such as 2, because they are unpopular. Another consideration is the sequence 2, 2^2, 2^2^2, 2^2^2^2 which shows that 2, 4, 16 are 'more important' than 8. If we omitted bit widths 2 and 4, then we could get to 8 with 3 type bits, and 32 with 4 type bits. We could also omit all choices below the minimal instruction size (e.g. give the system a minimal 'word size' after all, if not a maximal one).

And how about the variable encoding/decoding scheme? We didn't want it idiosyncratic, but then what do we want? Note that the scheme must provide a field layout for arbitrarily long instructions. One choice is to provide a simple procedure where you read the unary field, call it n, and take 2^n to determine the number of bits; then proceed through the fields in a certain order, keeping track of how many bits are left (so subtract the n first, since the unary field is of length n). For each field, the number of bits in that field is of the form 'n - a', where a is field-specific constant. The sizes of some fields such as the type field and the addressing modes really should vary by the log of n, not by n, but for small values of n we can do the same thing with n - a which is much easier for the decoder to compute, and for large values a linear waste of space doesn't matter. The last field holds all of the remaining bits, which is nice if we want to include immediate values within instructions, rather than having them sit in memory in between instructions.

An even simpler choice is to preset a field layout, make each field of equal size, then simply equally divide up the available bits. If the number of fields is a power of two, so much the better. If needed, we could then add or subtract small constants to certain fields

for even formats: x unary format bits 1 custom opcode bit 2x opcode bits 1 custom type bit 3x - 2 type bits divide remaining bits by 3 assign evenly to operand bits addr mode bits are 1 custom addr mode bit + x per operand any remainer goes to first (input) then to third (output) addr mode

fmt 1 (16 bits): 1 format bits 1 custom opcode bit 2 opcode bits 1 custom type bit 1 type bits 10 addr mode + operand bits (1 custom addr mode bit + 1 bit addr mode + 2 bits operand x 1 operands + 1 custom addr mode bit + 1 bit addr mode + 1 bits operand x 2 operands) or (1 custom addr mode bit + 1 bit addr mode + 3 bits operand x 2 operands) or (1 custom addr mode bit + 1 bit addr mode + 8 bits operand x 1 operand)

fmt 2 (32 bits): 2 unary format bits 1 custom opcode bit 4 opcode bits 1 custom type bit 4 type bits 20 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 2 operands + 1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 7 bits operand x 2 operands) or (1 custom addr mode bit + 2 bit addr mode + 17 bits operand x 1 operand)

fmt 3 (64 bits): 3 unary format bits 1 custom opcode bit 6 opcode bits 1 custom type bit 7 type bits 46 addr mode + operand bits (1 custom addr mode bit + 3 bit addr mode + 11 bits operand x 2 operands + 4 bit addr mode + 12 bits operand x 1 operands) or (1 custom addr mode bit + 3 bit addr mode + 19 bits operand x 2 operands) or (1 custom addr mode bit + 3 bit addr mode + 42 bits operand x 1 operand)

note: how many operands there are is implied by opcode note: when there are different numbers of bits for different operands, the output (c) operand gets them first, then the first input operand (a)

note: it would be possible to encode a useful 8-bit format too if one left out the custom bits (making the design less regular) and the third operand (1-bit format, 2-bit opcode, 1 bits addr mode, 2 x 2 bits operand), but this would necessitate adding one more format bit to 16-bits, which would reduce the 16-bit functionality quite a lot. Alternately, we could have 2 format bits for the 8-bit format and 1 for the 16-bit one, and have one of the 8-bit operands either have no addressing mode or just a 1-bit operand. I decided neither one was worth it.

the fixed types are:

-- 1 bit for fixed types: pointer dynamic -- 2 bits for fixed types: arbitary precision signed integer 1-bit -- 3 bits for fixed types 2-bit unsigned 2-bit signed 4-bit unsigned 4-bit signed -- 4 bits for fixed types 8-bit unsigned 8-bit signed 16-bit unsigned 16-bit signed reserved for future binary16 (half) float 32-bit unsigned 32-bit signed binary32 (single) float -- 5 bits for fixed types 64-bit unsigned 64-bit signed binary64 (double) float 128-bit unsigned 128-bit signed binary128 (quad) float ...

todo: what about other jasper types?!? what about strings, processes, graphs, etc?!? should we bite the bullet and remove 2-bit types? do we even want fixed-width types at all in the fixed types? i think i should get rid of the above and handcraft some fixed types and put 'het' where 'dynamic' is, and put 'dynamic' later and what about thunks and promises (and can't we unify those?). should a 'thunk of ints' be its own type at this level? recall that the point of types at this level is not to make proofs about programs, it's just so that the instruction knows exactly what implementation code to after consulting only the operand and the type. If there's a thunk of ints, you can only perform generic thunk operations on it for the most part; but then again, you can also something get an int, if the thunk is ready. Hmm.. i think what you could do is have the generic thunk return a het, and then do a CONV from het to int.

note: according to http://stackoverflow.com/a/12583866/171761 and http://www.mrob.com/pub/math/floatformats.html 4-bit floating points exist and can represent some fractions (only +-0.5 and +-1.5, using the representation in http://stackoverflow.com/a/12583866/171761 ). I don't think they are worth including. There are also 8-bit floating point types which seem to me to be significantly more useful than that; http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/JavaPrimitiveTypes.html#FloatingPointTypes presents a 1.2.5 format for which the smallest value above 1 that it can represent exactly is 1.0625. https://en.wikipedia.org/wiki/Minifloat#Example presents a 1.4.3.-2 format; i'm not sure what the corresponding least number greater than one is but i think it's 1.125.

However the IEEE 754 stuff is complicated and we don't want someone writing a Jasper interpreter or compiler to have to implement this sort of thing themself. Also, since 8-bit floats aren't standardized we would have to choose how many bits go in the exponent and how many in the mantissa. binary16 (half float) is at least standardized and although it isn't supported everywhere (e.g. it's not in C), it is supported in some places (e.g. for ARM targets only, GCC adds a type __fp16).

So we'll keep __fp16 in the types list in case it becomes more widely supported in the future. But it's marked as 'reserved', because again, we don't want a Jasper compiler writer to have to implement it.

note: you might think that, using the above encoding, you cannot write 16-bit instructions that use types such as double-precision floating point, or 8-bit integers. This is not true; the 'custom' types can simply be set to point to predefined types if this is desired. Similarly, the programmer can choose 2 custom address modes, and 4 custom opcodes which are accessible in 16-bit instructions; these can be set to alias built-in address modes and opcodes which are not usually available in 16-bit instructions.

address modes: -- 1 bit for fixed address modes: memory (register) immediate (or, for output operands, memory indirect postincrement, e.g. PUSH) -- 2 bits for fixed address modes: memory indirect ?? -- 3 bits for fixed address modes: ?? ?? ?? ??

note on opcode specification: each opcode defn specifies not only its implementation but also its signature, namely:

opcodes: -- 2 bits for fixed opcodes: CPY: a -> c CPYFROMIDX: *(a[b]) -> C CPYTOIDX: c -> *(a[b]) note: these instructions provide an analog to the popular indexed indirect addressing mode on other systems; although 16-bit instructions here don't have an indirect addressing mode (unless a custom addressing mode is set to that), the same thing can be achieved through these instructions, at the cost of another instruction. This also removes the need to either encode/associate both the base and the index into a single operand, which would be hard to fit into a fixed-width instruction, or to have special base or index registers.

BT: if a evaluates to True, branch to c (c is signed value, relative to PC, e.g. for 3 bits, one of -4, -3, -2, -1, +2, +3, +4, +5) note: would it be better to just MOV to the PC directly, or to ADDACCUMULATOR to it? in which case we'd need a conditional MOV or conditional ADD. but then we'd be back to the 3-operand version, and it would be hard to reach the PC, right? mb not.. hmm... CPYIF: if a, then b->c or ADDIF: if a, then c = c + b problem: need 2 bits in C to reach PC, but need 2 bits in b to have a displacement more than 1. possible solution: put the PC at 0. problem with this is that now a and b in the other instructions above can only access the PC and one other thing. If there is also a stack pointer then there is no room for an index register, or for actual data. or should we have SWAP instead of CPY as the default, which is nicer for reversible computing? or should we have CONV instead of BT?

NOTE: there are 'separate memories' for each type of thing. So e.g. there is a pointer at location 0 in the memory for things of type 'pointer', and there is also a dynamic object at location 0 in the memory for things of type 'dynamic', and there is also an integer in the memory for things of type 'arbitrary-precision integer'. If you do CPY 0->1 with type 'pointer', that only changes pointer memory, it doesn't affect the dynamic object in location 1 or the integer at location 1. There is also a 'heterogeneous' type or 'het' for short; the het space can store something of any type. What het actually does is, if the size of the object is less than or equal to the system word size, it puts the object in there directly, and otherwise it puts a pointer and transparently dereferences this pointer. To move something between spaces (including between Het space and something else), you do a CONV. Note that these separate memory spaces are abstract concepts and may not correspond to implementation. It is unlikely that a Jasper Assembly implementation will actually allocate a separate contiguous region of memory for each type; more likely it will store small things on the stack or in registers and compile references to these items into stack and register references, and store large things on a single heap, with one lookup table per type. You should think of these separate memory spaces as merely a way to name variables; e.g. if the program source code has a float variable called "altitude", in Jasper assembly that might be translated to "float variable number 3".

the motivation for separate memories is:

it is NOT because we expect to implement it that way! one disadvantage is that it forces implementations to not just use the number in the operand as the actual address of an register or memory location on the underlying platform, and to instead go thru another level of indirection (which leads either to an extra lookup at runtime or to extra compilation effort). But since different objects are different sizes, something like this would be needed anyways (e.g. if we only had one space, it would be like our Het space, and the object places at location 3 in het space, even if it's a 1-byte integer, may not necessarily be at word 3 from the start of het space in the underlying memory, because the object at location 2 may be 10,000 bytes long; so either het space is governed by a lookup table (extra lookup at runtime) or we map stuff at compile time -- so we'd already have the same problem).

the motivation for types is twofold:

part of the motivation for addressing modes is similar: we avoid a proliferation of opcodes that do the same thing but in different addressing modes.

both of these suffer from the code density disadvantage that we do not tailor our instruction encoding to product of opcode x type x addressing mode. however this is an advantage from the point of view of ease of implementation.

todo: how do we handle the common pattern of looping through a 1-D array? you don't want to go back to the header for the whole array and __get each item, right? you just want to increment a pointer by the size of the items. is there any way to allow this without allowing flat-out pointer arithmetic (we could include bounds-checking but it should be optional)? if we have pointer arithmetic, should it just be on the pointer type, type 01? we need some way to access raw memory anyways for FFI i'm not sure what the right answer is but the simplest answer seems to just let people do arithmetic on the pointer type, type 01. So let's go with that unless i think of something better.

What you can do with the fixed 16-bit instructions and fixed address modes: our 16-bit mode is mostly useless except for MOVs (CPYs) because a and b are 1-bit operands for 3-operands instructions. Assuming that location 0 is the stack register, that means that all you can really do with pointers is use location 1 as an index register, and fetch stuff from memory (CPYFROMINDEX), and move it around to the first 8 memory locations, and store it to somewhere in memory (CPYTOINDEX). However since there is a separate memory for each type of object, you can still work with 2 objects of each type in addition to the stack pointer and the index register. You can push stuff onto the stack using the memory indirect postincrement address mode. You can look at any of the first 8 memory locations and branch if it is not zero (useful for loops).

Moves are very common instructions and branches are sorta common, so these might actually get used.

Note: why is CPYTOIDX taking input from c? because if this is a 16-bit instruction, a and b can only target memory locations 0 and 1, and so it makes sense to place a base pointer and an index into those locations, and to let c look at locations 2 and 3, which can contain values that are being worked on.

-- 4 bits for fixed opcodes: AND OR NOT ADD POP (this is really a POP and a ROT combined since you can specify a location other than TOS to POP) LEA i guess this is like Jasper's give-me-the-link-itself-not-the-destination; but is it also like an EVAL? CALL (? what to do with the 3 operands? i guess one of them is the location of the call stack pointer, and one is the jump address? shall we even include a parameter? can we do some backwords calling and handler trees with this? can we combine CALL and RET?) RET (? what to do with the 3 operands? i guess one of them is the location of the call stack pointer? shall we even include a return value? can we do some backwords calling and handler trees with this? can we combine CALL and RET?) JMP EQ LEQ note: Lua's LT is a branch, not a CMP, and Lua also has a LOADBOOL which both converts to bool (or does it just load an immediate value?) and also optionally is a BCC (conditional branch based on boolean value); again, i'm not sure if the conditional value is immediate or not. these confusing constructs are optimized constructions for if/then and branching (see http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf page 19) but i'm not sure if we want that; it kinda seems to me like using NAND when we could use NOT and AND, because this 'BLEQ' is just the same as a pair of LEQ and BCC, similarly LOADBOOL seems like CONV followed by BCC. i think we want LEQ and maybe also a BLEQ for code density, but if so with the latter in the expanded (64-opcode) opcodeset only. PUSH (this is really a PUSH and a ROT combined since you can specify a location other than TOS to PUSH; it's really an INSERT) CONV (type conversion, two operand) BEGIN, END (blocks, transactions, etc. also, one BEGIN/END type could redundantly identify the start and end of the instruction stream, although i guess that's a waste of space for embedded CONTEXT SWITCH e.g. which variables/upvals have which values; or this could be a special register, or just based on the stack register LOADI? ??1 more ?? special JMP to jump between code segments? or mb a flag? ?? indexed (two operand) jump? 3 operand jump? ?? 3 operand LEA? ?? load/store multiple registers?

?? ALIAS ?? MOV (in addition to CPY) ?? SYNC ?? register interrupt/exception handlers ?? register event watchers ?? group variables into state objects ?? how do ARM, MSP430, x86 do 'exceptions' e.g. div by 0? x86 has divide_error handler, ARM has no divide: http://stackoverflow.com/questions/12526276/where-is-the-zero-divide-done-in-kernel-for-arm-cortex-a-9 http://infocenter.arm.com/help/topic/com.arm.doc.dui0471j/pge1358787016339.html http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/BABGCFHB.html ?? ARM-ish Branch and exchange (BX) ?? substitution into expressions? ?? function application? ?? nock operator? ?? grammars? ?? horn clauses? ?? graphs? ?? expression evaluation? ?? exceptions/backwards calling? ?? concurrency? ?? transactions/marking code blocks? ?? perspectives? ?? while loop? ?? BF ?? LT ?? ROT ?? DUP ?? CAS ?? SWAP ?? CMPS, MOVS, SCAS, STOS, LODS (see http://umcs.maine.edu/~cmeadow/courses/cos335/COA10.pdf ) Consensus choice (and other amorphous computing stuff) parallel Union parallel Intersection non-parallel set ops Pmap Pfold (reduce? )

-- 6 bits for fixed opcodes: ...

other popular ones are shl, xor, add, call, cmp, pop, inc, jz, lea, test, je, jnz remaining: shl, xor

idea: the JVM has an interesting idea; variable length encoding but many opcodes, and many opcodes don't require any operands (they are stack-based). the JVM has a one-byte opcode followed by zero or more operands. CIL has a one- or two-byte opcode followed by zero or more operands.

let's see how it would look to have different numbers of operands in different instruction lengths:

fmt 1 (2 instructions in 32 bits): 1 format bit 1 custom type bit 5 type bits

1 custom opcode bit 6 opcode bits 1 custom addr mode bit 2 addr mode bit 2 operand bits (a) 0 is assumed for operands b and c

1 custom opcode bit 6 opcode bits 1 custom addr mode bit 2 addr mode bit 2 operand bits (c) 1 unused bit 0 is assumed for operands a and b

fmt 2 (32 bits): 1 format bit 1 custom type bit 5 type bits 1 custom opcode bit 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 6 bits operand x 2 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

note: maybe add another small operand to the third form; this would allow us to e.g. have a flag in JMP for jumping to a different segment, etc. no on the flag: see below. but mb have a segment identifier or something, with '0' meaning 'current'?

note: or maybe eliminate fmt 1 (it's neat but adds complexity) and eliminate 1 type bit and add another small operand to EVERY form; this would allow us to e.g. have a flag in JMP for jumping to a different segment, have one CMP that does LEQ, LT, GEQ, GT, have one CMP that does EQ, NEQ, have an ARITH that does +/-/*/div, have AND/OR/NAND/NOR etc. Not sure if that's any better than just having more opcodes though.. probably not.. so let's scrap that idea

i guess i like that -- it gives us double the code density for some short stretches of code, and gives us more opcodes and types than the previous alternative

i guess if we have fixed length then we should have a LOADI instruction that loads a variable-length operand from outside the instruction. I'm thinking:

(a) a unary number counting the number of bits in (b) (b) a binary number counting the number of bits of the variable-length data in (c) (c) a binary number, same length as (b), counting the number of bits of padding (d) the variable-length data

on second thought i am questioning the 'separate address spaces' idea. the problem is not time but space. if there are many types, then the runtime must maintain many tables mapping the type address spaces to the real underlying platform address space. since there are many types, most likely most of the types will have few entries in these tables at any one time. so either the tables are of fixed size, and mostly empty, or there are a whole bunch of different stacks constantly expanding and contracting, which will be annoying to layout in memory (either you have to overreserve space for them or each one must be partitionable). also, a minor concern, but it would be annoying and inefficient to have to CONV to and from Het all the time. simpler to just have one address space.

so currently i think we should go back to one address space.

if 0 is the stack pointer then to make this more useful we need a POP addressing mode or at least an indirect addressing mode. also what about the PC?

also, i question the utility of fmt 1. it is only 100% more efficient; i dunno if that's a big enough difference to justify the extra complexity in the language standard. implementations that care to optimize could implement their own better bytecode if they wanted, e.g. the way Dalvik implements the JVM. i do like 32-bit fixed lengths because in the massively parallel future, all instructions in the stream could be parsed simultaneously if they are all aligned. maybe fmt 1 should be removed; or maybe it should take out one operand and reduce the another one in size to allow for packing 3 instructions instead of 2; or maybe we should just stick with the current proposal as a good compromise.

the second option is:

fmt 1 (3 instruction pipeline in 32 bits): 1 format bit 1 custom type bit 5 type bits

4 packed instruction routing bits: 1 bit: first input of first instr comes from top of stack vs register 1 1 bit: first input from second instr is top of stack, vs output of first instr 1 bit: first input from third instr is top of stack, vs output of first instr 1 bit: ? or mb 1 bit: first input of first instr comes from top of stack vs register 1 1 bit: second input of first instr comes from top of stack vs register 1 1 bit: first input from second instr is top of stack, vs register 1 1 bit: second input from second instr is top of stack, vs register 1 or mb 1 bit: output of first instr is consumed in first input of second instr, or first input of third instr other bits: and what if the first instr is a multi-output thing like DUP? or mb 2 operand bits and 2 packed instruction routing bits or mb 2 operand bits and 1 addressing mode bit and one 1 bit saying whether that operand is for the 1st or 2nd instruction or mb 1 bit: nothing, or SWAP in-between second and third instr 1 bit: first input of first instr comes from top of stack vs register 1 1 bit: first input from second instr is top of stack, vs register 1 1 bit: first input from third instr is top of stack, vs register 1 (note: so far i like this one the best) in this one if each instruction consumes two inputs from the stack and pushes one output

1 custom opcode bit 6 opcode bits

1 custom opcode bit 6 opcode bits

1 custom opcode bit 6 opcode bits

note btw that the instructions without operands, both in the previous proposal and in this one, run in 'stack mode', e.g. they take their unspecified input operands off the stack, and push their unspecified output operands to the stack. so the idea is that for this sort of instruction you get to stuff 3 instructions into 32-bits (about 11 bits per instruction, pretty good code density); the tradeoff is that they are only getting inputs and outputs from the stack, from register 1, and/or from each other. note that this fmt 1 has no immediate values.

note: branches using relative addressing count a fmt 1 instruction bundle as 1 instruction, not 3

also, should take a close look at Dalvik because whoever made that up appears to have had some of the same weird ideas as me. e.g. 4-bit and 16-bit registers, variable length constant data (LEB128).

and lets add back in our stack addressing mode even though it is side-effectful on inputs:

addr modes: mainstack (pop from stack pointed to by reg 0 at offset of operand / insert into stack pointed to by reg 0 at offset of operand) register immediate / ?? for output ?? memory indirect ?? secondary stack (top of stack PUSH/POP only, stack pointer in register specified by operand?) ?? mainstack indirect indexed (no PUSH/POP, just read/overwrite single locations) ?? looks like we dont really need relative addressing mode in the addr mode bits, this is assumed by certain instructions on top of the addr mode in the bits? ?? get/put upval (this could be an instr like lua but i think it makes more sense as an addressing mode) (or is everything just upval addressing by default, e.g. no registers so 'memory' is just upvals?) http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf page 21 notes that this is important because things like first class functions and closures could make this more complicated than just a stack i guess each closure might have its own separate stack? and what were the problems with saguaro stacks again? https://en.wikipedia.org/wiki/Spaghetti_stack

note that we have a lot of stack-like stuff here.

note that the pointer to the main stack is in register 0 and this register is treated specially in two places: (1) in fmt 1 (2) by mainstack addr mode, both of which use register 0 implicitly as the stack pointer. Do we also want to put the PC at a special place? Or maybe just have opcodes to deal with it?

note that since we dont have arbitrarily-long variable-length instructions anymore, we have to have a LOADCONSTANT opcode that loads a long constant from a constant area.

note that there is no switch codetables instruction, rather, code is grouped together into segments that each contain lookup tables of custom types, addressing modes, and opcodes, an instruction stream, and a table of constants.

embedded systems vendors offering Jasper Assembly-executing systems (interpreters or hardware) may wish to prohibit programmer-defined custom instructions, types, and opcodes, and instead hardcode their own custom types, addressing modes, and opcodes, or simply not have any. In this case the segments would not have a lookup table for these things. conforming jasper implementations are allowed to choose to not support this feature, although they would only be conforming to some weird 'nocustom' profile, not the 'standard' profile. The custom things probably improve code density so embedded systems would probably usually want them.

i guess the reason that writing instructions in stack-mode is so concise is that it's really just a useful shorthand for how the outputs of earlier instructions are plugged into the inputs of later ones, and that in the common case (executing a bunch of sequential instructions whose inputs depend on the output of the previous instruction) it's concise. when you get far away from this common case is when you get lots of ROTs and DUPs. our mainstack addr mode should make this even more concise.

note: the JVM idea that the size of the (local?) stack each time you hit a given instruction must be the same is interesting, we should consider doing that.

todo: http://www.csie.nuk.edu.tw/~stpan/course/101_9_digitalCircuit_Chap_10_1.ppt has lists of common instructions that i guess we should include in our 32- or 64-opcode set (by 32-opcode set i just mean the first 32 opcodes of the 64 opcode set, which i guess should be the most important ones).

note: since the 32-bit 3-operand form has 3-bit operands, this appears to suggest an 8-register system. However, do note (a) that there is also the stack, and (b) that up to 64 'register-like' memory locations can be directly accessed with the 6-bit, 2-operand instructions.

note: these small operands are one of the weaknesses of the design; 8 isn't enough to use the 3-operand commands to directly address local variables in a HLL, and even 64 is cutting it close. Lua uses 200 max local variables per function and 60 max upvals (http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf page 4). If we could add even 2 bits per operand to the 3-operand form that would be much better, as it would allow direct instructions to be used when there are 32 or variables in use, which is probably the majority of the time; this would be 6 more bits, which could be achieved at the cost of the format bit, the custom types and opcode bits, one type bit, one opcode bit, and limiting the addressing mode of the second operand. i feel like that cost is too much to pay, however.

a better compromise might be to add just 1 bit per operand:

fmt 0 (not recommended, just for discussion) (32 bits): 0 format bits 1 custom type bit 4 type bits 1 custom opcode bit 5 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 7 bits operand x 2 operands) + 1 unused bit or (1 custom addr mode bit + 2 bit addr mode + 18 bits operand x 1 operand)

with 16 directly accessible memory locations, it is likely that many subroutines could directly map all HLL variables to registers/memory locations, allowing them to be written without excessive CPYs to shuffle around things between locations. the price is: (a) no awesome 32-bit 3-instruction pipeline bundle, (b) only 16 fixed types, (c) only 32 fixed opcodes

not sure if i like this better or worse than the above proposal.. (a), (b), and (c) hurt but seem feasible, and all of them force simplicity, (a) in particular... i guess i like it a little better... it's gonna be tough to have less opcodes than even Lua, though! and the lack of enough types will probably force us to give up the dream of having both high-level types like 'arbitrary-precision int' and 'string' and also things like uint8 and float16 all as fixed types. http://www.csie.nuk.edu.tw/~stpan/course/101_9_digitalCircuit_Chap_10_1.ppt mentions 35 instructions, presumably must-haves, and that's without even mentioning the branch instructions! and we want to add a bunch of weird instructions to express other computational paradigms, too.. even PIC has 34 or 35 instructions for pete's sake.. even this minimalist guy who is cited on Wikipedia's MISC (minimal instruction set computer) has 32 or 33 instructions for pete's sake, and some of those have multiple forms so there's really more: http://www.colorforth.com/inst.htm . however looking at the MIPS instruction set, after you remove 'duplicates' that are immediate or otherly-typed versions of each other, there are only about 22 instructions.

no, i don't think we can give up that last fixed opcode bit.. we really need at least 6 bits there.

one thing we could do that would save a ton of space is to give up on the custom addressing modes.. maybe just for 1 operand..

another thing we could do is that hope our fancy addressing modes will save the day and make us need less instructions.. e.g. many of the http://www.colorforth.com/inst.htm instructions are just addressing modes for us..

todo look up http://www.colorforth.com/inst.htm

also i guess humans have 7+-2 registers for sequential computation, right? so if we have 8 regs with one reserved for the stack register, that's about right.

also 64 instructions seems like a lot to learn, or to implement.

but KLambda has 48 primitive functions http://www.shenlanguage.org/Documentation/shendoc.htm#The%20Primitive%20Functions%20of%20K%20Lambda

(Shen Prolog has 12 functions: http://www.shenlanguage.org/Documentation/shendoc.htm#Prolog )

so... i guess the thing is to try to fit the ISA into 32 instructions and the types into 16 types. if that's possible, then use the 4-bit operand format. otherwise, use the 3-bit format. the 3-in-1 format sounds fun but adds lots of implementation complexity.

as for the size of the interpreter (see jasperLowEndTargets for rationale), it should ideally fit within 8k (4k would be even better; Parallax Propeller's SPIN fits into 2k!), 16k is great too, but we can go up to 64k if needed (16k is the max that we could use for a Propeller though).

so really we should try pretty hard to fit within 16k, and 8k if possible.

a little estimation; if we had 8k for the implementation and half of the Jasper Assembly code was for implementing 64 opcodes, then that's 2^13/64 = 128 bytes per opcode. If each instruction in the implementation was 32-bits (like Jasper Assembly itself), that's 128/4 = 32 instructions per opcode. Seems possible but tight. I doubt we'll fit a garbage collecting cycle-tracer in there! Maybe reference-counting though. But maybe we can; according to http://electronicdesign.com/displays/inferno-operating-system-burns-its-way-embedded-systems , Inferno's Styx can fit into 6k on embedded systems; does this include Dis?

this paper fits Inferno into 1 MB on some small system: http://doc.cat-v.org/inferno/java_on_dis/java_on_dis.pdf

idea: regarding a tracing garbage collector to collect cycles: maybe make it an optional external library module, that you can link in & explicitly call if/when you want it?

my best guess is that:

idea: in addition to 'dynamic', in which the actual type is stored in memory with the value, could have a second type of dynamic type, 'register dynamic', in which the type is stored in a certain register (the last one, register 7?). similarly, could have an INTERP or EMU instruction to run custom instructions not included in the table (or, could just formulate the custom instruction implementations as an ordinary call, e.g. return addr pushed onto stack at call time, RET at end, and then instead of 'EMU custom instruction' could just CALL it.

idea: above we had fmt zero treat register 1 specially, and register 0 was stack. mb better to put PC in register 0, stack in reg 1, and treat reg 7 specially.

idea: could have fmt 0 (or fmt 2 or whatever) be for putting immediate data blocks in there. These would be accessed via PC-relative addressing. Due to caching/locality of reference, on some systems accessing constants embedded in code and therefore near the PC may be faster. But actually i see no reason for a special fmt for this; just JMP over the data system.

idea: should we do the 'base register plus offset' thing? i'm thinking no, just use LEA for that.

idea: a bitfield register whose size is implementation-dependent (its own little memory bank)

idea: Bank support via base reg indirection?

idea: 3 bits (8 values) isn't much for an immediate value specifying e.g. an index into a jump table. how about making another format for a 'two operand format' where the result is put either into an accumulator (r2? r0?) or on top of the stack, and the six bits saved from the third operand are used to add 3 bits to each of the first two operands, giving the capacity for immediate values up to 64?

note: you can have a stack machine but transparently cache the top 3 values in hidden registers, providing some of the efficiency advantages of accumulator machines to stack machines

note: if the # of bits in the third operand + # bits in its addr mode (incld custom bit) is even, then the two operand form has no wasted bit

note: i guess the # of bits available for the 3-operand form immediates isn't a huge deal because you could always just precede the instruction with a LOADI

note: for parametric types, we dont store the subtype, e.g. something is of type ARRAY, not ARRAY[int] (after you index into the array you use INT types in the instructions) -- or mb we do but only one level deep? but if so then what about things of many types, e.g. hetrogenous tuples?

idea: should explore what we could gain by having a 64-bit fixed size instruction:

this should be guided by what makes the ISA easier to remember, and what makes the implementation simpler, not just by efficiency

idea: types for slices/selected areas

idea: types for cursors

e.g.

fmt x (64 bits): 2 format bits 1 custom type bit 6 type semantics bits 1 type format bit 3 type size bits 6 type flags 1 custom opcode bit 6 opcode bits 2 modifier bits 36 addr mode + operand bits (1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 3 operands) or (1 custom addr mode bit + 3 bit addr mode + 14 bits operand x 2 operands) or (1 custom addr mode bit + 3 bit addr mode + 20 bits operand x 1 operands + 1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 1 operands ) or (1 custom addr mode bit + 3 bit addr mode + 32 bits operand x 1 operand)

note: fixed type flags might make it less a universal assembly encoding and more of a jasper one.. note: at least half of the type flags should be custom, or custom when in the custom opcodes, or something like that

eh, in the end i guess the modifiers are dumb. for AND and OR they seem okay (negate either or both of the two inputs), but for others there is not always a logical way to use them (shifts: ASR, ASL, LSL, ?? ROR?). so better to have more opcodes, i guess. there are more ways to use 1 modifier bit (e.g. LT vs. LEQ, EQ vs. NEQ, ASR vs ASL), but not always (DUP) and since the interpretation changes for different opcodes let's just let it be more opcodes

idea: structural vs. reference equality

idea: a type type

now, should type formats and sizes really be broken out from semantics and from each other? by the same argument as for opcodes, i guess the answer is no, because some types don't have all those sizes, and some don't have many formats, so you're wasting a bunch of bits if you force them to be broken out.

so we have:

fmt x (64 bits): 2 format bits 1 custom type bit 8 type bits 4 type flags 4 custom type flags 1 custom opcode bit 8 opcode bits 36 addr mode + operand bits (1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 3 operands) or (1 custom addr mode bit + 3 bit addr mode + 14 bits operand x 2 operands) or (1 custom addr mode bit + 3 bit addr mode + 20 bits operand x 1 operands + 1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 1 operands ) or (1 custom addr mode bit + 3 bit addr mode + 32 bits operand x 1 operand)

so what are we really gaining here over 32 bits?

in other words this would 'take the pressure off' in every dimension but not add anything really new except possibly the type flags.

how many types would we want if we had a zillion specific types for fixed-length arith?

int lengths of: 1,2,4,8,16,32,64,128 (8 choices, 3 bits) x signed or unsigned (1 bit) (note: 1-bit length can't be signed..) + 4 floating point choices so far we only have 20, we can still fit in 5 bits

i dont't think the about 64-bit instruction format is worth the increase in implementation complexity. the type flags would be nice but the other things are inessential.

as for bigger immediates, we could have the 2-operand form be asymmetrical, making LOADI more useful, at the expense of reducing the number of registers that be directly reached by CPY (of course we allow different opcodes to have different operand bit size breakdowns, instead of just setting the number of operands and their types and I/O-ness, but that seems to increase complexity to little benefit; if custom opcodes really want this they can bit-twiddle the input operands but we needn't help):

fmt 2 (32 bits): 1 format bit 1 custom type bit 5 type bits 1 custom opcode bit 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

one issue is that CONV can only specify one type. One solution is to use the 3-operand form and have input operand #2 specify the other type, but then it only has 3 bits to specify the type. Another solution is to use the 2-operand form but with our new asymmetrical layout there aren't enough bits here either. Of course, the destination type needn't be immediate addressed, in which case we're fine either way (we just put the dest type into a register with a LOADI beforehand)

let's keep the option for another fmt for later, but not require one right now.

so the best fmt so far is:

fmt 2 (32 bits): 1 reserved bit 1 custom type bit 5 type bits 1 custom opcode bit 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

now, is there really any advantage to having different types? it seems like a good way to reduce the number of opcodes and make them easier to remember, but for similar reasons as not having a modifier argument above, shouldn't we just have more opcodes? the assembly could hide/infer the types..

i was thinking that you could write polymorphic algorithms but you can't, because the type is statically specified in the instruction. So why not specify the type with another register?

fmt x (32 bits): 1 reserved bit 1 bit custom addr mode for type 2 bit addr for type 2 bit type 1 custom opcode bit 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

we don't have nearly enough types this way.. but we don't really need 4 addr modes for types, do we? just register and immediate. and we can burn the reserved bit.

fmt x (32 bits): 1 bit custom addr mode for type 1 bit addr mode for type 4 bit type 1 custom opcode bit 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

also we can disallow custom addr modes for types:

fmt x (32 bits): 1 bit addr mode for type (immediate or memory) 5 bit type 1 custom opcode bit 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

now we can write polymorphic algorithms and save on cache space :)

lets reorder to put opcode first so that a sequential bitstream reader can start decoding the operands ASAP

fmt a (32 bits): 1 bit addr mode for type (immediate or memory) 1 bit custom opcode 1 bit custom type 6 opcode bits 5 bit type operand 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

one problem with this is that we see 3 different conceptions of 'register-like' parts of memory in 3 places; 64 locations (types), 16 locations (destination operand in 2 operand form), and 8 locations (operands in 3-operand form). One thing we could do is if the type addr mode is not 'immediate', use two more bits for addr mode and custom addr mode, leaving 4 bits for the operand. Now we only have 3 fixed addr modes for types instead of 4. If we use the same trick in the normal operands, we can achieve 5 bit immediate operands, at the expense of 1 addr mode and some decoding complexity. Note that this does NOT increase the numer of registers that we get, so it's questionable how much it even helps; all it does is let us express constants of up to 32 instead of 8 without a LOADI, but there's an extra step in decoding. i'm guessing this extra step in decoding would defeat any improvements in code density.

if we have 7 bits for types including addressing mode, we could have 4 fixed addressing modes and no custom, leaving 5 bits for types including clustom, or 16 fixed types. This seems like either just enough or too few. so, i guess the above is a good compromise.

a silver lining is that we could declare that 'type locations' above 16 are actually slow-changing type parameters in a special memory region, e.g. an optimizing interpreter could watch for when these locations change and then recompile the polymorphic code into immediately-typed versions.

note that if the first bit is '0', then the next 13 bits can be treated as an extended opcode. If the first three bits are '000', then the next 11 bits statically determine the code needed to execute the instruction (hmm.. unless the object is dynamically typed).

fmt a (32 bits): 1 bit addr mode for type (immediate or memory) 1 bit custom opcode 1 bit custom type 5 bit type operand 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 2 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 2 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 2 bit addr mode + 15 bits operand x 1 operand)

ok, now if the first three bits are '000' and the next five bits are not 00000, then bits 4:14 statically determine the code needed to execute the instruction

mb better would be to put that last check in the custom side, e.g. custom type 00000 is fixed at 'dynamic'. Now we can say If the first three bits are '000', then the next 11 bits statically determine the code needed to execute the instruction.

some types: klambda types fixed-length 1d array variable-length 1d array, terminated by sentinel n-d array or mb even dataframe ints floating points UTF-8 linked list ? jasper graph

arg should we even expose type 'formats' like bit widths and linked-list vs. variable length terminated array? or just interfaces? if the latter we don't need so many types.. but how to find the format otherwise?

i think we should expose them.. the whole point here is to be able to quickly execute code without necessarily checking a boxed dynamic to find out its actual type (format) before dispatching.. to be as quick as RPython

python's built-in types: http://docs.python.org/3/library/stdtypes.html

see also http://docs.python.org/3/library/types.html#standard-interpreter-types

rpython's types: http://doc.pypy.org/en/latest/rtyper.html

some potential addr modes if you had a base and an index operand and a scale:

immediate b+i*s

in other words: immediate value indirect on b? indirect on i? indirect on sum of b and i (after possible indirection)? 2 bits for pre/post inc/dec

so: 5 bits plus 1 value

note: http://skilldrick.github.io/easy6502/ says that 6502 "Indexed indirect" mode, which is *(b + i), is rarely used, while "Indirect indexed", which is *b + i, is used often. i don't think 6502 even has *i modes. So the remaining modes are:

immediate b b+i*s (later note: i guess we'd never use this, this is a register that is computed arithmatically from three immediates)

can use 00 predec/postinc bits for immediate: 1 bit: b or *b two predec/postinc bits: 00: side-effectless, 01: pre/post on i, 10: pre/post on b or *b, 11: immediate (push-to in output operands)

umm actually i guess indexed indirect is *(b+*i) and indirect indexed is *b + *i in our notation, because the 6502 had an X register, which in our case would be a memory location

this doesn't cover pop-from accesses tho (pop from middle of stack)

note that since i isn't separately immediate (or is it?) we need to treat case 'b' separately

note that we don't need mainstack vs. secondary stack, because here we have both b and i, not just one of them

some of these combinations aren't as useful as others. e.g. pre/post on b or *b is probably only relevant in b or *b mode, and pre/post on i is probably the only thing relevant when i is involved.

so: immediate b (register/memory)

so 8 modes

pdp-8: "Z/C bit Page 0 or current page (with this instruction"

another use of 64-bit instructions could be to have a base register and an index register

note/todo: in the above, i've begun to stray from my original goal to have the instruction set be able to accomodate any fixed bit widths in pointers, integers, etc

i'm going back and forth on the 64-bit instructions.. the original goal was to make things elegant, right? isn't it more elegant to be able to do an base reg + index reg effective address calculation on any operand? isn't it more elegant to have 8 addressing modes, or at least to have 1 bit say whether we are using a side-effectful addressing mode?

but i don't want to make all the other instructions waste so much space.. and i dont want to make implementors implement multiple formats.. arg

probably once i make Jasper and then Jasper Core, i'll find new features that i want in Jasper Assembly, so we'll probably end up with 64 bits

so what are minimal, small, comfortable, desirable, and luxurious levels for the various things?

format bits: 0,0,0,1 or 2, unary size of operands: 1, 2, 3, 6 or 8 (allows base/index indexing with 3 or 4 bits ea), 9 or 12 (base/index/scale addressing with 3 or 4 bits ea)

  1. of operands: 0 (stack), 2, 3, 3, 4 (allows modifier operand)
  2. of addr modes: 1, 2, 4 * 2 for custom, 8 * 2 for custom, 16 * 2 for custom opcode bits: 3, 4, 5 + 1 for custom, 6 + 1 for custom type bits: 0, 3, 4 + 1 for custom, 5 + 1 for custom type flag bits: 0, 0, 0, 4 * 2 for custom type width bits: 0,0, 0, 3 type and type width addr mode bits: 0,0,0,1 or 2 or 3 or 4 barrel shifting bits on operands: 0,0,0, 4

when we have less operands we can put the savings into either: smaller instructions, or larger immediate operand length(s)

note: for the type and type scale operands, we need more bits if we want to do indirect indexing than we need for immediate mode representations of the concrete types and type scales. This suggests not separating the types and type scales after all?

fmt x (64 bits): 2 bit addr mode for type 1 bit custom opcode 1 bit custom type 4 bit type operand 2 bit type scale addr mode 4 bit type scale operand 2 bit opcode addr mode 6 opcode bits 18 addr mode + operand bits (1 custom addr mode bit + 5 bit addr mode + 3 bits operand x 3 operands) or (1 custom addr mode bit + 5 bit addr mode + 8 bits operand x 1 operands + 1 custom addr mode bit + 2 bit addr mode + 4 bits operand x 1 operands) or (1 custom addr mode bit + 5 bit addr mode + 15 bits operand x 1 operand)

ok without type/type scale separation:

fmt x (64 bits): 1 format bit 2 bit addr mode for type (immediate, register, *b + (i*s), *b + (*i)*s) 8 bit type operand (incld. custom) 2 bit opcode addr mode (immediate, register, *b pre on b, *b + (*i)*s pre on i) 8 opcode bit operand (incld. custom) 2 bit addr mode for type flags 4 bit type flags (incld. custom) 36 addr mode + operand bits (1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 3 operands) or (1 custom addr mode bit + 3 bit addr mode + 18 bits operand x 1 operands + 1 custom addr mode bit + 3 bit addr mode + 10 bits operand x 1 operands) or (1 custom addr mode bit + 3 bit addr mode + 32 bits operand x 1 operand) 1 reserved bit

where the type flags could be used for something else, so really we have 7 bits to spare

ideas for type flags are: nullable, isPromise, isArray. however this is more than 2; and also it may not make sense to have separate type flags if they cant be used in fmt x2, below.

this doesn't seem so awful. It allows semi-dynamic polymorphism via the possibility of a table of types, and also fast interpretation via the register or memory indirect opcode specifications.

but with 64 bit instructions, we definitely need either packages or variable length.

going with packages:

fmt x2 (4x16 bits): 1 format bit 1 bit stack or accumulator format 4 bit type operand (incld. custom) 5 opcode bit operand (incld. custom) 5 addr mode + operand bits (2 bits addr mode + 3 bits operand x 1 operands)

reg. 1 is PC, reg. 1 is stack pointer, reg. 2 is accumulator

in fmt x2, only the second operand is given (or the only one, if there is only one operand); in stack format, the first argument is popped from the stack and the third (if any) is popped or pushed to the stack, and in accumulator format, the first argument is read from the accumulator register and the third (if any) is read from or written to the accumulator register. The addressing modes for the second operand are: 0: immediate 1: register 2: memory indirect 2: memory indirect, pre/post (stack access)

note that the 16-bit form can be pretty easily 'expanded' to the 64-bit form; the implicit operands are just zeros and ones and twos with addressing modes 1 and 3.

todo: we only did 1 format bit in the packet so we have 3 spare bits

hmm, actually, in the 64-bit format, you wouldn't use opcode indirection for emulation or interpretation because the operands are still the same. But you would use this for functions-as-a-first-class-object.

so you can use it for closures, partial function application, etc. So this is a very interesting feature.

So i think we should give it all the addressing modes then. May as well do the same with types?

fmt x (64 bits): 1 format bit 3 bit addr mode for type 8 bit type operand (incld. custom) 3 bit opcode addr mode 8 opcode bit operand (incld. custom) 36 addr mode + operand bits (1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 3 operands) or (1 custom addr mode bit + 3 bit addr mode + 18 bits operand x 1 operands + 1 custom addr mode bit + 3 bit addr mode + 10 bits operand x 1 operands) or (1 custom addr mode bit + 3 bit addr mode + 32 bits operand x 1 operand) 5 reserved bits

so now functions that want more than 2 operands can simply return a partial application, which can be called using opcode indirection

note: all operands are at least 8 bits so that when split in half for indexing addressing modes, they are still at least 4 bits, yielding 16 registers

since we have more type bits now, we might use some of these as flags

if custom opcodes are just macros, why is it important to have them, and to have functions(opcodes)-as-a-first-class-object? Can't we just use CALL and RET like normal people? The main reason is that custom opcodes are NOT always just macros; in many cases, Jasper Assembly will be compiled or interpreted onto a lower-level platform; the implementation of custom opcodes can be on the platform rather than in Jasper Assembly. So this allows for more efficient implementations of things, by using platform primitives, than you could get to just using CALL and RET. Otoh, you could just access underlying memory and CALL into that; the assembly language's CALL could even manage the argument passing convention translation for you (e.g. the assembly language would define One True Argument-Passing Convention, and then it would register external functions and their preferred argument passing convention, then it would look up the external function's argument passing convention whenever you CALL it, and translate from the One True argument passing convention to that of the external function). I suppose any assembly language could replace all of its opcodes with just CALL with an additional operand that indexes into a jump table to system primitives. Which looks a lot like our opcode indirection...

which types to support in the 8-type set?

according to "operand size usage" in http://www.ece.northwestern.edu/~kcoloma/ece361/lectures/Lec04-mips.pdf , 32-bit ints and 64-bit floats are the most used widths for ints and floats

type ideas:

dynamic 1-bit uint32 int32 float64 int (arbitrary precision) rational decimal function utf8 promise/thunk array the other ones from rpython, from haskell, from lua, etc

for ops with naturally more than 3 inputs, what should the convention be? partial application, or take all inputs from the stack, or pass a pointer to a list or table of inputs? do we support keyword arguments at this level, or is that more of a Jasper Core thing?

i guess the Nock operator is like: have two functions f and g, each of which take a single input (which is probably a pointer to a table of inputs). have an input 'state'. Do (f(state))(g(state)), that is, call f(state), get back a function pointer, call g(state), call the function pointer giving the result of g(state) as the argument.

idea: what about multiple return values? addition on fixed-width integers traditionally returns both a result and a carry bit; traditionally the carry bit is stored in a special 'status register' (conceptually; actually it may be located at the ALU). where do we put the carry bit? we could put it on the stack but then either the signature of fixed-width addition is different from arbitrary-precision addition, or there is junk on the stack that arbitrary-precision code must deal with. also, traditionally, i'm not sure but i think the carry bit served as a third input to the next addition, if the next one was the 'with carry' form of addition.

so traditional 'with carry' addition actually has three inputs and two outputs; in1, in2, carry-in, out1, carry-out, where carry-in and carry-out are just 1 bit.

traditionally there are 4 status flags important to the ALU

so i guess we could spend 4 of our 5 reserved bits in the long format on 2 more operands with no addr mode (addr mode is implicitly 'memory/register')?

and maybe the fifth bit on an addr mode for one of those, to choose between immediate and register (a special register mode in which reg 1 is top-of-stack, not the stack itself)?

fmt x (64 bits): 1 format bit 3 bit addr mode for type 8 bit type operand (incld. custom) 3 bit opcode addr mode 8 opcode bit operand (incld. custom) 41 addr mode + operand bits (1 custom addr mode bit + 3 bit addr mode + 8 bits operand x 3 operands + 1 addr mode bit + 2 bit operand + 2 bit operand) or (1 custom addr mode bit + 3 bit addr mode + 18 bits operand x 1 operands + 1 custom addr mode bit + 3 bit addr mode + 10 bits operand x 1 operands + 1 addr mode bit + 2 bit operand + 2 bit operand) or (1 custom addr mode bit + 3 bit addr mode + 32 bits operand x 1 operand + 1 addr mode bit + 2 bit operand + 2 bit operand)

for the short format, if in stack mode, the status result is placed on/taken from the top of the stack, and if in accumulator mode, it is read from/written to a second special register.

Could put PC, stack, accumulator into higher memory so that all 4 of the first 4 regs are free for status. Or, could leave them there, with the special behavior that writing status to the PC skips if the status is 1, and reading status from/writing to the stack is a POP/PUSH (not an adjustment to the stack pointer). But now aren't there 4 status flags, not just 1? Do we really need more than 1? most systems have at least 4 flags: negative, zero, carry, overflow (NZCV). I think addition uses carry and overflow. i think overflow is just for signed quantities, and it records when the sign flips because you overflowed, e.g. if you have 1 byte, signed, and if it represents -127:128, and you add 120 + 20, you'll get -12 with overflow. but it seems to me that you could just use one carry flag for this? todo

So now the special registers are:

0: PC 1: other status-accessible register 2: stack 3: accumulator

another idea would be to just have special status registers, and special opcodes to read/write to them directly; this is how many processors work.

idea: btw the previous 'magic' when writing to PC and to stack could be a special addressing mode which is what you get if you select 'immediate' for an output. can we think of any 'magic' that should occur when writing to an ordinary address in this mode?

idea: a criterion for the short format is that the inner loop in basic arithmetic routines/numerical computation should be able to be efficiently written using the short format; so the opcodes chosen should support this. the idea being that the purpose of the short version is to help inner loops fit into small instruction caches.

the problems with ad-hoc polymorphism

One decision yet to be made is if we are going to have ad-hoc polymorphism, e.g. different implementations of the same opcode depending on the type field in the instruction.

hmm.. upon reflection this is a serious disadvantage of the addition of ad hoc polymorphism to bytecode..

if we have 6 bits of opcode and 6 bits of type, then the dispatch table alone is 4k, which is 2x larger than the entire SPIN interpreter for a the Parallax Propeller MCU. Also some chips have L0 instruction caches of 4k (e.g Qualcomm Snapdragon S4 Krait http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/2 ), we don't wanna burn the whole cache on a mostly-empty dispatch table, do we? Even Intel Sandy Bridge only has ~6k L0 cache: http://www.hardwaresecrets.com/printpage/Inside-the-Intel-Sandy-Bridge-Microarchitecture/1161 although this is just for microinstruction decoding, and the L1 instruction cache is 32k.

is there any way out of the dilemma? we could:

i guess the best compromise would be to leave it up to the implementation:

however i'm seriously considering abandoning the whole idea of ad hoc polymorphism in the language. clearly since most opcodes have few or no polymorphisms, the most efficient thing is to just have different opcodes for different ad hoc polymorphisms. you could still support bit-length polymorphism via parametric polymorphism; e.g. you could have "add-integer (32-bits)" but there would be a different opcode for "add-float (32-bits)".

i think that's the way to go..

which suggests that there's some lower-level (or just core of the ISA) 'interface' that gets and puts streams of bits for you depending on the type..

hmm.. on second thought i don't see how to separate ad-hoc from parametric when the use of polymorphism is to present a higher-level API over hetrogenous lower-level platforms. if we're trying to abstract from linear memory layout then why privilge bit-width over e.g. endianness, whether two's complement or a sign bit is used to represent signed values, etc. One might object that bit-width has semantics (a maximum value) that endianneess does not.

instruction encoding

i did a bunch of thinking about this; old ideas and justifications for the old ideas and for current ideas can all be found in jasperAssemblyOld1.txt. Not to say that this is even close to right, it may still change.

fixed encoding proposal

The current proposal is for a 'main' instruction encoding of fixed length of 64 bits, and a 'compact' encoding which is also 64 bits, but which packs in multiple instructions.

There are two current proposals for the main encoding.

fixed encoding proposal : main encoding proposal 1: 16 registers

fmt 0 (64 bits): 1 format bit 3 operand type application bits (see below) 12 bits for opcode, as: (1 custom addr mode bit + 3 bit addr mode + 8 bits operand) 12 bits for type and flags, as: (1 custom addr mode bit + 3 bit addr mode + 8 bits operand) 36 bits for operands and their addressing modes

note: when there are 3 operands, each operand is 8 bits (so, including the 4 addressing mode bits, we have 3 12-bit fields); when there are two, the first operand is 16 bits and the second is 12 bits; when there is 1 operand, it is 32 bits

operand type application bits: 3 bits choose between 8 choices; X means the type is applied to this operand; - means another operand is present, but its type is determined by the instruction:

--- X X- -X XX X-- -X- XX-

the operand type application bits, while usually redundant with the opcode, allow decoding of the operand fields to begin before looking up the opcode. If an opcode accepts more operands than are provided, the extra operands are passed as 0. An opcode has access to the operand type application bits if it would like to see which operands where provided by the program and which were assumed to be 0.

todo: we left 8 bits for both opcode and type for uniformity but perhaps it would be better to have these have different bit lengths. The implementation already has to deal with different lengths for the operands.

this is called the 16-register proposal because, if using the indexing addressing modes, the operand will be split in 2, leaving 4 bits each top specify base registers and index registers; so 16 registers will be able to be used as base or index.

note: we could free up 2 bits by disallowing custom addressing modes for opcodes, types, and flags, at the cost of some uniformity

how instructions that really take 3 inputs and/or yield 2 outputs work: use the status register for the extra input/output if it is a bool, or the accumulator otherwise (using the accumulator for this would prevent using this instruction in compact accumulator mode, though; see below).

now e.g. ADD uses the status register (location 1) as carry input, and then overwrites it with carry output (so ADD takes two input operands, one output operand, one implicit 1-bit input operand, and one implicit one-bit output operand).

fixed encoding proposal : main encoding proposal 2: 8 registers

this proposal is similar to the previous one, except that we use 6 bits for everything instead of 8 bits. This frees up 10 bits that we can use for flags; and it makes the dispatch table smaller for opcode + type (see 'the problems with ad-hoc polymorphism' above).

fmt 0 (64 bits): 1 format bit (determines whether this instruction is 'main format' or 'compact format' 3 operand format/type application bits (see above; ---, X, X-, -X, XX, X--, -X-, XX-) 10 bits flags: 1 bit custom addr mode + 3 bit type addr mode + 6 bit flag operand = 10 bits x 1 possibly including 1 conditional execution bit? also should we really allow indirection for these flags? having optional laziness suggests yes, but having conditional execution suggests no 10 bits type: 1 bit custom addr mode + 3 bit type addr mode + 6 bit type = 10 bits x 1 10 bits opcode: 1 bit custom addr mode + 3 bit addr mode + 6 bits operand = 10 bits x 1 30 bits operands: 1 bit custom addr mode + 3 bit addr mode + 6 bits operand = 10 bits x 3 operands

again, the last 30 bits is 3 10 bits fields if there are 3 operands, but different when there are less; when there are 2 operands, the first one is 16 bits and the second is 6 bits; when there is 1 operand, it is 26 bits.

note: we could free up 3 bits by disallowing custom addressing modes for opcodes, types, and flags, at the cost of some uniformity

fixed encoding proposal : discarded main encoding proposal: each operand has a type

if we didn't have a format bit we could do that:

fmt x (64 bits): 2 bit type addr mode + 5 bit type + 1 bit custom addr mode + 2 bit addr mode + 6 bits operand = 16 bits x 4 = 64 bits

if we disallowed a type on the third operand, then we only need 3 types and 4 operands:

fmt x (64 bits): 7 reserved bits 2 bit type addr mode + 5 bit type = 7 bits x 3 operands + 1 bit custom addr mode + 2 bit addr mode + 6 bits operand = 9 bits x 4 operands = 57 bits

so we could get back to 3 bit addr modes:

fmt x (64 bits): 3 reserved bits 2 bit type addr mode + 5 bit type = 7 bits x 3 operands + 1 bit custom addr mode + 3 bit addr mode + 6 bits operand = 10 bits x 4 operands = 61 bits

fixed encoding proposal : compact encoding proposal

this proposal is unfinished but the crux of the idea is to not specify as many operands in the instruction, and to use either 'stack mode' or 'accumulator mode' to get the other operands (stack mode: pop input operands off the stack, push outputs onto the stack; accumulator mode; the second and third operands are both taken to be the accumulator register).

variable encoding proposal

thoughts on variable encoding proposal

one of the original ideas of Jasper Assembly was to allow variable bit-widths, rather than a standard bit-width. However, then i decided i wanted a fixed-length encoding because i wanted to be able to decode instructions in parallel, and also because i wanted a design constraint to enforce simplicity. Going thru the exercise above shows how demanding a fixed-length encoding leads to a host of other design decisions, which cause the resulting language to have a similar 'feel' to traditional assembly. For example:

And if you want to minimize encoding size, you get:

These decisions do seem to be demanded by efficiency, and after all, what's the point of a bytecode intermediate language for Jasper if it is not faster to execute than a higher-level 'core' language?

However, the ultimate point of the exercise of thinking about Jasper Assembly was not supposed to be efficiency, but elegance. Perhaps, having thought through the consequences of an efficient fixed-length encoding, we should now eschew it and design an assembly which is not as efficient but also not constrained by a fixed-width straigthjacket.

Human natural language suggests that this might be a good idea. Human sentences have an unbounded number of modifiers, etc.

Could this be efficient? I'm not sure. Parsing a language is definitely computationally expensive. So perhaps we should conclude that in speech communications, humans appear to be bottlenecked by communications bandwidth, not by computational power within the hearer's brain. Alternately, perhaps the cultural process of evolving a language is more conducive to variable-length rather than fixed-length sentences (but i think that is unlikely because there are plenty of other specific, weird grammatical rules and structures that languages evolve).

Building on human languages, we could have phonemes, morphemes, words (lexemes), phrases, sentences. Phonemes seem related to a lower-protocol level (e.g. they are in speech but not 'really' in writing), so we'll neglect them. So we have morphemes, lexemes, phrases, and sentences. We could still potentially have fixed-length morphemes or fixed-length lexemes but variable length phrases and variable length sentences; i'm not sure if that would be a significant boon to parallel decoding.

However, in any case, the fact that there is an 'interpretation' step in between recognizing the basic units of the language, and actually doing something by following instructions, will probably slow things down. The thing about bytecode is that each instruction can be executed as soon as it is decoded. What we are talking about here is not taking action until each entire sentence is decoded. One of the gains in parallel decoding comes from the possibility of superscalar execution, including speculative execution stuff like speculative branching. It seems to me that this could no longer be done until the sentence level in this case. The human brain does not seem to me to be decoding multiple sentences in parallel, a warning sign that this may be difficult.

In other words, even if lexemes were fixed-length, what matters is the unit at which point we can actually execute something, which in that model would be the sentence. So really, 'sentences' seem to be the 'instructions' here; lexemes are just parts of instructions, and so we'd have variable-length instructions, even if lexemes were of fixed size.

If all we really need is for the next parallel decoding unit to be able to find the beginning of the next sentence before the previous decoding unit has looked at the previous sentence, all we need is some sort of unambiguous 'sentence delimiter' that signals the end of one sentence/the start of the next. I had thought of this earlier in the form of something like 'maybe a run of four 1s is an instruction delimiter' but discarded it because that meant run-length limited (RLL) coding of all values elsewhere, which i figured was inelegant and complicated to implement. But now that we are breaking sentences down into lexemes, the possibility exists for one lexeme (one lexemic 'opcode' if you will) to designate a sentence delimiter (analogous to a period in English, or a semicolon in many programming languages), and for other lexemes to indicate other things. This cannot be decoded in parallel as quickly as fixed-length sentences (analagous to programming code with one statement per line, when displayed left-justified on a screen; your eye can easily find the start of the next sentence without scanning), but at least there are not serial dependencies like in x86 (where you cannot just look for a delimiter lexeme to find the next instruction, you must begin to decode the previous instruction until you determine its length).

So to summarize:

the reason we want variable-length lexemes is that we want to be able to express arbitrarily large numeric literal values in a single lexeme.

now, this sounds like a high-level language which needs to be parsed, etc. Is there any reason to think of this as 'bytecode' or 'assembly'? Well, the least we can do is 'pre-parse' it by having meta-data such as having each phrase have a header with the length of the phrase, etc. Perhaps we should have a bunch of redundant 'parse-ish' data, sort of like having a string be both Pascal-style and C-style (both a header that tells its length, and a delimiter to terminate).

Since morphemes are fixed-length, anything that has a length header either has a maximal length, or must use some sort of more complicated encoding for large lengths. The latter sounds good to me; just use the convention that if the maximal length is indicated, that really means that the value is reached by concatenating this item and the next one. The next item has its own length header which may again be the maximal value. E.g. if the length header was 8 bits, then a string that started with a header of 255 means 'the 255 characters following this are part of the string; in position 256 is another length header for the next part of the string' (note that a length header might say zero).

Following the path of redundancy, after the end of the string could also be a terminator.

Going along this further, we could also have explicit parallelizability annotations (see EPIC, Explicitly parallel instruction computing) that let the runtime know which parts of the code could be executed in parallel.

Further redundancy to make things easier for the runtime: the combination of instruction opcode/ad-hoc polymorphic variant could be statically resolved whenever possible to a specific jump address.

now as for the fixed-length morphemes. Perhaps there could be a unary coding of morpheme (and often lexeme) 'type' at the beginning. Some morpheme types:

those could be renamed in a more easy to understand fashion..

do we really want both length headers and also delimiters at every level? at the lowest level (lexeme) this would add a lot of space, which will greatly increase program size and hurt cache efficiency and embedded systems use. note that in the common case we'll have short sentences with one- and two- lexeme phrases and short lexemes.

also note that since lexemes are unbounded in length, that either a long lexeme must be interrupted by continuing length headers for the phrase and sentence in which it is embedded, or the length headers for phrases and sentences must themselves be variable-length literals

so let's say that single-item elements are headerless and footerless (except for literals, which must have a header or a footer or both, because we don't want to put any constraints on format of the data inside of them), to save space. This is easy to decode because if you are looking at what might be a header and its not a header, then it's a single-item element.

3-format proposal

First two bits are format bits:

first bit: compact or non-compact encoding second bit (only in non-compact encoding): if non-compact encoding, then is this a new instruction or a continuation?

if this is a new instruction, then next we have a count in elias gamma format that says how many 64-bit (32-bit?) regions must be concatenated to make this instruction.

on each 64-bit (32-bit?) boundary, we again have the first two bits above, which in this case are 00. Then we continue with the payload.

it would be nice if we could make the non-extended but non-compact format fit in 32 bits instead of 64. but that's hard; we have 29 bits left after the above; if we need at least 5 fields, that's only 5 bits per field; 2 of these are addr mode; so only 3 bit operands and opcodes, plus 4 bits left over.

also it's not yet clear how to make the format very simple to implement and compact in the common case yet still have arbitrary lengths of things.

my best idea so far is to make operands really short and have them be all 1s if they need to be extended. instead of having elias gamma at the beginning, have a fixed length which again is 1111 if it is exceeded.

so e.g.

2 bits for format 2 bits for number of additional 32-bit fields opcode: 2 bits addr mode, 5 bits opcode type: 2 bits addr mode, 3 bits operand, 1 bit if the type of the other guy is different 3 x (2 bits addr mode + 3 bits operand)

seems to me that 'if the type of the other guy is different' we should just use a larger fixed-size instruction format.

so, first bit:

0 compact (16 bit format)

note: in compact format, for ea. operand include 1 bit saying: stack or accumulator even better would be 2 bits: stack, stack-1, acc0, acc1

so:

compact 16-bit format:

format: 1 bit opcode: 5 bits type: 4 bits 3 x (2 bits per operand)

first two bits: 0?: compact 10: 32-bit format 11: 64-bit format or larger (in other words, so far we have unary)

in 32-bit format:

format: 2 bits opcode: 2 bits addr mode, 5 bits opcode type: 2 bits addr mode, 3 bits opcode 3 x (2 bits addr mode + 4 bits operand)

in 64-bit format:

format: 3 bits operand parsing format: 2 bits opcode: 4 bits addr mode, 6 bits opcode type1: 4 bits addr mode, 5 bits opcode type2: 4 bits addr mode, 5 bits opcode 3 x (4 bits addr mode + 6 bits operand) 1 bit to spare

and to reduce complexity, consider eliminating the 32-bit format, although it may be too annoying to have to have packets of 4 16-bit instructions at once. if the 32-bit format is kept, then either we choose a base size of 32-bits, in which case we have 2 format bits in the middle of the 64-bit instruction (so that on every 32-bit boundary we have a format indicator), or we choose a base size of 64 bits, and require each 64-bit packet to be one of either 64/32-32/32-16-16/16-16-32/16-32-16/16-16-16-16 (incidentally in this case we save one bit per packet for the 16- and 32-bit containing formats by just having one 3-bit format indicator at the beginning of each 64-bit packet).

oh dear, this is looking rather like a fixed-length encoding again..

addressing modes

types

swagger types:

https://github.com/wordnik/swagger-spec/blob/master/versions/1.2.md jasper data types

CLR 15 primitive types: bool, byte, sbyte, char, decimal, double, float, int, uint, long, ulong, object, short, ushort, string

moarvm primitive types: int, num, str, object

dis primitive types: byte, word, string, real, big (and some short versions of some of these)

instruction set

inspired by MIPS and DLX

add, sub, mul, div and, or, xor, nor? beq, bne j, jal mov slt (set to 1 if less than), sgt, sle, sge, seq, sne lsl, lsr, asr lhi trap cvt rfe

inspired by (sorta) intersection of AVR, PIC, ARM

todo make specific

from my book notes:

In summary, it seems like a reasonable 'common core' would consist of:

mov, jump, call, addition, subtraction, bitwise arithmetic (and/or/not/xor, rotate right, bit clears, NOP, a way to make some hardware-specific calls, branch on zero, branch on condition flag, ways to get and set the condition flags, an operand to specify a destination register for each instruction, load/store, relative jumps, <= etc comparisons/branching, LSL, LSR, ASR, carry/no-carry forms of addition and subtraction, access to the stack pointer, register indirect addressing for loads, PUSH, POP. single instruction bit set/clear, skip/branch-if-bit.

A slightly extended core would also have MUL, post-increment addressing, multiply-accumulate, division, increment, decrement, swap nibbles.

todo look at JVM and CIL

instruction ideas

some ops ideas:

dup swap mov sto load clear (transfer zeros to destination) set (transfer ones to destination) pushto popfrom add sub mul div mod abs neg gt,geq,lt,leq,eq,neq, eq (struct), neq (struct) 6 shift and rotate from here? i thought two of these were supposed to be the same http://umcs.maine.edu/~cmeadow/courses/cos335/COA10.pdf and, or, not, xor conv translate (see http://umcs.maine.edu/~cmeadow/courses/cos335/COA10.pdf) in out CALL, RET isa (Python's isinstance: http://docs.python.org/2/library/functions.html#isinstance ) in Substitute (let, and letrec) Map, fold Solve (3sat but also system of equations, linear programming, prolog) simplify (and simplify*, which does until fixpoint (here, fixpoint = normal form i guess)) mb also Reduce (and reduce*; or factor and factor*) and Expand (and expand*), which are two different ways to simplify in the domain of polynomials: expand writes as a sum, factor/reduce factorizes (note this expand is different from the one below) eval (mb RUN or INTERPRET) Fork Match (regex but also grammar, return production) Expand and expand* for grammar productions (e.g. given a string and a grammar, find all of the strings that can be reached by a single application of some production rule; and for *, iterate until fixpoint) (note this expand is different from the one above) Replace channel operations ( https://en.wikipedia.org/wiki/%CE%A0-calculus ) atomic read/write of a consecutive sequence of multiple bytes at once for transactional memory: mark beginning of transaction (and specify fallback code to be executed if the transaction is aborted, which may just retry, or may retry a few times), mark end of transaction, manually abort transaction, test if currently in a transaction (i guess the test should ask about a specific txn, so that transactions are nestable)

as

idea: Context dependent or polymopehic polymorphic computational primitives such as simplify, solve, interpret. Interpret: Program, context, language -> context. A context can be a state (assigning variables to values), or it can be something like a superclass that specifies defaults. For an example of the latter consider 'interpret' (=eval) in the context of a language; within the context of Python, 'eval' means one thing, within the context of Scheme, something else. You can think of this as just looking up the function in a nested namespace, e.g. like a closure or Lua upvals or Jasper semiglobals.

" Intel x86 provides process synchronization instructions XADD, CMPXCHG, CMPXCHG8B • Also bus control/synch operations ESC, LOCK, HALT, WAIT" -- http://umcs.maine.edu/~cmeadow/courses/cos335/COA10.pdf

x2 (multiply by 2; double) opcode idea from Mesa? And/or tree

ideas

instruction encoding for other forms of computation

idea: also specify how to use this for jasper graph data, grammar data/production rules/horn clauses, RDF, 3-SAT, etc.

natural language

the basic idea is somewhat like RDF, except instead of triples we have quintuples. instead of subject, object, verb, we have object1, object2, target, verb, type, and we have the potential for indirection in each of these.

if we are encoding RDF, we could presumably encode subject, object, verb, annotation, type.

hmm, thinking of natural language, it would be more 'natural' to split the type up and attach it to each operand. if we took one of the reserved bits we could have 3 bits per operand for type. but that's only 4 types (+ custom), and no addr mode! but it does avoid the decoding step where we ust look up the verb's signature to see where the types apply. but we have to do that anyone in order to confirm that the third operand isn't an input, right? mb we should reserve some of the opcode bits to specify how many inputs/outputs we have and which ones. Also, in natural language, our type also correspond's to a verb's tense, so something is still needed there. So we'd really need to split it into four fields if we were doing that. Otoh mb we can safely assume that in any case, the 'targets type is determined by the input types and the verb type, so then we'd still only have a 3-way split. what we are doing isn't totally unnatural; it's like conjugating the verb to agree with the number of subject and/or object in a language where the nouns are not inflected for number (not sure if that ever actually occurs in natural language though!).

idea: could use one two of the reserved bits as 'ifs', like ARM, and perhaps one to negate that?

idea: can we expand this to have 'modifiers', e.g. adjectives, adverbs, other grammatical stuff?

ideas on this:

Phrases in register (array)?

addressing modes

idea: redefine sideeffectless to allow side-effects on the stack, and maybe on registers, only

interfaces?

memory management

idea: to the extent that we stick to 'linear' operations (no copying) and reversible ones (no destruction), we don't have to do any reference counting or garbage collection. hmm.. i wonder if that's related to why physics 'doesn't like' (generates more heat with) non-reversible (garbage-producing) operations.

reversible, linear, side-effectless, loop- and recursion-less (but not necessarily case-less) stack operations can be concisely encoded with zero operands and might be a good candidate for fmt zero or for a restricted addressing-mode language

misc

routing fabric should probably be assumed between the 64k local processors, e.g. the topology should perhaps not be directly exposed on the assembly level.

however could expose some amorphous computing primitives that let you work with a concept of nearby-ness.

idea: stack frame ptr register? probably not, this can just be a convention; unless we have a CALL command

idea: status register should be only 1 bit?

idea: when more than 3 operands are needed, use the status register and the accumulator this would allow the elimination of those 5 bits for status operands (which would also disallow those tricks where we increment the PC based on result code, resulting in a simple way to combine SKIP or IF-THEN with anything with a result code, or where we place the result code onto the stack, etc; and it disallows immediate mode status input operands; but then we don't need that 'magic' addressing mode, which makes implementation simpler)

idea: Immediate write can just be discard

idea: enforce capabilities-based access rights idea: bounds-checking

idea: stack frame pointer, like x86?

idea: output on the stack: at the end of the function, if the stack is higher than the stack frame, then everything above the stack frame is an output

Toreads

www.cs.cornell.edu/talc/papers/tal-popl.pdf

http://en.wikipedia.org/wiki/High_Level_Assembly

http://en.wikipedia.org/wiki/Typed_assembly_language

http://research.microsoft.com/en-us/um/people/gbell/Computer_Structures_Principles_and_Examples/csp0 023.htm

http://research.microsoft.com/en-us/um/people/gbell/Computer_Structures_Principles_and_Examples/csp0 033.htm