Bayle Shanks's website: proj-oot-ootAssemblyNotes21

---

32bit ideas

so with the current plan, in 32-bit, we'll get 16 more 3-operand instructions plus 8 more 2-operand ones. This will allow us to make a lot more things 3-operand, so may as well rewrite the whole thing. So we got tons of space; after just moving stuff around (and freeing up loadpc), we have 8 3-operand opcodes, 9 2-operand opcodes, 7 1-operand opcodes are open. Ideas:

with addition of:

3-operand: 4 16-bit arithmetic, 5 f64 arith
2-operand: 4 f64 arith (including peek and poke for f64), 3 i16 arith, 2 stack ops
1-operand: 4 f64 arith, 1 syscall2 (instr-/misc), 2 i16 arith (peek and poke)
0-operand: 11 syscall2s, 1 divmod, 1 divmod i16
totals: 1 divmod, 7 16-bit arithmetic, 13 f64, 12 syscall2s

note: the f64 instructions operate on 16 separate floating point registers

note: what about using half-precision (16-bit) floats instead? then we wouldn't need the separate registers. 64-bit floats ('doubles') could be available in OVM.

we could even just do unspecified 'native precision', but i'd prefer not to -- the point of native precision is to have a portable way to express computations, which may involve distances between pointers, and we dont know the pointer size so we cant know the integer size. but once we move outside of that, it's probably better to know how much precision you're dealing with.

but a problem with this is that i bet there are more architectures with f64 support than f16 support. In fact, it's worth than that; a quick check shows that ARM Cortex-M4F only supports f32 (single-precision floating point). Whereas Javascript and Lua and Python only support f64.

so, i think we'll stick with f64.

note: what about sign-extension and zero-extension of integers of various bitwidths? what about coerce-i16-i and coerce-i-i16? i guess those aren't very useful without knowing the bitwidth of the native ints (which a program could get from sysinfo and then have a switch statement based on the result, but it's probably simpler not to)

note: instead of all the 3-operand i16 and f64 arithmetic, perhaps we'd like more kinds of compare-and-branch instructions on native ints and ptrs?

new instructions:

==
arithmetic on native ints	divmod-int
stack ops	pickk, rollk (note: these obsolete the 1-operand dup, swap, over instructions)
64-bit floating point (optional)	add-f64, mul-f64, div-f64, bne-f64, ble-f64, neg-f64, peek-f64, poke-f64, fclass-f64, cvt-f64-i16, cvt-i16-f64, coerce-f64-i16, coerce-i16-f64
16-bit integers	add-i16, mul-i16, cvt-i16-i, cvt-i-i16, neg-i16, peek-i16, poke-i16, bne-i16, ble-i16, divmod-i16
syscalls (optional)	exec, delete, get, put, spawn, pctrl, time, rand, environ, getpid, signal
==

new critical errors: divide-by-zero, (??: conversion-out-of-range? should that really be a critical error?)

note: does float divide by zero cause a critical error, or does it just create Inf? Can you choose?

todo: specify that native ints are 2s complement, and also that you can give native ints as inputs to -i16 operations on native ints, and the effect is to take just the lowest-order 16 bits of the native int. You can also give -i16s as input to native int operations, and the effect is to (zero-extend or sign-extend?) the i16. This allows us to use bne-int to compare i16s (but watch out comparing an i16 to an int; it may compare as unequal even if the low-order bits are the same, if int has any high-order bits set). However, since the i16 has a sign bit where longer ints have a '128' bit, ble-int won't work right on i16s; it will interpret i16 -128 as int 128. So, we want a ble-i16, but we have no room for that. Alternately, we could change the 'i16's to 'u16's; but then (a) we want a sub-i16, and we have no room in the 3-operands for that, and also (b) if the native ints happen to be 16-bit then ble-int wont work again. So, maybe we should say that we CANNOT use bne-int or ble-int on i16s (it's a type error), and that we must use cvt-i16-i first, which i guess sign-extends the i16 to an int. Is this a significant enough hit to i16 efficiency such that we should provide bne-i16 and ble-i16, and move mul-f64 and div-f64 (or something else) to make room? Should we consume our last 3-operand RESERVED? ok so far i consumed the RESERVED and moved divmod-int.

note: we clearly can't use most of the addr modes on float regs, should we use the bits differently? like, mb just offer register direct, immediate (which are still interpreted as signed integers), and constant, and allow 5 value bits instead of 4? Should we then have 32 f64 regs instead of 16 (eg RISC-V has 32)? Alternately, if the addr mode has any inirection, we treat this as a PEEK or a POKE to normal memory through the normal registers. I guess the latter is more consistent.

todo: instead of offering PEEK and POKE for -i16 and -f64, we could just provide bitwise truncation and sign- and zero-extension, and say, just use those with indirect addressing modes to load and store. That makes more sense since these loads and stores are to ordinary memory, right? Well, no, not with f64; how many native memory spots it will occupy is non-portable (it will be 1 on 64-bit machines and 4 on 16-bit machines). So, maybe do that with i16 but not f64?

todo: hmm aside from f64 and i16 we're looking pretty stable. But there are still some decisions to be made then regarding f64 and i16. Throwing in f64 and i16 adds some complexity, is it worth it? If so, i16 or u16? How do f64s get read from and stored into memory, and how many spaces do they take up? How about u16s? What happens if you attempt an int operation on an int16? Are there sign-extend and zero-extend ops for in16? How do you convert from ints to int16s and could this cause a critical error? How many f64 operations do we expose? Is f64 divide by 0 a critical error or does it produce inf, and can this be configured?

note: Will we be IEEE 754 compliant? I don't think so; it seems to me that IEEE-754-2008 may require SQRT and ABS and multiple rounding modes. Also, the RISC-V spec comments, "The C99 language standard effectively mandates the provision of a dynamic rounding mode register". Perhaps the RISC-V floating point operations would be the simplest way that would support the standard. I guess we could say we support it if OUR standard required various assembler intrinsics that computed in 'software' for what the VM itself doesn't do at runtime. I'd rather just keep BootX? simple, though, and say that we don't support IEEE-754-2008, although we do provide a subset of the operations defined there. If we wanted to be as complex as RISC-V, why would we even create BootX? at all? OVM will have more opcode space and can have all those other operations.

note: in fact, it seems that even Python doesn't support IEEE-754 out of the box: [1]. And Python is used a lot for numerical computing. My motto: if Python doesn't support some numerical thing, then we really don't need it (at least not at the OVM level; maybe Oot stdlib could have it).

The 32 three-operand instruction opcodes and mnemonics and operand signatures are (note: in 32-bit instructions, constants ('c') are 7 bits, not 3 or 4 bits, because the addressing mode is treated as part of the constant): 0. bne-int: c ii ii (branch-if-not-equal on ints) 1. bne-ptr: c ip ip (branch-if-not-equal on pointers) 2. jrel: c c c (unconditional relative jump by a constant signed amount) 3. ldi: oi c c (load immediate 8-bit int) 4. ld: c o si (load from memory address plus unsigned constant) 5. st: c so i (store from register to memory address plus unsigned constant) 6. addi-int: c io ii (in-place addition of ints and immediate constant) 7. addi-ptr-int: c iop ii (in-place addition of ints and immediate constant to ptr) 8. ble-int: c ii ii (branch if less-than-or-equal on ints) 9. ble-ptr: c ip ip (branch if less-than-or-equal on pointers) 10. add-int: oi ii ii (addition of ints) 11. add-ptr-int: op ip ii (add a int to a pointer) 12. CAS 13. bne-i16 14. annotate: c c c (can be ignored) 15. bitor: io ii ii (bitwise OR) 16. bitand: io ii ii (bitwise AND) 17. bitxor: io ii ii (bitwise XOR) 18. sub-ptr: op ip ip (subtraction of pointers) 19. mul-int: oi ii ii (integer multiply) 20. sll: io c ii (shift left logical (multiplication by 2^c (mod MAX_INT+1))) 21. srl: io c ii (shift right logical (division by 2^c, rounding towards zero)) 22. sra: io c ii (shift right arithmetic (division by 2^c, rounding towards negative infinity)) 23. ble-i16 24. add-i16: io16 ii16 ii16 (integer addition of 16-bits) 25. mul-i16: io16 ii16 ii16 (integer multiplication of 16-bits) 26. add-f64: iof64 if64 if64 (float64 addition) 27. mul-f64: iof64 if64 if64 (float64 multiplication) 28. div-f64: iof64 if64 if64 (float64 division) 29. bne-f64: c if64 if64 (branch-if-not-equal on float64) 30. ble-f64: c if64 if64 (branch-if-less-than-or-equal on float64) 31. instr-two:c ? ? (used to encode two-operand instructions))

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register) 1. pop: o sim 2. push: som i 3. sysinfo (query system metadata): o c 4. neg: io ii (arithmetic negation) 5. bitnot: io ii (bitwise negation) 6. pickk: c sio (pick c on stack) 7. rollk: c sio (roll c on stack) 8. neg-f64: iof64 if64 (arithmetic negation of float64) 9. peek-f64: of64 ip64 (load f64 from external memory address) 10. poke-f64: op64 if64 (store f64 to external memory address) 11. cvt-i16-i: oi ii16 (convert int16 to int (sign-extend?)) 12. cvt-i-i16: oi16 ii (convert int to int16 (how do we deal with ints bigger than 2^15-1 or smaller than -2^15? is that a critical error or do we mandate saturation or something like that?)) 13. neg-i16: oi16 ii16 14. fclass-f64: oi if64 (classify a floating point number; see eg FCLASS in RISC-V, eg FCLASS.S in section 8.9) 15. instr-one: c ? (used to encode one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. 2. 3. cvt-f64-i16: if64 (convert float64 to int16, pushing 4 entries onto SMALLSTACK) 4. cvt-i16-f64: iof64 (convert int16 to float64, popping 4 entries from SMALLSTACK) (is this round? floor? what do we do when the f64 is larger or smaller than the largest or smallest int16 -- is this a critical error or do we saturate or something else?) 5. coerce-f64-i16: if64 (coerce float64 to int16, pushing 4 entries onto SMALLSTACK) 6. coerce-i16-f64: iof64 (coerce int16 to float64, popping 4 entries from SMALLSTACK) 7. malloc: op 8. mdealloc: ip 9. mrealloc: iop 10. peek-i16: ip (load i16 from external memory address, pushing onto SMALLSTACK) 11. poke-i16: iop (store i16 to external memory address, popping from SMALLSTACK) 12. 13. 14. syscall2 15. syscall: c (used to encode zero-operand misc instructions)

The 16 SYSCALL2 zero-operand instructions are: 0. exec 1. 2. get 3. put 4. spawn 5. pctrl (process control, eg join/wait, kill, etc -- or should these each be separate?) 6. time 7. rand 8. environ 9. getpid 10. signal (?? not sure if we want to do it this way -- signal handler setup) 11. create 12. delete 13. 14. divmod-i16: (on SMALLSTACK; consume 2 items and push dividend, then push remainder) 15. divmod-int: (on SMALLSTACK; consume 2 items and push dividend, then push remainder)

The 16 INSTR-ZERO zero-operand instructions are: 0. halt (terminate program execution) 1. break (mark breakpoint for a debugger) 2. fence-seq 3. peek-i8 4. poke-i8 5. devop 6. read 7. write 8. open 9. close 10. seek 11. flush 12. poll 13. coerce-int-ptr: coerce int in T to ptr 14. log 15. library

an idea for the 64 8-bit instructions (we really only have 6 bits because there are 2 encoding format bits in the 8-bit instruction encoding):

8-bit instruction ideas: 0. ANNOTATE

1. SMALLSTACK SWAP 2. SMALLSTACK OVER 3. SMALLSTACK DROP 4. SMALLSTACK DUP

5. SMALLSTACK ROT

6. POP SMALLSTACK and PUSH to MEMSTACK 7. POP MEMSTACK and PUSH to SMALLSTACK

8. POP SMALLSTACK into ERR 9. POP SMALLSTACK into R4 10. POP SMALLSTACK into R5 11. PUSH ERR onto SMALLSTACK 12. PUSH R4 onto SMALLSTACK 13. PUSH R5 onto SMALLSTACK

14. LD through T into R4 (that is, load in register indirect mode with register T, and put it into R4) 15. LD through R4 and PUSH it onto SMALLSTACK 16. ST R4 through T 17. POP SMALLSTACK and ST it through R4 18. (POP SMALLSTACK, add it to the pointer in R4), LOAD from the addr in parens and PUSH it to SMALLSTACK 19. (POP SMALLSTACK, add it to the pointer in R4), POP SMALLSTACK and ST it to the addr in parens

20. PUSHPC onto MEMSTACK 21. POP MEMSTACK and JD

22. SMALLSTACK ADD-ptr-int 23. SMALLSTACK ADD-int 24. SMALLSTACK NEG 25. sub-ptr SMALLSTACK 26. MUL SMALLSTACK 27. SLL SMALLSTACK 28. SLR SMALLSTACK 29. SRA SMALLSTACK

30. SKIPNZ-int SMALLSTACK 31. SKIPZ-int SMALLSTACK 32. SKIPEQ-int SMALLSTACK 33. SKIPEQ-ptr SMALLSTACK 34. SKIPLE-int SMALLSTACK 35. SKIPLE-ptr SMALLSTACK 36. SKIPLT-int SMALLSTACK 37. SKIPLT-ptr SMALLSTACK 38. SKIPGE-int SMALLSTACK 39. SKIPGE-ptr SMALLSTACK

40. LD SMALLSTACK 41. ST SMALLSTACK

42. PUSH 0 onto SMALLSTACK 43. PUSH 1 onto SMALLSTACK

44. PUSH addr of MEMSTACK (R1) onto SMALLSTACK 45. POP from SMALLSTACK to addr of MEMSTACK (R1)

46. JREL +1 47. JREL +2 48. JREL +3 49. JREL +4 50. JREL +5 51. JREL +6 52. JREL -2 53. JREL -3 54. JREL -4 55. JREL -5 56. JREL -6 57. JREL -7

58. BITAND SMALLSTACK 59. BITOR SMALLSTACK 60. BITXOR SMALLSTACK 61. BITNOT SMALLSTACK

62. CAS SMALLSTACK 63. FENCE

(this fits in all of our compute operations, even SRA and ROT, and except we can only access the first 8 registers; and no syscalls, not even malloc or sysinfo; and no way to load immediate constants; and JREL is restricted to +-6 (that means we can go fwd or back skipping over 3 16-bit instructions) (JREL is +-1024 in the 16-bit ISA). And we even managed to fit in EQ, LT, and GE (but not NE).)

(in reality, not only will be profile and choose the 8-bit format to represent the most common instructions or instruction sequences, but we need to leave at least 8 of these free as 'custom instructions' for VMs like OVM implemented on top of BootX?)

---

actually let's make 'native floats' of unspecified size, that fit in the same space as native ints and native pointers. Now we can get rid of the f64 registers.

---

Arm cortex M4 offers floating point with single Precision that is 32 bits which is the same as the native pointer bitwidth (so it seems pretty feasible to have floats of the same precision as pointers)

---

how to represent program memory:

if we use pointers:

if we say Boot instructions take up one 'word' and that program memory is word-addressed, then, in BootX?, we won't be able to adequately represent branches to 8-bit encoded instructions.

if we say Boot instructions take two words, then we are wasting tons of space if pointers are implemented naively.

can we just say that all memory is byte-addressed? No, unless we want to add a 'sizeof' -- because we don't know what the pointer size is, so this would leave us unable to traverse data structures with pointers.

So maybe we should just go back to using integers to refer to program locations.

We could have these integers be relative to the PC. But what about 'PUSHPC' and 'JD'? We need to be able to use a JD to go to a PUSHPC that was captured at some other time, when the PC was somewhere else. So maybe make them relative to the beginning of the program.

Or, we could just say that PUSHPC and JD take pointers, and say that the platform word size can be gotten from SYSINFO, and if you want to do pointer arithmetic on the result of PUSHPC you can (or can't, if the implementation doesn't permit it) but you have to ask SYSINFO for the word size first.

---

https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md

---

actually now that CODEPTRs are opaque again, we need a way to do longer jumps again. So we probably need one of: LOADCODEPTR or CVT-INT-PTR, which takes an integer and interprets it as bytes from beginning of program. a JMP instruction special instructions to do pointer arithmetic, in units of bytes, on CODEPTRs generated from LOADPC

(later) i chose CVT-INT-PTR. i'm not going to reintroduce the restriction that it comes right after a LOADI; to keep things simple, Boot doesn't have all those restrictions that make code analysis easier (eg there is no requirement that the map of SMALLSTACK be staticly known). Maybe i'll add those in BootX?.

---

how do other ISAs do integer multiply?

http://gem5.org/X86_microop_ISA Mul1s Signed multiply " ProdHi:ProdLo? = Src1 * Src2

Multiplies the unsigned contents of the Src1 and Src2 registers and puts the high and low portions of the product into the internal registers ProdHi? and ProdLo?, respectively. (my note: they have another instruction for unsigned multiply with the exact same description.. i think they meant the SIGNED contents here) "

https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf “M” Standard Extension for Integer Multiplication and Division, Version 2.0

" MUL performs an XLEN-bit × XLEN-bit multiplication and places the lower XLEN bits in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return the upper XLEN bits of the full 2×XLEN-bit product, for signed × signed, unsigned × unsigned, and signed × unsigned multiplication respectively. If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies. "

so it looks like i'm doing the right thing already? i should check with a simulator or something.

OK here's how 'imul', signed integer multiplication, works on my computer (x86-64). I made a file 'test.asm' with the following:

section     .text
global      _start                              

_start:
  mov  al,0x01
  mov  bl,2
  imul bl

  mov  al,0x3f
  mov  bl,2
  imul bl

  mov  al,0x40
  mov  bl,2
  imul bl

  mov  al,0x7f
  mov  bl,2
  imul bl

  mov  al,0x80
  mov  bl,2
  imul bl
  
  mov  al,0x81
  mov  bl,2
  imul bl

  mov  al,0xc0
  mov  bl,2
  imul bl

  mov  al,0xc1
  mov  bl,2
  imul bl
  
  mov  al,0xff
  mov  bl,2
  imul bl

  mov     eax,1                               ;sys_exit
  int     0x80

then ran:

nasm -g -F dwarf -f elf64 test.asm; ld -o test test.o; ddd test

and opened Status-Registers, and set a breakpoint on _start, then did Run and then stepped through.

EAX and EBX are general-purpose 32-bit registers. The least-significant 16 bits of EAX and EBX can be used separately via register names AX and BX (i'll call these 'virtual registers'. The most-significant 16 bits of EAX are divided into 2 8-bit virtual registers, AH and AL, and similarly BH and BL for EBX. (see [2]