proj-oot-ootAssemblyNotes21

---

32bit ideas

so with the current plan, in 32-bit, we'll get 16 more 3-operand instructions plus 8 more 2-operand ones. This will allow us to make a lot more things 3-operand, so may as well rewrite the whole thing. So we got tons of space; after just moving stuff around (and freeing up loadpc), we have 8 3-operand opcodes, 9 2-operand opcodes, 7 1-operand opcodes are open. Ideas:

with addition of:

note: the f64 instructions operate on 16 separate floating point registers

note: what about using half-precision (16-bit) floats instead? then we wouldn't need the separate registers. 64-bit floats ('doubles') could be available in OVM.

we could even just do unspecified 'native precision', but i'd prefer not to -- the point of native precision is to have a portable way to express computations, which may involve distances between pointers, and we dont know the pointer size so we cant know the integer size. but once we move outside of that, it's probably better to know how much precision you're dealing with.

but a problem with this is that i bet there are more architectures with f64 support than f16 support. In fact, it's worth than that; a quick check shows that ARM Cortex-M4F only supports f32 (single-precision floating point). Whereas Javascript and Lua and Python only support f64.

so, i think we'll stick with f64.

note: what about sign-extension and zero-extension of integers of various bitwidths? what about coerce-i16-i and coerce-i-i16? i guess those aren't very useful without knowing the bitwidth of the native ints (which a program could get from sysinfo and then have a switch statement based on the result, but it's probably simpler not to)

note: instead of all the 3-operand i16 and f64 arithmetic, perhaps we'd like more kinds of compare-and-branch instructions on native ints and ptrs?

new instructions:

==
arithmetic on native ints divmod-int
stack ops pickk, rollk (note: these obsolete the 1-operand dup, swap, over instructions)
64-bit floating point (optional)add-f64, mul-f64, div-f64, bne-f64, ble-f64, neg-f64, peek-f64, poke-f64, fclass-f64, cvt-f64-i16, cvt-i16-f64, coerce-f64-i16, coerce-i16-f64
16-bit integers add-i16, mul-i16, cvt-i16-i, cvt-i-i16, neg-i16, peek-i16, poke-i16, bne-i16, ble-i16, divmod-i16
syscalls (optional)exec, delete, get, put, spawn, pctrl, time, rand, environ, getpid, signal
==

new critical errors: divide-by-zero, (??: conversion-out-of-range? should that really be a critical error?)

note: does float divide by zero cause a critical error, or does it just create Inf? Can you choose?

todo: specify that native ints are 2s complement, and also that you can give native ints as inputs to -i16 operations on native ints, and the effect is to take just the lowest-order 16 bits of the native int. You can also give -i16s as input to native int operations, and the effect is to (zero-extend or sign-extend?) the i16. This allows us to use bne-int to compare i16s (but watch out comparing an i16 to an int; it may compare as unequal even if the low-order bits are the same, if int has any high-order bits set). However, since the i16 has a sign bit where longer ints have a '128' bit, ble-int won't work right on i16s; it will interpret i16 -128 as int 128. So, we want a ble-i16, but we have no room for that. Alternately, we could change the 'i16's to 'u16's; but then (a) we want a sub-i16, and we have no room in the 3-operands for that, and also (b) if the native ints happen to be 16-bit then ble-int wont work again. So, maybe we should say that we CANNOT use bne-int or ble-int on i16s (it's a type error), and that we must use cvt-i16-i first, which i guess sign-extends the i16 to an int. Is this a significant enough hit to i16 efficiency such that we should provide bne-i16 and ble-i16, and move mul-f64 and div-f64 (or something else) to make room? Should we consume our last 3-operand RESERVED? ok so far i consumed the RESERVED and moved divmod-int.

note: we clearly can't use most of the addr modes on float regs, should we use the bits differently? like, mb just offer register direct, immediate (which are still interpreted as signed integers), and constant, and allow 5 value bits instead of 4? Should we then have 32 f64 regs instead of 16 (eg RISC-V has 32)? Alternately, if the addr mode has any inirection, we treat this as a PEEK or a POKE to normal memory through the normal registers. I guess the latter is more consistent.

todo: instead of offering PEEK and POKE for -i16 and -f64, we could just provide bitwise truncation and sign- and zero-extension, and say, just use those with indirect addressing modes to load and store. That makes more sense since these loads and stores are to ordinary memory, right? Well, no, not with f64; how many native memory spots it will occupy is non-portable (it will be 1 on 64-bit machines and 4 on 16-bit machines). So, maybe do that with i16 but not f64?

todo: hmm aside from f64 and i16 we're looking pretty stable. But there are still some decisions to be made then regarding f64 and i16. Throwing in f64 and i16 adds some complexity, is it worth it? If so, i16 or u16? How do f64s get read from and stored into memory, and how many spaces do they take up? How about u16s? What happens if you attempt an int operation on an int16? Are there sign-extend and zero-extend ops for in16? How do you convert from ints to int16s and could this cause a critical error? How many f64 operations do we expose? Is f64 divide by 0 a critical error or does it produce inf, and can this be configured?

note: Will we be IEEE 754 compliant? I don't think so; it seems to me that IEEE-754-2008 may require SQRT and ABS and multiple rounding modes. Also, the RISC-V spec comments, "The C99 language standard effectively mandates the provision of a dynamic rounding mode register". Perhaps the RISC-V floating point operations would be the simplest way that would support the standard. I guess we could say we support it if OUR standard required various assembler intrinsics that computed in 'software' for what the VM itself doesn't do at runtime. I'd rather just keep BootX? simple, though, and say that we don't support IEEE-754-2008, although we do provide a subset of the operations defined there. If we wanted to be as complex as RISC-V, why would we even create BootX? at all? OVM will have more opcode space and can have all those other operations.

note: in fact, it seems that even Python doesn't support IEEE-754 out of the box: [1]. And Python is used a lot for numerical computing. My motto: if Python doesn't support some numerical thing, then we really don't need it (at least not at the OVM level; maybe Oot stdlib could have it).

The 32 three-operand instruction opcodes and mnemonics and operand signatures are (note: in 32-bit instructions, constants ('c') are 7 bits, not 3 or 4 bits, because the addressing mode is treated as part of the constant): 0. bne-int: c ii ii (branch-if-not-equal on ints) 1. bne-ptr: c ip ip (branch-if-not-equal on pointers) 2. jrel: c c c (unconditional relative jump by a constant signed amount) 3. ldi: oi c c (load immediate 8-bit int) 4. ld: c o si (load from memory address plus unsigned constant) 5. st: c so i (store from register to memory address plus unsigned constant) 6. addi-int: c io ii (in-place addition of ints and immediate constant) 7. addi-ptr-int: c iop ii (in-place addition of ints and immediate constant to ptr) 8. ble-int: c ii ii (branch if less-than-or-equal on ints) 9. ble-ptr: c ip ip (branch if less-than-or-equal on pointers) 10. add-int: oi ii ii (addition of ints) 11. add-ptr-int: op ip ii (add a int to a pointer) 12. CAS 13. bne-i16 14. annotate: c c c (can be ignored) 15. bitor: io ii ii (bitwise OR) 16. bitand: io ii ii (bitwise AND) 17. bitxor: io ii ii (bitwise XOR) 18. sub-ptr: op ip ip (subtraction of pointers) 19. mul-int: oi ii ii (integer multiply) 20. sll: io c ii (shift left logical (multiplication by 2^c (mod MAX_INT+1))) 21. srl: io c ii (shift right logical (division by 2^c, rounding towards zero)) 22. sra: io c ii (shift right arithmetic (division by 2^c, rounding towards negative infinity)) 23. ble-i16 24. add-i16: io16 ii16 ii16 (integer addition of 16-bits) 25. mul-i16: io16 ii16 ii16 (integer multiplication of 16-bits) 26. add-f64: iof64 if64 if64 (float64 addition) 27. mul-f64: iof64 if64 if64 (float64 multiplication) 28. div-f64: iof64 if64 if64 (float64 division) 29. bne-f64: c if64 if64 (branch-if-not-equal on float64) 30. ble-f64: c if64 if64 (branch-if-less-than-or-equal on float64) 31. instr-two:c ? ? (used to encode two-operand instructions))

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register) 1. pop: o sim 2. push: som i 3. sysinfo (query system metadata): o c 4. neg: io ii (arithmetic negation) 5. bitnot: io ii (bitwise negation) 6. pickk: c sio (pick c on stack) 7. rollk: c sio (roll c on stack) 8. neg-f64: iof64 if64 (arithmetic negation of float64) 9. peek-f64: of64 ip64 (load f64 from external memory address) 10. poke-f64: op64 if64 (store f64 to external memory address) 11. cvt-i16-i: oi ii16 (convert int16 to int (sign-extend?)) 12. cvt-i-i16: oi16 ii (convert int to int16 (how do we deal with ints bigger than 2^15-1 or smaller than -2^15? is that a critical error or do we mandate saturation or something like that?)) 13. neg-i16: oi16 ii16 14. fclass-f64: oi if64 (classify a floating point number; see eg FCLASS in RISC-V, eg FCLASS.S in section 8.9) 15. instr-one: c ? (used to encode one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. 2. 3. cvt-f64-i16: if64 (convert float64 to int16, pushing 4 entries onto SMALLSTACK) 4. cvt-i16-f64: iof64 (convert int16 to float64, popping 4 entries from SMALLSTACK) (is this round? floor? what do we do when the f64 is larger or smaller than the largest or smallest int16 -- is this a critical error or do we saturate or something else?) 5. coerce-f64-i16: if64 (coerce float64 to int16, pushing 4 entries onto SMALLSTACK) 6. coerce-i16-f64: iof64 (coerce int16 to float64, popping 4 entries from SMALLSTACK) 7. malloc: op 8. mdealloc: ip 9. mrealloc: iop 10. peek-i16: ip (load i16 from external memory address, pushing onto SMALLSTACK) 11. poke-i16: iop (store i16 to external memory address, popping from SMALLSTACK) 12. 13. 14. syscall2 15. syscall: c (used to encode zero-operand misc instructions)

The 16 SYSCALL2 zero-operand instructions are: 0. exec 1. 2. get 3. put 4. spawn 5. pctrl (process control, eg join/wait, kill, etc -- or should these each be separate?) 6. time 7. rand 8. environ 9. getpid 10. signal (?? not sure if we want to do it this way -- signal handler setup) 11. create 12. delete 13. 14. divmod-i16: (on SMALLSTACK; consume 2 items and push dividend, then push remainder) 15. divmod-int: (on SMALLSTACK; consume 2 items and push dividend, then push remainder)

The 16 INSTR-ZERO zero-operand instructions are: 0. halt (terminate program execution) 1. break (mark breakpoint for a debugger) 2. fence-seq 3. peek-i8 4. poke-i8 5. devop 6. read 7. write 8. open 9. close 10. seek 11. flush 12. poll 13. coerce-int-ptr: coerce int in T to ptr 14. log 15. library

an idea for the 64 8-bit instructions (we really only have 6 bits because there are 2 encoding format bits in the 8-bit instruction encoding):

8-bit instruction ideas: 0. ANNOTATE

1. SMALLSTACK SWAP 2. SMALLSTACK OVER 3. SMALLSTACK DROP 4. SMALLSTACK DUP

5. SMALLSTACK ROT

6. POP SMALLSTACK and PUSH to MEMSTACK 7. POP MEMSTACK and PUSH to SMALLSTACK

8. POP SMALLSTACK into ERR 9. POP SMALLSTACK into R4 10. POP SMALLSTACK into R5 11. PUSH ERR onto SMALLSTACK 12. PUSH R4 onto SMALLSTACK 13. PUSH R5 onto SMALLSTACK

14. LD through T into R4 (that is, load in register indirect mode with register T, and put it into R4) 15. LD through R4 and PUSH it onto SMALLSTACK 16. ST R4 through T 17. POP SMALLSTACK and ST it through R4 18. (POP SMALLSTACK, add it to the pointer in R4), LOAD from the addr in parens and PUSH it to SMALLSTACK 19. (POP SMALLSTACK, add it to the pointer in R4), POP SMALLSTACK and ST it to the addr in parens

20. PUSHPC onto MEMSTACK 21. POP MEMSTACK and JD

22. SMALLSTACK ADD-ptr-int 23. SMALLSTACK ADD-int 24. SMALLSTACK NEG 25. sub-ptr SMALLSTACK 26. MUL SMALLSTACK 27. SLL SMALLSTACK 28. SLR SMALLSTACK 29. SRA SMALLSTACK

30. SKIPNZ-int SMALLSTACK 31. SKIPZ-int SMALLSTACK 32. SKIPEQ-int SMALLSTACK 33. SKIPEQ-ptr SMALLSTACK 34. SKIPLE-int SMALLSTACK 35. SKIPLE-ptr SMALLSTACK 36. SKIPLT-int SMALLSTACK 37. SKIPLT-ptr SMALLSTACK 38. SKIPGE-int SMALLSTACK 39. SKIPGE-ptr SMALLSTACK

40. LD SMALLSTACK 41. ST SMALLSTACK

42. PUSH 0 onto SMALLSTACK 43. PUSH 1 onto SMALLSTACK

44. PUSH addr of MEMSTACK (R1) onto SMALLSTACK 45. POP from SMALLSTACK to addr of MEMSTACK (R1)

46. JREL +1 47. JREL +2 48. JREL +3 49. JREL +4 50. JREL +5 51. JREL +6 52. JREL -2 53. JREL -3 54. JREL -4 55. JREL -5 56. JREL -6 57. JREL -7

58. BITAND SMALLSTACK 59. BITOR SMALLSTACK 60. BITXOR SMALLSTACK 61. BITNOT SMALLSTACK

62. CAS SMALLSTACK 63. FENCE

(this fits in all of our compute operations, even SRA and ROT, and except we can only access the first 8 registers; and no syscalls, not even malloc or sysinfo; and no way to load immediate constants; and JREL is restricted to +-6 (that means we can go fwd or back skipping over 3 16-bit instructions) (JREL is +-1024 in the 16-bit ISA). And we even managed to fit in EQ, LT, and GE (but not NE).)

(in reality, not only will be profile and choose the 8-bit format to represent the most common instructions or instruction sequences, but we need to leave at least 8 of these free as 'custom instructions' for VMs like OVM implemented on top of BootX?)

---

actually let's make 'native floats' of unspecified size, that fit in the same space as native ints and native pointers. Now we can get rid of the f64 registers.

---

Arm cortex M4 offers floating point with single Precision that is 32 bits which is the same as the native pointer bitwidth (so it seems pretty feasible to have floats of the same precision as pointers)

---

how to represent program memory:

if we use pointers:

if we say Boot instructions take up one 'word' and that program memory is word-addressed, then, in BootX?, we won't be able to adequately represent branches to 8-bit encoded instructions.

if we say Boot instructions take two words, then we are wasting tons of space if pointers are implemented naively.

can we just say that all memory is byte-addressed? No, unless we want to add a 'sizeof' -- because we don't know what the pointer size is, so this would leave us unable to traverse data structures with pointers.

So maybe we should just go back to using integers to refer to program locations.

We could have these integers be relative to the PC. But what about 'PUSHPC' and 'JD'? We need to be able to use a JD to go to a PUSHPC that was captured at some other time, when the PC was somewhere else. So maybe make them relative to the beginning of the program.

Or, we could just say that PUSHPC and JD take pointers, and say that the platform word size can be gotten from SYSINFO, and if you want to do pointer arithmetic on the result of PUSHPC you can (or can't, if the implementation doesn't permit it) but you have to ask SYSINFO for the word size first.

---

https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md

---

actually now that CODEPTRs are opaque again, we need a way to do longer jumps again. So we probably need one of: LOADCODEPTR or CVT-INT-PTR, which takes an integer and interprets it as bytes from beginning of program. a JMP instruction special instructions to do pointer arithmetic, in units of bytes, on CODEPTRs generated from LOADPC

(later) i chose CVT-INT-PTR. i'm not going to reintroduce the restriction that it comes right after a LOADI; to keep things simple, Boot doesn't have all those restrictions that make code analysis easier (eg there is no requirement that the map of SMALLSTACK be staticly known). Maybe i'll add those in BootX?.

---

how do other ISAs do integer multiply?

http://gem5.org/X86_microop_ISA Mul1s Signed multiply " ProdHi:ProdLo? = Src1 * Src2

Multiplies the unsigned contents of the Src1 and Src2 registers and puts the high and low portions of the product into the internal registers ProdHi? and ProdLo?, respectively. (my note: they have another instruction for unsigned multiply with the exact same description.. i think they meant the SIGNED contents here) "

https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf “M” Standard Extension for Integer Multiplication and Division, Version 2.0

" MUL performs an XLEN-bit × XLEN-bit multiplication and places the lower XLEN bits in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return the upper XLEN bits of the full 2×XLEN-bit product, for signed × signed, unsigned × unsigned, and signed × unsigned multiplication respectively. If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies. "

so it looks like i'm doing the right thing already? i should check with a simulator or something.

OK here's how 'imul', signed integer multiplication, works on my computer (x86-64). I made a file 'test.asm' with the following:

section     .text
global      _start                              

_start:
  mov  al,0x01
  mov  bl,2
  imul bl

  mov  al,0x3f
  mov  bl,2
  imul bl

  mov  al,0x40
  mov  bl,2
  imul bl

  mov  al,0x7f
  mov  bl,2
  imul bl

  mov  al,0x80
  mov  bl,2
  imul bl
  
  mov  al,0x81
  mov  bl,2
  imul bl

  mov  al,0xc0
  mov  bl,2
  imul bl

  mov  al,0xc1
  mov  bl,2
  imul bl
  
  mov  al,0xff
  mov  bl,2
  imul bl

  mov     eax,1                               ;sys_exit
  int     0x80                                

then ran:

nasm -g -F dwarf -f elf64 test.asm; ld -o test test.o; ddd test

and opened Status-Registers, and set a breakpoint on _start, then did Run and then stepped through.

EAX and EBX are general-purpose 32-bit registers. The least-significant 16 bits of EAX and EBX can be used separately via register names AX and BX (i'll call these 'virtual registers'. The most-significant 16 bits of EAX are divided into 2 8-bit virtual registers, AH and AL, and similarly BH and BL for EBX. (see [2] for a diagram).

mov, imul etc have variants for different bitwidths, which (at least in nasm) appear to be automatically selected based on what kind of (virtual) register you give them.

"mov al,0x3F" moves immediate constant 0x3F (63) into AL.

"mov bl,2" moves immediate constant 2 into BL.

"imul bl" does signed multiply: al = al * bl. https://c9x.me/x86/html/file_module_x86_id_138.html gives details on this command. Notably, "For the one operand form of the instruction, the CF and OF flags are set when significant bits are carried into the upper half of the result and cleared when the result fits exactly in the lower half of the result".

here are the results of multiplying various things in the above fashion:

So it works as i expected; the most-significant-byte concatenated with the least-significant-byte contains the 16-bit signed bit pattern of the result. This means that only the MSB contains the sign information, and the LSB (least significant byte), taken by itself, cannot be interpreted as signed (ex 127*2 = 254, so the LSB there is 0xfe, which is unsigned 254). It's not necessary to know about the CF,OF flags to understand the result.

The MAX_INT-bitwidth analog of these would probably be useful test cases for our mul-int function.

---

so does the current version of python mul-int work right? (no it didnt, but i fixed it, below is the output from the new one)

MAX_INT = 127 INT_BITS = 8

def m(in0,in1): result = in0 * in1 if result > 0: result_twos_complement = result else: result_twos_complement = ((MAX_INT+1)*2)2 + result result_most_significant_word_twos_complement = result_twos_complement >> INT_BITS result_least_significant_word_twos_complement = result_twos_complement & ((MAX_INT+1)*2-1) if result_most_significant_word_twos_complement <= MAX_INT: result_most_significant_word = result_most_significant_word_twos_complement else: result_most_significant_word = - ((MAX_INT+1)*2 - result_most_significant_word_twos_complement) if result_least_significant_word_twos_complement <= MAX_INT: result_least_significant_word = result_least_significant_word_twos_complement else: result_least_significant_word = - ((MAX_INT+1)*2 - result_least_significant_word_twos_complement) return result_most_significant_word, result_least_significant_word

print([m(in0,in1) for (in0,in1) in [(1,2), (63,2), (64,2), (127,2), (-128,2), (-127,2), (-64,2), (-63,2), (-1,2), ]])

[(0, 2), (0, 126), (0, -128), (0, -2), (-1, 0), (-1, 2), (-1, -128), (-1, -126), (-1, -2)]

(in comparing to the above, remember that our results are the signed interpretation of the two bitstrings making up the result words; so eg 0x81 * 2 = 0xff02, ff is represented by -1, 02 is represented by 2, so (-1,2) is correct )

so it's correct.

btw, there is a range in which this provides the correct answer if the result_most_significant_word is simply discarded and the result_least_significant_word is (incorrectly) interpreted as a complete signed bitstring:

when -128 < result < 127. So, generalizing this, -(MAX_INT+1) <= result <= MAX_INT. Which is nice.

---

---

[3]

" All processors support

...

LC4/MIPS/x86 Operations and Datatypes

...

How Much Memory? Address Size

LC4/MIPS/x86 Registers

...

All ISAs moving to 64 bits (if not already there) ...

Memory Addressing

...

MIPS implements only displacement

...

I-type instructions: 16-bit displacement ...VAX experiment showed 1% accesses use displacement >16

MIPS : Displacement: R1+offset (16-bit): Experiments showed this covered 80% of accesses on VAX

...

SPARC adds Reg+Reg

...

LC4

---

" https://ir.canterbury.ac.nz/bitstream/handle/10092/9405/jaggar_thesis.pdf;sequence=1

RISC Development

In 1975 IBM began a project to "achieve significantly better cost/performance for High Level Language programs than that attainable by existing systems" [Radi82]. The 801 project ...

Only two addressing modes were provided, base register plus immediate index and base register plus register index. The result of the base plus index calculation could be stored back into the base register after each memory access, providing an "auto-increment" facility.

...

((ARM))

a) Base Register plus Offset. The value of a base register and an immediate value or the value in an offset register are combined to form the memory address. The immediate is a twelve bit unsigned integer which may be added or subtracted from the base register (effectively yielding a thirteen bit signed immediate offset). The value of the offset register may be shifted in a similar manner to the second -56-operand of a data processing instruction, (although only by an immediate amount).

b) Base Register Plus Offset with Pre Increment (or Decrement). These modes are similar to Base Register plus Offset, but the result of the base register and offset addition (or subtraction) is written back to the base register. This mode is useful for accessing arrays of data. The shift applied to a register offset can be used to directly scale the array index.

c) Base Register Plus Offset with Post Increment (or Decrement). These modes are similar to the above except the Base plus offset calculation is not performed or written back to, until after the memory access has been made. These modes are also useful for (scaled) array operations. It should be noted that these addressing mode calculations only use existing hardware used for normal addition and subtraction data processing instructions, so they only add decoding hardware to an implementation. A flag in the instruction word indicates that a single byte should be loaded from memory rather than a full word; there is no single instruction to load a 16 bit quantity or to load and sign extend a byte "

---

so i guess the only addr mode change i've come up with recently is for the 32-bit format; we should add in an immediate to autoincrement.

---

well, i guess if we also added in some indexed LD/ST instructions in 64-bit mode, then we could add a scaling immediate to them? that's kind of complex, though, so probably not.

---

another way to have the CALL3 mechanism specify where a return value goes is to have RET read the program code; CALL could push its own address (rather than that of the following instruction) onto the call stack, and then RET could read the CALL instruction to determine where to put the return data.

this has the disadvantage that this procedure, meant for interpretation, is quite different from what would happen during compilation; there, the CALL would be expanded to two instructions; the second instruction would pop the return value from SMALLSTACK and place it where the original CALL wanted it.

one thinks, maybe there should just be an assembler macro instead of this stuff. Indeed, why not just leave out the return value assignment part of the CALL? After all, we're going to have to 'execute' another (possibly virtual) 'instruction' upon return one way or another. So just return the return value on SMALLSTACK, and don't have any return operand in the CALL3. Instead, we have 3 input arguments to the function instead of just 2.

---

what i have been calling 'register indirect' is a little confusing, because 'indirect' can also be used to mean a double indirection through main memory, that is, the effective address is the pointer which is contained in an absolute memory address in the operand.

(i guess if you say 'REGISTER direct' that's unambiguous, though?)

---

mb for the 8-bit instructions, focus on loads and stores and branches and calls, and provide those loads and stores with various 'addressing modes' based on two accumulator-ish registers and two index-ish registers (think 6502)

, because "The dynamic data reinforce the conclusions from the static analysis that a good measure of the overall efficiency of an architecture is best found in the loads, stores, conditional branches, subroutine calls, and addressing modes. This coin-cides with the modern view that computers spend most of their time in moving data around and making decisions based on the data rather than in number crunching"

(but also provide stack ops and stack addr modes)

provide addition but not too much other arithmetic there, as arithmetic is relatively uncommon both dynamically and statically. Provide addr modes including displacement by small constants, PC-displacement, indexed addr modes. Provide loads of small immediate constants, and provide MOVs.

---

we might also reach for scaled, which is also very general and sorta common: "Scaled: R1=mem[R2+R3*immed1+immed2]" [5] -- most common addr mode after displacement (if you include displacement=0) -- of course, how many of those were just 'index-base'(immed1 = 1). Is this practical for us? In 64-bit, sorta -- we have 8 operand bits, so we can split it into 4 groups of 2. But that only lets R2 and R3 range over 4 regs. Much better would probably be to get rid of immed2, and distribute those 2 bits to R2 and R3 -- now R2 and R3 can range over 8 regs. I think this may be worth it (to replace indirect-indexed/index-base), because index-base is less useful if the thing being iterated over is width greater than 1 -- so we'd have 3 bits for R2, 3 for R3, and 2 for immed1. In the 32-bit format i think we have to settle for immed1 = immed2 = 0.

nah, it's more valuable to access all 16 registers.

later: actually i dunno in some benchmarks (eg [6] figure 15), scaled indexed is more common than indexed

(although in [7] Table 6, indexed is more popular than scaled indexed)

eh, mb leave it in, in 64-bit mode only. So there we have 3 bits for the first register (holding the base address), 3 bits for the second register (holding the integer index), and 2 bits for the scale, which is multiplied by the integer index.

---

" comp.arch › Why did ARM64 drop predication? 4 posts by 4 authors Ivan Godard 9/28/13 I'm only venturing an ignorant guess, but perhaps it was because predication adds another source dependency, which may have been too much for the out-of-order dispatcher to handle.

I don't see as encoding entropy could have played a part - surely there was enough room for the extra register number. But the guess above seems somewhat weak too - anybody have a clue? Click here to Reply Bruce Hoult 9/28/13 On Sunday, September 29, 2013 3:50:53 PM UTC+13, Ivan Godard wrote: > I'm only venturing an ignorant guess, but perhaps it was because > predication adds another source dependency, which may have been too much > for the out-of-order dispatcher to handle.

Exactly.

There are still a number of conditional instructions, e.g.

CSEL Wd = if cond then Wn else Wm

That lets you use the superscalar execution to calculate arbitrary results in parallel and then choose which one to keep. Of course most modern CPUs have that, including x86.

But also:

CSINC d = if cond then n else m+1 CSINV d = if cond then n else !m CSNEG d = if cond then n else -m

Plus shortcut versions when n=m or to set d to 0 or 1/-1 from the flags.

((my note: also CSET Conditional set and CSETM Conditional set mask))

Other than simple cases such as incrementing a variable or not, predicated execution is useful mostly in simple in-order implementations. Once you have even a little of superscalar you're just as well off to actually execute the instructions and select the results as to turn them into NOPs. "

---

in the above:

CSEL Wd, Wn, Wm, cond: "Conditional Select returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register." CSINC d = if cond then n else m+1 CSINV d = if cond then n else !m CSNEG Wd, Wn, Wm, cond = if cond then Wn else -Wm

other conditional things:

CNEG Conditional negate CCMN (immediate) Conditional compare negative (immediate), setting condition flags to result of comparison or an immediate value CCMN (immediate) CCMN (register) Conditional compare negative (register), setting condition flags to result of comparison or an immediate value CCMN (register) CCMP (immediate) Conditional compare (immediate), setting condition flags to result of comparison or an immediate value CCMP (immediate) CCMP (register) Conditional compare (register), setting condition flags to result of comparison or an immediate value CCMP (register) CINC Conditional increment CINV Conditional invert

you can see what these are at http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100069_0609_00_en/pge1427897722020.html

(already copied this to ootAssemblyOpsNotes2)

---

" It's crystal clear that, for example, almost the entire (( C )) code base is compatible with signed addition defining overflow the same way that every modern CPU handles it. " -- D. J. Bernstein [8]

---

i noticed three exciting things today. From least to most:

1) i was a little worried about how we could just give a pointer to a function entrypoint for first-class functions -- surely some implementations will need some metadata. But i realized that we can use ANNOTATE opcodes to store that stuff at the function entrypoint.

2) recently i have been forgetting what the original purpose of 'stack mode' was supposed to be. In essence there are TWO stack modes, one of them fixes one stack, and then uses it like registers with the argument saying which location on the stack to access, and the other one uses the argument to choose a stack and pushes and pops to it. The second one is predecrement/postincrement. The first one is 'stack mode' or as i now call it, 'smallstack mode'. So smallstack really is like a second bank of 16 additional registers. Also, if you think of it that way, you don't have to care about the situation when the stack is empty; T just acts like a register regardless, it's just that this register will be clobbered the next time you push to the stack.

3) since we have 5 operand value bits for the opcode, even though we want to use non-immediate addressing modes for CALL3 (one-instruction function calls), this makes no sense when the operand value is above 15, because we only have 16 registers (and 16 stack locations). So we can use those addr_mode/operand_value combinations as more instruction opcodes! This gives us a total of 128 3-operand opcodes instead of just 32. Wow! We can fit pretty much everything we want in there, probably even most of the arithmetic for 8/16/32/64 unsigned/signed and 16/32/64 floats plus higher-order functions on first-class functions! (well, i'm exaggerating a little bit; i think the JVM has almost 256 opcodes, so we can't expect to have room for as much as the JVM; also if we have 128 opcodes then we're almost as complicated as RISC-V, so is it really worth it to even be having a BooX? in that case? Also in a way the stakes are higher with 128 opcodes, because we'll feel pretty stupid if we can't do as much as, say, Lua with less than 40 opcodes, but otoh we may not want to add in high-level data structures yet)

---

instruction ideas:

16. call: c sio icp (insert PC into indicated stack underneath c arguments, then jmp to codepointer) 17. ccall: c sio icp (pop item from SMALLSTACK, if it's nonzero, execute CALL) 18. ccall2: ii sio icp (if item is nonzero, push PC to indicated stack, then jmp to codepointer)

---

I guess a key question is, do we want boo tx to be just an Assembly Language or do we wanted to include higher-level Concepts like ovm (and JvM? and Lua). And i think that the answer is that, with so many opcodes available, we should include higher-level things.

Since we have 7-bit subopcodes, we have 127 3-operand instructions but also 127 2-operand-instrs and 127 1-operand-instrs and 127 0-operand instrs

so maybe this is enough to fit in higher-order functions, stack primitives, combinatorial primitives, data structures, etc.

i guess we should start with stuff in:

when we have all that, we can think about:

and then add other stuff?

---

Maybe have a bootx to boot X compiler to compile away the complicated instructions

---

Some data types that we need

Bytes Ascii strings Unicode strings Multidim arrays Mutable lists Dicts Bits Anything else? Trees? Logic? Objects?

---

hmm after looking at how quickly those 128 3-operand instructions were taken up by a few quick ideas, maybe we won't have enough for HLL data structures after all (unless we put them in the 0-operand instructions). I guess i really need to try to stuff in parts of (RISC-V, WASM, JVM low-level, LLVM/PNaCl?, GNU lightning) first and then see how much is left before we can make this decision.

maybe start by making a concordance of RISC-V and WASM and then implementing the things in the intersection or nearly so?

also, we can't really declare victory even on a sketch of Boot until we have a sketch of an LLVM backend and an LLVM frontend.

y'know, actually we might want to be even more pragmatically goal-driven here. The goal of Boot was to be easy to implement, and i think the code sketch shows that it probably will be. The actual goal of BootX? is to be a portable, good compile target for OVM, but since OVM's decision in itself in its early stages, maybe for design purposes we should pretend that the goal of BootX? is to be easy to be either a compile target or a frontend for any of:

(why a compile target for these things, even though our actual intent is just for it to be a frontend? Because we can assume that if these things include an instruction, it's probably because others have found that that instruction is useful for HLL language implementations; so making it a compile target forces us to include those things too. Also, since BootX? is supposed to be a compile target for OVM but we don't have OVM yet, having BootX? be a compile target for anything might help. Otoh i don't really want to implement RISC-V or WASM, so mb just a compile target for LLVM)

So maybe i should start by sketching an implementing a BootX? that only has instructions that directly map to Boot, and then try to write frontends and backends for those three things and along the way add the instructions (like the floating point instructions that i've already added) that seem useful for that.

Then we can see if there is any room left for higher-level stuff.

Also, a next step is to include compiler support for treating most BootX? instructions as macros for Boot, in order to allow BootX? programs to compile to Boot programs (so that porters need only port Boot).

(again, why not just use RISC-V, WASM, or LLVM? because i fear none of these are easily portable enough; also i fear that LLVM is not expressive enough to deal with weird control flows)

---

Bootx can't be full OVM because it only has 16 registers and ovm needs enough registers to serve as HLL local variables

---

Actually we really shouldn't just specify that native integers ('int's) must always use modular arithmetic -- since the point here is to be a glue language, we need to be able to represent bigints in case the target platform uses those.

---

deprecated ideas for instructions that combine a comparison with a conditionally-executed select/mov (csel):

16. stackif-eq: sio ii ii (pop two items from stack op2; if op1 == op0, then push the first item back onto smallstack, otherwise push the second item back onto smallstack) 17. stackif-ne: sio ii ii (pop two items from stack op2; if op1 != op0, then push the first item back onto smallstack, otherwise push the second item back onto smallstack) 18. stackif-le: sio ii ii (pop two items from stack op2; if op1 <= op0, then push the first item back onto smallstack, otherwise push the second item back onto smallstack) 19. stackif-lt: sio ii ii (pop two items from stack op2; if op1 < op0, then push the first item back onto smallstack, otherwise push the second item back onto smallstack) 20. stackif2-eq: sio ii ii (pop two items from stack op2; if item2 == item1, then push op0 onto smallstack, otherwise push op1 onto smallstack) 21. stackif2-ne: sio ii ii (pop two items from stack op2; if item2 != item1, then push op0 onto smallstack, otherwise push op1 onto smallstack) 22. stackif2-le: sio ii ii (pop two items from stack op2; if item2 <= item1, then push op0 onto smallstack, otherwise push op1 onto smallstack) 23. stackif2-lt: sio ii ii (pop two items from stack op2; if item2 < item1, then push op0 onto smallstack, otherwise push op1 onto smallstack) 24. stackif3-eq: o ii ii (pop two items from SMALLSTACK; if item2 == item1, then op2 = op0, otherwise op2 = op1) 25. stackif3-ne: o ii ii (pop two items from SMALLSTACK; if item2 != item1, then op2 = op0, otherwise op2 = op1) 26. stackif3-le: o ii ii (pop two items from SMALLSTACK; if item2 <= item1, then op2 = op0, otherwise op2 = op1) 27. stackif3-lt: o ii ii (pop two items from SMALLSTACK; if item2 < item1, then op2 = op0, otherwise op2 = op1) 28. stackif4-eq: o ii ii (pop two items from SMALLSTACK; if op1 == op0, then op2 = item1, otherwise op2 = item2) 29. stackif4-ne: o ii ii (pop two items from SMALLSTACK; if op1 != op0, then op2 = item1, otherwise op2 = item2) 30. stackif4-le: o ii ii (pop two items from SMALLSTACK; if op1 <= op0, then op2 = item1, otherwise op2 = item2) 31. stackif4-lt: o ii ii (pop two items from SMALLSTACK; if op1 < op0, then op2 = item1, otherwise op2 = item2)

my conclusion is that if we had 5 data operands instead of 3, this would make sense; all of these could be replaced by 4 instructions:

cmpsel-eq: o i i ii ii: if op0 == op1 then op4 = op2, otherwise op4 = op3

(also cmpsel-ne, cmpsel-le, cmpsel-lt)

but with only 3 data operands, these 4 instructions must be expressed instead as a mass of opcodes wasted on confusing permutations of stuff that should be in operands. And there is a good alternative; just split the instruction into two, a comparison operation that outputs a bool, and a conditional select that takes that bool as input. Even with this, there still isn't quite enough data operands (you'd like 4; the output, the two inputs being conditionally selected, and the bool), but that's still better than wanting 5 operands; and there is an obvious solution, which is what i did: have a CSEL that assumes that the bool is on top of SMALLSTACK

---

teddyknox on Mar 13, 2017

parent favorite on: LLVM 4.0.0

What differentiates LLVM IR from, say, JVM bytecode? I'm curious because there's a stalled out GNU project under GCC called GCJ that would compile JVM bytecode to native. I wonder if the issue became that statically linking in the JVM in the binary resulted in a lot of bloat, or something more intrinsic to the suitability of JVM bytecode as a platform-independent IR...

drdrey on Mar 13, 2017 [-]

Off the top of my head:

[1] https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf

pjmlp on Mar 13, 2017 [-]

> JVM bytecode is truly portable, whereas target ABI details leak into LLVM IR. A biggie is 32 bit vs 64 bit LLVM IR.

From a few LLVM meetings and Apple's work on bitcode packaging, I think there is some work to make LLVM IR actually architecture independent.

gsnedders on Mar 13, 2017 [-]

PNaCl? is noteworthy here too, insofar as its IR is very much based on LLVM IR (but portable and stable).

mhh__ on Mar 13, 2017 [-]

LLVM IR is an IR format, :). Basically LLVM is just a abstract RISC, whereas the JVM is a lot of that with a truck load of very high level instructions. One could implement these as a superset of LLVM, but that's not what LLVM is. You, mostly, can JIT LLVM IR and use it as a generic bytecode vm: but it's really designed for static copmilation.

makapuf on Mar 13, 2017 [-]

When will we see processors with a subset of llvm ir in hardware ?

pcwalton on Mar 13, 2017 [-]

That would be a terrible idea. LLVM IR has an inefficient in-memory representation; every IR node is separately heap-allocated via new and chained together via pointers. This is probably a suboptimal design for the compiler, but it would go from suboptimal to disastrous if directly used for execution.

oconnor0 on Mar 14, 2017 [-]

I don't think an implementation of LLVM IR for execution would require the same in-memory representation.

makapuf on Mar 14, 2017 [-]

Exactly, my point was more that as we're having C-optimized processors and microcontrollers or even java or lisp based ones, maybe once there is many software readily compileable with llvm maybe architectures could be optimized for it (but not directly porting it, just having a tiny final step llvm based microcode. By example of course you can't have infinite registers as ssa. But it ca' influence your instruction set.

pcwalton on Mar 14, 2017 [-]

That's not the only reason why you wouldn't want to run LLVM IR directly (if it were possible). You still have the types, which are useless at runtime, and the unlimited register file to deal with.

You could make an ISA which is similar to LLVM IR, but there'd be little point when RISC-V (or even AArch64) already exists.

mhh__ on Mar 13, 2017 [-]

Likely never, LLVM IR uses SSA form. This means that optimisations are easier, but the "assembly" is significantly higher lever than assembly a la MIPS. IR is for doing optimisations not executing code (although LLVM does have interpreters if that's what ya need)

int_19h on Mar 14, 2017 [-]

A more interesting question is, when will we see operating systems using LLVM IR (or similar; some future version of WebAssembly?, perhaps?) for binaries on disk, dynamically compiling them and caching the result for the current platform as needed.

mhh__ on Mar 14, 2017 [-]

In principle that could happen, but LLVM IR is really not designed for anything other than being transformed by LLVM. One could define an abstract risc machine, to be jitted at either side. LLVM is not quite suitable for this purpose: It assumes quite a few details about the target. Also, this requires a huge amount of co-operation. So far this has only happened in the browser with e.g. ECMAScript standardization, asm.js and WebAsm?. The JVM tried to do this, but it's not a good compilation target for languages like C/C++. Therefore, I think will happen eventually: The web browsers will develop the tools and specifications to make this stuff, then it will get broken off and used outside of the web (I hope, god forbid all software has to be distributed via the web using overlyHypedWebScale.js v2)

pjmlp on Mar 14, 2017 [-]

iOS 9, ChromeOS? PNaCL??

Although it is not really what you are describing.

---

https://www.eejournal.com/article/fifty-or-sixty-years-of-processor-developmentfor-this/

" DEC engineers developed the VAX ISA during a time when assembly-language coding dominated, partly because of engineering inertia (“we’ve always done it that way”) and partly because the rudimentary high-level compilers of the day produced machine code that did not compare well against tight, hand-coded assembly language. The VAX ISA’s instructions supported a huge number of programmer-friendly addressing modes and included individual machine instructions that performed complex operations such as queue insertion/deletion and polynomial evaluation. VAX engineers delighted in designing hardware that eased the programmer’s burden. Microcoding made it easy to add new instructions to an ISA, and the 99-bit-wide VAX microprogram control store ballooned to 4096 words.

This focus on creating an ever-expanding number of instructions to ease the assembly-language programmer’s burden proved to be a real competitive advantage for DEC’s VAX. Programmers loved computers that made their jobs easier. For many computer historians, the VAX 11/780 marks the birth of CISC (complex instruction set computing).

" ---

[9]

LC4/MIPS/x86 Addressing Modes

Control Transfers I: Computing Targets

LC4, MIPS, x86 Control Transfers

Later: ISA Include Support For...

The RISC Tenets

CISCs and RISCs

"x86 one of the worst designed ISAs EVER, but survives"

---