proj-oot-old-ootAssemblyNotes21Old

Difference between revision 2 and current revision

No diff available.

should we specify a fixed calling convention for LIBRARY? Perhaps just have all arguments and all return values on MEMSTACK? But then LIBRARY is essentially variadic, and this means that, unless the implementation is told the signatures for all LIBRARY calls, it can't staticly know the stack map for DATASTACK. Is this a problem? Does the JVM have two stacks (call stack, operand stack) or one? If two, do JVM stack maps cover the call stack, or just the operand stack? i think what the JVM does is have the 'operand stack' within each single stack frame in the call stack, so there is just one stack. And stack maps are for each stack frame. So, yes, the JVM stack maps do map the call stack frames. So, yes, a variadic LIBRARY would be a problem, or at least require there to be a way to tell any program verification toolchain what the signatures of various LIBRARY calls are. so what does the JVM do with JNI? i think it does need the signatures; see [1]

decision: no leave that for further up the toolchain -- the spirit of Boot is not to make all sorts of restrictions for the sake of program analysis.

---

what i'm calling 'indirect indexed' is also called 'index-base' by some [2]

---

also we may need to add an 'absolute' addr mode -- one paper notes that this is essential for accessing global variables. BUT i think the 'constant' addr mode may work for this; or at the least, the constant addr mode can provide the base address for the global variable table, and we can add offsets.

also many other architectures allow displacements of 8- or 16- bits, which at first blush is much more than the 2 bits that we have. But, in LOADs and STOREs, which is where it probably matters most, we actually we get 7 bits (4 + 3 addr mode bits, since this is a constant, in 32-bit mode, and 12 bits (8 + 4 addr mode bits) in 64-bit mode), so we're probably good.

---

mb we want predecrement instead of postincrement. If you think about it, in a performance-sensitive for loop, you want to count down from n to 0, because then within the loop you only have to compare against 0, and you can forget about n. also

in the M6809 addr mode dynamic frequency ("Table IX-Dynamic indexed addressing statistics "), increment was much more common than decrement in some benchmarks.

"Table 16 - 68020 Addressing Mode Use" decrement was more common.

eh, i think we'll stick with increment. There's many times when you know the base address but not (without further computation) the end address of some data structure.

---

idea for 32-bit BootX? before taking out the -i16 instructions:

The 32 three-operand instruction opcodes and mnemonics and operand signatures are (note: in 32-bit instructions, constants ('c') are 7 bits, not 3 or 4 bits, because the addressing mode is treated as part of the constant): 0. bne-int: c ii ii (branch-if-not-equal on ints) 1. bne-ptr: c ip ip (branch-if-not-equal on pointers) 2. jrel: c c c (unconditional relative jump by a constant signed amount) 3. ldi: oi c c (load immediate 8-bit int) 4. ld: c o si (load from memory address plus unsigned constant) 5. st: c so i (store from register to memory address plus unsigned constant) 6. addi-int: c io ii (in-place addition of ints and immediate constant) 7. addi-ptr-int: c iop ii (in-place addition of ints and immediate constant to ptr) 8. ble-int: c ii ii (branch if less-than-or-equal on ints) 9. ble-ptr: c ip ip (branch if less-than-or-equal on pointers) 10. add-int: oi ii ii (addition of ints) 11. add-ptr-int: op ip ii (add a int to a pointer) 12. CAS 13. bne-i16 14. annotate: c c c (can be ignored) 15. bitor: io ii ii (bitwise OR) 16. bitand: io ii ii (bitwise AND) 17. bitxor: io ii ii (bitwise XOR) 18. sub-ptr: op ip ip (subtraction of pointers) 19. mul-int: oi ii ii (integer multiply) 20. sll: io c ii (shift left logical (multiplication by 2^c (mod MAX_INT+1))) 21. srl: io c ii (shift right logical (division by 2^c, rounding towards zero)) 22. sra: io c ii (shift right arithmetic (division by 2^c, rounding towards negative infinity)) 23. ble-i16 24. add-i16: io16 ii16 ii16 (integer addition of 16-bits) 25. mul-i16: io16 ii16 ii16 (integer multiplication of 16-bits) 26. add-f: iof if if (float addition) 27. mul-f: iof if if (float multiplication) 28. div-f: iof if if (float division) 29. bne-f: c if if (branch-if-not-equal on float) 30. ble-f: c if if (branch-if-less-than-or-equal on float) 31. instr-two:c ? ? (used to encode two-operand instructions))

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register) 1. pop: o sim 2. push: som i 3. sysinfo (query system metadata): o c 4. neg: io ii (arithmetic negation) 5. bitnot: io ii (bitwise negation) 6. pickk: c sio (pick c on stack) 7. rollk: c sio (roll c on stack) 8. neg-f: iof if (arithmetic negation of float) 9. 10. 11. cvt-i16-i: oi ii16 (convert int16 to int (sign-extend?)) 12. cvt-i-i16: oi16 ii (convert int to int16 (how do we deal with ints bigger than 2^15-1 or smaller than -2^15? is that a critical error or do we mandate saturation or something like that?)) 13. neg-i16: oi16 ii16 14. fclass-f: oi if (classify a floating point number; see eg FCLASS in RISC-V, eg FCLASS.S in section 8.9) 15. instr-one: c ? (used to encode one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. 2. 3. cvt-f-i16: if (convert float to int16, pushing 4 entries onto SMALLSTACK) 4. cvt-i16-f: iof (convert int16 to float, popping 4 entries from SMALLSTACK) (is this round? floor? what do we do when the f is larger or smaller than the largest or smallest int16 -- is this a critical error or do we saturate or something else?) 5. coerce-f-i16: if (coerce float to int16, pushing 4 entries onto SMALLSTACK) 6. coerce-i16-f: iof (coerce int16 to float, popping 4 entries from SMALLSTACK) 7. malloc: op 8. mdealloc: ip 9. mrealloc: iop 10. peek-i16: ip (load i16 from external memory address, pushing onto SMALLSTACK) 11. poke-i16: iop (store i16 to external memory address, popping from SMALLSTACK) 12. 13. 14. syscall2 15. syscall: c (used to encode zero-operand misc instructions)

The 16 SYSCALL2 zero-operand instructions are: 0. exec 1. 2. get 3. put 4. spawn 5. pctrl (process control, eg join/wait, kill, etc -- or should these each be separate?) 6. time 7. rand 8. environ 9. getpid 10. signal (?? not sure if we want to do it this way -- signal handler setup) 11. create 12. delete 13. 14. divmod-i16: (on SMALLSTACK; consume 2 items and push dividend, then push remainder) 15. divmod-int: (on SMALLSTACK; consume 2 items and push dividend, then push remainder)

The 16 INSTR-ZERO zero-operand instructions are: 0. halt (terminate program execution) 1. break (mark breakpoint for a debugger) 2. fence-seq 3. peek-i8 4. poke-i8 5. devop 6. read 7. write 8. open 9. close 10. seek 11. flush 12. poll 13. cvt-int-codeptr: coerce int in T to ptr 14. log 15. library

---

in 32-bit format:

i'm thinking of breaking regularity and adding more instruction formats, with long immediates (or maybe just more LOADI-in-Boot-like instructions with concatenated arguments): a JMP with a large immediate jump target (PC relative or relative to program location zero?), and a LD/ST with a large immediate (absolute) address target (relative to start of global memory, or implementation-dependent?). Ppl say [3] that even 12-bit immediates are considered too narrow; in 32-bit format we'd have 14 bits this way. We can get rid of BNE-F and BLE-F and replace them with an FCMP (like RISC-V has). We'd need to get rid of one more, maybe MUL-F? that kinda hurts, but is probably worth it... also i'm thinking that LOADI should be a different format, not just concatenated, because it should use its addr mode bits as part of its value. I'm thinking that in code space (which is in bytes), 0 is the start of code, and in data space (which is in units of slots/native ptr sizes), 0 is the start of the data segment. The interpretation of negative addresses is platform-dependent (idea: some platforms could use each negative address in code space as a separate entry point to a library, allowing for 8k entry points); that means that, even with a concatenated format, we only have 14 bit immediates for LD/ST/LOADI, so without the sign bit that's 13 bits, or 8k. I guess we need another format for the JMP, which has 21 bits, or 20 bits without the sign (so 1 MiB?); that's pretty good.

done

---

in 32-bit format:

also CALL, RET, and conditional CCALL, CRET. CSEL generalizes conditional load, b/c we have register indirect addr mode.

done

---

in 32-bit format:

actually the conditionals are probably pretty useful, mb we should make them a high priority. 3-operand CSEL subsumes 2-operand CCPY, so mb make room for 3-operand CSEL. mb mv the rarely used mul-f and div-f down to zero-operand. Also remember that we need a 2-operand CMP, and we probably also want to add in one- or two-operand CINC, and mb CNEG, CINV.

done

---

in 32-bit format:

if we want CSEL and CCPY, we need either 3-operand CMP, or a 2-operand CMP-to-SMALLSTACK (actually more than one; CMP-to-SMALLSTACK-LE, CMP-to-SMALLSTACK-NE), or a zero-operand CMP-SMALLSTACK, or something similar. Mb move the floating point ops here down to one-operand or zero-operand instructions, and use SMALLSTACK for the other operand(s). Since we only have -SMALLSTACK versions of CSEL and CCPY anyways, 2-operand CMP-to-SMALLSTACK is sufficient, we dont need 3-operand CMP.

done

---

https://stackoverflow.com/questions/22168992/why-are-conditionally-executed-instructions-not-present-in-later-arm-instruction

" ...modern systems have better branch predictors and compilers are much more advanced so their cost on instruction encoding space is not justified...

    The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

And it continues

    A very small set of “conditional data processing” instructions are provided. These instructions are unconditionally executed but use the condition flags as an extra input to the instruction. This set has been shown to be beneficial in situations where conditional branches predict poorly, or are otherwise inefficient.

Another paper titled Trading Conditional Execution for More Registers on ARM Processors claims:

    ... conditional execution takes up precious instruction space as conditions are encoded into a 4-bit condition code selector on every 32-bit ARM instruction. Besides, only small percentages of instructions are actually conditionalized in modern embedded applications, and conditional execution might not even lead to performance improvement on modern embedded processors.

...

In addition, predication does not play well with out-of-order execution: it can require 4 data flow source operands (predicate, current value of destination register [needed if predicate is false], and two source register values) which must be checked for availability. AArch64's predicated instructions only require three sources (which is more likely to be supported by the OoO? machinery [e.g., to support FMA] and is more friendly to a cracking into 2-source µops [like Alpha 21264 did for CMOV]). -- paul-a-clayton

...

I couldn't even find a conditional branch to register or conditional return, and no conditional loads. -- ant6n

...

For AArch64 the number of registers has been doubled compared to 32-bit ARM, but again you don't have any remaining bits for the new 3 high bits of the registers. If you want to use the old encoding then you must "borrow" either from the narrow 12-bit immediate or the 4-bit condition. 12-bit is too small compared to other RISC architectures such as MIPS and reducing it making everything worse, so removing the condition is a better choice

...

Conditional execution is a good choice in implementation of many auxiliary or bit-twiddling routines, such as sorting, list or tree manipulation, number to string conversion, sqrt or long division. We could add UART drivers and extracting bit fields in routers. Those have a high branch to non-branch ratio with somewhat high unpredictability too.

However, once you get beyond the lowest level of services (or increase the abstraction level by using a higher level language), the code looks completely different: code blocks inside different branches of conditions consists more of moving data and calling sub-routines. Here the benefits of those extra 4 bits rapidly fade away. It's not only personal development but cultural: Culturally programming has grown from unstructured (Basic, Fortran, Assembler) towards structural. Different programming paradigms are supported better also in different instruction set architectures.

A technological compromise could have been the possibility to compress the five bit 'cond.S' field to four or three most frequently used combinations.

...

It's somewhat misleading to say that conditional execution is not present in ARMv8. The issue is to understand why you don't want to execute some instructions. Perhaps in the early ARM days, the actual non-execution of instructions mattered (for power or whatever) but today the significance of this feature is that it allows you to avoid branches for small dumb jumps, for example code like a=(b>0? 1: 2). This sort of thing is more common than you might imagine --- conceptually it's things like MAX/MIN or ABS (though for some CPUs there may be instructions to do these particular tasks).

In ARMv8, while there are not general conditionally executed instructions there are a few instructions that perform the specific task I am describing, namely allowing you to avoid branching for short dumb jumps; CSEL is the most obvious example, though there are other cases (e.g. conditional setting of conditions) to handle other common patterns (in that case the pattern of C short-circuited expression evaluation).

IMHO what ARM has done here is what makes the most sense. They've extracted the feature of conditional execution that remains valuable on modern CPUs (avoid many branches) while changing the details of the implementation to match the micro-architecture of modern CPUs.

"

---

[4]