proj-oot-ootAssemblyNotes15

Difference between revision 7 and current revision

No diff available.

pcwalton 13 hours ago [-]

Also worth noting: x86 supports four hardware breakpoints, which are sometimes more convenient, since they don't involve overwriting any instructions. (They're also more flexible because they can be used to trap on memory reads, etc.) https://en.wikipedia.org/wiki/X86_debug_register

reply

EvanAnderson? 2 hours ago [-]

I remember the first time I read about the hardware-based debugging functionality in the 80386 and being in awe. I'd programmed 6502 and 8086 assembler and was just blown-away by the treasure-trove of debugging functionality the '386 had available to it. It seemed like it was just short of an in-circuit emulator and so very, very cool.

reply

compiler-guy 59 minutes ago [-]

It's not just x86, pretty much every modern architecture has hardware breakpoint registers, data watch point registers and so on.

reply

wazari972 2 hours ago [-]

I thought that these HW 'breakpoint' registers were (only) used to implement watchpoint, that is, read/write operations on data. Can they be used to implement 'instruction breakpoints'?

reply

pcwalton 1 hour ago [-]

You can set the "break on" bitfield to "execute" (0) to have the debug register function as a normal breakpoint.

reply

jasonzemos 1 hour ago [-]

On register %dr7 there are 4-bit switches for each of the 4 breakpoints. The first two bits can trap on Instruction (0b00), Write (0b01), and ReadWrite? (0b11). The second two bits specify the alignment of Byte (0b00), Short (0b01), Long (0b10), and Int (0b11).

reply

--- https://m.phys.org/news/2017-01-middle-popular-yields-more-efficient-parallel.html

" Cilk adds just two commands to C: "spawn," which initiates a fork, and "sync," which initiates a join. That makes things easy for programmers writing in Cilk but a lot harder for Cilk's developers. ... All previous compilers for fork-join languages are adaptations of serial compilers and add the runtime invocations in the front end, before translating a program into an intermediate representation, and thus before optimization. In their paper, the researchers give an example of what that entails. Seven concise lines of Cilk code, which compute a specified term in the Fibonacci series, require the compiler to add another 17 lines of runtime invocations. The middle end, designed for serial code, has no idea what to make of those extra 17 lines and throws up its hands.

The only alternative to adding the runtime invocations in the front end, however, seemed to be rewriting all the middle-end optimization algorithms to accommodate the fork-join model. And to many—including Leiserson, when his group was designing its first Cilk compilers—that seemed too daunting.

Schardl and Moses's chief insight was that injecting just a little bit of serialism into the fork-join model would make it much more intelligible to existing compilers' optimization algorithms. Where Cilk adds two basic commands to C, the MIT researchers' intermediate representation adds three to a compiler's middle end: detach, reattach, and sync.

The detach command is essentially the equivalent of Cilk's spawn command. But reattach commands specify the order in which the results of parallel tasks must be recombined. That simple adjustment makes fork-join code look enough like serial code that many of a serial compiler's optimization algorithms will work on it without modification, while the rest need only minor alterations. "

https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf

" Tapir introduces three new instructions to LLVM’s IR in order to represent parallel tasks. These instructions are detach , reattach and sync . "

detach label <detached>, label <continuation> reattach label <continuation> sync

http://wmoses.scripts.mit.edu/docs/parallelir.pdf

" Namely, all of these frameworks share two fundamental con- cepts: a detach operation and a synchronize operation. In this sense, a detach operation represents marking code to be run in par- allel. In OpenMP?, detach can be thought of entering a parallel con- text such as a parallel for loop. In Cilk, detach can be though of a spawn operation, assuming all function arguments have been eval- uated. Using PThreads, detach represents running a thread on a specific function. The synchronize operation represents taking code that was pre- viously detached and waiting for it to complete "

" The new detach instruction is a block terminator. This func- tion acts much like a conditional break, as it is provided to blocks which it can branch to. However, unlike the conditional break, the detach instruction represents branching to both blocks simultane- ously. For clarity, we will denote the first argument of the instruc- tion as the continuation while the second argument represents the spawned block. Additionally, a new reattach instruction will be created. This instruction is also a block terminator. It takes one argument – the block representing the continuation. Theoretically this is not nec- essary, however, its inclusion greatly simplifies optimization passes and compilation. The reattach has one successor – the continu- ation. However, unlike successors produced by a normal branch instruction, successors to a reattach instruction cannot use SSA registers inside of the reattach ed block –even in PHI nodes. This is done to represent the fact that the detach ’ed block is not guar- enteed to have executed before the continuation. However, any reg- isters before the detach statement can be used in either the contin- uation or the detach ed block without the use of PHI nodes since it is guarenteed that they were executed before entering the block. When a reattach instruction is executed in a serial context (e.g. no parallelism is available on the machine), it simply acts as an un- conditional branch to the continuation – thus representing the serial ellision. When executed in a parallel context, however, a reattach instruction is used to signal the end of execution for the detach -ed thread. Any detach ’s to function calls using the old syntax can be con- verted to use this new syntax as Figure 3 illustrates. In order to maintain the correctness as shown above, we give this new detach function-like properties, where it can consider any memory accesses in the detached code to be its arguments. "

http://wsmoses.com/tapir.pdf

" Conceptu- ally, the detach instruction is similar to a function call, and the reattach instruction is similar to a return. The three in- structions have the following syntax: detach label b , label c reattach label c sync where b , c ∈ V . The label keywords indicate that b and c are (labels of) basic blocks in V . A detach instruction takes a detached block b and a con- tinuation block c as its arguments, and it allows b and c to operate in parallel. A detach instruction in a block a ter- minates a and spawns a parallel task starting with b which can operate in parallel with the continuation block c . The CFG contains a detach edge ( a , b ) ∈ E and a continue edge ( a , c ) ∈ E . The block b must be a single-entry block for which every exit from the sub-CFG reachable from b — the parallel task spawned by a — is terminated by a reattach whose continuation c matches the continuation in the corre- sponding detach instruction — the detach instruction reat- tached by the reattach . For each such reattach instruc- tion, the CFG contains the reattach edge ( b 0 , c ) ∈ E , where b 0 is the block terminated by the reattach . In a sense, detach works like a function call to b that resumes at c after the subcomputation rooted at b returns, whereas reattach acts like a return. But unlike a function call and return, b is a block within the CFG of the function containing it, rather than a di ff erent function in the program, and b and c can operate in parallel. A detach does not require that b and c execute in parallel, however. It simply allows the runtime system to schedule them for parallel execution, if it so desires. Tapir assumes that every CFG G = ( V , E , v 0 ) obeys the following invariants: 1. A reattach instruction reattaches one detach instruc- tion. 2. For each reattach instruction j that reattaches a detach instruction i , every path from v 0 to the block terminated by j passes through the detach edge of i . 3. Every path starting from the detached block of a detach instruction i must reach a block containing a reattach instruction that reattaches i . 4. If a path from a detach instruction i to a reattach instruction j that reattaches i passes through the detach edge of another detach instruction i 0 , then it must also pass through a reattach instruction j 0 that reattaches i 0 . 5. Every cycle containing a detach instruction i must pass through a reattach instruction that reattaches i . 6. Any immediate successor of a reattach instruction can- not contain “ φ instructions.” "

---

instead of hard limits on length of custom instruction code and nesting of custom instructions, we could just say 'whatever is fine, as long as no fully expanded custom instruction is more than 64k'. But that would mean that if someone is programming custom instructions and using a library of someone else's custom instructions nested within them, that changes in the implementation of the library could make the top-level illegal. So mb better to have hard limits. But that's such a waste, because the limit will almost never be achieved, so we'd be wasting a lot of potential custom instruction length.

How about: hard limits on every layer except the top one; the top one's limit is computed assuming that the layers beneath it are all their own max, but by using more or less of them its own limit changes. This way, the top one's limit is not dependent on how the ones beneath are implemented. This means that the 'layer number' of a custom instruction is part of its fixed interface signature.

But in fact, we could do the same thing for every intermediate layer. The various intermediate layers could have different max expanded code size budgets, monotonically nondecreasing. Presumably the max expanded code size budget of each layer would be powers of two.

But in this case, we could simplify it by throwing out the idea of a fixed numbering of layers, and just specify the max expanded code size budget of any custom instruction as part of its interface signature/API. These could be constrained to be powers of two, and the total max is 64k (2^16). So there are 16 choices (so this parameter can be specified in only 4 bits; maybe the other 4 bits could be flags like 'deterministic', 'side-effectless', 'idempotent', 'threadsafe', etc).

Alternately, the other 4 bits could indicate how much hiddenstack is needed/consumed by this custom instruction, which the compiler should track since there is a small, hard limit on hiddenstack size.

What should the stack size limit be? These 4 bits suggest 16. 16 is also nice because we can't have more than 16 layers of custom instructions. https://people.ece.cornell.edu/land/courses/ece5760/DE2/Stack_cpu.html mentions 16 and 8 as sizes for small stacks. https://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html#212 mentions 16 as the size for a stack buffer (" In the case of a stack that is only used as an expression evaluation stack, "large" may only be approximately 16 elements, since single expressions seldom nest very deeply (Haley 1962). "). Let's say 16.

---

note: we used to have 12-bit operands (4k variables). Is this enough to encode most HLL programs? Probably; MS suggests using no more than 64 locals because the CLR runtime won't consider enregistering anymore [1]. So that's only 6 bits of local variables, and we'd be giving 12 bits for all variables.

but currently we have 8-bit operands (256 vars). This is only 4x the 64 suggested locals in CLR.

---

note: since we are throwing out stack addressing and address modes in SHORT, the hiddenstack is no longer enough for custom instructions. We have three alternatives:

discussion on what to do about the lack of direct access to the hiddenstack TOS:

one of the simplest things to implement is to mmap the hiddenstack TOS to memory location 6. The disadvantage is that this is a special case that doesn't expand our expressive power except by just enough for this problem. We could also add TOS mmaps for datastack and callstack, but then we'd have more than 8 special purpose registers, and only 8 GPRs. An obvious choice is to make the hiddenstack and datastack TOSs mmapped, but not callstack.

btw, somewhat tangentially, we probably need a RET instruction if that CALL instruction is to be useful in SHORT. Otherwise you have to return with a temporary location still in use: (a) pop the callstack (or possibly, the entire regreg saved state) to a temporary variable, (b) set the PC to the popped value.

btw, i got rid of swap, but didn't i figure out that we need it, if there is no scratch register within hidden functions? how else would you swap between datastack and callstack?

if we have a 'scratch' register reserved for custom instructions, that's nice i guess, but we incur the cost of saving and restoring this register in between calls to custom instructions. Although i suppose we could have a 'callee saves' policy there, rather than doing it automatically. I guess if this is not automatic, then it is also simple to implement, i guess even simpler than the mmap.

re-introducing 1-bit address modes into SHORT is the most powerful option, and we would get to eliminate the LOAD/STORE instructions, which means we'd have room for a RET and SWAP. But it also introduces the most theoretical complexity, and also it means the SHORT instruction set is no longer RISC.

i guess based on implementation and conceptual simplicity for SHORT the best answer is a reserved scratch register for use by custom instructions.


hmm should we try for 12 instructions instead of 16?

(b/c base 12 = 2*2*3)

currently we have:

ANNOTATE_OPCODE = 0 LOADI_OPCODE = 1 CPY_OPCODE = 2 ADD_U16_U16_OPCODE = 3 SUB_U16_U16_OPCODE = 4 LEQ_U16_U16_OPCODE = 5 ADD_PTR_U16_OPCODE = 6 SUB_PTR_U16_OPCODE = 7 CALL3_OPCODE = 8 LEA_OPCODE = 9 SYSCALL_OPCODE = 10 JZREL_OPCODE = 11 JNZREL_OPCODE = 12 J_OPCODE = 13 JI_OPCODE = 14 SWAP_OPCODE = 15

initial thoughts:

let's get rid of SWAP at least. Probably also LEA.

also, i'm glad that i thought through SHORT, but now, instead of officially specifying both SHORT and MEDIUM, mb just list SHORT as 'reserved for future use' and don't include it in the spec (except that it still is called 'SHORT' and still has format bits 00 and still is a 64-bit fixed length package). No, as the next paragraph indicates, we are still making changes to everything based on SHORT.

also remember to look at ootAssemblyNotes14, which suggests a different instruction set with less 'signatures' and which also removes the need for addr modes in SHORT, at the expense of making the PC writable. The more i look at that list (today), the more i kind of like it. Perhaps we could still provide the J instructions in MEDIUM mode and make PC read-only in MEDIUM mode? But would that require signatures again?

in that list there are SKIPZ and SKIPNZ. But since we have two input operands, we can use the second input to specify how many to skip. In which case we may as well allow short backwards jumps too and call this JZREL and JNZREL, right? wait no, we had to leave it SKIP because of custom instruction expansion; remember that we were going to let the implementation use the two unused operands to turn each SKIP into a jump of up to 64k, allowing for depth 2 of custom instruction nesting if each custom instruction is up to ~256 instructions. Since the PC is writable in SHORT, we can already do a JREL by just ADDing or SUBing to the PC.

also is CALL3 really CALL3? Because one input operand must be used to give the location of the function reference in memory, so really we just give the function one input and get one output, right?

hmm.. we could actually replace SYSCALL with J (jump to label, that is jump to entry in constant jump table), and just define the first n entries of the jump table to be syscalls. But no, part of the point of syscalls is that they accept an input and give an output, like call3s. Well, how about we rename CALL3 to CALLI (call indirect), and replace SYSCALL with CALL? The low CALLs are the syscalls (at least 16 of them, because in SHORT we want to reach at least 16 syscalls, but probably many more; how about reserving 256? but no, currently our MEDIUM operands are only 8 bits, so that would be all of them, so maybe this should just be SYSCALL and not CALL). This isn't that useful, though, because we already have custom instructions; can't they do everything that CALL does, and more? Will have to think about this more.

as nice as REGREG is, it should probably be a syscall that has an argument saying how many registers are being saved.

registers/memory map with a 0 register prepended: 0: 0 register 1: datastack 2: callstack 3: hiddenstack 4: first-class function (? do we still need a register for this now that SHORT operands can access 16 memory locations instead of 4? probably not) 5: err 6: GPR (now that SHORT has 4 operand bits, we probably don't need a GPR here) 7: PC 8: regreg (this is used to tell the implementation to copy registers to memory so that the copy will persist across function calls)

ok so it's been 5 months and i don't remember the details. Must re-read at least ootAssemblyNotes14, ootAssemblyNotes15, and ootAssemblyThoughts. But after only glancing at some stuff, for future reference, i think the current new plan is:

16 instructions, all can be signature 'out in in' (except for maybe loadi?): annotate (note: annotate with all zero operands is NOP) loadi load store cpy skipz (note: this skips the rest of this 64-bit instruction pack, if we're in a SHORT, and the entire next 64-bit instruction/instruction pack, which might be either one instruction (if it is MEDIUM) or 4 (if it is SHORT)) (note: the skips only use one operand, the other two are reserved for the implementation for use in storing an adjusted target relative jump during custom instruction expansion) skipnz calli (call3?) pop (note: pop and push take stack pointers (pop's at in and push's at out?) and i guess also take a number saying how far down in the stack they operate) push add-u16 sub-u16 leq-u16 add-ptr (i guess that when navigating a struct this should go to the next struct field, even if this means you are moving different amounts of bytes as you move over different fields) sub-ptr syscall

(how shall we use the extra operands in load, store, cpy?)

no addr modes in SHORT.

SHORT encoding format: 1 form bit + 4 packed 16-bit instructions (actually 1 15-bit instruction + 3 16-bit instructions): 1 form bit (0) + (3 opcode bits + 3 x (4 operand bits) ) + 3 x (4 opcode bits + 3 x (4 operand bits)). (so the first instruction in every SHORT must be one of the first 8 primitives (or is it, an even-opcode primitive?); the others must be one of the first 16 primitives; note that the first instruction can be decoded as normal). Encodings are little-endian, and bits are from LSB to MSB.

MEDIUM encoding format: same as before. 2 form bits (10), 2 reserved bits, 5 x (4 addr mode bits + 8 data bits). field1 is the opcode. field0 is a type argument to allow dynamic types to be passed in calls to polymorphic functions.

writable PC in SHORT

read-only PC in MEDIUM

addr modes in MEDIUM. But no STACK addr modes: 000: immediate 001: register direct 010: register indirect 011: RESERVED 100: constant table 101: boxed register direct 110: boxed register indirect 111: RESERVED 1xxxx: RESERVED

some non-primitive instructions (do we allow 'inout inout inout' signatures here? do we have special signatures for the jump instructions?): MOV COW AND OR NOT NEG MUL J JI LEA SWAP

registers/low memory map: 0: 0 register 1: PC 2: err/status (usually zero for no error, nonzero for error, but also used for carry) 3: scratch register (by convention, this may be overwritten by some custom instructions, and should be callee-saved, if needed, onto the hiddenstack before invoking any custom instruction) 4: hiddenstack (hard limit on hiddenstack stack size: 15) 5: datastack 6: callstack 7: reserved

note: we took out 'regreg' so now we want two syscalls to load and save register state

note: we want syscalls to load from the constant table and from the jump table

(assuming that we no longer need a first-class function register (see above); must think about that more, but note that it's not even referenced in pymovm.__init__.py).

some ways that this differs from typical bare-bones assembly:

declarations of custom instructions include the following as part of the signature (note that each of these fit in a nibble, so both of them together can be packed into a representation within one byte):

The compiler can check custom instructions to make sure that hiddenstack overflow and underflow is impossible. Custom instructions must leave the hiddenstack pointer in the place where they found it. Custom instructions cannot call each other in a recursive loop. The compiler can check that the code length budget for custom instructions is not overflowed (this allows an implementation to expand custom instructions into a sequence of primitive instructions, and change SKIPs into internally-implemented JRELs which jump by no more than 64k, which can be held in the 16 bits of the two unused operands in a SKIP in MEDIUM encoding).

should put the stuff in this section into an updated README after the code is updated to make it true:

---

interesting idea for even easier initial bootstrapping on a new platform: write a full OVM interpreter completely in SHORT (or at least compile one to that). Now for an initial bootstrap, someone only has to implement a SHORT interpreter. So they don't have to think about addressing modes or custom instructions or polymorphism. Further simplifications along these lines may be possible.

Of course, if they want good performance or good host platform integration, they will need to write their own full OVM interpreter in the host language, but at least this way they don't have to. And it doesn't add complexity, since SHORT is a proper subset of OVM anyway.

Ooo, i think i like that.

We might rename SHORT to OVMCORE to make this clearer, although that may be too easily be confused with Oot Core. Maybe go back to OotB? or SOot? I guess "Soot" is the easiest and most distinct to remember; so we'd have "Soot" (SHORT Oot), then OotB? (Oot Bytecode, the language than Ovm executes), then Oot Core, then Oot.

so i guess now i'm going the other direction from something i said before: instead of not specifying Oot SHORT yet, we very much specify that. Of course, this specification is tentative and unreleased and liable to change at any time as we gain experience from implementing the full Ovm.

actually i'll call it Boot, not Soot. This way Oot Core could be called Soot later. The Ovm language can just be called Ovm.

I started drafting an initial spec: https://gitlab.com/oot/boot/blob/master/doc/boot_reference.txt

---

in order to make Soot simpler, maybe the skips will just skip one soot instruction, regardless of quad. This provides one more way to have errors in OVM, but to do otherwise would mean to (a) have to plan for an 8x reduction in custom instruction length budgets (from 64k to 8k) due to the possibility of skipping over 7 SHORT instructions that have been represented in the MEDIUM encoding, (b) make translation from SHORT to MEDIUM more complicated, since a SKIP in SHORT couldn't be represented by a SKIP in MEDIUM.

---

hmm it's pretty obnoxious not to be able to JREL. How about we allow JREL and give up some of our total 'custom instruction length budget' in order to allow skips. Our total budget is 64k. If we skip more than one, does that just divide the budget, or is it worse than that? I think it just divides it, because if the outermost layer of instructions can still make it over the gap, then all of the inner ones can too, because that means the total gap is less than 64k.

in SHORT the most we'd want to skip is 16, which would mean our total budget would go to 4k instead of 64k. But actually we want negative skips too, so 8 at most, so our total budget is 8k. That seems reasonable; that's saying that no custom instruction can ever expand to more than 8k of SHORT instructions, which sounds like quite a lot to fit into one instruction.

---

todo think about and add some of the vmcalls i thought of before (i assume they are in ootAssemblyThoughts or ootAssumblyNotes14 or something like that) to boot_reference, add stuff from similar kernels that include things like file I/O eg http://www.shenlanguage.org/learn-shen/shendoc.htm#The%20Primitive%20Functions%20of%20K%20Lambda

---

3 operand get/set for use with maps/arrays/OOP? could replace PUSH and POP by changing k argument to register direct but then would have to LOADI all the darn time when doing stack manipulation

ok moved PUSH and POP to 2-operand instruction list and removed their k argument

---

'read' and 'wait for event' are similar if the file being 'read' is an event stream. Also similar to both of these are iterators with iterator.next().

---

'poll' can be used as 'sleep' too, b/c could wait on an empty set of events with a timeout

---

instead of having a 'poll' (or epoll or kqueue/kevent or completion ports) to wait on multiple events at once, could simply create a virtual stream of the union of events from all of the things you're waiting for; then this virtual stream can be 'read' via the file I/O system. So if you have a way to multiplex/union event streams together like that, then all you need afterwards is a 'wait', not a 'poll'.

---

for awhile i had 'get' and 'put' in the Boot instruction set. But then i thought:

get and put are useful for the extensible Ovm future, but how do they actually work right now? are these vmcalls? arrays? maps? objects? Do we really want a get and a put here? Are they used for channels/files?

i decided that the most sensible non-polymorphic get and put would just be load and store with offsets, which we already have. Although we could have load and store with the offset being a compile-time constant, or stored in a register, i guess we should have both. But save the words get and put for polymorphic stuff later in OotB?