Bayle Shanks's website: proj-oot-ootAssemblyNotes15

pcwalton 13 hours ago [-]

Also worth noting: x86 supports four hardware breakpoints, which are sometimes more convenient, since they don't involve overwriting any instructions. (They're also more flexible because they can be used to trap on memory reads, etc.) https://en.wikipedia.org/wiki/X86_debug_register

EvanAnderson? 2 hours ago [-]

I remember the first time I read about the hardware-based debugging functionality in the 80386 and being in awe. I'd programmed 6502 and 8086 assembler and was just blown-away by the treasure-trove of debugging functionality the '386 had available to it. It seemed like it was just short of an in-circuit emulator and so very, very cool.

compiler-guy 59 minutes ago [-]

It's not just x86, pretty much every modern architecture has hardware breakpoint registers, data watch point registers and so on.

wazari972 2 hours ago [-]

I thought that these HW 'breakpoint' registers were (only) used to implement watchpoint, that is, read/write operations on data. Can they be used to implement 'instruction breakpoints'?

pcwalton 1 hour ago [-]

You can set the "break on" bitfield to "execute" (0) to have the debug register function as a normal breakpoint.

jasonzemos 1 hour ago [-]

On register %dr7 there are 4-bit switches for each of the 4 breakpoints. The first two bits can trap on Instruction (0b00), Write (0b01), and ReadWrite? (0b11). The second two bits specify the alignment of Byte (0b00), Short (0b01), Long (0b10), and Int (0b11).

--- https://m.phys.org/news/2017-01-middle-popular-yields-more-efficient-parallel.html

" Cilk adds just two commands to C: "spawn," which initiates a fork, and "sync," which initiates a join. That makes things easy for programmers writing in Cilk but a lot harder for Cilk's developers. ... All previous compilers for fork-join languages are adaptations of serial compilers and add the runtime invocations in the front end, before translating a program into an intermediate representation, and thus before optimization. In their paper, the researchers give an example of what that entails. Seven concise lines of Cilk code, which compute a specified term in the Fibonacci series, require the compiler to add another 17 lines of runtime invocations. The middle end, designed for serial code, has no idea what to make of those extra 17 lines and throws up its hands.

The only alternative to adding the runtime invocations in the front end, however, seemed to be rewriting all the middle-end optimization algorithms to accommodate the fork-join model. And to many—including Leiserson, when his group was designing its first Cilk compilers—that seemed too daunting.

Schardl and Moses's chief insight was that injecting just a little bit of serialism into the fork-join model would make it much more intelligible to existing compilers' optimization algorithms. Where Cilk adds two basic commands to C, the MIT researchers' intermediate representation adds three to a compiler's middle end: detach, reattach, and sync.

The detach command is essentially the equivalent of Cilk's spawn command. But reattach commands specify the order in which the results of parallel tasks must be recombined. That simple adjustment makes fork-join code look enough like serial code that many of a serial compiler's optimization algorithms will work on it without modification, while the rest need only minor alterations. "

https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf

" Tapir introduces three new instructions to LLVM’s IR in order to represent parallel tasks. These instructions are detach , reattach and sync . "

detach label <detached>, label <continuation> reattach label <continuation> sync

http://wmoses.scripts.mit.edu/docs/parallelir.pdf

" Namely, all of these frameworks share two fundamental con- cepts: a detach operation and a synchronize operation. In this sense, a detach operation represents marking code to be run in par- allel. In OpenMP?, detach can be thought of entering a parallel con- text such as a parallel for loop. In Cilk, detach can be though of a spawn operation, assuming all function arguments have been eval- uated. Using PThreads, detach represents running a thread on a specific function. The synchronize operation represents taking code that was pre- viously detached and waiting for it to complete "

" The new detach instruction is a block terminator. This func- tion acts much like a conditional break, as it is provided to blocks which it can branch to. However, unlike the conditional break, the detach instruction represents branching to both blocks simultane- ously. For clarity, we will denote the first argument of the instruc- tion as the continuation while the second argument represents the spawned block. Additionally, a new reattach instruction will be created. This instruction is also a block terminator. It takes one argument – the block representing the continuation. Theoretically this is not nec- essary, however, its inclusion greatly simplifies optimization passes and compilation. The reattach has one successor – the continu- ation. However, unlike successors produced by a normal branch instruction, successors to a reattach instruction cannot use SSA registers inside of the reattach ed block –even in PHI nodes. This is done to represent the fact that the detach ’ed block is not guar- enteed to have executed before the continuation. However, any reg- isters before the detach statement can be used in either the contin- uation or the detach ed block without the use of PHI nodes since it is guarenteed that they were executed before entering the block. When a reattach instruction is executed in a serial context (e.g. no parallelism is available on the machine), it simply acts as an un- conditional branch to the continuation – thus representing the serial ellision. When executed in a parallel context, however, a reattach instruction is used to signal the end of execution for the detach -ed thread. Any detach ’s to function calls using the old syntax can be con- verted to use this new syntax as Figure 3 illustrates. In order to maintain the correctness as shown above, we give this new detach function-like properties, where it can consider any memory accesses in the detached code to be its arguments. "

http://wsmoses.com/tapir.pdf

" Conceptu- ally, the detach instruction is similar to a function call, and the reattach instruction is similar to a return. The three in- structions have the following syntax: detach label b , label c reattach label c sync where b , c ∈ V . The label keywords indicate that b and c are (labels of) basic blocks in V . A detach instruction takes a detached block b and a con- tinuation block c as its arguments, and it allows b and c to operate in parallel. A detach instruction in a block a ter- minates a and spawns a parallel task starting with b which can operate in parallel with the continuation block c . The CFG contains a detach edge ( a , b ) ∈ E and a continue edge ( a , c ) ∈ E . The block b must be a single-entry block for which every exit from the sub-CFG reachable from b — the parallel task spawned by a — is terminated by a reattach whose continuation c matches the continuation in the corre- sponding detach instruction — the detach instruction reat- tached by the reattach . For each such reattach instruc- tion, the CFG contains the reattach edge ( b 0 , c ) ∈ E , where b 0 is the block terminated by the reattach . In a sense, detach works like a function call to b that resumes at c after the subcomputation rooted at b returns, whereas reattach acts like a return. But unlike a function call and return, b is a block within the CFG of the function containing it, rather than a di ff erent function in the program, and b and c can operate in parallel. A detach does not require that b and c execute in parallel, however. It simply allows the runtime system to schedule them for parallel execution, if it so desires. Tapir assumes that every CFG G = ( V , E , v 0 ) obeys the following invariants: 1. A reattach instruction reattaches one detach instruc- tion. 2. For each reattach instruction j that reattaches a detach instruction i , every path from v 0 to the block terminated by j passes through the detach edge of i . 3. Every path starting from the detached block of a detach instruction i must reach a block containing a reattach instruction that reattaches i . 4. If a path from a detach instruction i to a reattach instruction j that reattaches i passes through the detach edge of another detach instruction i 0 , then it must also pass through a reattach instruction j 0 that reattaches i 0 . 5. Every cycle containing a detach instruction i must pass through a reattach instruction that reattaches i . 6. Any immediate successor of a reattach instruction can- not contain “ φ instructions.” "

---

instead of hard limits on length of custom instruction code and nesting of custom instructions, we could just say 'whatever is fine, as long as no fully expanded custom instruction is more than 64k'. But that would mean that if someone is programming custom instructions and using a library of someone else's custom instructions nested within them, that changes in the implementation of the library could make the top-level illegal. So mb better to have hard limits. But that's such a waste, because the limit will almost never be achieved, so we'd be wasting a lot of potential custom instruction length.

How about: hard limits on every layer except the top one; the top one's limit is computed assuming that the layers beneath it are all their own max, but by using more or less of them its own limit changes. This way, the top one's limit is not dependent on how the ones beneath are implemented. This means that the 'layer number' of a custom instruction is part of its fixed interface signature.

But in fact, we could do the same thing for every intermediate layer. The various intermediate layers could have different max expanded code size budgets, monotonically nondecreasing. Presumably the max expanded code size budget of each layer would be powers of two.

But in this case, we could simplify it by throwing out the idea of a fixed numbering of layers, and just specify the max expanded code size budget of any custom instruction as part of its interface signature/API. These could be constrained to be powers of two, and the total max is 64k (2^16). So there are 16 choices (so this parameter can be specified in only 4 bits; maybe the other 4 bits could be flags like 'deterministic', 'side-effectless', 'idempotent', 'threadsafe', etc).

Alternately, the other 4 bits could indicate how much hiddenstack is needed/consumed by this custom instruction, which the compiler should track since there is a small, hard limit on hiddenstack size.

What should the stack size limit be? These 4 bits suggest 16. 16 is also nice because we can't have more than 16 layers of custom instructions. https://people.ece.cornell.edu/land/courses/ece5760/DE2/Stack_cpu.html mentions 16 and 8 as sizes for small stacks. https://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html#212 mentions 16 as the size for a stack buffer (" In the case of a stack that is only used as an expression evaluation stack, "large" may only be approximately 16 elements, since single expressions seldom nest very deeply (Haley 1962). "). Let's say 16.

---

note: we used to have 12-bit operands (4k variables). Is this enough to encode most HLL programs? Probably; MS suggests using no more than 64 locals because the CLR runtime won't consider enregistering anymore [1]