Bayle Shanks's website: proj-oot-ootAssemblyOpsNotes3

You can have a reduce which is sequential but also one which is parallel (If it's associative I guess)

---

isnan

isinf

fcmp

our totalOrder comparison op on floats

---

frame_type_from_frame_pointer

---

gc_pin gc_unpin

---

https://en.wikipedia.org/wiki/Threading_Building_Blocks#Library_contents

---

static_assert

---

    'add_amo', 
    'ap_amo', 
    'and_amo', 
    'or_amo', 
    'xor_amo',
    'add_amo_sc', 
    'ap_amo_sc', 
    'and_amo_sc', 
    'or_amo_sc', 
    'cas_sc',

add_amo $dest &memsrc1 $src2: Atomically, the integer value at memory location &memsrc1 has $src2 added to it, and $dest = the old value that was at memory location &memsrc1.
ap_amo $dest &memsrc1 $src2: Like add_amo, except that the value at memory location &memsrc1 is a pointer.
and_amo: like add_amo, except that the operation is bitwise AND rather than addition
or_amo: like add_amo, except that the operation is bitwise OR rather than addition
xor_amo: like add_amo, except that the operation is bitwise XOR rather than addition
{add, ap, and, or, xor}_amo_sc: Like add_amo, ap_amo, and_amo, or_amo, xor_amo, except that in addition these are sequentially consistent synchronization operations (todo what does that mean exactly? do they only synchronize with each other? or, with anything at that memory location?).

Those AMO operations without the _sc suffix do not imply any synchronization. This makes them suited for parallel reduction.

---

Generally, compiler optz can go full bore between sync/special/fence or sync IDs.
Some optzcan be done w.r.t. global shmemobjects.

Programmer supplied, standardized safety nets:

“Don’t know; Assume worst” –Starting method?
Over-marking SYNCs is overly-conservative

Programming Model Support:

doall–no depsbetween iterations –(HPF/F95 –forall, where)
SIMD (CUDA) –Implied multithread access w/o sync or IF cond
Data type -volatile-C/C++
Directives –OpenMP:
- #pragma ompparallel Sync Region
- #pragma ompshared(A) Data Type
Library –(eg, MPI, OpenCL?, CUDA) " -- [1]

---

mutex rwlock

have some locks that any thread can unlock, and other locks that only one the locking thread can unlock

---

mb provide a primitve that 'atomically' acquires a set of locks at once -- then until those locks are uUnlocked you can't acquire any more locks

---

Oot should provide combination fork exec as a primitive (but maybe not provide fork as a primitive, it's too complicated due to issues like fork one and mutexes)

---

https://cseweb.ucsd.edu/~wgg/CSE131B/oberon2.htm#Sec10.3

---

"We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost" -- [2]

(bitwise, on 4k-sized blocks (DRAM rows))

" Table 1: Summary of Supported PIM Operations Operation R W Input Output Applications 8-byte integer increment O O 0 bytes 0 bytes AT 8-byte integer min O O 8 bytes 0 bytes BFS, SP, WCC Floating-point add O O 8 bytes 0 bytes PR Hash table probing O X 8 bytes 9 bytes HJ Histogram bin index O X 1 byte 16 bytes HG, RP Euclidean distance O X 64 bytes 4 bytes SC Dot product O X 32 bytes 8 bytes SVM "

"Key to practicality: single-cache-block restriction Each PEI ((PIM-Enabled Instruction)) can access at most one last-level cache block. Similar restrictions exist in atomic instructions"

-- PIM-Enabled Instructions: A Low-Overhead, Locality-AwareProcessing?-in-Memory Architecture [3]

---

dup swap drop over rot select register swap

---

" There is demand for a lot of new functionality in MCUs these days. “I look at a lot of designs and I am seeing more MCUs containing a network access controller (NAC),” says Mentor’s Zarrinfar. “This is a networking solution that regulates how secure devices connect when they first attempt to access a network. I frequently see hash circuitry or AES encryption blocks. Security is a very wide range of concerns and could involve fingerprinting, or things within the supply chain to make sure that chips are not sold to the wrong people.” "

---

ptrace

---

the stuff on pages 31 and 32 of:

lisp the ultimate opcode https://dspace.mit.edu/handle/1721.1/5731

constant list (evaluating this returns itself)
constant symbol (evaluating this returns itself)
variable reference (evaluating this returns the current value of the variable)
constant closure (a closure is a pair (procedure, environment)) (evaluating this returns itself)
procedure (evaluating this produces a closure of this procedure with the current environment)
conditional
procedure call (see below)
quoted constant

primitive operations availabe to the procedure called using 'procedure call':

CAR
CDR
CONS
ATOM
PROGN
LIST
FUNCALL (note: actually calls a given closure, not a function)

---

https://en.wikipedia.org/wiki/Comparison_of_programming_languages_(basic_instructions) https://en.wikipedia.org/wiki/Template:Programming_language_comparisons

---

"deferred load instructions to account for the lag time between requesting a piece of data and actually being able to use it"

---

PC-relative load, store? or at least, larger immediate offsets

---

" A pipelined (two-cycle) Saturating Arithmetic Unit (SAU).

    Supports all packed and unpacked saturating and halving arithmetic instructions.

An IEEE 754 compliant(ish) FPU.

    The following single-cycle FPU instructions are implemented:
        FMIN, FMAX
        FSEQ, FSNE, FSLT, FSLE, FSUNORD, FSORD
    The following three-cycle FPU instructions are implemented:
        ITOF, UTOF, FTOI, FTOU, FTOIR, FTOUR
    The following four-cycle FPU instructions are implemented:
        FADD, FSUB, FMUL
    Both packed and unpacked FPU operations are implemented."

-- https://github.com/mrisc32/mrisc32-a1

---

this is cool but probably too unpopular for us:

https://github.com/mrisc32/mrisc32/blob/master/doc/SHUF.md

"One of my favorite instructions in the MRISC32 ISA is the shuf instruction" -- https://www.bitsnbites.eu/some-features-of-the-mrisc32-isa/

---

a CMP operation that populates R1 (or anything?) with a 'status word'/'flag word'
branch-if-zero-after-AND-with-immediate-mask to look at a status word, AND it with the given immediate mask, and then branch if the result is zero. Also the companion instruction that branches if the result is not zero. Also the companion 2 instructions with OR instead of AND. This allows you to make use of the status word for branching. It also allows you to test bits.
mb simpler would just be a BITT and BITF instruction pair that tests a single immediately specified bit of a given register, and branches if it is true/false

https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d9982f7618ef68 says "No condition codes, instead...compare-and-branch instructions...(Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)"

so what if we force both CMP and BITT and BITF to only use R1 for as a status register? Does that solve it? e didn't explain why using a GPR is bad. mb i should ask.

---

double-wide CAS

---

https://developer.arm.com/documentation/dui0801/g/A64-SIMD-Vector-Instructions/PMULL--PMULL2--vector- is used in this guy's implementation of CRC32 that is faster than ARM's native CRC32 instruction (by parallelizing more, i think):

https://news.ycombinator.com/item?id=31471869

https://dougallj.wordpress.com/2022/05/22/faster-crc32-on-the-apple-m1/

---

tail calls: maybe return_call, return_call_indirect, like in https://news.ycombinator.com/item?id=32069418

---

" What we should have is multi memory support and memory read permissions, because without them WASM lacks basic security primitives.[1] https://www.usenix.org/conference/usenixsecurity20/presentation/lehmann

Reference types maybe. But I've yet to see a convincing argument as to why they are absolutely necessary. " -- [4]

---

wasm syscalls(?) in golang-on-wasm implementation: wasmMove wasmZero wasmDiv wasmTruncS wasmTruncU exitThread osyield usleep currentMemory growMemory wasmExit wasmWrite nanotime walltime scheduleCallback clearScheduledCallback getRandomData -- https://github.com/neelance/go/blob/13b0931dc3fa8c8a6ab403dbdae348978a53c014/src/runtime/sys_wasm.s

---

https://github.com/golang/go/blob/2540f4e49d47f951de6c7697acdc510bcb7b3ed1/src/cmd/compile/internal/ssagen/ssa.go lists various operations that need low-level support, like basic math functions and atomics

https://github.com/golang/go/blob/0d0193409492b96881be6407ad50123e3557fdfb/src/runtime/asm_wasm.s 's InitTables? () has some others, like hashing

---

https://en.wikipedia.org/wiki/C-element

---

"If you are interested in physically-plausible and effective bottom-up redesigns of the computational stack, study the Scheme-79 chip and related work. Or better still, consider what can be built from the asynchronous Muller C-gate, and how well the latter plays with a pure-dataflow computational paradigm." -- [5]

---

'mv's in addition to 'cp's -- also 'move' variants of load and store

---

" This patchset is the first step to open-source this work. As explained in the linked pdf and video, SwitchTo? API has three core operations: wait, resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation on top of which user-space threading libraries can be built.

    Another common use case for FUTEX_SWAP is message passing a-la RPC between tasks: task/thread T1 prepares a message, wakes T2 to work on it, and waits for the results; when T2 is done, it wakes T1 and waits for more work to arrive. Currently the simplest way to implement this is

    a. T1: futex-wake T2, futex-wait
    b. T2: wakes, does what it has been woken to do
    c. T2: futex-wake T1, futex-wait

    With FUTEX_SWAP, steps a and c above can be reduced to one futex operation that runs 5-10 times faster.

A 5~10x speed improvement with FUTEX_SWAP certainly sounds compelling as does the information shared way back at LPC 2013 via the video below and the PDF slides.

---

some concurrency assembly examples in ARM from https://assets.bitbashing.io/papers/concurrency-primer.pdf :

memory barriers:

int getFoo() { return foo; }

getFoo: ldr r3, <&foo> dmb ldr r0, [r3, #0] dmb bx lr

(dmb is memory barrier)

ll/sc:

void incFoo() { ++foo; } compiles to: incFoo: ldr r3, <&foo> dmb loop: ldrex r2, [r3] LL foo adds r2, r2, #1 Increment strex r1, r2, [r3] SC cmp r1, #0 Check the SC result. bne loop Loop if the SC failed. dmb bx lr

(ldrex/strex is ll/sc)

e notes that ldrex/strex is allowed fail spuriously (that way, they can implement it at cache line granularity)

cas_strong vs cas_weak: cas_weak can fail spuriously

making an rmw op out of a cas primitive:

void atomicMultiply(int by) { int expected = foo; while (!foo.compare_exchange_weak( expected, expected * by)) { Empty loop. (On failure, expected is updated with foo's most recent value.) } }

armv8 processors offer dedicated load-acquire and store-release in- structions: lda and stl

-- https://assets.bitbashing.io/papers/concurrency-primer.pdf

---

https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-987.pdf Capability Hardware Enhanced RISC Instructions: CHERI Instruction-Set Architecture (Version 9) page 97 3.7 Capability-Aware Instructions

much easire to read in the PDF, which has hyperlinks for the instructions, but here's a plaintext copy:

" Retrieve capability fields These instructions extract specific capability-register fields and move their values into general-purpose (integer) registers: CGetBase?, CGetFlags?, CGetHigh?, CGetLen?, CGetOffset?, CGetPerm?, CGetSealed?, CGetTag?, CGetTop?, and CGetType?. Capability move This instruction moves a capability from one register to another without change: CMove. Manipulate capability fields These instructions modify capability-register fields, setting them to values moved from integer registers, subject to constraints such as monotonicity and representability: CAndPerm?, CClearTag?, CIncOffset?, CIncOffsetImm?, CSetAddr?, CSetBounds?, CSetBoundsExact?, CSetBoundsImm?, CSetFlags?, CSetHigh?, and CSetOffset?. Capability pointer comparison These instructions provide pointer comparison: CSetEqualExact? and CTestSubset?. Load or store via a capability These instructions access memory via an explicitly named cap- ability register, and will ideally correspond to a full range of contemporary indexing modes present in the baseline ISA – for example, allowing aligned or unaligned access to zero-extended and sign-extended integers of varying widths, as well as loading and storing of capabilities themselves. Further, software stacks dependent on atomic opera- tions on pointers will require a suitable suite of atomic operations loading, modifying, and storing capabilities – e.g., load-linked, store-conditional instructions, or atomic test- and-set instructions, depending on the underlying architecture. CHERI-RISC-V adds CLC and CSC to load and store capabilities as well as a new instruction decoding mode in which existing memory access instructions use capability registers as the base address instead of integer registers. CHERI-RISC-V also adds new instructions which expli- citly use a capability register as the base address regardless of decoding mode including L[BHWD][U].CAP, LC.CAP, S[BHWD][U].CAP, and SC.CAP. These correspond in semantics to the similar baseline ISA instructions, but are con- strained by the properties of the named capability including tag check, permissions, bounds, seal check, and so on; if capability protections would be violated, then an excep- tion will be thrown. Capability restrictions can be used to implement spatial safety via permissions and bounds. Additionally, the CLoadTags? instruction provides direct, read-only access to capability tags; see Section 9.26. Program-Counter Capability Generated code makes frequent reference to PCC in common position-independent code structures, such as references to the Global Offset Table (GOT) or Program Linkage Table (PLT). CHERI-RISC-V extends the base AUIPC instruction with AUIPCC that adds an offset to PCC. Capability jumps Capability-based code pointers allow the implementation of control-flow robustness by limiting the permissions and bounds on jump targets (e.g., preventing store, and limiting fetchable instructions). Depending on the underlying ISA, different jump variations may be required – for example, adding capability variants of jump-and-link register, jump register, and so on, including: JALR.CAP and CJALR. 3.8. PROTECTION-DOMAIN TRANSITION WITH CINVOKE 99 Capability sealing The CSeal and CUnseal instructions seal or unseal capabilities given a suit- able authorizing capability (i.e., one with the PERMIT_SEAL or PERMIT_UNSEAL permis- sion as appropriate). Sealed capabilities allow software to implement encapsulation, such as is required for software compartmentalization. The CSealEntry? instruction constructs hardware-interpreted sealed entry (‘sentry’) capabilities; see Section 3.9. Protection-domain switching The CInvoke instruction is a primitive upon which protection- domain switching can be implemented. CInvoke has a jump-based semantic that unseals its sealed code and data capability-register operands. This allows software-controlled non-monotonicity by granting access to additional state via unsealing. Fast register clear The CClear and FPClear instructions clear a range of capability or floating- point registers to support fast protection-domain transition. Special capability registers Special capability registers are read and written via CSpecialRW?. Tag loading and rederivation Certain system operations, such as process or virtual-machine checkpointing and memory compression, require that tagged memory have its tags saved and then restored. Memory locations can be iteratively loaded into capability registers to check for tags; tags can then be later restored by manually rederived manually us- ing instructions such as CAndPerm? and CSetBounds?. However, these instruction sequences are complex and can incur substantial overhead when used during bulk restoration. The CLoadTags? instruction allows tags to be loaded for a cache line of memory (non-temporally), and the CBuildCap?, CCopyType?, and CCSeal instructions allow tags to be efficiently re- stored. Compartment identifiers CHERI protection domains, when constructed purely of graphs of capabilities, do not allow the microarchitecture to explicitly identify one domain from another. In order to allow tagging of microarchitectural state, such as branch-predictor entries, to avoid side channels, instructions are present to allow software to explicitly identify compartment boundaries where confidentiality requirements preclude more ex- tensive microarchitectural sharing: CGetCID? and CSetCID?. Capability Address and Length Rounding Instructions Capability compression requires stronger alignment as allocation sizes increase as described in Section 3.5.5. CRAM and CRRL can be used by allocators to enforce non-overlapping bounds for distinct allocations " ---

see "Figure 4: The names of type-specialized instructions follow this format." in https://developers.redhat.com/articles/2022/11/22/how-i-developed-faster-ruby-interpreter

we see w data types (integer, floating point), 6 comparison operations (eq, neq, lt, le, gt, ge), 5 arith instructions that can be applied to both integers and fp (add, minus, mult, div, mod), and 4 integer-only operations (or, and, aref, aset; aref and aset appear to be get and set for items within arrays)

note that https://sonots.github.io/ruby-capi/d8/dd1/id_8c_source.html does show some other stuff (pow, dot2, dot3, shifts, match, nmatch, colon2, anddot); not sure if all of those are the same kind of thing

---