proj-oot-ootCoreNotes1

i like Parrot's custom bytecodes

---

types of memory barriers

load/load, load/store, store/load, store/store

"Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The four categories of barrier can be specified individually"

also http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf wants a way to specify a barrier that only applies to labeled 'sync operations'

so varieties of memory barriers:

{load, store, *}/{load, store, *}, as sync or as data, and relative to only sync or to only data and i guess each load/store must mark if it is sync or data then and i guess each barrier should also say whether or not mmio (memory mapped io) should be affected ( http://lwn.net/Articles/283776/ ), or whether ONLY mmio should be affected so for example a memory barrier might say "load-sync/*-{sync*data}, cpu+mmio", meaning "don't move any sync loads or stores after this instruction before any sync loads before this instruction, including both cpu and mmio"

isync -- complete all previous instructions up to what is needed before you perform a context switch

incoherent instruction cache flush for self-modifying code

dependency barrier (causes dependencies to be ordered, only needed on Alpha)

command to wait until a write is 'visible' to all other CPUs, and command to wait until all reads from other CPUs are visible here; vs command to do the previous barriers but only w/r/t local reordering this two 'wait' commands in the previous sentence are similar to CRF's Reconcile and Commit , i guess. note that in my formulation, the sender must commit and the receiver must reconcile to ensure transmission. otherwise eventual consistency.

i'll define eventual consistency to say that FIFO consistency may be transiently violated but if you wait long enough and the sender doesn't change the value any more, you'll eventually see their latest value forever

general marker for sections of code dealing with threadlocal vs. shared memory? or is this part of the load/store stuff?

note: acquire/release can be implemented in terms of the above;

" Acquire semantics is a property which can only apply to operations which read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.

    Release semantics is a property which can only apply to operations which write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order." -- http://preshing.com/20120913/acquire-and-release-semantics/#comment-20810

a release is load/store + store/store, and and acquire is load/load, load/store.

http://fileadmin.cs.lth.se/cs/Education/EDAN25/F06.pdf 56/70 "Example of Cumulative Ordering"; on Power, there is also an option for some barriers to be 'cumulative' and some to not, which just means transitive, e.g. causal transitivity

???

"

And a couple of implicit varieties:

 (5) LOCK operations.
     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the LOCK operation will appear to happen after the LOCK
     operation with respect to the other components of the system.
     Memory operations that occur before a LOCK operation may appear to happen
     after it completes.
     A LOCK operation should almost always be paired with an UNLOCK operation.
 (6) UNLOCK operations.
     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the UNLOCK operation will appear to happen before
     the UNLOCK operation with respect to the other components of the system.
     Memory operations that occur after an UNLOCK operation may appear to
     happen before it completes.
     LOCK and UNLOCK operations are guaranteed to appear with respect to each
     other strictly in the order specified.
     The use of LOCK and UNLOCK operations generally precludes the need for
     other sorts of memory barrier (but note the exceptions mentioned in the
     subsection "MMIO write barrier").

" -- https://www.kernel.org/doc/Documentation/memory-barriers.txt

"

 (*) There is no guarantee that a CPU will see the correct order of effects
     from a second CPU's accesses, even _if_ the second CPU uses a memory
     barrier, unless the first CPU _also_ uses a matching memory barrier (see
     the subsection on "SMP Barrier Pairing")." -- https://www.kernel.org/doc/Documentation/memory-barriers.txt

--

some interesting ideas from various weird x86 features:

https://en.wikipedia.org/wiki/Control_register#CR3

---

x86 has a 'REP' (repeat) prefix for some instructions: http://faydoc.tripod.com/cpu/movsb.htm , which i guess is like a 1-element for loop

--

http://www.cc.gatech.edu/~rama/Beehive/papers/git.cc.91.51.pdf table 1: hardware primitives (paraphrased):

read without coherence write without coherence read-global read data from shared memory, bypassing cache write-global globally perform the write read-update read data from shared memory and request future updates reset-update cancel the request for updates from shared memory flush-buffer stall this processor until all requests in the write-buffer are globally performed read-lock request a shared lock for a memory block write-lock request an exclusive lock for a memory block unlock release the lock

---

" Popek and Goldberg summarized the concept in 1974: "Formal Requirements for Virtualizable Third Generation Architectures". Communications of the ACM 17  Equivalence/Fidelity  A program running under the hypervisor should exhibit a behaviour essentially identical to that demonstrated when running on an equivalent machine directly.  Resource control / Safety  The hypervisor should be in complete control of the virtualized resources.  Efficiency/Performance  A statistically dominant fraction of machine instructions must be executed without hypervisor intervention. " -- ARMv7-A Architecture Overview, David Brash, Architecture Program Manager, ARM Ltd. www.linaro.org/documents/download/d7fe510d8eb46775afc3953d217b15224fbb93086598a (good read, slides with various details about ARM virtualization)

notes:

--

See also ootLowEndTargets

--

" Synchronization Instructions 1990-1993-2005-...

    atomic RMW
        compare-and-swap (CAS), e.g. x86 CMPXCHG
        x86 LOCK prefix
        fetch-and-add, e.g. x86 XADD
        double width compare-and-swap (DCAS), e.g. x86 CMPXCHG8B, CMPXCHG16B 
    "RISC" load-linked / store-conditional (LL/SC) 
    MONITOR/MWAIT 

Advanced topics:

    multi-location atomicity:
        compare-and-swap 2 locations (CAS2) - AFAIK not implemented by anyone
        KCSS
        LLn/SCn
        transactional memory (TM) " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010

" Generic Fence Options

Many different types of fences can be expressed as combinations of the above features.

    Do not perform or make visible subsequent instructions of type
        All memory operations
        Loads
        Stores
        Fences
        Instruction Fetches
        Any specially marked versions of the above 
    Until all earlier instructions of type
        All memory operations
        Loads
        Stores
        Fences
        Instruction Fetches
        Any specially marked versions of the above 
    Have been completed 

Further options:

    In particular address ranges or types of memory
    Make visible / flush to certain points - e.g. flush into L2, but not further 

Split Fences

TBD: my own work from the 80s

Recent 2009 papers Ordering vs. Reliability Fences

There are at least two different usage models for fences - or the related flush instructions.

Ordering Fences Folks who want to obtain semantic guarantees about memory ordering, and hence parallel program correctness. In general, these folks don't care if a fence instruction flushes a queue of memory operations, so long as it appears to do so.

Reliability Fences Folks who, e.g., want to ensure that all pending stores in store buffers have been flushed to battery backed up RAM. Or ensure that there are no loads in flight when they are trying to un-hotplug a module.

Strictly speaking the latter are flush operations, not fence operations. But the terms flush and fence are often mixed up, since the earliest fence operations were often implemented by flushing buffers. " -- https://www.semipublic.comp-arch.net/wiki/Fence_Instructions

"

Basic vector instructions

    vector-vector instructions 

Less common

    scalar vector: vreg[i] += scalar, for all i " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010

" DSP ISA extensions

 Branchless code instructions
    conditional move (CMOV)
    conditional select (CSELECT)
    conditional load, conditional store
        do you fault if the condition/predicate is false? 
    predication 

Specific instructions - in integer, scalar FP, packed, etc. versions

    absolute value (ABS)
    maximum (MAX), minimum (MIN)
        minimum and maximum (MINMAX)
        sort instruction 

" -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010

" Security and DRM ISA Extensions 2000-2005

Microsoft [(Next Generation Secure Computing Base)] inspired hardware support efforts at CPU manufacturers:

Intel Trusted Execution Technology (TXT or TET), (originally Lagrande Technology (LT)). SINIT instruction

AMD: TBD?

Similar features were found on non-Wintel processors, such as ARM and MIPS, all seeking to support DRM.

In some ways beaten to product line by virtual machine (VM) support. " --https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010

" Random number generator (RNG) ISA support Virtual Machines renaissance 2002-... Cryptography ISA extensions 2008-... " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010

"

The Intel AESNI instructions look like multiply or divide-step instructions

AESENC Perform one round of an AES encryption flow. AESENCLAST Perform the last round of an AES encryption flow. AESDEC Perform one round of an AES decryption flow. AESDECLAST Perform the last round of an AES decryption flow. AESKEYGENASSIST Assist in AES round key generation. AESIMC Assist in AES Inverse Mix Columns.

And, as of 2010, multiple step and divide step instructions look as if they were a dead end, a solution that did not last long. I suspect that memory to memory engines may be the ultimate direction for cryptographic instructions. APIs such as

ENCRYPT <parameter block>,<state block>,<from, to address ranges> DECRYPT <parameter block>,<state block>,<from, to address ranges> HASH <parameter block>,<state block>,<from, to address ranges> " -- https://www.semipublic.comp-arch.net/wiki/ISA_Trends,_1980-2010

" Parsing instructions

See SSE4.2 String / Text / XML instructions PCMP(ie)STR(im) Motion Estimation

MPSAD vector minimum entry and index (Multiple packed sum of absolute difference) GPU ISA features "

pcmpstr: http://halobates.de/pcmpstr-js/pcmp.html

http://halobates.de/blog/p/207

https://www.semipublic.comp-arch.net/wiki/X86_ISA#SSE4.2_String_.2F_Text_.2F_XML_instructions

 Vector Conditional Moves

SSE4.1's BLEND instructions are misnamed - really are vector select under mask.

AMD's XOP has VPCMOV Vector Conditional Move and VPPERM, vector packed permute bytes.

VPCMOV is a bit level select: three operands, the select mask, and the 2 inputs. Can implement byte or element select by appropriate extension from a bitmask to an element mask, 0/-1.

VPPERM is more than an element wise conditional move - itt is a vector permute, on steroids. Permutation

Permutation instrictions are a perrenial bugbear.

Hard to get a fully general permute. Many special cases:

XOP

    VPSHUFHW, VPSHUFD, VPSHUFLW, VPSHUFW
    permute VPERMIL2PS, VPERMIL2PD 
 Dot Product 

https://www.semipublic.comp-arch.net/wiki/X86_ISA

 Prefetch and Hint Instructions 
 Packed Vector Shuffles and Permutes

Packed vector Permute instructions have a long and annoying history.

Arguably, instruction sets such as Altivec jumped to the ideal: an arbitrary byte oriented permute instruction. However

    what about 4 bit and 2 bit numbers - especially useful for genomics
    where does the permutation vector live?
        in register - inconvenient
        in constants - funny sized constant 
    what about the hardware cost? 

Equivalently, scatter/gather - whether register to memory, or register to register.

Considerations such as these have motivated a long list of special case permutation instructions.

    SSE: SHUFPS, UNPCKHPS, UNPCKLPS, PSUFW
    SSE5: MOVDDUP, MOVSHDUP, MOVSLDUP 
 System, Virtual Machine & Security Instructions

It would take too long to exsplicitly describe these, so I will just list as described in wikipedia, http://en.wikipedia.org/wiki/X86_instruction_listings

Added with AMD-V CLGI, SKINIT, STGI, VMLOAD, VMMCALL, VMRUN, VMSAVE (SVM instructions of AMD-V)

Added with Intel VT-x VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON

Other Timer extensions

There is a perennial problem between absolute and process- or thread- or whatever virtual time - excluding interrupts, etc.

I (Glew) have repeatedly proposed instructions that read the timer and, atomically, add in one of several possible offsets from registers or structs. - thereby provided multiple virtual times. Possibly scaled. Along with instructions to manipulate such virtual times.

VPPERM

AMD XOP's VPPERM is more than an element wise conditional move - it is a vector permute, on steroids.

VPPERM is a permutation operator, with additional features

    concatenates two vector operands, like a funnel shift
    byte permutation: 5 bit selection, 32B, 256b (2 concatenated 128b)
    extra per byte operation:
        select source byte
        invert source byte
        bit reverse source byte
        bit reverse inverted source byte
        00h
        FFh
        most significant but replicated
        inverted most significant but replicated 

This just fits in 128 bits; might not extend to 256, 512 bits. VPERMIL2PS or PD

Floating point, not quite as powerful as VPPERM, but still pretty major.

Selects from one of two source operandds, or 0, depending on a few bits of a 3rd selection operand, and an immediate.

Can be considered a special case of the general hardware of VPPERM, convenient to comparisons produced by VPCOMW, etc., instructions.

Vector Shifts

AMD XOP and Intel AVX contain vector element shifts - where the size is specified by the opcode, and the shiftcounts are specified either by a vector, with a different shiftcount per element, or else by a constant or scalar, with the same shiftcount per element.

TBD: make the size dynamically specifiable? Restrict to power of two, or make any integer number of bits?

Elsewhere I have discussed wide integer shifts, treating a packed vector register as a very long integer.

Vector element funnel shifts occur to me as a natural step: Take two vectors, concatenate on an element basis, and shift by a shift count e.g. scalar or in a corresponding vector register.

 Instruction Prefixes

Not so much an instruction operation feature, as an extension feature.

x86 has had instruction prefixes since its beginning

    LOCK
    segment prefixes
    REP/REPNE 

AMD-64 introduced the REX prefix

    REX 

XOP introduces a 3 byte pefix

    3 byte 8F RXB.m5 Wv4Lpp 

AVX (and LRBNI) introduce the VEX prefix

    VEX 

Along the way, the C4 prefix was used

    C4 
 FMA

What: FMA3 / FMA4

 Carry-less Divide

What: Carry-free division (?Glewism) What: Reduction of an Integer Modulo a Polynomial Where: not yet (except CRC) When: future, if at all (likelihood high?)

Computations such as GCM in AES and Elliptical Cryptography frequently require the reduction of an integer, interpreted as a polynomial SUM of xi AND val_i, modulo another polynomial, in the Galois field using XOR.

Currently (2010) this is implemened using shifts and shuffles without hardware support - with the exception of the CRC instruction, which has a hardwired polynomial.

Glew conjecture: hardware support for this should be forthcoming.

 Carry-less Multiply

What: Carryless Multiply (CLMUL) Where: Intel x86 nhm When: Proposed 2008 / shipped in Nhm 2010

Part of AESNI, although I break it out as more generic. Used in the Galois Counter Mode (GCM) of AES.

GCM requires carry-less multiplication of 128 bit operands, producing a 255 bit.

Vector-to-vector element-by-element shifts

256 bit AVX vectors

    left shift logical (dword/qword)
    right shift logical (dword, qword)
    right shift arithmetic (dword) 

Curiously, the qword (64b) version of vector element by element right shift arithmetic is missing. Why? Cost?

Floating Point Multiply Accumulate

Bit manipulation

ANDN And Not Z := ~X & Y BLSR Reset Lowest Bit Y := X & (X-1) BLSMSK Get Mask Up to Lowest Set Bit Y := X ^ (X-1) LZCNT Leading Zero Count TZCNT Trailing Zero Count BZHI Zero High Bits starting with Specified Bit Position

 Gather

All combinations of

    Gather datatype: SP FP, DP FP, Dword (32b), Qword (64b)
    Index datatype: Dword (32b), Qword (64b) 

s

VPERM*, on dward (32b) and qword (64b)

Vector-to-vector element-by-element shifts

256 bit AVX vectors

    left shift logical (dword/qword)
    right shift logical (dword, qword)
    right shift arithmetic (dword) 

Curiously, the qword (64b) version of vector element by element right shift arithmetic is missing. Why? Cost?

Highlights from the point of view of instruction novelty (not necessarily absolute, but newness to x86):

    vadcpi - vector add with carry in and carry out
    vaddsetcpi - vector add and set mask to carry
    vsbbpi - subtractvectors with borrow in and borrow out
    vsbbrpi - reverse subtract brrow
    vaddsetsp - vector add and set mask to sign 
    vclampzp - clamp vector elements between a value and zero 
    vcvtps2srgb8 - convert vector to sRGB8 vector 
    vector scatter/gather - index (address) vectors
    vexpand / vcompress - under mask? 
    bitinterleave: 1:1, 2:1, vectior/scalar 
    bsfi/bsri - bitscan forward / reverse, initialized
        start in the middle, after an earlier bsf/bsr 

Some of Larrabee's most powerful features involved operand broadcast and swizzle, and more powerful convedrsions on memory operands. Full permute was not allowed, but a nice selection of permutes within a 128 bit "lane" of 4 32b elements.

    identity dcba
    pair swap cdab
    two away badc
    broadcast aaaa, bbbb, cccc, dddd,
    rotate, as in cross product dacb 
 AVX, SSE5, and XOP

Between 2007 and 2011 Intel and AMD have been thrashing about packed vector extensions such as Intel AVX, and AMD SSE5 and XOP, with Intel LRBNI in the background.

This wiki page is not intended to be a history of this thrashing. This wiki page is meant only to capture the key innovations or technical aspects, forgetting the marketing spin. These aspects will be separate sections.

Key instruction set packages in AVX, SSE5, XOP (and LRBNI, to some extent):

    non-destructive 3 operand A=BopC and 4-operand A = B op1 C op2 D
    FMA3 and FMA4 as aspects
    permute
    conditional move (in packed vector)
        wide integer (in packed vector register) select under mask 
    vector shift/rotate (with per element counts)
    vector integer compare
    vector integer multiply accumulate
        w/wo saturation and width changes 
    FP16 

XOP

    horizontal integer add/subtract
        add adjacent pairs of values 
    FP fraction extract
    Vectorcobnditional Moves 

Intel AVX FMA has the following:

    double/single scalar/packed
    fused multiply add/subtract, operand optionally negated - FMADD, FMSUB, FNMADD, FNMSUB
    FMADDSUB - fused multiply alternating add/subtract 

SSE4.1

What: Intel x86 SSE4.1 Where: Penryn When: 2007?

SSE4.1 with Intel Core 2 (Merom) is notable for

    Motion estimation instructions MPSADBW and PHMINPOSUW
    Dot product instructions DPPS, DPPD
    Streaming load, MOVNTDQA
    BLENDPS, ertc. instructions
        misnomer: really are a conditional move or vector select under bitmask. 
    packed integer min/max, e.g. PMINUB
    insert/extract instructions
    packed integer formatconversions
    quadword (128b) equality
    hole filling: PACKUSDW 

SSE4.2

What: Intel x86 SSE4.2 Where: Nehalem Core i7 When: 2009

SSE4.2

    string instructions (compare)
        Explicit vs Implicit Length
        returning index or mask
        byte specifying formats 
    PCMPGT - 128 bit greater than compare
        filling in, partially, the whole whereby there were 128 bit equality checks but not direction checks 
    CRC32
    POPCNT
        just a 1 operand form - no ANDing or XORing 

SSE4a

What: AMD x86 SSE4a Where: Barcelona When: 2009

    LZCNT - leading zero count
    POPCNT
    EXTRQ/INSERTQ
    MOVNTSD / MOVNTSS - scalar streaming store instructions 
 SSE3

What: Intel SSE When: Pentium III, 1999

From wikipedia: http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

    SSE introduced both scalar and packed floating point instructions. Floating point instructions
        Memory-to-Register / Register-to-Memory / Register-to-Register data movement
            Scalar – MOVSS
            Packed – MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS 
        Arithmetic
            Scalar – ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS
            Packed – ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS 
        Compare
            Scalar – CMPSS, COMISS, UCOMISS
            Packed – CMPPS 
        Data shuffle and unpacking
            Packed – SHUFPS, UNPCKHPS, UNPCKLPS 
        Data-type conversion
            Scalar – CVTSI2SS, CVTSS2SI, CVTTSS2SI
            Packed – CVTPI2PS, CVTPS2PI, CVTTPS2PI 
        Bitwise logical operations
            Packed – ANDPS, ORPS, XORPS, ANDNPS 
    Integer instructions
        Arithmetic
            PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW 
        Data movement
            PEXTRW, PINSRW 
        Other
            PMOVMSKB, PSHUFW 
    Other instructions
        MXCSR management
            LDMXCSR, STMXCSR 
        Cache and Memory management
            MOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE 

Mostly unremarkable, although I may mention some, such as packed average, in more detail.

 Meta-organization
    Packages or Families of Instruction set extensions
        #AESNI
        #16_bit_FP
        SSE, SSE2, SSSE3, SSE4, SSE5, AVX, LRBNI, XOP 
    Cryptography
        #AESNI
        #Carry-less_Multiply
        #Carry-less_Divide 
    Vectors, Packed
        #Vector_Shifts
        #Add-Subtract
        #Vector_Horizontal_or_Pairwise_Operations
        Related
            #FMA 
        #16_bit_FP 
    Synchronization
        #Monitor_and_Mwait 
    System instructions
        Virtual machines and security, Intel and AMD's different flavors. 

" Bounds checking

HardBound?: Architectural Support for Spatial Safety of the C Programming Language. Joe Devietti, Colin Blundell, Milo M. K. Martin, and Steve Zdancewic. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2008.

    http://www.cis.upenn.edu/acg/talks/asplos08_hardbound_talk.pdf
    http://www.cis.upenn.edu/acg/papers/asplos08_hardbound.pdf 

Although Milo disavows this heatedly, HardBound? is really an example of a capability instruction set extension. It is really an attempt to make accessible to regular programs some of the features of secure pointers in instruction sets such as that of the IBM AS400. HardBound? differs, however, in not using fat pointers, but in placing the capability descriptors in memory locations separate from the pointers they are securing, permitting unchanged memory layout for data type binary compatibility.

HardBound? requires the compiler to identify where pointers are created, and what they are allowed to point to:

setbound instruction

Hardware checks and propagates the pointers. All registers and memory have metadata, describing, in this case, the bounds of the pointer. "

https://www.semipublic.comp-arch.net/wiki/Stream_descriptor#Stream_descriptor

"

Stream descriptor

A stream descriptor is a description, in a register or memory data structure, that describes a memory access pattern.

For example, a 1D stream descriptor might include

    data item description - size, type
    base
    stride
    length
    current position 

A 2D stream descriptor might include

    data item description - size, type
    base
    horizontal stride, length
    vertical stride, length
    current position, row and column 

Obviously extendable to N dimensions.

In addition to arrays, there may be other types of stream descriptor

A scatter/gather descriptor

    a 1D descriptor of a vector of addresses
    with an indication that each addressshould be dereferenced 

A linked list descriptor

    Base
    a skeleton structure descriptor
        payload to be accessed while traversing the list - offset, type
        position within a list element of pointer to next element 

A tree or B-tree descriptor

    Base
    a skeleton structure descriptor
        payload
        key
        pointers 

For the case of a tree or B-tree descriptor, the access pattern may (a) traverse all of a tree or subtree or (b) traverse only a path seeking an individual element Explicit versus Implicit Stream Descriptors

A stream descriptor can be explicit - e.g. it may be bound to a register, so that a memory access instruction might look like

  destreg := LOAD [stream_descriptor_reg]
                Issue: what aboyt index and offset addressing modes? Nicest if there are instructions "load/store from/to stream" 

"

"

Data Manipulation Overhead

Users of SIMD packed vector instruction sets quickly come to realize that the actual memory access and computation is only a small part of the task of programming code to use such instructions. Often more than half of the instructions in such kernels are data manipulation overhead

    unpacking 8 bit numbers into 16 bit numbers, or, worse, even fancier unpackings such as 5:6:5
    transposng rows and columns
    chunky versus planar data, i.e. SOA versus AOS 

Explicit support for streams via stream descriptors can eliminate much of this overhead, avoiding explicit instructions.

The CSI Multimedia Architecture goes further: not only does it hide the data packing and unpacking, but it also seeks to hide the SIMD packed vector nature of the hardware datapath. Instead of the programmer loading from a stream into a register of N bits, operating, and then storing, this paper proposes complex streaming instructions that are implemented N bits at a time in such a way that the code does not need to be changed when the microarchitecture changes the width of N.

Others have attempted to go even further, providing ways of expressing streaming kernels using simple instructions that automatically get mapped onto a fixed length datapath. "

"

The CSI Multimedia Architecture, IEEE T. VLSI, Jan 2005. Stream descriptors, plus optimizations to allow implicit SIMDification.

    iWarp - stream descriptors bound to registers
    Wm - William Wulf's processor, firs instance I know of stream descriptors bound to register "

" Other security related instruction set extensions

    separating the return address stack from data stack "

" More Types of Arithmetic Matrix Arithmetic

    vector matrix
    matrix matrix
    transpose 

Historically held back by register size; now, as we approach 512 bits = 16 32 bit numbers = 4x4 32b matrix ...

Copious possibilities for new instructions... Complex Arithmetic Quaternion Arithmetic

??? In for a penny, in for a pound (dropped off a bridge in Dublin) ??? LNS (logarithmic number systems)

In particular, instruction set support for the Gaussian logarithm which allows computation of A+B in log form, i.e. computing log(A+B) given log(A) + log(B).

LNS often is coupled to irregular LUT support. Lightweight threading instruction set extensions

Q: is MIPS MT such? Message passing instruction set extensions

    send/receive, put/get
    PGAS (partitioned global address space) 

See Intel GenX? Instruction Set#Message Gateway, although these messages seem to be used for communication within a GPU, at small scale, not at a wide scale between processors or systems. "

" Bit matrix multiply (BMM)

...

 Form of Matrix Multiplication

One of the most natural forms is "conventional" AND-OR BMM:

for(i=0;i<63;i++) { for(tmp=0, j=0;j<63;j++) { tmp

a.bit[i] * b[i][j];
    }
    result.bit[i] = tmp; }

Another much loved form is AND-XOR (also called AND parity).

Yet another form is AND-POPCNT

for(i=0;i<63;i++) { for(tmp=0, j=0;j<63;j++) { tmp += a.bit[i] * b[i][j]; } result[i] = tmp; }

although here the result is not 64 bits, but at least 64*6 bits (typically 64*8).

Uses

The AND-OR form is a fairly general and standard way of expressing logic functions - sum-of-products. As is, for that matter, the OR-AND form - product-of-sums. Other forms may also allow expression of all logic functions, although they may be less familiar.

The AND-OR form can be used to accomplish arbitrary permutations, using matrices that have only a single bit per row and column. As can, for that matter, the AND-XOR form, although it may be necessary to invert after the permutation.

The AND-XOR form can be used for information theoretic calculations, as can POPCNT.

The AND-XOR form can be used as a fairly general form of XOR based hash function, as described by Vandierendonck and De Bosschere in "XOR-Based Hash Functions". Hardware

The AND-OR form is obviously easy to do with wired logic, and not so bad even with limited fan-in.

The XOR form is more expensive, requiring an XOR tree. POPCNT still more so, typically requiring an 8:3 counter (of which the XOR form needs only the lowest bit).

Because 8 levels of XOR can be hard, may motivate wider flatter formulations.

...

When writing this, I found the recent publication (2008):

    http://palms.princeton.edu/system/files/hilewitz_asap_08.pdf 
        It is nice to see this "lore" of (super)computer architecture leaking into academia"

" Message Gateway

TBD

256 bit messages - GenXregister? size

    OpenGateway
    CloseGatway
    ForwardMsg
    BarrierMsg
        counter based barriers implemented in the "Message Gateway" messaging unit, delaying response until all threads join 
    UpdateGatewayState
    MMIOReadWrite
        provides a bridge from message passing to MMIO 

The OpenGateway?/Closegateway amounts to the "channel" I have described in Message Passing Instructions. Indeed, GenX? uses exactly the term "channel". "

---

insight:

things like 'sin' that aren't needed on embedded systems but that might have hardward acceleration on some platforms could be in a std library, rather than a core instruction.

--