proj-plbook-plChGpuLangs

Table of Contents for Programming Languages: a survey

GPGPU

GPGPU (General-purpose computing on graphics processing units).

http://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units

http://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units#GPU_programming_concepts

http://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units#GPU_methods

OpenCL

http://en.wikipedia.org/wiki/OpenCL

"OpenCL? C, for example, is quite similar to C, but has a lot of small differences, including primitive vector types, large collections of built-in functions, and support for multiple address spaces." -- [1]

GPU instruction sets: CUDA and CUDA PTX

CUDA PTX:

GPU instruction sets: HLSL Shader Model 5

GLSL

GLSL has "compute shaders" for GPGPU stuff, but OpenCL? or CUDA is still recommended over GLSL for general computation. [http://stackoverflow.com/questions/15868498/what-is-the-difference-between-opencl-and-opengls-compute-shader ] [2] [3] [4] [5] [6]

WebGL uses GLSL for shading:

Does this site use GLSL?: https://www.shadertoy.com/

More on OpenGL?:

Brook

http://graphics.stanford.edu/projects/brookgpu/lang.html

Harlan

https://github.com/eholk/harlan/wiki/Language-Overview

Micro automata processor

Computes NFA (non-deterministic finite automata) in parallel.

" There are three major components on the AP: State-Transition-Element (STE), Counter Element and Boolean Elements, among which STE is the core component. One STE can match an 8-bit user-specified symbol in a clock cycle and STEs can connect to each other via edges. Each STE has two states: activated and matched. Only activated STEs will be able to accept the next input symbol to perform a match against the user-specified symbol within that STE. Once the symbol on an STE is matched, the STEs connected to it will be activated to accept the next input symbol and match that against their user-specified symbols.

Besides STEs, the Counter Element is used to count numbers. It requires a user-specified threshold. Once the threshold is reached, the counter can produce a report or activate STEs that are connected to it. There are also Boolean Elements that function as logic gates such as AND, OR, NOR, etc.

Users can design their Automata structures using an XML-like language, Automata Network Markup Language (ANML). The ANML code is then compiled and loaded onto the processor.

Once the design code is loaded onto the chip, then a scanning-matching task is performed. The processor will take the input data as a stream and match them against the design at a rate of 128 MBps. The starting STEs can accept either the entire input data, or only the start of data. The reporting STEs will report if they are activated and matched. The output from the Emulator contains an offset number on which a reporting element is reported, as well as the ID of the reporting element for all the reported STEs.

...

One single AP chip contains two independent half-cores, each of which has 24,576 STEs. Thus a chip contains a total of 49,152 STEs among which 6144 can report. One chip also has 768 Counter Elements and 2304 Combinatorial Elements. There are 32 chips on an AP board which has 1,572,864 STEs that can work in parallel. This enables highly parallel operations [18]. The designs presented in this work can easily fit inside a single AP chip. " -- [7]

" The AP chip has three types of functional elements - the state transition element (STE), the counters, and the Boolean elements [5]. The state transition element is the central feature of the AP chip and is the element with the highest population density. Counters and Boolean elements are critical features to extend computational capabilities beyond NFAs. STEs have two exclusive states - activated and deactivated. An STE is activated if the incoming signal is high. If there are many incoming connections, an implicit rule of \or" will be applied to all these incoming connections. All activated STEs in one automaton will accept the same symbol of input stream at a given cycle. Each STE can be con gured to match a set of any 8-bit symbols. When an activated STE matches a symbol of input stream, the output connections (if any) will produce a high signal. An STE can be con gured as a self-activation STE by connecting an output wire back to its incoming port. The Boolean element can be con gured as \or", \and", \nor", \nand", \inv", \POS" (product-of-sum), \SOP" (sum-of-product), \NPOS" (not POS), \NSOP" (not SOP). The last four are Booleans element with multiple (6) input terminals, say I0; I1; I2; I3; I4; I5 and one output. An SOP is the sum (OR) of product (AND) terms. A POS is the product (AND) of sum (OR) terms. An NSOP is an SOP with its activation value inverted. An NPOS is POS with its activation value inverted. For example, the \POS", output= (I0 or I1) and (I2 or I3) and (I4 or I5) ; the \SOP",output= (I0 and I1) or (I2 and I3) or (I4 and I5). A counter has two input ports - counting port (a small \C" on the counter in the following automaton illus- trations) and reset port (a small \R" on the counter in the following automaton illustrations). When the counting port receives a high signal at a cycle, the counter will count one incrementally (only incrementally). A counter will be reset if it receives high signal from its reset port. A counter needs to be con gured with an integer threshold. When the threshold is reached, a counter will assert its output. Depending on the operation mode con gured, one counter can have three di erent output behaviors when the threshold is reached: latch mode, i.e., holds high output forever; pulse mode, i.e., produces high signal at that cycle only; and roll mode, i.e., produces a high signal at that cycle and reset. One counter can count up to 2^12

... Micron's current generation AP - D480 chip is built on 45nm technology running at an input symbol (8-bit) rate of 133 MHz. The D480 chip has two half-cores and each half-core has 96 blocks. Each block has 256 STEs, 4 counters and 12 Boolean elements. In total, one D480 chip has 49,152 processing state elements, 2,304 programmable Boolean elements, and 768 counter elements [5]. Each AP board can have up to 48 AP chips that can perform pattern matching in parallel [6]. Each AP chip has a worst case power consumption of 4W [5]. The power consumption of a 48-core AP board is expected to be similar to a high-end GPU card.

The AP takes input streams of 8-bit symbols. Each AP chip is capable of processing up to multiple separate data streams concurrently, although we do not use this feature for this work. The data processing and data transfer are implicitly overlapped by using the input double-bu ering of the AP chip. Any STE can be con gured to accept the rst symbol in the stream (called start-of-data mode, a small \1" in the left-upper corner of STE in the following automaton illustrations), to accept every symbol in the input stream (called all-input mode, a small \ 1 " in the left-upper corner of STE in the following automaton illustrations) or to accept a symbol only upon activation. The all-input mode will consume one extra STE. Any type of element on the AP chip can be con gured as a reporting element; one reporting element generates a one-bit signal when the element matches an input symbol. One AP chip has up to 6144 reporting elements. If any reporting element reports at a cycle, the chip will generate an output vector which contains signals of \1" corresponding to the elements that report at that cycle and \0"s for reporting elements that do not report. If too many output vectors are generated, the output bu er can ll up and stall the chip. Thus, minimizing output vectors is an important consideration for performance improvement.

3.4 Programming and recon guration Automata Network Markup Language (ANML) is an XML language for describing the composition of automata networks. ANML is the basic way to program automata on the AP chip. Besides ANML, Micron provides a graphical user interface tool called the AP Workbench for quick automaton designing and debugging. A \macro" is a container of automata for encapsulating a given functionality, similar to a function or subroutine in common programming languages. A macro can be de ned with parameters of symbol sets of STEs and counter thresholds which can be instantiated with actual arguments. A macro also can have input ports and output ports to connect the wires outside the macro. Micron's AP SDK also provides C and Python interfaces to build automata, create input streams, parse output and manage computational tasks on the AP board "

" Direct implementation of automata in hardware has the potential to be more efficient than software executing on a von Neumann architecture. Hardware-based automata can effect simultaneous, parallel exploration of all possible valid paths in an NFA, thereby achieving the processing complex- ity of a DFA without being subject to DFA state explosion. " -- An Efficient and Scalable Semiconductor

(of course the hardware-based automata has a fixed upper limit on the number of states)

regexes can be directly implemented; see Table 3 in An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing by Paul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes, PDF page 8. They can fit about 1k regexs and simultaneously apply them to inputs at about 1 Gbps.

They can also store a lookup tree in the chip; to index into the tree, you re-broadcast the query n times, where n is the depth of the tree.

The paper also reviews related work including FPGA implementations of regexs.

alternate link to paper:

it's an example of MISD (multiple instruction, single data)

it's basically a repurposed memory chip:

"Paul Dlugosch is director of Automata Processor Development in the Architecture Development Group of Micron’s DRAM division. “One thing people don’t understand well, aside from those memory researchers or people in this industry, is that any memory device is by nature a very highly parallel device. In fact, he says, “most of the power of that parallelism is left on the table and unused.”" -- http://www.hpcwire.com/2013/11/22/micron-exposes-memorys-double-life-automata-processor/

figure in http://www.hpcwire.com/2013/11/22/micron-exposes-memorys-double-life-automata-processor/ shows that a Automata processor with ~300W outperformed a $20k >2000W 48 CPU HPC cluster, taking 14 minutes to do a bioinformatics "Planted Motif Search" where the HPC cluster took 47 hours.

"To be clear, the Automata Processor is not a standalone processor...Keep in mind that the first generation Automata Processor uses a physical interface based on the DDR3 protocol specification.

This means that the way an Automata Processor interfaces to a 'controller' is very similar to how a DDR3 memory component (or DIMM module) would interface to a CPU. The Automata Processor array must have information streamed to it (DDR3 writes) and periodically, the Automata Processor must provide analytic results (DDR3 reads). In a complete system, some type of CPU/NPU or perhaps FPGA must implement these read write operations."

-- http://www.techradar.com/us/news/world-of-tech/all-you-need-to-know-about-automata-micron-s-revolutionary-processor-1237982

supplement to the above linked paper: https://www.micron.com/~/media/Documents/Products/Other%20Documents/automata_processing_technical_paper_supplementary_material.pdf

---

here's another 'hetrogeneous computing' hardware design in which each 'pixel' gets a small general purpose processor; the idea is called 'pixel planes':

http://www.cs.unc.edu/~jp/Siggraph89Pxpl5.pdf

http://www.cs.unc.edu/techreports/85-001.pdf

---

SPIR-V

https://www.khronos.org/spir/ https://en.wikipedia.org/wiki/Standard_Portable_Intermediate_Representation

NIR

Triton

https://www.openai.com/blog/triton/ https://dl.acm.org/doi/abs/10.1145/3315508.3329973 http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf

WebGPU

https://surma.dev/things/webgpu/ discussion:

Use.GPU

https://usegpu.live/ https://acko.net/blog/the-case-for-use-gpu/ https://lobste.rs/s/yb8d9m/case_for_use_gpu https://gitlab.com/unconed/use.gpu

A datalog engine for GPUs

https://www.dcc.fc.up.pt/~ricroc/homepage/publications/leap/2013-WFLP.pdf

JAX

https://jax.readthedocs.io/en/latest/notebooks/quickstart.html

GPU language comparisons