proj-oot-lowEndTargets-gpus

Embedded and low-end hardware survey: Contemporary MCUs

could also look at the resources available to each processor or even each 'thread' within the GPU in low-end GPGPU systems. The Intel integrated graphics (e.g. integrated into the package or die of the CPU) GPUs supported the OpenCL? standard apparently starting with the Ivy Bridge system generation, of which a lower-end system was the 2500

NOTE: I DONT KNOW ANYTHING ABOUT GPUS OR GPGPU OR OPENCL YET, MY UNDERSTANDING OF THESE NUMBERS IS SEVERELY LACKING AND SO SOME OF THE FOLLOWING INTERPRETATIONS MAY BE WRONG!

https://compubench.com/device-info.jsp?config=12921360 :

CL_DEVICE_IMAGE2D_MAX_HEIGHT 16384 CL_DEVICE_IMAGE2D_MAX_WIDTH 16384 CL_DEVICE_IMAGE3D_MAX_DEPTH 2048 CL_DEVICE_IMAGE3D_MAX_HEIGHT 2048 CL_DEVICE_IMAGE3D_MAX_WIDTH 2048 CL_DEVICE_LOCAL_MEM_SIZE 65536 CL_DEVICE_MAX_COMPUTE_UNITS 6 CL_DEVICE_MAX_PARAMETER_SIZE 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS 128 CL_DEVICE_MAX_SAMPLERS 16 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3 CL_DEVICE_MAX_WORK_ITEM_SIZES (256,256,256)

And for the AMD Mali-T604, which is in the Google Nexus 10,

https://jogamp.org/bugzilla/attachment.cgi?id=581&action=edit

CL_DEVICE_LOCAL_MEM_SIZE: 32768 CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_IMAGE2D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE2D_MAX_WIDTH: 65536 CL_DEVICE_IMAGE3D_MAX_DEPTH: 65536 CL_DEVICE_IMAGE3D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE3D_MAX_WIDTH: 65536 CL_DEVICE_MAX_PARAMETER_SIZE: 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_SAMPLERS: 16 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: [256, 256, 256] CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8

http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf :

" The maximum number of threads (256) stems from the fact that each thread can use 4 registers from a register bank that contains 1024 registers. The larger the number of registers used by the kernel, the fewer the concurrent threads. So if a kernel uses 8 registers, only a maximum of 128 threads can run in parallel. If there are enough threads to hide latency there should be no performance implication of using more registers. "

(so i guess if you wanted 128 registers, you could have 8 threads)

according to http://www.realworldtech.com/ivy-bridge-gpu/4/ , Ivy Bridge has 32kb L1 icaches, 8 threads per core (is this comparable to the 10 EUs in Haskell, see below? probably not, see http://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics#Ivy_Bridge; maybe this is comparable to the SIMD parallelism, see below, in which case i'd hesitate to call it a 'thread'; or maybe this is the number of sub-slices per slice) and 1k general registers (32k of memory)

see also http://events-tce.technion.ac.il/files/2013/07/Michael.pdf

" Vertex shading commonly uses SIMD4x2, with 4 data elements from 2 vertices. Pixel shading is SIMD1x8 or SIMD1x16 (aka SIMD8 or SIMD16), operating on a single color from 8 or 16 pixels simultaneously. Media shaders are similar to pixel shaders, except they are packed even more densely with 8-bit data, rather than the 32-bit data used in graphics shaders. To support all these different execution modes, the GRF is incredibly versatile.

Registers are each 256-bits wide, which is perfectly suited for SIMD2x4 or SIMD8. In a 16B aligned mode, instructions operate on 4-component RGBA data, with source swizzling and destination masking. In a 1B aligned mode, instructions use region-based addressing to perform a 2-dimension gather from the register file and swizzling and destination masking are disabled. This is critical for good media performance, where 1B data is packed together for maximum density. Collectively, these two addressing modes also simplify converting from AOS to SOA data structures.

Each thread is allocated 128 general purpose registers, so the GRF has expanded to 32KB to handle 8 threads. The GRF has also been enhanced to handle larger 8B accesses that are necessary for double precision computation. "

in other words, each thread has 4k bytes worth of registers available to it; but this is in the form of only 128 256-bit registers.

http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2

looks like the fundamental component is the EU, which is grouped into 'sub-slices' (Intel), 'GCN's (AMD), or 'Kepler SMX's (Nvidia). there is a table on http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2 showing their capabilities on the Haswell system generation. They are SIMDs whose parallelism range from 8-wide (actually, 2x4 dual issue) to 32-wide; and each subslice/GCN/SMX has from 4-10 EUs (4-10 SIMD units), for a total of 64-192 ALUs per subslice/GCN/SMX

there are up to 4 sub-slices total, or 40 EUs in Haswell (for Intel HD 2000, it was 6 EUs, for Intel HD 3000, it was 12 EUs), with peak FP ops per core/EU at around 16.

note that EUs are also called 'cores'?

so interesting numbers relating to amount of memory available per core are:

so assuming a word size of 256-bits, there we see numbers from 128 (4k bytes) to 512 (16k bytes), and up to 2048 (CL_DEVICE_LOCAL_MEM_SIZE in Intel HD 2500, divided by 32 bytes per word)

and numbers relating to parallelism hardware are:

so there we see numbers on the order of 8

and numbers relating to parallelism image size are:

so there we see numbers greater than 2048

and other numbers relating to parallelism in the OpenCL? environment are:

so there we see one 8 and a bunch of things close to 256

in summary, we see the following critical values (approximate):

http://stackoverflow.com/questions/3957125/questions-about-global-and-local-work-size says that the actual software-exposed parallelism is the 'global work size', but i don't understand how to find the max global work size for a given GPU. Maybe you can't do that without actually creating the kernel, because it depends on exactly how much memory (even how many registers) the kernel is using, etc. http://www.khronos.org/message_boards/showthread.php/9207-Determine-global_work_size suggests this is on the order of 2^17. See also http://www.khronos.org/message_boards/showthread.php/6060-clEnqueueNDRangeKernel-max-global_work_size , https://devtalk.nvidia.com/default/topic/477978/questions-about-global-and-local-work-size/.

http://people.maths.ox.ac.uk/gilesm/cuda/new_lectures/lec1.pdf says that (higher-end?) GPUs typically have on the order of 1024 (2^10) cores and 128-384 registers per core, 8192 threads (but i think it says these are 32-way SIMD, so really that's 32 independent threads?), 64k shared memory, and can support at least ~2^17 global work size.

On the number of 'cores', http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf cautions: "An important caveat to keep in mind is that the marketing num- bers for core count for NVIDIA and ATI aren‘t always a good r resentation of the capabilities of the hardware. For instance, on NVIDIA‘s website a Quadro 2000 graphics card has 192 “Cuda Cores”. However, we can query the lower-level hardware capa- bilities using the OpenCL? API and what we find is that in reality there are actually 4 compute units, all consisting of 12 stream mul- tiprocessors, and each stream multiprocessor is capable of 4-wide SIMD, 192 = 4*12*4. In the author‘s opinion this makes the mar- keting material confusing, since you wouldn‘t normally think of a hardware unit capable only of executing floating point operations as a “core”. Similarly, the marketing documentation for a HD6970 (very high end GPU from ATI at time of writing) shows 1536 pro- cessing elements, while in reality the hardware has 24 compute units (SIMD engines), and 16 groups of 4-wide processing elements per compute unit. 1536 = 24*16*4 . "

http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf also says "Clearly, for small matri- ces the overhead of OpenCL? execution dominates the performance benefits of massivly concurrent execution. For our measurements, below a matrix dimension of roughly 150 x 150 the simple multi- threaded CPU code out performs OpenCL?"

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0538e/BABGJBFI.html says "The optimal global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device. To calculate the optimal global work size use the following equation: global work size = <maximum_work-group_size> * <number of shader cores> * <constant>, where constant is typically 4 or 8 for the Mali-T604 GPU"

The Mali-T604 seems to have 4-8 shader cores: " ARM announces 8-way graphics core

www.eetimes.com/document.asp?doc_id=1270826 EE Times Nov 10, 2011 - The Mali-T658 design supports up to 8 shader cores, compared with the Mali-T604's four shader cores, and ARM has also doubled the number ... "
EE Times

so that's 256*4*4 = 4096, which is close to the 2048 number we keep seeing

random note: OpenCL? allows the specification of a dependency graph of 'tasks', but 'kernels' executing in different tasks, although they may actually be executing in parallel, cannot synchronize and therefore should not try to share memory (http://www.rtcmagazine.com/articles/view/102645)

so what about the maximum local work group size? From the CL_ environment variables, both the Intel HD 2500 and the AMD Mali-T604 seems to allow 256. But http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf says that this is only if only 4 registers are used per thread? So i guess that means that if we have 128 registers per thread, then we only have a max local work group size of 8? So let's put this at 'at least 8' and ranging from 8 to 256.

Compare all this to Cell SPEs http://en.wikipedia.org/wiki/Cell_(microprocessor):

Interestingly, the OpenCL? work group size is the limit of the number of processes that can synchronize and pass information between each other using the shared local memory (GPU-local memory, but not per-processor). This is kind of the same order of magnitude that is achieved in multiprocessor systems with a shared memory (e.g. dual-core (4 threads with hyperthreading), quad-core (8 threads), 8 SPEs in the Cell processor, the 32-48 cores in initial Larrabee ( http://en.wikipedia.org/wiki/Larrabee_%28microarchitecture%29 ) (maybe multiply that by 4 because each Larrabee core could run 4 threads, so 128-192 threads), etc). Note that with larger numbers of registers used, at least in the AMD Mali-T604, this number gets less than 256, and in fact 256 is only when the numbers of registers used is very small (4); if 8 registers are used, you have 128 threads, if 16 registers are used, you have 64 threads, if 32 registers are used, you have 32 threads, etc. So this suggests that in general, these numbers may be the orders of magnitude to which coherent shared memory multiprocessing scales; between 8 and 256 independent threads. Beyond this, perhaps supporting atomic highly consistent (are we talking sequential consistency?) shared memory access becomes inefficient, and it becomes more efficient to resort to a more relaxed memory consistency model, or to alternative paradigms such as task-dependency and message-passing for IPC, as well as non-communicative data-parallelism for operations which don't need further IPC ("kernels", i guess).'

http://www.slideshare.net/mikeseven/imaging-on-embedded-gp-us-bamm-meetup-20131219 similarly says although there is a 256-thread limit, that's 64 in practice

http://www.slideshare.net/mikeseven/imaging-on-embedded-gp-us-bamm-meetup-20131219 says the Adreno 330 GPU on the Qualcomm MSM8974 has 128 bit registers, 8k local memory per core, 512 work items max, 1.5MB on-chip RAM, and says the ARM Mali T604 has 32k local memory per core.

parallax propeller 2: http://www.rayslogic.com/Propeller2/Propeller2.htm 8 cogs "P2 has 128 kB of HUB RAM, P1 has 32 kB. Both P2 and P1 have 512 longs of COG RAM, but P2 has an additional 256 longs of stack RAM in each cog."

so we can distinguish a few architectural levels here:

note: the # of data items in shared 'local' memory can be relied upon to be at least 512; (local meaning not local to each processor, but local to the GPU as opposed to ordinary main memory) (the 16k minimum value for CL_DEVICE_LOCAL_MEM_SIZE in OpenCL? 1.0, divided by 256-bit (32-byte) register width from Intel HD). However, OpenCL? 1.2 raises the minimum to 32k, and in fact the devices i looked at above all seem to have at least that much, and that's also a common cache size seen these days, and many devices have 128-bit registers instead of 256-bit; so each individual device seems to have space for at least 2048 items in local memory, which is another number we saw a lot in the Intel HD OpenCL? environment. And even pretty old computers had 4k of memory, which is 2048 words if you have 16-bit words. So let's assume 2048.

so we can rely on at least approximately:

now 128 registers seems excessive, often we'll want more synchronizable parallelism at the expense of less memory, so let's cut that down:

in summary, we can rely on at least approximately:

in bits:

rounding up (to get bit width for addressing for a fixed-length instruction set) and down to powers of 2 (to get the minimal numbers that we can rely upon), in bits:

so max bit width for fixed length addressing:

note: this argues for:

and min that we can rely upon:

in summary, we can rely on at least approximately:

CONCLUSION/NOTE: BASED ON THESE NUMBERS, IT LOOKS LIKE A CONNECTION-MACHINE LIKE ACTIVE-DATA-STYLE PROGRAMMING COULD BE EMULATED BY PRESENT-DAY CONSUMER GPUs!! Even though there are only 2 cpus in lower-end consumer computers and ~8 gpu units, there's no need to wait for computers with 64k cpuS because, since in the paradigm we are targetting, (virtual) processors only need to synchronize with their immediate neighbors, all 64k of these don't share a synchronizable local memory, so we can emulate them using data-parallelism.

SUGGEST RE-TARGETTING OOT AT OPENCL

---

---

interesting history/economics on integrated graphics: http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested

---

GPU program size limits:

https://www.opengl.org/discussion_boards/showthread.php/182182-Shader-Program-size-limit https://www.reddit.com/r/opengl/comments/2n8mjz/size_limits_for_shaders_on_various_gpus/ https://stackoverflow.com/questions/2617957/glsl-maximum-number-of-instructions

looks like on OpenGL? 3.x (SM4), we can have 64k instructions in a pixel shader (but less on vertex shaders, or on OpenGL? 2.x).

---

https://www.digikey.com/en/supplier-centers/a/adapteva

Epiphany (18 cores in their Parallela board, 1024 cores in Epiphany-V chip)

The Epiphany is presumably scalable -- they claim it should scale to a billion processors -- but so far there hasn't been evidence of more than 1024. Supposedly even the Parallela boards can be connected to each other: "The architecture also allows parallella boards to be combined into a cluster with a fast inter-chip 'eMesh' interconnect, extending the logical grid of cores (creating almost unlimited scaling potential)."

The Epiphany-V local stores are 64k.

---