Embedded and low-end hardware survey: Contemporary MCUs

could also look at the resources available to each processor or even each 'thread' within the GPU in low-end GPGPU systems. The Intel integrated graphics (e.g. integrated into the package or die of the CPU) GPUs supported the OpenCL? standard apparently starting with the Ivy Bridge system generation, of which a lower-end system was the 2500

NOTE: I DONT KNOW ANYTHING ABOUT GPUS OR GPGPU OR OPENCL YET, MY UNDERSTANDING OF THESE NUMBERS IS SEVERELY LACKING AND SO SOME OF THE FOLLOWING INTERPRETATIONS MAY BE WRONG!

https://compubench.com/device-info.jsp?config=12921360 :

CL_DEVICE_IMAGE2D_MAX_HEIGHT 16384 CL_DEVICE_IMAGE2D_MAX_WIDTH 16384 CL_DEVICE_IMAGE3D_MAX_DEPTH 2048 CL_DEVICE_IMAGE3D_MAX_HEIGHT 2048 CL_DEVICE_IMAGE3D_MAX_WIDTH 2048 CL_DEVICE_LOCAL_MEM_SIZE 65536 CL_DEVICE_MAX_COMPUTE_UNITS 6 CL_DEVICE_MAX_PARAMETER_SIZE 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS 128 CL_DEVICE_MAX_SAMPLERS 16 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3 CL_DEVICE_MAX_WORK_ITEM_SIZES (256,256,256)

And for the AMD Mali-T604, which is in the Google Nexus 10,

https://jogamp.org/bugzilla/attachment.cgi?id=581&action=edit

CL_DEVICE_LOCAL_MEM_SIZE: 32768 CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_IMAGE2D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE2D_MAX_WIDTH: 65536 CL_DEVICE_IMAGE3D_MAX_DEPTH: 65536 CL_DEVICE_IMAGE3D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE3D_MAX_WIDTH: 65536 CL_DEVICE_MAX_PARAMETER_SIZE: 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_SAMPLERS: 16 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: [256, 256, 256] CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8

http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf :

" The maximum number of threads (256) stems from the fact that each thread can use 4 registers from a register bank that contains 1024 registers. The larger the number of registers used by the kernel, the fewer the concurrent threads. So if a kernel uses 8 registers, only a maximum of 128 threads can run in parallel. If there are enough threads to hide latency there should be no performance implication of using more registers. "

(so i guess if you wanted 128 registers, you could have 8 threads)

according to http://www.realworldtech.com/ivy-bridge-gpu/4/ , Ivy Bridge has 32kb L1 icaches, 8 threads per core (is this comparable to the 10 EUs in Haskell, see below? probably not, see http://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics#Ivy_Bridge; maybe this is comparable to the SIMD parallelism, see below, in which case i'd hesitate to call it a 'thread'; or maybe this is the number of sub-slices per slice) and 1k general registers (32k of memory)

" Vertex shading commonly uses SIMD4x2, with 4 data elements from 2 vertices. Pixel shading is SIMD1x8 or SIMD1x16 (aka SIMD8 or SIMD16), operating on a single color from 8 or 16 pixels simultaneously. Media shaders are similar to pixel shaders, except they are packed even more densely with 8-bit data, rather than the 32-bit data used in graphics shaders. To support all these different execution modes, the GRF is incredibly versatile.

Registers are each 256-bits wide, which is perfectly suited for SIMD2x4 or SIMD8. In a 16B aligned mode, instructions operate on 4-component RGBA data, with source swizzling and destination masking. In a 1B aligned mode, instructions use region-based addressing to perform a 2-dimension gather from the register file and swizzling and destination masking are disabled. This is critical for good media performance, where 1B data is packed together for maximum density. Collectively, these two addressing modes also simplify converting from AOS to SOA data structures.

Each thread is allocated 128 general purpose registers, so the GRF has expanded to 32KB to handle 8 threads. The GRF has also been enhanced to handle larger 8B accesses that are necessary for double precision computation. "

in other words, each thread has 4k bytes worth of registers available to it; but this is in the form of only 128 256-bit registers.

http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2

looks like the fundamental component is the EU, which is grouped into 'sub-slices' (Intel), 'GCN's (AMD), or 'Kepler SMX's (Nvidia). there is a table on http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2 showing their capabilities on the Haswell system generation. They are SIMDs whose parallelism range from 8-wide (actually, 2x4 dual issue) to 32-wide; and each subslice/GCN/SMX has from 4-10 EUs (4-10 SIMD units), for a total of 64-192 ALUs per subslice/GCN/SMX

there are up to 4 sub-slices total, or 40 EUs in Haswell (for Intel HD 2000, it was 6 EUs, for Intel HD 3000, it was 12 EUs), with peak FP ops per core/EU at around 16.

note that EUs are also called 'cores'?

so interesting numbers relating to amount of memory available per core are:

128 registers in Intel HD Ivy Bridge
4k bytes in Intel HD Ivy Bridge register file
32k icache (shared between all cores) in Intel HD Ivy Bridge
16k minimum value for CL_DEVICE_LOCAL_MEM_SIZE in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html)
64k CL_DEVICE_LOCAL_MEM_SIZE in Intel HD 2500 (i assume this is shared local memory?)
"It is important to understand the OpenCL? device hardware limitations and capabilities when programming OpenCL? kernels to use SLM. As a rule of thumb, do not use SLM if the LUT size is larger than 4K bytes." -- https://software.intel.com/en-us/articles/accelerate-performance-using-opencl-with-intel-hd-graphics (note: SLM is "shared local memory") and as to the MINIMAL size when to bother to use SLM (i think that's what they meant): "Lookup tables will almost always create a performance bottleneck when the lookup table data is large, generally more than 256 bytes"

so assuming a word size of 256-bits, there we see numbers from 128 (4k bytes) to 512 (16k bytes), and up to 2048 (CL_DEVICE_LOCAL_MEM_SIZE in Intel HD 2500, divided by 32 bytes per word)

and numbers relating to parallelism hardware are:

8-wide to 32-wide SIMD
8 threads per core, 4-10 EUs per unit
64-192 ALUs per unit
6 CL_DEVICE_MAX_COMPUTE_UNITS in Intel HD 2500

so there we see numbers on the order of 8

and numbers relating to parallelism image size are:

8192 minimum value for CL_DEVICE_IMAGE2D_MAX_HEIGHT and CL_DEVICE_IMAGE2D_MAX_WIDTH in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is double that)
2048 minimum value for CL_DEVICE_IMAGE3D_MAX_{DEPTH, HEIGHT/WIDTH) in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is double that)

so there we see numbers greater than 2048

and other numbers relating to parallelism in the OpenCL? environment are:

256 minimum value for CL_DEVICE_MAX_PARAMETER_SIZE, "Max size in bytes of the arguments that can be passed to a kernel" in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is 4x that)
128 minimum value, if supported, for CL_DEVICE_MAX_READ_IMAGE_ARGS, "Max number of simultaneous image objects that can be read by a kernel" in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is 128, the minimum)
8 minimum value, if supported, for CL_DEVICE_MAX_WRITE_IMAGE_ARGS, "Max number of simultaneous image objects that can be written to by a kernel" in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is 8, the minimum)
16 minimum value, if supported, for CL_DEVICE_MAX_SAMPLERS, "Maximum number of samplers that can be used in a kernel", in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is 16, the minimum)
256 CL_DEVICE_MAX_WORK_GROUP_SIZE in Intel HD 2500, "Maximum number of work-items in a work-group executing a kernel using the data parallel execution model."
3 minimum value for CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, "Maximum dimensions that specify the global and local work-item IDs used by the data parallel execution model.", in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html) (Intel HD 2500 is 3, the minimum)
(256,256,256) CL_DEVICE_MAX_WORK_ITEM_SIZES in Intel HD 2500, "Maximum number of work-items that can be specified in each dimension of the work-group to clEnqueueNDRangeKernel"

so there we see one 8 and a bunch of things close to 256

in summary, we see the following critical values (approximate):

memory size, in bytes: from 4k to 64k
memory size, in 256-bit words: from 128 to 2048
single hardware SIMD units parallelism: from 4 (8) to 32
number of hardware execution units: from 6 to 40
other possibly-related-to-execution parallelism numbers: ~8 for CL_DEVICE_MAX_COMPUTE_UNITS, CL_DEVICE_MAX_SAMPLERS, CL_DEVICE_MAX_WRITE_IMAGE_ARGS, "threads per core", around 256 for others. what are 'work items'? http://stackoverflow.com/questions/21221834/opencl-work-items-processing-elements-ndrange says they are an instance of a kernel. It says that a 'work group' is a set of work items that can be synchronized with each other via a barrier. So i guess 8 is the actual parallelism, whereas 256 is the parallelism exposed by OpenCL?, e.g. OpenCL? or the GPU schedules those 256 work items to run on the actual available processors??

http://stackoverflow.com/questions/3957125/questions-about-global-and-local-work-size says that the actual software-exposed parallelism is the 'global work size', but i don't understand how to find the max global work size for a given GPU. Maybe you can't do that without actually creating the kernel, because it depends on exactly how much memory (even how many registers) the kernel is using, etc. http://www.khronos.org/message_boards/showthread.php/9207-Determine-global_work_size suggests this is on the order of 2^17. See also http://www.khronos.org/message_boards/showthread.php/6060-clEnqueueNDRangeKernel-max-global_work_size , https://devtalk.nvidia.com/default/topic/477978/questions-about-global-and-local-work-size/.

http://people.maths.ox.ac.uk/gilesm/cuda/new_lectures/lec1.pdf says that (higher-end?) GPUs typically have on the order of 1024 (2^10) cores and 128-384 registers per core, 8192 threads (but i think it says these are 32-way SIMD, so really that's 32 independent threads?), 64k shared memory, and can support at least ~2^17 global work size.

On the number of 'cores', http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf cautions: "An important caveat to keep in mind is that the marketing num- bers for core count for NVIDIA and ATI aren‘t always a good r resentation of the capabilities of the hardware. For instance, on NVIDIA‘s website a Quadro 2000 graphics card has 192 “Cuda Cores”. However, we can query the lower-level hardware capa- bilities using the OpenCL? API and what we find is that in reality there are actually 4 compute units, all consisting of 12 stream mul- tiprocessors, and each stream multiprocessor is capable of 4-wide SIMD, 192 = 4*12*4. In the author‘s opinion this makes the mar- keting material confusing, since you wouldn‘t normally think of a hardware unit capable only of executing floating point operations as a “core”. Similarly, the marketing documentation for a HD6970 (very high end GPU from ATI at time of writing) shows 1536 pro- cessing elements, while in reality the hardware has 24 compute units (SIMD engines), and 16 groups of 4-wide processing elements per compute unit. 1536 = 24*16*4 . "

http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf also says "Clearly, for small matri- ces the overhead of OpenCL? execution dominates the performance benefits of massivly concurrent execution. For our measurements, below a matrix dimension of roughly 150 x 150 the simple multi- threaded CPU code out performs OpenCL?"

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0538e/BABGJBFI.html says "The optimal global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device. To calculate the optimal global work size use the following equation: global work size = <maximum_work-group_size> * <number of shader cores> * <constant>, where constant is typically 4 or 8 for the Mali-T604 GPU"

The Mali-T604 seems to have 4-8 shader cores: " ARM announces 8-way graphics core

www.eetimes.com/document.asp?doc_id=1270826 EE Times Nov 10, 2011 - The Mali-T658 design supports up to 8 shader cores, compared with the Mali-T604's four shader cores, and ARM has also doubled the number ... "

EE Times

so that's 256*4*4 = 4096, which is close to the 2048 number we keep seeing

random note: OpenCL? allows the specification of a dependency graph of 'tasks', but 'kernels' executing in different tasks, although they may actually be executing in parallel, cannot synchronize and therefore should not try to share memory (http://www.rtcmagazine.com/articles/view/102645)

so what about the maximum local work group size? From the CL_ environment variables, both the Intel HD 2500 and the AMD Mali-T604 seems to allow 256. But http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf says that this is only if only 4 registers are used per thread? So i guess that means that if we have 128 registers per thread, then we only have a max local work group size of 8? So let's put this at 'at least 8' and ranging from 8 to 256.

Compare all this to Cell SPEs http://en.wikipedia.org/wiki/Cell_(microprocessor):

8 SPEs (parallel processors) per Cell (or do some have only 6?)
128 128-bit registers per processor
"256 KiB? embedded SRAM for instruction and data, called "Local Storage"" (16k 128-bit items)
up to 4 GiB? local store memory (is this just the maximum size of the previous 256 KiB??)

Interestingly, the OpenCL? work group size is the limit of the number of processes that can synchronize and pass information between each other using the shared local memory (GPU-local memory, but not per-processor). This is kind of the same order of magnitude that is achieved in multiprocessor systems with a shared memory (e.g. dual-core (4 threads with hyperthreading), quad-core (8 threads), 8 SPEs in the Cell processor, the 32-48 cores in initial Larrabee ( http://en.wikipedia.org/wiki/Larrabee_%28microarchitecture%29 ) (maybe multiply that by 4 because each Larrabee core could run 4 threads, so 128-192 threads), etc). Note that with larger numbers of registers used, at least in the AMD Mali-T604, this number gets less than 256, and in fact 256 is only when the numbers of registers used is very small (4); if 8 registers are used, you have 128 threads, if 16 registers are used, you have 64 threads, if 32 registers are used, you have 32 threads, etc. So this suggests that in general, these numbers may be the orders of magnitude to which coherent shared memory multiprocessing scales; between 8 and 256 independent threads. Beyond this, perhaps supporting atomic highly consistent (are we talking sequential consistency?) shared memory access becomes inefficient, and it becomes more efficient to resort to a more relaxed memory consistency model, or to alternative paradigms such as task-dependency and message-passing for IPC, as well as non-communicative data-parallelism for operations which don't need further IPC ("kernels", i guess).'

http://www.slideshare.net/mikeseven/imaging-on-embedded-gp-us-bamm-meetup-20131219 similarly says although there is a 256-thread limit, that's 64 in practice

http://www.slideshare.net/mikeseven/imaging-on-embedded-gp-us-bamm-meetup-20131219 says the Adreno 330 GPU on the Qualcomm MSM8974 has 128 bit registers, 8k local memory per core, 512 work items max, 1.5MB on-chip RAM, and says the ARM Mali T604 has 32k local memory per core.

parallax propeller 2: http://www.rayslogic.com/Propeller2/Propeller2.htm 8 cogs "P2 has 128 kB of HUB RAM, P1 has 32 kB. Both P2 and P1 have 512 longs of COG RAM, but P2 has an additional 256 longs of stack RAM in each cog."

so we can distinguish a few architectural levels here:

single CPU (eg 486)
shared memory multiprocessing (2-32 threads, large amounts of registers per thread) (eg Intel Core series), or local memory + shared memory multiprocessing (2-32 threads, large amounts of registers and local memory per thread) (eg Cell)
local memory + shared memory multiprocessing (8-256 threads, small amounts of registers and local memory per thread) (eg GPU work groups)
data parallelism (2048+ (2^11), current consumer GPU limit around 2^11-2^17)

note: the # of data items in shared 'local' memory can be relied upon to be at least 512; (local meaning not local to each processor, but local to the GPU as opposed to ordinary main memory) (the 16k minimum value for CL_DEVICE_LOCAL_MEM_SIZE in OpenCL? 1.0, divided by 256-bit (32-byte) register width from Intel HD). However, OpenCL? 1.2 raises the minimum to 32k, and in fact the devices i looked at above all seem to have at least that much, and that's also a common cache size seen these days, and many devices have 128-bit registers instead of 256-bit; so each individual device seems to have space for at least 2048 items in local memory, which is another number we saw a lot in the Intel HD OpenCL? environment. And even pretty old computers had 4k of memory, which is 2048 words if you have 16-bit words. So let's assume 2048.

so we can rely on at least approximately:

up to 128 data items (registers) per processor in processor-local memory
2048 data items, total, in shared local memory
8-way within-work-group (synchronizable access to shared local memory) parallelism (alternately; if we settle for 4-registers-per-processor, then up to 256-way; note that 16 regs-per-processor yields 64-way)
2^11-2^17 software-exposed data-parallelism

now 128 registers seems excessive, often we'll want more synchronizable parallelism at the expense of less memory, so let's cut that down:

in summary, we can rely on at least approximately:

16 registers, per processor (processor-local memory)
2048 data items, total, in shared local memory
8-way within-work-group (synchronizable access to shared local memory) parallelism (alternately; if we settle for 4-registers-per-processor, then up to 256-way; note that 16 regs-per-processor yields 64-way)
2^11-2^17 software-exposed data-parallelism

in bits:

4-7 bits to address processor-local memory (registers)
9-12 bits to address (synchronizable) shared local memory (note however that for shared memory multiprocessing systems, ALL memory (main memory) is like this; but it's really the caches that are the comparable number there, due to cache coherency; http://en.wikipedia.org/wiki/Larrabee_%28microarchitecture%29#Comparison_with_the_Cell_Broadband_Engine says each Larrabee core had 256k of L2 cache, so 13 bits to address that)
3-8 bits to address number of synchronizable threads
11-17 bits to address total number of software-exposed data-parallel kernels

rounding up (to get bit width for addressing for a fixed-length instruction set) and down to powers of 2 (to get the minimal numbers that we can rely upon), in bits:

4-8 bits to address processor-local memory (registers)
8-16 bits to address (synchronizable) shared local memory
2-8 bits to address number of synchronizable threads
8-?? bits to address total number of software-exposed data-parallel kernels (?? because i think this number could in theory scale up indefinitely)

so max bit width for fixed length addressing:

8 bits for registers
16 bits for synchronizable shared local memory
8 bits for synchronizable thread number
(arbitarily large) bits for main memory
(arbitarily large) bits for number of non-synchronizable-memory-sharing tasks ("kernels")

note: this argues for:

arbitrarily large (or at least 256-bit) max bit-width for actual word size
arbitrarily large (or at least 128-bit) max bit-width for addressing main memory
arbitrarily large (or at least ??) max bit-width for addressing number of data-parallel tasks
fixed bit width (>=8) for number of registers
fixed bit width (>=8) for thread number
fixed bit width (>=16) for words in synch shared memory
could just make the last 3 all 16 bits to make everything uniform/easy (also i'm not terribly comfortable with putting a max of 256 on number of regs, number of threads, surely there will be weird emulation situations where you want more than the max 'efficient' number in hardware in your software)
i'm toying with the idea of just having the registers memory-mapped into the synch shared memory, e.g. the 16 or so lowest addresses are not really shared. Or, maybe just use the highest bit for this, e.g. 32k registers and 32k words in synch shared memory.

and min that we can rely upon:

4 registers
256 items in shared local memory
4 synchronizable threads (not sure if this is useful, since a parallel program can always be executed in serial, with 1 thread; the max is more relevant, because that's how much contention we have to worry about/the max that it's worth scaling up our synch parallel algorithms to)
256 kernels (actually 2048 is the smallest we've seen, 256 is rounding down from 2^11 to 2^8)
?? main memory (out of scope in this section; but elsewhere in this document, we've frequently seen L1 cache values of about 4k words)

in summary, we can rely on at least approximately:

4-16 registers, per processor (processor-local memory)
2048 data items, total, in shared local memory
8-way within-work-group (synchronizable access to shared local memory) parallelism (alternately; if we settle for 4-registers-per-processor, then up to 256-way; note that 16 regs-per-processor yields 64-way)
2^11-2^17 software-exposed data-parallelism

CONCLUSION/NOTE: BASED ON THESE NUMBERS, IT LOOKS LIKE A CONNECTION-MACHINE LIKE ACTIVE-DATA-STYLE PROGRAMMING COULD BE EMULATED BY PRESENT-DAY CONSUMER GPUs!! Even though there are only 2 cpus in lower-end consumer computers and ~8 gpu units, there's no need to wait for computers with 64k cpuS because, since in the paradigm we are targetting, (virtual) processors only need to synchronize with their immediate neighbors, all 64k of these don't share a synchronizable local memory, so we can emulate them using data-parallelism.

SUGGEST RE-TARGETTING OOT AT OPENCL

---

interesting history/economics on integrated graphics: http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested

---

GPU program size limits:

https://www.opengl.org/discussion_boards/showthread.php/182182-Shader-Program-size-limit https://www.reddit.com/r/opengl/comments/2n8mjz/size_limits_for_shaders_on_various_gpus/ https://stackoverflow.com/questions/2617957/glsl-maximum-number-of-instructions

looks like on OpenGL? 3.x (SM4), we can have 64k instructions in a pixel shader (but less on vertex shaders, or on OpenGL? 2.x).

---

https://www.digikey.com/en/supplier-centers/a/adapteva

Epiphany (18 cores in their Parallela board, 1024 cores in Epiphany-V chip)

The Epiphany is presumably scalable -- they claim it should scale to a billion processors -- but so far there hasn't been evidence of more than 1024. Supposedly even the Parallela boards can be connected to each other: "The architecture also allows parallella boards to be combined into a cluster with a fast inter-chip 'eMesh' interconnect, extending the logical grid of cores (creating almost unlimited scaling potential)."

The Epiphany-V local stores are 64k.

---

proj-oot-lowEndTargets-gpus

Embedded and low-end hardware survey: Contemporary MCUs