could also look at the resources available to each processor or even each 'thread' within the GPU in low-end GPGPU systems. The Intel integrated graphics (e.g. integrated into the package or die of the CPU) GPUs supported the OpenCL? standard apparently starting with the Ivy Bridge system generation, of which a lower-end system was the 2500
NOTE: I DONT KNOW ANYTHING ABOUT GPUS OR GPGPU OR OPENCL YET, MY UNDERSTANDING OF THESE NUMBERS IS SEVERELY LACKING AND SO SOME OF THE FOLLOWING INTERPRETATIONS MAY BE WRONG!
https://compubench.com/device-info.jsp?config=12921360 :
CL_DEVICE_IMAGE2D_MAX_HEIGHT 16384 CL_DEVICE_IMAGE2D_MAX_WIDTH 16384 CL_DEVICE_IMAGE3D_MAX_DEPTH 2048 CL_DEVICE_IMAGE3D_MAX_HEIGHT 2048 CL_DEVICE_IMAGE3D_MAX_WIDTH 2048 CL_DEVICE_LOCAL_MEM_SIZE 65536 CL_DEVICE_MAX_COMPUTE_UNITS 6 CL_DEVICE_MAX_PARAMETER_SIZE 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS 128 CL_DEVICE_MAX_SAMPLERS 16 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3 CL_DEVICE_MAX_WORK_ITEM_SIZES (256,256,256)
And for the AMD Mali-T604, which is in the Google Nexus 10,
https://jogamp.org/bugzilla/attachment.cgi?id=581&action=edit
CL_DEVICE_LOCAL_MEM_SIZE: 32768 CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_IMAGE2D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE2D_MAX_WIDTH: 65536 CL_DEVICE_IMAGE3D_MAX_DEPTH: 65536 CL_DEVICE_IMAGE3D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE3D_MAX_WIDTH: 65536 CL_DEVICE_MAX_PARAMETER_SIZE: 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_SAMPLERS: 16 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: [256, 256, 256] CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf :
" The maximum number of threads (256) stems from the fact that each thread can use 4 registers from a register bank that contains 1024 registers. The larger the number of registers used by the kernel, the fewer the concurrent threads. So if a kernel uses 8 registers, only a maximum of 128 threads can run in parallel. If there are enough threads to hide latency there should be no performance implication of using more registers. "
(so i guess if you wanted 128 registers, you could have 8 threads)
according to http://www.realworldtech.com/ivy-bridge-gpu/4/ , Ivy Bridge has 32kb L1 icaches, 8 threads per core (is this comparable to the 10 EUs in Haskell, see below? probably not, see http://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics#Ivy_Bridge; maybe this is comparable to the SIMD parallelism, see below, in which case i'd hesitate to call it a 'thread'; or maybe this is the number of sub-slices per slice) and 1k general registers (32k of memory)
see also http://events-tce.technion.ac.il/files/2013/07/Michael.pdf
" Vertex shading commonly uses SIMD4x2, with 4 data elements from 2 vertices. Pixel shading is SIMD1x8 or SIMD1x16 (aka SIMD8 or SIMD16), operating on a single color from 8 or 16 pixels simultaneously. Media shaders are similar to pixel shaders, except they are packed even more densely with 8-bit data, rather than the 32-bit data used in graphics shaders. To support all these different execution modes, the GRF is incredibly versatile.
Registers are each 256-bits wide, which is perfectly suited for SIMD2x4 or SIMD8. In a 16B aligned mode, instructions operate on 4-component RGBA data, with source swizzling and destination masking. In a 1B aligned mode, instructions use region-based addressing to perform a 2-dimension gather from the register file and swizzling and destination masking are disabled. This is critical for good media performance, where 1B data is packed together for maximum density. Collectively, these two addressing modes also simplify converting from AOS to SOA data structures.
Each thread is allocated 128 general purpose registers, so the GRF has expanded to 32KB to handle 8 threads. The GRF has also been enhanced to handle larger 8B accesses that are necessary for double precision computation. "
in other words, each thread has 4k bytes worth of registers available to it; but this is in the form of only 128 256-bit registers.
http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2
looks like the fundamental component is the EU, which is grouped into 'sub-slices' (Intel), 'GCN's (AMD), or 'Kepler SMX's (Nvidia). there is a table on http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2 showing their capabilities on the Haswell system generation. They are SIMDs whose parallelism range from 8-wide (actually, 2x4 dual issue) to 32-wide; and each subslice/GCN/SMX has from 4-10 EUs (4-10 SIMD units), for a total of 64-192 ALUs per subslice/GCN/SMX
there are up to 4 sub-slices total, or 40 EUs in Haswell (for Intel HD 2000, it was 6 EUs, for Intel HD 3000, it was 12 EUs), with peak FP ops per core/EU at around 16.
note that EUs are also called 'cores'?
so interesting numbers relating to amount of memory available per core are:
so assuming a word size of 256-bits, there we see numbers from 128 (4k bytes) to 512 (16k bytes), and up to 2048 (CL_DEVICE_LOCAL_MEM_SIZE in Intel HD 2500, divided by 32 bytes per word)
and numbers relating to parallelism hardware are:
so there we see numbers on the order of 8
and numbers relating to parallelism image size are:
so there we see numbers greater than 2048
and other numbers relating to parallelism in the OpenCL? environment are:
so there we see one 8 and a bunch of things close to 256
in summary, we see the following critical values (approximate):
http://stackoverflow.com/questions/3957125/questions-about-global-and-local-work-size says that the actual software-exposed parallelism is the 'global work size', but i don't understand how to find the max global work size for a given GPU. Maybe you can't do that without actually creating the kernel, because it depends on exactly how much memory (even how many registers) the kernel is using, etc. http://www.khronos.org/message_boards/showthread.php/9207-Determine-global_work_size suggests this is on the order of 2^17. See also http://www.khronos.org/message_boards/showthread.php/6060-clEnqueueNDRangeKernel-max-global_work_size , https://devtalk.nvidia.com/default/topic/477978/questions-about-global-and-local-work-size/.
http://people.maths.ox.ac.uk/gilesm/cuda/new_lectures/lec1.pdf says that (higher-end?) GPUs typically have on the order of 1024 (2^10) cores and 128-384 registers per core, 8192 threads (but i think it says these are 32-way SIMD, so really that's 32 independent threads?), 64k shared memory, and can support at least ~2^17 global work size.
On the number of 'cores', http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf cautions: "An important caveat to keep in mind is that the marketing num- bers for core count for NVIDIA and ATI aren‘t always a good r resentation of the capabilities of the hardware. For instance, on NVIDIA‘s website a Quadro 2000 graphics card has 192 “Cuda Cores”. However, we can query the lower-level hardware capa- bilities using the OpenCL? API and what we find is that in reality there are actually 4 compute units, all consisting of 12 stream mul- tiprocessors, and each stream multiprocessor is capable of 4-wide SIMD, 192 = 4*12*4. In the author‘s opinion this makes the mar- keting material confusing, since you wouldn‘t normally think of a hardware unit capable only of executing floating point operations as a “core”. Similarly, the marketing documentation for a HD6970 (very high end GPU from ATI at time of writing) shows 1536 pro- cessing elements, while in reality the hardware has 24 compute units (SIMD engines), and 16 groups of 4-wide processing elements per compute unit. 1536 = 24*16*4 . "
http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf also says "Clearly, for small matri- ces the overhead of OpenCL? execution dominates the performance benefits of massivly concurrent execution. For our measurements, below a matrix dimension of roughly 150 x 150 the simple multi- threaded CPU code out performs OpenCL?"
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0538e/BABGJBFI.html says "The optimal global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device. To calculate the optimal global work size use the following equation: global work size = <maximum_work-group_size> * <number of shader cores> * <constant>, where constant is typically 4 or 8 for the Mali-T604 GPU"
The Mali-T604 seems to have 4-8 shader cores: " ARM announces 8-way graphics core
| EE Times |
so that's 256*4*4 = 4096, which is close to the 2048 number we keep seeing
random note: OpenCL?