great article on Julia GPU programming:
[1]
some notes:
- main package is GPUArrays
- lots of great sample code, including profiling example code
- "One might think that the GPU performance suffers from being written in a dynamic language like Julia, but Julia's GPU performance should be pretty much on par with the raw performance of CUDA or OpenCL?. Tim Besard did a great job at integrating the LLVM Nvidia compilation pipeline to achieve the same – or sometimes even better – performance as pure CUDA C code. Tim published a pretty detailed blog post in which he explains this further. CLArrays approach is a bit different and generates OpenCL? C code directly from Julia, which has the same performance as OpenCL? C!"
- "GPUs have their own memory space with video memory (VRAM), different caches, and registers. Whatever you do, any Julia object must get transferred to the GPU before you can work with it. Not all types in Julia work on the GPU."
- an immutable struct or array that contains only other immutables is called an 'isbits' type (did i get that right?). Anything containing a heap allocated reference is not isbits
- types with the 'isbits' property "can be used without constraints on the GPU"
- GPUArray Constructors: literals, fill, rand, range (using 1:10 syntax),
- "Julia's fusing dot broadcasting notation. This notation allows you to apply a function to each element of an array, and create a new array out of the return values of f. This functionality is usually referred to as a map. The broadcast refers to the fact that arrays with different shapes get broadcasted to the same shape." example: x = zeros(4, 4) # 4x4 array of zeros y = zeros(4) # 4 element array z = 2 # a scalar
- y's 1st dimension gets repeated for the 2nd dimension in x
- and the scalar z get's repeated for all dimensions
- the below is equal to `broadcast(+, broadcast(+, xx, y), z)` x .+ y .+ z
more: https://julialang.org/blog/2018/05/extensible-broadcast-fusion
https://julia.guide/broadcasting
- "This means any Julia function that runs without allocating heap memory (only creating isbits types), can be applied to each element of a GPUArray and multiple dot calls will get fused into one kernel call. As kernel call latency is high, this fusion is a very important optimization."
- " Some more operations supported by GPUArrays:
Conversions and copy! to CPU arrays
multi dimensional indexing and slicing (xs[1:2, 5, :])
permutedims
Concatenation (vcat(x, y), cat(3, xs, ys, zs))
map, fused broadcast (zs .= xs.^2 .+ ys .* 2)
fill(CuArray, 0f0, dims), fill!(gpu_array, 0)
Reduction over dimensions (reduce(+, xs, dims = 3), sum(x -> x^2, xs, dims = 1)
Reduction to scalar (reduce(*, xs), sum(xs), prod(xs))
Various BLAS operations (matrix*matrix, matrix*vector)
FFTs, using the same API as julia's FFT" (note: lots of hyperlinks in there in the original)- to pass in an arbitrary (GPU-compatible) kernel (which will be called many times in parallel with the given arguments; it doesn't have to just be a map, it can access the whole array, and multiple arrays, if you put them in the arguments; the 'A::GPUArray' parameter seems to just be for getting the length of the array so as to determine how many threads to do? that seems wrong though so maybe i'm wrong): " gpu_call. It can be called as gpu_call(kernel, A::GPUArray, args) and will call kernel with the arguments (state, args...) on the GPU. State is a backend specific object to implement functionality like getting the thread index. A GPUArray needs to get passed as the second argument to dispatch to the correct backend and supply the defaults for the launch parameters. "
---
" I had some extended notes here about "less-mainstream paradigms" and/or "things I wouldn't even recommend pursuing", but on reflection, I think it's kinda a bummer to draw too much attention to them. So I'll just leave it at a short list: actors, software transactional memory, lazy evaluation, backtracking, memoizing, "graphical" and/or two-dimensional languages, and user-extensible syntax. If someone's considering basing a language on those, I'd .. somewhat warn against it. Not because I didn't want them to work -- heck, I've tried to make a few work quite hard! -- but in practice, the cost:benefit ratio doesn't seem to turn out really well. Or hasn't when I've tried, or in (most) languages I've seen. " [2]
---
" Heterogeneous memory and parallelism
These are languages that try to provide abstract "levels" of control flow and data batching/locality, into which a program can cast itself, to permit exploitation of heterogeneous computers (systems with multiple CPUs, or mixed CPU/GPUs, or coprocessors, clusters, etc.)
Languages in this space -- Chapel, Manticore, Legion -- haven't caught on much yet, and seem to be largely overshadowed by manual, not-as-abstract or not-as-language-integrated systems: either cluster-specific tech (like MPI) or GPU-specific tech like OpenCL?/CUDA. But these still feel clunky, and I think there's a potential for the language-supported approaches to come out ahead in the long run. " [3]