Bayle Shanks's website: proj-oot-ootConcurrencyNotes8

https://tokio.rs/blog/2019-10-scheduler/ https://news.ycombinator.com/item?id=21249708

---

https://blog.rust-lang.org/2019/11/07/Async-await-stable.html https://github.com/rust-lang/async-book https://rust-lang.github.io/compiler-team/working-groups/async-await/

" async fn first_function() -> u32 { .. } ... let result: u32 = future.await; ... This example shows the first difference between Rust and other languages: we write future.await instead of await future. This syntax integrates better with Rust's ? operator for propagating errors (which, after all, are very common in I/O). You can simply write future.await? to await the result of a future and propagate errors. It also has the advantage of making method chaining painless. Zero-cost futures

The other difference between Rust futures and futures in JS and C# is that they are based on a "poll" model, which makes them zero cost. In other languages, invoking an async function immediately creates a future and schedules it for execution: awaiting the future isn't necessary for it to execute. But this implies some overhead for each future that is created.

In contrast, in Rust, calling an async function does not do any scheduling in and of itself, which means that we can compose a complex nest of futures without incurring a per-future cost. As an end-user, though, the main thing you'll notice is that futures feel "lazy": they don't do anything until you await them. " -- https://blog.rust-lang.org/2019/11/07/Async-await-stable.html

Dowwie 6 hours ago [-]

This is a major milestone for Rust usability and developer productivity.

It was really hard to build asynchronous code until now. You had to clone objects used within futures. You had to chain asynchronous calls together. You had to bend over backwards to support conditional returns. Error messages weren't very explanatory. You had limited access to documentation and tutorials to figure everything out. It was a process of walking over hot coals before becoming productive with asynchronous Rust.

fooyc 5 hours ago [-]

This is a big improvement, however this is still explicit/userland asynchronous programming: If anything down the callstack is synchronous, it blocks everything. This requires every components of a program, including every dependency, to be specifically designed for this kind of concurency.

Async I/O gives awesome performance, but further abstractions would make it easier and less risky to use. Designing everything around the fact that a program uses async I/O, including things that have nothing to do with I/O, is crazy.

Programming languages have the power to implement concurrency patterns that offer the same kind of performances, without the hassle.

brandonbloom 2 hours ago [-]

Yup. See https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ for a pretty good explanation of the async/await problem.

...

An alternative design would have kept await, but supported async not as a function-modifier, but as an expression modifier. Unfortunately, as the async compiler transform is a static property of a function, this would break separate compilation. That said, I have to wonder if the continuously maturing specialization machinery in Rustc could have been used to address this. IIUC, they already have generic code that is compiled as an AST and specialized to machine code upon use, then memoized for future use. They could specialize functions by their async usage and whether or not any higher-order arguments are themselves async. It might balloon worst-case compilation time by ~2X for async/non-async uses (more for HOF), but that's probably better than ballooning user-written code by 2X.

ameixaseca 1 hour ago [-]

Async adds a runtime requirement, so Rust cannot just take the same approach as Go. You only have a scheduler if you instantiate one. And having a runtime or not has nothing to do with the C abstract machine, but with precise control on what the code does.

For instance, Go does not give you the option to handle this yourself: you cannot write a library for implementing goroutines or a scheduler, since it's embed in every program. That's why it's called runtime. In Rust, every bit of async implementation (futures, schedulers, etc) is a library, with some language support for easing type inference and declarations. This should already tell you why they took this approach.

Regarding async/await and function colors (from the article you posted), I would much rather prefer Rust to use an effect system for implementing this. However, since effects are still much into research and there is no major language which is pushing on this direction (maybe OCaml in a few years?) it seems like a long shot for now.

MuffinFlavored? 5 hours ago [-]

For JavaScript? developers expecting to jump over to Rust and be productive now that async/await is stable:

I'm pretty sure the state of affairs for async programming is still a bit "different" in Rust land. Don't you need to spawn async tasks into an executor, etc.?

Coming from JavaScript?, the built in event-loop handles all of that. In Rust, the "event loop" so to speak is typically a third party library/package, not something provided by the language/standard itself.

steveklabnik 5 hours ago [-]

> I'm pretty sure the state of affairs for async programming is still a bit "different" in Rust land.

There are some differences, yes.

> Don't you need to spawn async tasks into an executor, etc.?

Correct, though many executors have added attributes you can tack onto main that do this for you via macro magic, so it'll feel a bit closer to JS.

ComputerGuru? 7 hours ago [-]

I’ve been playing with async await in a polar opposite vertical than its typical use case (high tps web backends) and believe this was the missing piece to further unlock great ergonomic and productivity gains for system development: embedded no_std.

Async/await lets you write non-blocking, single-threaded but highly interweaved firmware/apps in allocation-free, single-threaded environments (bare-metal programming without an OS). The abstractions around stack snapshots allow seamless coroutines and I believe will make rust pretty much the easiest low-level platform to develop for.

vanderZwan 7 hours ago [-]

Have you ever heard of Esterel or Céu? They follow the synchronous concurrency paradigm, which apparently has specific trade-offs that give it great advantages on embedded (IIRC the memory overhead per Céu "trail" is much lower than for async threads (in the order of bytes), fibers or whatnot, but computationally it scales worse with the nr of trails).

Céu is the more recent one of the two and is a research language that was designed with embedded systems in mind, with the PhD? theses to show for it [2][3].

I wish other languages would adopt ideas from Céu. I have a feeling that if there was a language that supports both kinds of concurrency and allows for the GALS approach (globally asynchronous (meaning threads in this context), locally synchronous) you would have something really powerful on your hands.

EDIT: Er... sorry, this may have been a bit of an inappropriate comment, shifting the focus away from the Rust celebration. I'm really happy for Rust for finally landing this! (but could you pretty please start experimenting with synchronous concurrency too? ;) )

[0] http://ceu-lang.org/

[1] https://en.wikipedia.org/wiki/Esterel

[2] http://ceu-lang.org/chico/ceu_phd.pdf

[3] http://sunsite.informatik.rwth-aachen.de/Publications/AIB/20...

Kaladin 6 hours ago [-]

Is there any resources you could point to learn more about how to use this async programming with?

Dowwie 6 hours ago [-]

https://rust-lang.github.io/async-book

https://book.async.rs

https://tokio.rs

---

5.2 Using virtual machine abstraction to supportconcurrency

One abstract extension to a VM that was found to providegreatoverallvaluetothesystemisnonpreemptive,user-levelconcurrency. As shown in Fig.11, a VM that maintains itsown concurrency has advantages over a real-time operatingsystem.SwitchingcontextsinaVMcanbedoneinonevirtualinstruction? whereas a native RTOS-driven system requiressaving and restoring contexts, which may involve dozens of variables and the saving of a task state. Also, a concurrentVM is more controlled than a real-time operating systemtask switch, which can occur at almost any time requiringcritical regions of code to protect against race conditionsand errors. ...

.2.1 Implementing a task instruction to support concurrency New instructions were added to the Pascal Virtual Machineto support multitasking and message passing as high levelabstract instructions. A new task instruction was added tothe interpreter to support multitasking along with the neces-sary Pascal language extension as follows:

status := task(function); function: TASKINIT=0: status: 1=parent, 2=child TASKPEND=1: status: is new task ID TASKWHO=2: status: is my task ID TASKKILL=3: No status

With this new special function call, new tasks could becreated using the subfunction TASKINIT which performs asimilar function as the UNIX fork() command within a sin-glePCodeinstruction.ModelingtaskcreationaftertheUNIXforkcommandwasamatterofconvenienceforthisresearch?,although itdoes have some advantages.

...

Theinterpreter was modiﬁed to support up to ﬁve tasks (an arbi-trary choice that depends on the memory available). When anew task is created, an exact duplicate of the working stack of the parent task is created and a new entry is made in aTaskState table for the new child task. The return value fromthis function call indicates whether the currently executingtask is still the parent task (1) or the child task (0) that is anexact copy.Thereturnvaluecanthenbeusedtocauseeachofthetasksto follow their own execution path depending on whether itis the parent or the child task. As shown in Fig. 12, when a new task is created only four values need to be rememberedto save and restore the task; namely the PC, SP, MP, andEP registers. PC is the program counter, which is saved toremember what instruction this task was executing when itwas suspended so that it can pick up where it left off whenit starts running again. SP is the stack pointer which indi-cates where the current top of the stack is for this task. Sinceeach new task has its own copy of the stack, each task mustremember where that is in memory for its own stack. MP isthe mark pointer which indicates where the base of the cur-rentstackframeis.EPistheextremepointerwhichindicatesthetopofthecurrentstackframesothattheVMknowswhereto? put the next stack frame if another procedure or functionis called from the current context. There is no need to saveany working registers because the PMachine is stack basedand nothing else is stored in registers between instructionsbesideswhatisonthestack,makingtaskswitchingrelativelyfast and efﬁcient.Another subfunction, TASKPEND, is to allow the cur-rentlyexecutingtasktovoluntarilygiveupthevirtualproces-sorandallowadifferenttasktorun. ...

TASKKILL can becalledfor a task to permanently kill itself and allow other tasks torun instead

5.3 Implementing a mail instruction to support concurrencyA second instruction was implemented to support messagepassing between tasks in the VM. The mail() instruction wascreated to handle this functionality:

Status := mail(function, parameter1, parameter2, parameter3); Function: MAILSEND=0 parameter1: destination task ID parameter2: length of message to send parameter3: buffer address of message to send

   MAILRCV=1
     parameter1: my task ID
     parameter2: size of message buffer for receiving
     parameter3: buffer address where to put received message

... In this implementation, a callby a task to receive mail will cause the task to pend if nomail is available. The task then becomes ready to run againwhen mail arrives. The task that sent the mail continues torun until it voluntarily gives up the processor or pends onits own mailbox. This implementation was chosen becausethe native operating system used in the timing comparisonperforms this way allowing the timing comparison to be fair.Another faster implementation might be to allow the receiv-ing task to wake up immediately which would make the sys-tem appear to switch tasks faster but is not commonly howembedded operating systems work.

...

Others showed that VMs are capable of doing things thatnative machine hardware is not effective at doing by imple-menting VMs that do not emulate real machines. The Esterel and the Harrison VMs are good examples of this [10,41]. Additionally,other researchshowed thatabstractionisawayto increase performance in a VM as shown by the Carbayo-nia and Hume VMs [18,19].

10. Craig ID (2005) Virtual Machines. Springer, New York 41. Plummer B, Khajanchi M, Edwards SA (2006) An Esterel VirtualMachine? for Embedded Systems. In: Proceedings of synchronouslanguages, applications, and programming (SLAP 2006) 18. Gutierrez DA, Soler FO (2008) Applying lightweight ﬂexible vir-tual machines to extensible embedded systems. In: Proceedings of the1stworkshoponisolationandintegrationinembeddedsystems.ACM, Glasgow, pp 23–28 19. Hammond K, Michaelson G (2003) Hume: a domain-speciﬁc lan-guage for real-time embedded systems. In: Proceedings of the 2ndinternational conference on generative programming and compo-nent engineering. Springer, Erfurt, pp 37–56 "

-- https://www.academia.edu/26447075/A_real-time_virtual_machine_implementation_for_small_microcontrollers

---

" Procedures

Procedures can be invoked in four ways – normal, call, process, and run.

The normal invocation invokes a procedure in the normal way any language invokes a routine, by suspending the calling routine until the invoked procedure returns.

The call mechanism invokes a procedure as a coroutine. Coroutines have partner tasks, where control is explicitly passed between the tasks by means of a CONTINUE instruction. These are synchronous processes.

The process mechanism invokes a procedure as an asynchronous task and in this case a separate stack is set up starting at the lexical level of the processed procedure. As an asynchronous task, there is no control over exactly when control will be passed between the tasks, unlike coroutines. Note also that the processed procedure still has access to the enclosing environment and this is a very efficient IPC (Inter Process Communication) mechanism. Since two or more tasks now have access to common variables, the tasks must be synchronized to prevent race conditions, which is handled by the EVENT data type, where processes can WAIT on an event until they are caused by another cooperating process. EVENTs also allow for mutual exclusion synchronization through the PROCURE and LIBERATE functions. If for any reason the child task dies, the calling task can continue – however, if the parent process dies, then all child processes are automatically terminated. On a machine with more than one processor, the processes may run simultaneously. This EVENT mechanism is a basic enabler for multiprocessing in addition to multitasking. Run invocation type The last invocation type is run. This runs a procedure as an independent task which can continue on after the originating process terminates. For this reason, the child process cannot access variables in the parent's environment, and all parameters passed to the invoked procedure must be call-by-value. " -- https://en.wikipedia.org/wiki/Burroughs_large_systems

---

orisho 14 hours ago [-

The problem that's described here - "green" threads being CPU bound for too long and causing other requests to time out is one that is common to anything that uses an event loop and is not unique to gevent. node.js also suffers from this.

Rachel says that a thread is only ever doing one thing at a time - it is handling one request, not many. But that's only true when you do CPU bound work. There is no way to write code using blocking IO-style code without using some form of event loop (gevent, async/await). You cannot spin up 100K native threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services). Or well, you can, but the native thread context switch overhead is very quickly going to grind the machine to a halt as you grow.

I'm a big fan of gevent, and while it does have these shortcomings - they are there because it's all on top of Python, a language which started out with the classic async model (native threads), rather than this model.

Golang, on the other hand, doesn't suffer from them as it was designed from the get-go with this threading model in mind. So it allows you to write blocking style code and get the benefits of an event loop (you never have to think about whether you need to await this operation). And on the other hand, goroutines can be preempted if they spend too long doing CPU work, just like normal threads.

jashmatthews 5 hours ago [-]

> You cannot spin up 100K native threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services). Or well, you can, but the native thread context switch overhead is very quickly going to grind the machine to a halt as you grow.

Why? Bigger MySQL? servers regularly use 10k native pthreads which spend most of their time waiting on IO.

Context switch overhead isn’t linear with the number of threads! The Linux scheduler runs O(logN). Moving from 10k MySQL? threads to 1 thread per core will give you less than 10% extra throughput.

Doing IO means context switching. Goroutines don’t get a free pass here.

orisho 5 hours ago [-]

There are other costs to regular context switching as opposed to goroutines/greenlets (the green threads that gevent uses). I don't remember the details but specific attention was paid to the point of making context switching and other resource consumption by these green threads cheaper than native threads, so I suggest reading about it in greenlet/Golang docs :) You can also try searching for C10K which was the term people used to discuss how to achieve 10K connections, and is often associated with cooperative threading.

For example, the cost of the context switch itself (storing all registers) is more significant with native threads.

Just try spinning up 100K threads that each print a line and then sleep for 10ms, and see how high your CPU usage gets.

Also, doing IO does not necessarily mean context switching - it means calling into the kernel (system calls). If you use an async IO operation (read/write from a socket) and then continue to the next thread, by the time you're done with all ready threads, you're likely to have some sockets ready to read from again, so you might not context switch at all. Kernel developers are working on even reducing the need for syscalls with io_uring, which is designed to allow you to perform IO without system calls.

fulafel 2 hours ago [-]

There shouldn't be any 100k hard limit for threads at least in Linux, though you need enough memory for 100k stacks of course. You need to increase some default limits for it though (https://stackoverflow.com/a/26190804)

Assuming a generous(?) 20 kB per thread in stack and other corresponding OS bookkeeping inforation you could have 1k threads in 20 MB, or 1M threads in 20 GB.

Doing 100 Hz timer wakeups and IOs concurrently in 100k threads makes 10 M wakeups/second, that takes a chunk of CPU independent of green / native threads choice. Performance vs kernel threads will depend on the green threads implementation.

adev_ 4 hours ago [-]

> You cannot spin up 100K native > threads to handle 100K requests that are IO bound (which is very common in a microservice architecture, since requests will very quickly block on requests to other services)

The Varnish server does that all the time and it is not a problem.

Modern OS handle a large number of threads much better than 10 years ago.

In case of python however, you can not do that due to the GIL and the fact python handle native threading terribly.

mborch 3 hours ago [-]

From the docs:

> We rarely recommend running with more than 5000 threads. If you seem to need more than 5000 threads, it’s very likely that there is something not quite right about your setup, and you should investigate elsewhere before you increase the maximum value.

adev_ 2 hours ago [-]

Yes, and still 5000 will be more than enough to serve most of production load. Even on high traffic websites.

rossmohax 13 hours ago [-]

Beam VM (Erlang/Elixir) is another example which copes with the problem very well due to preemption.

Floegipoky 12 hours ago [-]

To expand on this, by allowing preemption on function calls and simply not providing a loop mechanism, you can guarantee an upper bound on how long a process may need to wait until it can be scheduled.

cbsmith 11 hours ago [-]

Guarantee is a strong term. If you have an infinite number of processes to running, there aren't guarantees on how long until it is scheduled.

Similarly, with cooperative multi-tasking as in Gevent, you can manipulate scheduling to try to provide better guarantees about wait times. It's just... you can't ignore the problem.

akvadrako 3 hours ago [-]

Actually Linux can handle a few million threads just fine - you only need to increase the max pid.

You should just update your reasoning and say you can’t have 100M threads waiting for blocking IO.

---

https://toroid.org/unix-pipe-implementation

---

 sandGorgon 17 hours ago [-]

I see a lot of comments that are talking about how python does not have a go-like concurrency story.

Fyi - python ASGI frameworks like fastapi/Starlette are the same developer experience as go. They also compete on techempower benchmarks. Also used in production by Uber, Microsoft,etc.

A queue based system is used for a very different tradeoff of persistence vs concurrency. It's similar to saying that the usecase for Kafka doesn't exist because go can do concurrency.

Running "python -m asyncio" launches a natively async REPL. https://www.integralist.co.uk/posts/python-asyncio/#running-...

---

 fluffything 58 days ago [–]

I write a lot of compute kernels in CUDA, and my litmus test is prefix sum, for two reasons.

First, you can implement it in pure CUDA C++, and max out the memory bandwidth of any nvidia or AMD GPU. The CUB library provides a state of the art implementation (using decoupled-lookback) that one can compare against new programming languages.

Second, it is one of the most basic parallel algorithm building blocks. Many parallel algorithms use it, and many parallel algorithms are "prefix-sum-like". If I am not able to write prefix sum from scratch efficiently in your programming language / library, I can't use it.

Every time someone shows a new programming language for compute, the examples provided are super basic (e.g. `map(...).fold(...)`), but I have yet to see a new programming language that can be used to implement the 2-3 most fundamental parallel algorithms from any parallel algorithms graduate course. For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck. In WGPU, it appears that prefix-sum will be an intrinsic of WHSL, which sounds like you would be out-of-luck too.

You mentioned WGPU and Vulkan. Do you know how to implement prefix-sum from scratch on these? If so, do you know how the performance compare against CUB?

Athas 57 days ago [–]

> For example, Futhark provides a prefix sum intrinsic, that just calls CUB - if you want to implement a prefix-sum-like algorithm, you are out of luck.

Futhark doesn't call CUB (it also targets OpenCL?), and here is an asymptotically work-efficient prefix sum in pure Futhark: https://gist.github.com/athas/611e6b9a76b382ec079f979ec7fb85...

It's not as fast as a hand-written scan in CUDA, nor as fast as the built-in scan in Futhark, but you can do it. Futhark isn't a low-level or GPU-specific language, so you can't take direct advantage of hardware-specific features.

Generally, Futhark doesn't compete performance-wise primitive-for-primitive, but can do application-level transformations that are infeasible for libraries, and impractical to do by hand. For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well, while there's no way to turn the CUB scan into a regular segmented scan, unless the original author thought of it. Similarly, if you are 'scan'ing the same array twice in a Futhark program, but with different operators or pre-scan transforms, the compiler will fuse the two operations into one, to reduce the number of accesses.

fluffything 57 days ago [–]

> Futhark doesn't call CUB (it also targets OpenCL?),

If the builtin scan doesn't call CUB, how fast is a prefix sum in Futhark compared to Cuda using CUB ?

> For example, you can 'map' a Futhark 'scan' over a two-dimensional array and it will just work pretty well,

What does "pretty well" mean? 95% of peak device throughput ? 60% ?

When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.

Athas 57 days ago [–]

> If the builtin scan doesn't call CUB, how fast is a prefix sum in Futhark compared to Cuda using CUB ?

Last I checked, it's about 50% the performance of CUB, but a sibling post describes a current effort to implement the decoupled lookback algorithm in Futhark's CUDA backend.

> What does "pretty well" mean? 95% of peak device throughput ? 60% ?

It's as fast as Futhark's single-dimensional scan. In exotic cases (when a single segment or "scan instance" fits in a thread block) it can be even faster.

> When a compute device costs 8.000$, an efficiency of 90% means I'm throwing 800$ out of the window.

Certainly! But people run much less than 90% efficient code on $8000 CPUs all the time, so I think there is definitely room for a high-level programming language that runs on GPUs. It's the usual performance-versus-productivity tradeoff, although Futhark's ambition (and to a large extent, reality) puts more emphasis on the performance side of things than when people usually use that argument.

While Futhark can in principle not beat hand-written code, I will note that empirically it is not unusual for us to see performance increases when we take published GPU benchmark code and rewrite it straightforwardly in Futhark. Much GPU code out there is not perfectly written at the level of CUB, and while the Futhark compiler doesn't exactly generate highly inspired code, it gets the basics consistently right.

fluffything 56 days ago [–]

You claim there is a trade-off between usability and performance for GPUs. The reason Rust is so popular is that people used to believe the same thing for CPU programming languages, yet Rust showed that you can have a language with the abstraction power and safety of Scala, Haskell, etc. and the performance of C. This was believed to be impossible, turns out it wasn't.

In the last NERSC survey, 75% of parallel models in use in US national HPC centers are MPI and/or OpenMP?, ~22% is CUDA C and CUDA C++, and all others are the rest (~3%).

CUDA C++ is the best we have because it let's you write code that extracts all the performance from the device, while also allowing one to expose that code using very high-level APIs (Thrust, CUTLASS, CUB, AmgX?, ...), and even RAPIDS, CuPY?, etc.

While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), a practical language (that actually gets used widely on HPC) must do so without sacrificing low-levelness.

A GPU programming for the masses needs to provide or improve on the performance of CUDA C, its abstraction capabilities, and its safety.

While I think it is very useful to research how "high-level" a PL for GPUs can be (Futhark, Taichi), or how "middle" a middle ground can be (OpenCL?, SyCL?, etc.), a practical language that actually gets used widely on HPC must be better than what we have without sacrificing anything that we have.

The hardest thing about CUDA is writing code that uses the memory hierarchy well. In my experience, this is the hardest thing to do well portably, and the hardest thing to abstract, which is why pretty much all device-only high level libraries like CUB or CUTLASS expose the memory hierarchy on their APIs, and this is, at least today, required to get good performance.

I find languages like sequoia quite interesting, because they explore how to improve on this particular problem which, right now, it appears that any CUDA replacement would need to solve better than CUDA. Languages like regent/legion, hpx, futhark, taichi, etc. focus more on just trading off low-level performance for high-level abstractions. This research is definitely useful to explore what other things a CUDA killer could do better, but I don't think any of these would end up taking a little bit of CUDA's market share. At best, they might take a bit of OpenMP?'s marketshare, but that seems unlikely.

9q9 54 days ago [–]

Why do you think most languages avoid exposing the memory hierarchy? What is the core problem that has not yet been solved?

fluffything 53 days ago [–]

For programs running on the CPU, the programmer cannot manually control memory transfers across the memory hierarchy, so adding that ability to your programming language does not solve any problems.

So I'd say most languages don't expose it because there is no need for it.

Many languages for GPUs do expose the memory hierarchy to programmers (e.g. CUDA, OpenCL?, even OpenAcc?).

> What is the core problem that has not yet been solved?

That using the memory hierarchy efficiently and correctly when writing GPU programs in those languages is hard and error prone. It is trivial to write code that performs horribly and/or has data-races and other forms of UB in CUDA, e.g., when dealing with global/shared/local memory.

Sequoia attempted to split the kernel in the algorithms at the different levels of the memory hierarchy, and the memory owned by the kernel at the different levels, as well as how to split this memory as you "refine" the kernel, e.g., from kernel (global) -> thread blocks (global, shared, constant) -> warps (global, shared, constant, shared registers) -> threads (global, shared, constant, shared registers, local memory).

For many algorithms (e.g. GEMM, convolutions), how you partition global memory into thread blocks, and which parts of global memory one loads into shared memory and how, has a huge impact on performance.

9q9 53 days ago [–]

Interestingly, a lot of programmers (who program for the CPU) worry about caches a great deal, despite:

Programmers being unable to control caches, at least directly, and reliably.
Languages (e.g. C/C++) having no direct way of expressing memory constraints.

This suggests to me that even in CPU programming there is something important missing, and I imagine that a suitable explict representation of the memory hierarchy might be it. A core problem is that its unclear how to abstract a program so it remains perfomant over different memory hierarchies.

fluffything 52 days ago [–]

People that worry about caches for CPUs, can't control them directly, so they need to instead, e.g., "block"/"tile" their loops in such a way that their caches are used efficiently, which is hard.

On different CPUs (e.g. with different cache sizes), these loops need to be tiled differently. If you want a single binary to perform well across the board, it just need to support using the different tile-sizes depending on the hardware.

Usually, this is however not enough. For example, you have some data in memory (a vector of f16s on an intel CPU), and for operating on them, you need to decompress that first into a vector of f32s.

You probably want to decompress to fill the cache, operate, recompress, to save memory and memory bandwidth. For that you need a "scratchpad" (or __shared__ memory in CUDA), e.g., for the different levels of the cache hierarchy.

The compiler needs to know, for the different architectures, what their L1/L2/L3/shared/constant/texture/.... memory sizes are, and either fill these sizes for you to use the whole cache, or let the user pick them, making the program fail if run on hardware where this isn't correct. NVCC (and pretty much every production compiler) can bundle code for different architectures within a single binary, and pick the best at run-time.

So if your L3 can be 4,8,16, or 32 Mb, your compiler can bundle 4 copies of your kernel, query the CPU cache size at initialization, and be done with it.

The way in which you abstract the memory hierarchy is by just partitioning your problem's memory.

If you have a f16s vector in global memory, you might want to partition it into N different chunks, e.g., recursively, until if a chunk where to be decompressed into f32s, that decompressed chunk would fit into a faster memory (e.g. L3). At that point you might want to do the decompression, and continue partitioning the f32s up to some threshold (e.g. the L1), on which you operate.

That is, a kernel has multiple pieces:

a state initialization (e.g. owns the global memory)
partitioning: how to partition that into smaller subproblems
the merge: what gets computed at each level of the partitioning (e.g. how are the results of the partitions merged together to compute the final result)
the leaf computation: what gets computed at the "finest" level

The programmer doesn't know a priori how many partitions would be necessary, that would depend on the memory hierarchy. But it does know everything else.

For example, for a sum reduction:

the programmer knows how to perform a leaf computation: by summing a whole leaf partition and putting the result somewhere in the next larger memory level.
a CUDA or CPU thread can sum N elements in parallel using SIMD, and put the result in inter-warp shared memory or the L1
the programmer knows how to merge computations: by summing the results of the current level, and putting them in the next level of the hierarchy (inter-warp -> inter-block -> inter-grid, or L1 -> L2 -> L3 -> RAM)
in a GPU / CPU, depending on the actual level, the compiler can use different operations (e.g. SIMD shuffles, inter warp shuffles with warp-level synchronization, inter block shuffles with block-level synchronization, local sum + atomic memory operation to write the result, etc.)
the programmer knows how to partition an input vector into N pieces (N CUDA threads, N warps, N blocks, N grids, N cpu threads, N numa domains, etc.)

A good programming model needs to allow the user to express what they know, and abstract away what they cannot know (how big the caches are, how many cache levels, how many threads of execution, etc.)

9q9 52 days ago [–]

Thanks. Amazing overview.

   abstract away what they 
   cannot know

Here I'm a bit surprised. For theoretical reasons (I've never gotten my hands dirty with GPU implementations, I'm afraid to admit), I would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection on:

Number of levels (e.g. how many layers of caches).
Size of levels.
Cost of memory access at each level.
Cost of moving data from a level to the next level up/down.

The last 3 numbers don't have to be absolute, I imagine, but can be relative, e.g.: size(Level3) = 32 * size(Level2). This data would be useful to decide how to partition compute jobs as you describe, and do so in a way that is (somewhat) portable. There are all manner of subtle issues, e.g. what counts as cost of memory access and data movements (average, worst case, single byte, DMA ...), and what the compiler and/or runtime should do (if anything) if they are are violated. In abstract terms: what is the semantics of the language representation of the memory hierarchy. Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array, and have DMA-like send/receives to move blocks of data between levels. Is that a good idea?

Equally subtle, and I already alluded to this above, is when to make this information available. Since the processor doesn't change during computing, I imagine that using a multi-stage meta-programming setup (see e.g. [1] for a rationale), might be the right framework: you have a meta-program, specialising the program doing the compute you are interested in. C++ use template for program specialisation, but C++'s interface for meta-programming is not easy to use. It's possible to do much better.

As you wrote above, programming "in those languages is hard and error prone", and the purpose of language primitives is to catch errors early. What errors would a compiler / typing system for such a language catch, ideally without impeding performance?

[1] Z. DeVito?, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf

fluffything 50 days ago [–]

> would have expected that it is highly useful to have a language interface that allows run-time (or at least compile-time) reflection

That's an interesting thought. I'm not sure I agree, maybe?

The user job is to express how to partition the problem. To do that properly, they need to know, e.g., "how many bytes can I fit in the next partition per unit-of-compute", so the langue/run-time has to tell them that.

I don't know if knowing the costs of memory access at each level or across levels is useful. It is a reasonable assumption that, at the next level, memory accesses are at least one order of magnitude faster, and that synchronizing across the hierarchy is very expensive and should be minimized. That's a good assumption to write your programs with, and knowing actual costs do not really help you there.

> Another subtle but important question is: what primitives should a language provide to access a memory level, and which ones to move between levels. An obvious choice is to treat each level as an array

I think CUDA's __shared__ memory is quite good. E.g. per kernel, a way to obtain memory in the level of the hierarchy that the kernel is currently working on. Nobody has extended CUDA to multiple levels because there hasn't been a need, but I expect these programs to be "recursive" in the sense that they recursively partition a problem, and for __shared__ memory on each recursion level to give you memory at a different level of the hierarchy.

To move memory across levels of the hierarchy, you would just use raw pointers, with appropriate reads/writes (e.g. atomic ones). The exact instructions that get emitted would then depend on which level of the hierarchy you are. For that the compiler needs to know something about the architecture it is compiling for, but that seems unavoidable.

> [1] Z. DeVito?, J. Hegarty, A. Aiken, P. Hanrahan, J. Vitek, Terra: A Multi-Stage Language for High-Performance Computing. https://cs.stanford.edu/~zdevito/pldi071-devito.pdf

Thanks, I actually hadn't read the Terra paper. I'll try to get to it this weekend. I think that using a Lua-like language as the meta-programming for your language (e.g. Regent) is an interesting approach, but it is not necessary.

For example, Scheme, Lisp, Haskell, D and Rust have shown that you can do meta-programming quite well in the same language you do normal programming in, without having to learn a completely different "meta-language".

I particularly like Rust procedural macros. It is normal run-time Rust code that just get first compiled, and then executed (with your source code as input) during compilation, with some twist to make that quite fast (e.g. the compiler parses an AST, and the proc macros do AST->AST folds).

If one wants to delay that until run-time, one should just do so (e.g. you just embed your compiler in your app, and when your proc macros run, its "run-time"). No need to layer multiple languages, but maybe the Terra paper convices me otherwise.

9q9 50 days ago [–]

   Terra paper

I didn't mean to push Terra in particular, and I agree that doing meta-programming is easier in the same language you do normal programming in. I just mentioned the Terra paper since it provided a rationale for using meta-programming in high-performance programming (I could have pointed to others making the same argument).

Rust's procedural macros are standard compile-time meta programming.

   proc macros do AST->AST folds)

Last time I looked (I have not used Rust for a while) procedural macros operate over tokens, not ASTs. Maybe that changed?

steveklabnik 50 days ago [–]

It hasn’t changed; we didn’t want to freeze the AST forever, so we went with the token approach. There are libraries to form an AST though.

---

https://docs.google.com/presentation/d/1FRH81IW9RffkJjm6ILFZ7raCgFAUPXYYFXfiyKmhkx8/edit#slide=id.g6ef30904b7_0_5

Exotic GPU languages

Halide A DSL layered on C++, mostly imaging applications Futhark A functional array language in the ML family Co-dfns Based on Dyalog APL Julia Layered above CUDA, provides interactivity In the works: Tensorflow MLIR

Future prospects

SPIR-V is becoming a solid foundation Tools and languages can be built on top We will be able to rely on WebGPU? functionality Manufacturers will compete on ability to run WebGPU? workloads well Test suite Don’t underestimate this - GPU drivers are notoriously buggy New languages will emerge I don’t want a “sufficiently smart compiler” I want lower friction Deployment will become widespread Driven by machine learning workloads Prediction: people will migrate compute workloads to Vulkan and WebGPU?

comments also mention taichi

throwlaplace 58 days ago [–]

of the links you put on the last page (or other gpu languages you might not have included) what's the most mature way to write decent gpu code right now (even if not necessarily optimal). i'm right this moment working on a project where i'm trying to speed up a blob detection algorithm by porting it to gpu. i'm using pytorch because the primitives fit naturally but if that weren't the case i think i would be kind of at a loss for how to do it (i do not know, as of yet, how to write cuda kernels).

btw the Julia link is dead

raphlinus 58 days ago [–]

It depends on what you're trying to do. If you want to get stuff done now then just get an Nvidia card and write CUDA. The tools and knowledge community are vastly ahead of other approaches.

If you're doing image-like stuff, Halide is good, there's lots of stuff shipping on it.

One other project I'd add is emu, which is built on wgpu. I think that's closest to the future I have in mind, but still in early stages.

https://github.com/calebwin/emu

ZeroCool?2u 58 days ago [–]

If you're interested in getting started writing GPU code and you're currently dealing with PyTorch? a good place to start might be using NUMBA for CUDA programming. It lets you do in Python what you'd normally do in C/C++. It's great, because setup is minimal and it has some good tutorial materials. Plus, you're mostly emulating what you'd end up doing in C/C++, so conceptually the code is very similar.

When I was at GTC 2018 this is one of the training sessions I went to and you can now access that here[1]. The official NUMBA for CUDA docs are great too[2]. Also, you can avoid a lot of installation/dependencies headaches by using anaconda/miniconda if you're doing this stuff locally.

[1]: https://github.com/ContinuumIO/gtc2018-numba

[2]:https://numba.pydata.org/numba-doc/latest/cuda/overview.html

jjoonathan 58 days ago [–]

I have a love/hate relationship with numba. On one hand, the minimal setup is excellent. On the other hand, it has so many limitations and rough patches that occasionally we have been forced to abandon the numba code and reimplement in C -- by which time we've spent more time fighting numba than the C reimplementation takes. This has happened twice out of 5 major recent attempts to use it.

Numba CPU is pretty good. I'd recommend it, modulo some expectation control:

In order to actually compile a function, it must be written in a very minimal python subset that feels like C
Numpy is only partially supported in this subset. Expect to manually reimplement ~1/3 of your simple numpy code and all of your fancy numpy code.

Numba GPU is not good. I'd avoid by default, unless you're looking for a project to contribute to:

When we first installed, it was trivially broken by the removal of a long-deprecated CUDA API that had happened a year prior. It was an easy fix, we just chopped the reference out of numba, but it was a bad sign that foretold problems to come.
It has lingering subtle bugs. One time we passed in a float array, it made us declare the float type three times (silly but not bad), and then it ignored the types and silently binary cast to double (bad). We found a "ghost town" thread where others had the issue.
It has severe structural limitations. To a first order, GPU performance is all about avoiding stalls, which are usually triggered by CPU-GPU copies, and Numba's limited buffer abstractions sometimes make that difficult (you have to destructure your code) or impossible.
It doesn't play well with nvidia's perf tools. If you're doing GPU, you're after perf, and GPUs have lots of tricky caveats that can accidentally tank your perf, so this is a much bigger deal than jupyter (until recently) missing a debugger.
If you're wondering if numba supports CUDA API XYZ, the answer is no.

ZeroCool?2u 58 days ago [–]

Yes, I'd agree with all of your points really. You can see in my reply below that I was mostly advocating for it's use in learning the concepts behind GPU programming, so you can skip delving into C/C++ if you're uncomfortable with that.

If I had to pick something today to use for arbitrary GPU acceleration in Python I'd almost certainly opt for JAX instead, but I haven't seen the kind of tutorials in JAX that exist for NUMBA CUDA.

https://developer.nvidia.com/udacity-cs344-intro-parallel-programming

sprash 58 days ago [–]

Halfway through you were bringing up hopes for a great new way to do 2D rendering on GPUs but then there is no show of actual implementation. Rather anticlimactic!

Even decades old Cairo is beating VulKan?/GPU based renderers like Skia in some tasks. Can it be that hardware intended for ML or 3D-graphics is not suited at all for 2D rendering (parallelism vs. pipelining)? How will WebGPU? solve those problems?

To me this talk raises more questions than it answers...

vmchale 58 days ago [–]

My experience with Futhark (and to some extent accelerate) is that they're not terribly hard to program in applicable cases.

I'm not sure Futhark-generated code is as fast as specialist code but it's definitely a speedup compared to even skilled CPU implementations.

streb-lo 58 days ago [–]

I'm not well versed in this area at all -- but something I find fascinating is what I have heard of the early days of GPU compute. Before CUDA and the like, you would apparently take your data you wanted to compute on, structure it as an image/texture and use the graphics API to ham fist your way to the result you wanted. No idea if that's true, but pretty neat if it is.

raphlinus 58 days ago [–]

That works well for certain kinds of computations, basically where each thread can compute a "pixel" independently. Modern GPU compute has threadgroup local storage, subgroups, atomic operations, etc. These unlock huge categories of workloads that would not fit nicely in the rasterization pipeline.

eggy 57 days ago [–]

I loved his commentary on Co-dfns for APL at 54:38 in the video. I program in J[0], and dabble in Dyalog APL. Aaron Hsu's work on the gpu compiler, he's the implementor of Co-dfns, is nothing short of mind-blowing[1]. His arguments for using APL vs. more mainstream languages are cogent and persuasive. The paradigm shift from C-like or even Haskell (ML) type languages is too big a gap for most mainstream programmers to accept. My opinion is that aside from Python's clean, indented syntax, the real reason it took off for DL/ML, was Numpy and Pandas. Pandas was heavily influenced by the array languages per its creator Wes McKinney?. Besides, I just like playing with J.

I have looked briefly at Futhark, and the apltail compiler[2], but I am trying to focus on the APL family, because my time is limited. I am a GPU dilettante who has tried basic CUDA in C. I tried Rust, but Zig [3] is working out better for me given I make more progress with less effort, using C APIs are effortless, but my Rust difficulties may just be my bias with PLs. I find Haskell easier than Rust.

I just read an HPC article today about a survey/questionnaire about the languages used among 57 experts. It's still predominantly C++, but with a lot pain expressed by the experts/users. I agree SPIR-V sounds promising, and I hope to check it out. Just like DL, I think people don't realize it needs to be knowledge domain and algorithms, or experts and software. Somebody has to setup the computation based on knowledge domain and software knowledge. This shows itself to me when I wind up running somebody's program, because I don't have my own specific problem I'd like to try and implement as a learning exercise and knowledge exercise.

Great talk! I found it very understandable and paced just right!

[0] https://jsoftware.com

[1] https://scholarworks.iu.edu/dspace/handle/2022/24749

[2] https://futhark-lang.org/blog/2016-06-20-futhark-as-an-apl-c...

[3] https://ziglang.org

dnautics 56 days ago [–]

hey eggy I didn't know you were interested in zig as well, drop a line, contact's in my about profile.

https://github.com/Co-dfns/Co-dfns

High-performance, Reliable, and Parallel APL

Co-dfns Compiler

The Co-dfns project aims to provide a high-performance, high-reliability compiler for a parallel extension of the Dyalog dfns programming language. The dfns language is a functionally oriented, lexically scoped dialect of APL. The Co-dfns language extends the dfns language to include explicit task parallelism with implicit structures for synchronization and determinism.

https://dl.acm.org/doi/pdf/10.1145/2627373.2627384

geokon 58 days ago [–]

I have a more theoretical followup then :)

So you spend a lot of time in the talk massaging your sequential CPU task into an appropriately GPU task. It's clearly tricky and involves reworking your algorithms to leverage the parallelism and the memory "features" of the GPU. But through rewriting, have you actually substantively hurt the performance of the sequential program?

The big picture question is : is the reverse problem of going from a GPU program to a CPU problem ever problematic? I mean in a algorithmic sense - without looking at micro optimizations for leveraging CPU SIMD instructions or whatever. Or are you always going to be pretty okay with running your shaders sequentially one after another on your big luxurious CPU memory pool?

and ultimately is there anything stopping you from compiling SPIR-V for a CPU? Could you not autgenerate in the kernel dispatcher just a switch that'll branch and run a precompiled on-CPU kernel if no driver is found? Then you'd finally really get compile-once-run-everywhere GPU code

I guess since it's not being done then I'm missing something haha :) Maybe you are going to often hit scenarios where you'd say "No, if I'm going to run this on the CPU I need to fundamentally rewrite the algorithm"

raphlinus 58 days ago [–]

These are indeed interesting questions, thanks for asking them.

If you have a workload that runs really well on GPU, then it can be adapted to run on CPU. There are automated tools for this, but it's not mainstream. To answer one of your questions, spirv-cross has a C++ generator.

There are a bunch of things that are in the space between GPU and traditional scalar CPU. One of the most interesting is Intel's ispc. The Larabee project was also an attempt to build some GPU features into a CPU, and that is evolving into AVX-512. The new mask features are particularly useful for emulating the bitmap of active threads, as this is something that's hard to emulate with traditional SIMD.

I think it would be a very interesting project to build a compiler from SPIR-V compute workloads to optimized multithreaded SIMD, and there are projects exploring that: https://software.intel.com/en-us/articles/spir-v-to-ispc-con...

The main reason this hasn't happened is that when doing optimization it's always better to target the actual hardware than go through layers of translation. If you want to run a neural network on CPU, you're definitely going to get better performance out of matrix multiply code tuned for the specific SIMD implementation than something adapted from GPU. But I think there may still be a niche, especially if it's possible to get "pretty good" results.

For machine learning and imaging workloads in particular, there's probably a lot more juice in having a high level (Halide) or medium level (MLIR) representation, and targeting both GPU and CPU as backends that do optimization specific to their targets.

I'm really interested to see how this space evolves, it feels like there are major opportunities, while the scalar CPU side feels largely mined out to me.

SomeoneFromCA? 58 days ago [–]

Here is the question, though: would you consider Intel HD a native part of CPU or a "some GPU". I am still quite perplexed, why everyone targets CUDA and no one tries to use ubiquitous Intel HD for AI computations. I mean, Intel CPUs are everywhere (I know, know, Ryzens are chaging the situation but still..), and almost always they have GPUs onboard.

raphlinus 57 days ago [–]

They definitely qualify as GPU. I think the main reason people aren't using them is that tools are primitive. Also, historically they've been pretty anemic in horsepower, though are getting better. Still not competitive with high end discrete graphics cards, but absolutely getting there on the low end.

Intel GPU has three hidden strengths:

CPU <-> GPU communication is cheaper, because they can actually share memory (a separate copy to staging buffers is not needed).
Latency is potentially lower, although it's not clear to me yet that driver software can take advantage of the potential offered by the hardware. (More empirical measurement is needed)
Subgroup operations on Intel appear to be wicked-fast, shuffle in particular.

Long story short, I think there is opportunity here that most people aren't exploiting.

modeless 57 days ago [–]

Yes, you bundle SwiftShader? with your application.

flumpcakes 57 days ago [–]

Hi,

If you want to target GPU in some form to increase performance (open source standards that currently fill this niche are OpenCL? for compute, OpenGL?/Vulkan for graphics) but you want to use the same code for CPU and GPU there is a solution!

There's a new standard called SYCL.

Currently it is based on C++11 and there are implementations that target OpenGL? and CUDA.

pjmlp 57 days ago [–]

Except that one needs to rely on tooling from CodePlay? or Intel.

NVidia isn't that keen in supporting OpenCL? beyond 1.2 and is pushing for OpenACC? in what concerns GPGPU outside CUDA on their hardware.

VHRanger 57 days ago [–]

OpenCL? can target the CPU

raphlinus 57 days ago [–]

Indeed. I think it's fair to say that OpenCL? is excellent for targeting a very wide range of devices, with the exception of most high performance GPUs.

vmchale 58 days ago [–]

Question: what is the state of GPU compute w.r.t. Rust? Are the libraries on par with accelerate or (not exactly comparable) Futhark?

raphlinus 58 days ago [–]

There's a ton of low and mid level infrastructure for GPU being written in Rust. That includes the gfx-hal work, the wgpu implementation, and more experimental languages such as emu.

But it's still early days, and I don't think there's a complete solution comparable to either of those yet.

VHRanger 57 days ago [–]

Expect a lot of rough patches

---

https://github.com/luncliff/coroutine

---

https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html

Pause and Resume WebAssembly? with Binaryen's Asyncify

Jul 16, 2019

Pausing and resuming code can be useful for various things, like implementing coroutines, async/await, limiting how much CPU time untrusted code gets, and so forth. If you are customizing a WebAssembly? VM then you have various ways to instrument the compiled code to do this. On the other hand, if you want to do this in “userspace”, that is, if you are running inside of a standard WebAssembly? VM and you want to transform some WebAssembly? so that it can be paused and resumed, then Binaryen’s Asyncify provides a solution. For example, you can use it to pause and resume WebAssembly? that you run on the Web. If you have some synchronous code that you need to be asynchronous, but it’s hard to rewrite, then Asyncify can do that for you automatically!

Asyncify is implemented as a Binaryen pass. It provides low-level support for pausing and resuming WebAssembly? by instrumenting the code and providing special functions that control unwinding and rewinding the call stack, preserving all the local state while doing so. We’ll see examples below of how to do that - basically, you can start with normal synchronous WebAssembly? code, run Asyncify on it, and then easily control unwinding and rewinding the call stack from either WebAssembly? or JavaScript?.

Hopefully WebAssembly? will support coroutines natively in the future. Another possibility here is to use threads (by blocking on another thread), but that can only work in browsers that support threads, and only in a worker. Asyncify is another option in the meantime, which works everywhere but adds some amount of overhead. We’ll look into performance in depth later on - in many cases that overhead is surprisingly low!

Let’s start with three examples of how to use Asyncify: in pure WebAssembly?, in JS plus WebAssembly?, and in C using Emscripten. Example: Pure WebAssembly?

To use Asyncify in pure WebAssembly?, define imports to the asyncify.* API. Asyncify will turn those into direct calls to the implementations as it adds them. That API lets you control unwinding and rewinding the call stack:

    asyncify.start_unwind: Starts to unwind the call stack. Receives a pointer to a data structure that will store the information about the call stack and local state, see below.
    asyncify.stop_unwind: Stops unwinding the call stack. Must be called after you reach the end of the stack you want to unwind.
    asyncify.start_rewind: Starts to rewind the call stack. Receives a pointer to the data structure used earlier to unwind, and prepares to rewind to that exact position.
    asyncify.stop_rewind: Stops rewinding the call stack. Must be called after you reach the top of the stack you rewound, that is, when you finish returning to the previous position.

...

How Asyncify works

(You don’t need to understand how Asyncify works in order to use it. If you’re not interested in that you can skip this section.)

The basic capabilities we need in something like Asyncify are to unwind and rewind the call stack, to jump back to the right place in the middle of the function in each case, and to preserve locals while doing so.

...

The big question is then what to do about control flow. Asyncify does something like this to that last code snippet (again, in JS-like pseudocode):

function caller() { let x; if (rewinding) { .. restore x .. } if (normal) { x = foo(); } if (normal

.. check call index ..) {

    sleep(100);
    if (unwinding) {
      .. save call index and x ..
      return;
    }
  }
  return x;}

This may look confusing at first, but it’s really pretty simple. The identifiers “normal”, “rewinding”, “unwinding” mean that we check if we are running normally, rewinding the call stack, or unwinding it. If we are rewinding, we start by restoring the local state. We also guard most code with checks on whether we are running normally - if we are rewinding, we must skip that code, since we’ve already executed it. When we reach a call that might unwind the stack, we have a “call index” for it so that we know which call to return to inside each function. If the call starts an unwinding, we save the call index and local state and exit immediately. Or if we are rewinding, then the call will set the state to normal, and then we will simply continue to execute on from the right place.

The key principle here is that we don’t do any complex CFG transformations. Instead, Asyncify’s approach is to skip code while rewinding, always moving forward so that we eventually get to the right place (this algorithm is called “Bysyncify” internally in Binaryen). Importantly, we can put those ifs around whole clumps of code that can’t unwind, like a loop without a call:

if (normal) { for (let i = 0; i < 1000; i++) { total += i; } }

The entire loop is inside the if, which means that it runs at full speed! This is a big part of why Asyncify is faster than you’d expect. "

---

" async / promise support #1171 Open nikhedonia opened this issue on Jan 7, 2018 · 7 comments Comments @nikhedonia nikhedonia commented on Jan 7, 2018 •

Many JavaScript? libraries are async and I have not seen that WebAssembly? supports promises yet. In order to easily write performant applications that leverage the JS ecosystem I'd suggest to add support of promises in imported functions.

For instance this should work:

const importObject = { returnOneAsync: () => new Promise(done => done(1)) };

extern "C" int returnOneAsync();

int main(){ int x = returnOneAsync(); should suspend main until promise resolved. return x+x; }

@rossberg Member rossberg commented on Jan 8, 2018

Any support of JS promises would have to be part of the JS host bindings proposal.

But I'm afraid it cannot possibly work like your example suggests, since that would require C itself to understand async execution and suspension, which it doesn't. Also, there is nothing async in your example, it just creates a promise, so the equivalent code wouldn't even suspend in JS itself, but just produce a string like "[object Promise][object Promise]". @nikhedonia Author nikhedonia commented on Jan 8, 2018 •

    Also, there is nothing async in your example

Well, the value is not available until the eventloop processes the promise. The idea is to suspend the execution of the webassembly engine until the value is available. The provided example is of course very trivial. A more concrete example could be making a db query over the network: getAgeOf: (name) => db.find({name}).then(x=>x.age)

    C itself to understand async execution and suspension, which it doesn't

That's incorrect. There are various coroutine implementations for C. Furthermore LLVM-5.0 and upwards supports coroutines and async/await.

https://llvm.org/docs/Coroutines.html

And as an addendum to C++17 there is the coroutine-ts that has been implemented since clang-5.0. There are couple of experimental libraries using it already: https://github.com/lewissbaker/cppcoro#generator @rossberg Member rossberg commented on Jan 8, 2018

Implementations of coroutines in C do not magically interoperate with JavaScript?'s event loop. Also, they usually rely on implementation-specific behaviour. To reliably and portably implement something like coroutines or async you'd need a form of delimited continuations in the language. The closest C has to offer is longjmp, which isn't enough and cannot currently be implemented in Wasm itself (with Emscripten, longjmp is implemented in JS with JS exceptions). Wasm cannot currently express suspension, although there are plans to add something along these lines eventually. C++ coroutines aren't standard yet and cannot be handled by Emscripten AFAICT. @nikhedonia Author nikhedonia commented on Jan 8, 2018 •

I see where you are coming from. I think I haven't expressed my idea well enough. I'm not suggesting to to support coroutines nor promises in C/C++ or realise that in the wasm bytecode.

    But I'm afraid it cannot possibly work like your example suggests, since that would require C itself to understand async execution and suspension, which it doesn't

Why do you think it is required for C to understand async execution?

What's the technical difficulty to suspend the wasm execution engine until the promise is resolved?

I understand that wasm is supposed to be ahead-of-time compiled to x86 but it does not mean you have no control over the flow of execution. An common way to stop the execution is to use hardware interrupts and syscalls like ptrace; that's how debugger work. We know exactly which functions are imported and we could inject the appropriate (x86) opcodes to accomplish that. Although this will slowdown the execution of those functions - it wouldn't matter as we are waiting in those cases for the promise to resolve anyway. That is how we implemented suspend/debug/resume in our LLVM based C++JIT/AoT?.

I guess this issue should be filed in the host-bindings repo... @rossberg Member rossberg commented on Jan 9, 2018

I think you misunderstand the computational model of the web. JavaScript? isn't multi-threaded, so suspending a computation would halt the engine, and in fact the entire renderer hosting it. The page would freeze and no other code could run (other than separate workers). You'd need a way of reifying and later restoring continuations to make suspension fly, which is not something engines currently can do. " -- https://github.com/WebAssembly/design/issues/1171

---

https://llvm.org/docs/Coroutines.html

---

dynamite-ready 6 hours ago [–]

The overlap between K8s and BEAM is a good question. Even amongst experienced BEAM (especially Erlang) programmers, there's a lot of conflicting information. From my limited understanding, Kubernetes is comparatively complicated, and can hamstring BEAM instances with port restrictions.

On the other hand, there's a rarely documented soft limit on communication between BEAM nodes (informally, circa 70 units, IIRC). Above this limit, you have to make plans based on sub-clusters of nodes, though I have certainly not worked at that level of complexity.

Would be interesting to hear what other people think about this specific subject.

di4na 5 hours ago [–]

This limitation is far higher for the past few years and there is more work from the OTP team to raise it.

Also there all the hooks needed to adapt and change it as you grow.

I would argue that when you reach above a hundred nodes, you will need to optimise yourself in any tech though

josevalim 3 hours ago [–]

FWIW, the soft limit for 70 units is about using the global module, which provides consistent state in the cluster (i.e. when you do a change, it tries to change all nodes at once).

The default Erlang Distribution was shown to scale up to ~300 nodes. After that, using sub-clusters is the way to go and relatively easy to setup: it is a matter of setting the "-connect_all false" flag and calling Node.connect/2 based on the data being fed by a service discovery tool (etcd, k8s, aws config, etc).

PS: this data came from a paper. I am struggling to find it right now but I will edit this once/if I do.

---

https://speakerdeck.com/mrb/distributed-systems-archaeology?slide=42

https://speakerdeck.com/mrb/distributed-systems-archaeology?slide=43

recommends the following two books which he seems to think is interesting at the intersection of distributed systems and programming language design:

Van Roy and Haridi, 2004. Concepts, techniques, and models of computer programming Varela, 2013. Programming distributed computing systems: a foundational approach

---

here's something cheap enough that you could actually have 64k of them for a reasonable price (if assembly and interconnection and everything else were free!):

https://jaycarlson.net/2019/09/06/whats-up-with-these-3-cent-microcontrollers/

however, this is 8-bit stuff, folks. 4k ROM, 256 bytes of RAM, and that's the high-end product ($0.1061 @ 100). 4 registers: 1 PC, 1 flag, 1 stack pointer, 1 accumulator. Not sure if even a Boot interpreter would fit in ROM, certaintly not in RAM. 4k of ROM is only 1k 32-bit words, and 256 bytes of RAM is only 64 32-bit words.

64*64k = 4194304 = 4MiB? words of RAM total across all 64k processors (before accounting for RAM taken up by the Boot implementation).

64k*.1061 = 216*.1061 = $7000

---

So i guess we should really should probably look at:

Tensorflow
Keras
GPUs

as the target platform for massive concurrency before thinking too hard about hardware. In 2020 an Nvidia Ampere GPU is about $12.5k. That has 108 Streaming Multiprocessors (SM), 4 Tensor Cores per SM, 40GB of memory, 192 KB l1 cache per SM. There are 32 threads in a 'warp'. Each tensor core also has an L0 instruction cache, but i don't know its size. Each tensor core has 16384 32-bit registers.

The "GPU boost clock" is 1410 MHz.

https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

So.. i'm not really sure but i guess that means, if you think of each 'warp' as if it were a virtual processor, you effectively have 108*4*32 = 13824 warps, each with 192k/32 = 6k L1 icache, and 16384/32 = 512 32-bit registers.

So that's not quite 64k but if you multiplexed 5 threads onto each warp, you'd have > 64k virtual threads:

about 64KiB? threads (69120 threads)
about 1KiB? L1 cache per thread (1228 bytes)
over 64 registers per thread (about 102)
clock speed of about 282 MHz

Of course this is ignoring the massive SIMD parallelism that is a key part of these products.

which is a bit more than the $.10 MCUs in the previous section gives you per processor, and at a similar price ($12500 is probably cheaper than $7000 for the MCUs plus the cost to put them together, not to mention the hassle and expense of engineering design). I expect that ultimately a custom designed thing could be much better (could do 'neuromorphic' stuff, making use of low-power, unreliable, slow, sparse but high connectivity to bring down cost and improve overall connectivity) but i don't have the money to manufacture that.

Apparently competitor lines include the AMD Radeon Instinct and the Intel Xeon Phi.

the upshot is that this gives us the sort of specs we should target for 64k threads:

64k threads
1k icache
64 registers

well, i don't think this changes much for us:

we already are thinking of millions of threads, so 64k is less stringent than that
i don't think we'll be able to fit our hot code in 1k icache, so we'll give up on that. But it does highlight the importance of keeping our hot code as small as possible; this is a lot less than the ~64k icache in e.g. AMD zen2 desktop CPUs. It strengthens my belief that
64 registers is more than most ISAs, and less than the 255 virtual registers afforded by 8-bit operand fields. So for Boot we want much less, and in LOVM we may just offer 255 virtual registers. But it does underline previous suggestions that 64 is potentially a good limit for # of registers or # of local variables, when a moderate limit is desired.
otoh didn't the Lua VM have some sort of close-to-64 local variable limit? No, i'm thinking of this: "By default, Lua has a maximum stack frame size of 250. This is encoded as MAXSTACK inllimits.h. The maximum stack frame size in turn limits the maximum number of localsper function, which is set at 200, encoded as LUAI_MAXVARS in luaconf.h. Other limitsfound in the same file include the maximum number of upvalues per function (60), encodedas LUAI_MAXUPVALUES, call depths, the minimum C stack size, etc. Also, with an sBx fieldof 18 bits, jumps and control structures cannot exceed a jump distance of about 131071." [1]
282 MHz is still much faster than we need for 'brain-like performance'; neural spike refractory period is about 2ms, so that 0.5KHz, so this is 564,000 times that. Of course it probably takes many CPU instructions to do what a neuron does in the time period of 1 spike, so that's not as good as 564,000 neurons. But if it did:
216*564000 = 36962304000 (over 36 billion). The brain has ~86 billion. So we're getting there. A mouse brain only has ~75 million neurons, so if it takes 500 CPU instructions to do what a neuron does in 2ms, then this thing would be about as powerful as a mouse brain (before accounting for timeslicing and any other overhead). Which is more than enough to be useful.
that line of reasoning suggests that we target a further x1KiB? multiplexing. So 64M of threads, not 64k threads. We don't have nearly enough registers or icache for that to smoothly translate, however.

---

dom96 21 hours ago [–]

> In the next round of languages, we'll probably have to deal directly with non-shared memory. Totally shared memory in multiprocessors is an illusion maintained by elaborate cache interlocking and huge inter-cache bandwidth. That has scaling limits. Future languages will probably have to track which CPUs can access which data. "Thread local" and "immutable" are a start.

Interesting prediction. It is precisely the model of threading which Nim follows, global variables are thread local and sharing them is restricted. I'm not sure if the Nim designer thought of it in the terms you've described, but I will certainly point this out to him :)

nicoburns 20 hours ago [–]

It's the route JavaScript? is taking too. JavaScript? has traditionally been single threaded, so all the existing types are single-thread only. Multi-threading is being introduced (still very experimental atm) and the only way to have shared memory is through dedicated "shared" types.

---

" Trying to map a high-level language to an architecture with different semantics is difficult, but not impossible. It's sometimes possible, for example, to extract parallelism from purely sequential code. Consider something like this loop in C:

for (int i=0 ; i<100 ; i++) { j[i] = sin(k[i]); }

At first glance, you might think that you could do each loop iteration entirely in parallel. If you were generating code for a vector unit, you might rewrite this as 25 loop iterations, each doing a vector-sine. For a GPU, you could generate a small kernel that did a single sine operation, and then run it on 100 inputs all in parallel. (For such a simple program, the cost of copying data to and from the GPU would make this a silly thing to do.)

But what happens if k and j are both pointers to the same array, with j being offset forward by one element? In this case, the loop will fill each entry in the array with the sine of the previous element. Because this result is entirely valid in C, the compiler can't optimize this code easily. It could bracket the optimized version in a test that the two ranges didn't overlap, generating code for both cases, but this situation gets particularly complex when you consider running code on multiple cores, especially when they have different instruction sets.

There's a very simple solution: Ask the programmer to provide a bit more information. This is the approach that OpenMP? uses for concurrency. OpenMP? defines a set of annotations that you can add to a program to specify concurrency potential. In the example above, you could specify that j and k are not aliased, and the compiler would then be able to make the loop parallel. " -- [2]

---

Future Challenges

At the moment, CPU and GPU code are generally separated in a program, for these simple reasons: Both CPU and GPU have their own (local) memory. Also, while both are fast, copying between CPU and GPU is expensive.

This relationship is likely to change over the next few years....

Unfortunately, OpenCL? isn't the solution. Because it's designed entirely around the copy-execute-copy model, it adds a lot of overhead (in terms of source-code complexity) if you don't need the copy. For low-level programming, expect to see languages like C introduce something like futures. These are already in the C++0x specification, making it trivial to implement asynchronous calls, synchronizing when you try to access the result. " [3]

---

https://en.wikipedia.org/wiki/Setcontext

---