proj-oot-ootConcurrencyNotes9

don't know where to put this but it talks about CLOS, crossbars, butterfly networks, which is a topic i had some lost notes on years ago. I'm not interested so much in this post itself, but just a reminder to look up those terms again someday:

brandmeyer on Feb 20, 2019 [–]

I went on a bit of a research expedition to see if there was something that scaled better a general permutation instruction for SIMD machines. General permute scales at O(log(N)) in gate delay+ and O(N^2 * log(N)) in area, where N is the vector length. Its a full crossbar, but the fanout on the wires adds an additional log(N) factor in buffers.

For a while, it seemed like a set of instructions based on a rank-2 CLOS network (aka butterfly network) would get the job done. It scales at O(log(N)) in gate delay+ and O(N * log(N)) in area, and is very capable. Fanout is O(1). You can do all kinds of inter- and intra-lane swaps, rotations, and shifts with it. You can even do things like expand packed 16-bit RGB into 32-bit uniform RGBA.

But things like that SMH algorithm are definitely out of scope: Each input bit can only appear in at most one output location with the butterfly. So the cost to replicate a prefix scales at O(repetitions), which is unfortunate. Some algorithms based on using general shuffle are also relying on the use of PSHUFB's functionality as a complete lookup table, which the butterfly network can't do, either.

My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.

+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.

-- https://news.ycombinator.com/item?id=19210123


dragontamer on Feb 20, 2019 [–]

GPU Coders haven't changed their code in the last 10 years, even as NVidia changed their architecture repeatedly.

PTX Assembly from NVidia still runs on today's architectures. I think this variable-length issue they focus on so much is a bit of a red-herring: NVidia always was 32-way SIMD but the PTX Code remains portable nonetheless.

The power is that PTX Assembly (and AMD's GCN Assembly) has a scalar-model of programming, but its execution is vectorized. So you write scalar code, but the programmer knows (and assumes it to be) in a parallel context. EDIT: I guess PTX is technically interpreted: the number of registers is not fixed, etc. etc. Nonetheless, the general "SIMD-ness" of PTX is static, and has survived a decade of hardware changes.

There are a few primitives needed for this to work: OpenCL?'s "Global Index" and "Local Index" for example. "Global Index" is where you are in the overall workstream, while "Local Index" is useful because intra-workgroup communications are VERY VERY FAST.

And... that's about it? Really. I guess there are a bunch of primitives (the workgroup swizzle operations, "ballot", barrier, etc. etc.), but the general GPU model is actually kinda simple.


I see a lot of these CPU architecture changes, but none of them really seem to be trying to learn from NVidia or AMD's model. A bit of PTX-assembly or GCN Assembly probably would do good to the next generation of CPU Architects.

Const-me on Feb 21, 2019 [–]

GPU programming model is simple because it's limited. Low single-threaded performance, high latency masked by threads switching, inefficient branching, limited memory model (can't allocate RAM, write access is very limited).

If you're happy with these limitations, write OpenCL? code and run it on CPU. Will work much faster than scalar code but likely slower than a GPU would.

the rest of this subthread goes very deep but is interesting:

https://news.ycombinator.com/item?id=19212817

---

friendlysock 4 days ago

link flag

Digging into IPC a bit, I feel like Windows actually had some good stuff to say on the matter.

I think the design space looks something like:

    Messages vs streams (here is a cat picture vs here is a continuing generated sequence of cat pictures)
    Broadcast messages vs narrowcast messages (notify another app vs notify all apps)
    Known format vs unknown pile of bytes (the blob i’m giving you is an image/png versus lol i dunno here’s the size of the bytes and the blob, good luck!)
    Cancellable/TTL vs not (if this message is not handled by this time, don’t deliver it)
    Small messages versus big messages (here is a thumbnail of a cat versus the digitized CAT scan of a cat)

I’m sure there are other axes, but that’s maybe a starting point. Also, fuck POSIX signals. Not in my OS.

---

https://github.com/StephenCleary/AsyncEx

--- ruby ractor

https://bugs.ruby-lang.org/issues/17100

---

ruby ractors

"

Ractor (experimental)

Ractor is an Actor-model like concurrent abstraction designed to provide a parallel execution feature without thread-safety concerns.

You can make multiple ractors and you can run them in parallel. Ractor enables you to make thread-safe parallel programs because ractors can not share normal objects. Communication between ractors are supported by message passing.

To limit sharing of objects, Ractor introduces several restrictions to the Ruby’s syntax (without multiple Ractors, there is no restriction).

The specification and implementation are not matured and may be changed in the future, so this feature is marked as experimental and show the “experimental feature” warning when the first Ractor.new.

The following small program calculates n.prime? (n is relatively a big integer) in parallel with two ractors. You will confirm that the program execution is about x2 times faster compared to the sequential program on the parallel computer.

require 'prime'

  1. n.prime? with sent integers in r1, r2 run in parallel r1, r2 = *(1..2).map do Ractor.new do n = Ractor.recv n.prime? end end
  2. send parameters r1.send 261 - 1 r2.send 261 + 15
  3. wait for the results of expr1, expr2 p r1.take #=> true p r2.take #=> true

See doc/ractor.md for more details. Fiber Scheduler

Fiber#scheduler is introduced for intercepting blocking operations. This allows for light-weight concurrency without changing existing code. Watch “Don’t Wait For Me, Scalable Concurrency for Ruby 3” for an overview of how it works.

Currently supported classes/methods:

    Mutex#lock, Mutex#unlock, Mutex#sleep
    ConditionVariable#wait
    Queue#pop, SizedQueue#push
    Thread#join
    Kernel#sleep
    Process.wait
    IO#wait, IO#read, IO#write and related methods (e.g. #wait_readable, #gets, #puts and so on).
    IO#select is not supported.

(Explain Async gem with links). This example program will perform several HTTP requests concurrently:

(Explain this:)

    async is outer gem
    async uses this new feature

require 'async' require 'net/http' require 'uri' Async do ["ruby", "python", "c"].each do

topic
    Async do
      Net::HTTP.get(URI "https://www.google.com/search?q=#{topic}")
    end
  endend

" -- [1]

---

https://tokio.rs/tokio/tutorial/hello-tokio

---

https://en.wikipedia.org/wiki/Petri_net

---

https://github.com/jimblandy/context-switch

" Comparison of Rust async and Linux thread context switch time and memory use

These are a few programs that try to measure context switch time and task memory use in various ways. In summary:

    A context switch takes around 0.2µs between async tasks, versus 1.7µs between kernel threads. But this advantage goes away if the context switch is due to I/O readiness: both converge to 1.7µs. The async advantage also goes away in our microbenchmark if the program is pinned to a single core. So inter-core communication is something to watch out for.
    Creating a new task takes ~0.3µs for an async task, versus ~17µs for a new kernel thread.
    Memory consumption per task (i.e. for a task that doesn't do much) starts at around a few hundred bytes for an async task, versus around 20KiB (9.5KiB user, 10KiB kernel) for a kernel thread. This is a minimum: more demanding tasks will naturally use more.
    It's no problem to create 250,000 async tasks, but I was only able to get my laptop to run 80,000 threads (4 core, two way HT, 32GiB).

These are probably not the limiting factors in your application, but it's nice to know that the headroom is there. "

---

    12
    c-cube 29 hours ago | link | flag | 

The classic “what color is your function” blog post describes what is, I think, such a pain? You have to choose in your API whether a function can block or not, and it doesn’t compose well.

    ~
    kevinc 29 hours ago | link | flag | 

I read that one, and I took their point. All this tends to make me wonder if Swift (roughly, Rust minus borrow checker plus Apple backing) is doing the right thing by working on async/await now.

But so far I don’t mind function coloring as I use it daily in TypeScript?. In my experience, functions that need to be async tend to be the most major steps of work. The incoming network request is async, the API call it makes is async, and then all subsequent parsing and page rendering aren’t async, but can be if I like.

Maybe, like another commenter said, whether async/await is a net positive has more to do with adapting the language to a domain that isn’t otherwise its strong suit.

    14
    kristoff 29 hours ago | link | flag | 

You might be interested in knowing that Zig has async/await but there is no function coloring problem.

https://kristoff.it/blog/zig-colorblind-async-await/

    ~
    kevinc edited 28 hours ago | link | flag | 

Indeed this is an interesting difference at least in presentation. Usually, async/await provides sugar for an existing concurrency type like Promise or Task. It doesn’t provide the concurrency in the first place. Function colors are then a tradeoff for hiding the type, letting you think about the task and read it just like plain synchronous code. You retain the option to call without await, such that colors are not totally restrictive, and sometimes you want to use the type by hand; think Promise.all([…]).

Zig seems like it might provide all these same benefits by another method, but it’s hard to tell without trying it. I also can’t tell yet if the async frame type is sugared in by the call, or by the function definition. It seems like it’s a sort of generic, where the nature of the call will specialize it all the way down. If so, neat!

    7
    kristoff 20 hours ago | link | flag | 
    It seems like it’s a sort of generic, where the nature of the call will specialize it all the way down. If so, neat!

That’s precisely it!

...Part of the complexity of async/await in Zig is that a single library implementation can be used in both blocking and evented mode, so in the end it should never be the case that you can only find an async version of a client library, assuming authors are willing to do the work, but even if not, support can be added incrementally by contributors interested in having their use case supported.

---

---

 I don’t know how async/await works exactly, but it definitely has a clear use case.
 ..

I’d even go further: given the performance of our machines (high latencies and high throughputs), I believe non-blocking I/O at every level is the only reasonable way forward. Not just for networking, but for disk I/O, filling graphics card buffers, everything. Language support for this is becoming as critical as generics themselves. We laughed “lol no generics” at Go, but now I do believe it is time to start laughing “lol no async I/O” as well. The problem now is to figure out how to do it. [8]

---

8 adaszko 31 hours ago

link flag

This is amazing. I had similar feelings (looking previously at JS/Scala futures) when the the plans for async/await were floating around but decided to suspend my disbelief because of how good previous design decisions in the language were. Do you think there’s some other approach to concurrency fit for a runtime-less language that would have worked better?

    16
    spacejam edited 30 hours ago | link | flag | 

My belief is generally that threads as they exist today (not as they existed in 2001 when the C10K problem was written, but nevertheless keeps existing as zombie perf canon that no longer refers to living characteristics) are the nicest choice for the vast majority of use cases, and that Rust-style executor-backed tasks are inappropriate even in the rare cases where M:N pays off in languages like Go or Erlang (pretty much just a small subset of latency-bound load balancers that don’t perform very much CPU work per socket). When you start caring about millions of concurrent tasks, having all of the sources of accidental implicit state and interactions of async tasks is a massive liability.

I think The ADA Ravenscar profile (see chapter 2 for “motivation” which starts at pdf page 7 / marked page 3) and its successful application to safety critical hard real time systems is worth looking at for inspiration. It can be broken down to this set of specific features if you want to dig deeper. ADA has a runtime but I’m kind of ignoring that part of your question since it is suitable for hard real-time. In some ways it reminds me of an attempt to get the program to look like a pretty simple petri net.

I think that message passing and STM are not utilized enough, and when used judiciously they can reduce a lot of risk in concurrent systems. STM can additionally be made wait-free and thus suitable for use in some hard real-time systems.

I think that Send and Sync are amazing primitives, and I only wish I could prove more properties at compile time. The research on session types is cool to look at, and you can get a lot of inspiration about how to encode various interactions safely in the type system from the papers coming out around this. But it can get cumbersome and thus create more risks to the overall engineering effort than it solves if you’re not careful.

A lot of the hard parts of concurrency become a bit easier when we’re able to establish maximum bounds on how concurrent we’re going to be. Threads have a little bit more of a forcing function to keep this complexity minimized due to the fact that spawning is fallible due to often under-configured system thread limits. Having fixed concurrency avoids many sources of bugs and performance issues, and enables a lot of relatively unexplored wait-free algorithmic design space that gets bounded worst-case performance (while still usually being able to attempt a lock-free fast path and only falling back to wait-free when contention picks up). Structured concurrency often leans into this for getting more determinism, and I think this is an area with a lot of great techniques for containing risk.

In the end we just have code and data and risk. It’s best to have a language with forcing functions that pressure us to minimize all of these over time. Languages that let you forget about accruing data and code and risk tend to keep people very busy over time. Friction in some places can be a good thing if it encourages less code, less data, and less risk.

    16
    c-cube 29 hours ago | link | flag | 

I like rust and I like threads, and do indeed regret that most libraries have been switching to async-only. It’s a lot more complex and almost a new sub-language to learn.

That being said, I don’t see a better technical solution for rust (i.e. no mandatory runtime, no implicit allocations, no compromise on performance) for people who want to manage millions of connections. Sadly a lot of language design is driven by the use case of giant internet companies in the cloud and that’s a problem they have; not sure why anyone else cares. But if you want to do that, threads start getting in the way at 10k threads-ish? Maybe 100k if you tune linux well, but even then the memory overhead and latency are not insignificant, whereas a future can be very tiny.

Ada’s tasks seem awesome but to the best of my knowledge they’re for very limited concurrency (i.e the number of tasks is small, or even fixed beforehand), so it’s not a solution to this particular problem.

Of course async/await in other languages with runtimes is just a bad choice. Python in particular could have gone with “goroutines” (for lack of a better word) like stackless python already had, and avoid a lot of complexity. (How do people still say python is simple?!). At least java’s Loom project is heading in the right direction.

---

    ~
    ngrilly 24 hours ago | link | flag | 

The green process abstraction seems to work well enough in Erlang to serve tens of thousands of concurrent connections. Why do you think the async/await abstraction won’t work for Rust? (I understand they are very different solutions to a similar problem.)

    ~
    c-cube 16 hours ago | link | flag | 

Not who you’re asking, but the reason why rust can’t have green threads (as it used to have pre-1.0, and it was scraped), as far as I undertand:

Rust is shooting for C or C++-like levels of performance, with the ability to go pretty close to the metal (or close to whatever C does). This adds some constraints, such as the necessity to support some calling conventions (esp. for C interop), and precludes the use of a GC. I’m also pretty sure the overhead of the probes inserted in Erlang’s bytecode to check for reduction counts in recursive calls would contradict that (in rust they’d also have to be in loops, btw); afaik that’s how Erlang implements its preemptive scheduling of processes. I think Go has split stacks (so that each goroutine takes less stack space) and some probes for preemption, but the costs are real and in particular the C FFI is slower as a result. (saying that as a total non-expert on the topic).

I don’t see why async/await wouldn’t work… since it does; the biggest issues are additional complexity (a very real problem), fragmentation (the ecosystem hasn’t converged yet on a common event loop), and the lack of real preemption which can sometimes cause unfairness. I think Tokio hit some problems on the unfairness side.

    ~
    notriddle edited 1 hour ago | link | flag | 

The biggest problem with green threads is literally C interop. If you have tiny call stacks, then whenever you call into C you have to make sure there’s enough stack space for it, because the C code you’re calling into doesn’t know how to grow your tiny stack. If you do a lot of C FFI, then you either lose the ability to use small stacks in practice (because every “green” thread winds up making an FFI call and growing its stack) or implementing some complex “stack switching” machinery (where you have a dedicated FFI stack that’s shared between multiple green threads).

Stack probes themselves aren’t that big of a deal. Rust already inserts them sometimes anyway, to avoid stack smashing attacks.

In both cases, you don’t really have zero-overhead C FFI any more, and Rust really wants zero-overhead FFI.

    I think Go has split stacks (so that each goroutine takes less stack space)

No they don’t any more. Split Stacks have some really annoying performance cliffs. They instead use movable stacks: when they run out of stack space, they copy it to a larger allocation, a lot like how Vec works, with all the nice “amortized linear” performance patterns that result.

~ andyc 19 hours ago

link flag

Two huge differences:

    Erlang’s data structures are immutable (and it has much slower single threaded speed).
    Erlang doesn’t have threads like Rust does.

That changes everything with regard to concurrency, so you can’t really compare the two. A comparison to Python makes more sense, and Python async has many of the same problems (mutable state, and the need to compose with code and libraries written with other concurrency models)

---

https://github.com/jimblandy/context-switch/issues/1

(linked from https://lobste.rs/s/eppfav/why_i_rewrote_my_rust_keyboard_firmware#c_eibcnp ) ---

https://news.ycombinator.om/item?id=26406989

in this discussion on a post about complaints with Rust async some commentators, such as newpavlov, give what they think would have been better solution. And the_duke says that "Completion is the right choice for languages with a heavy runtime" which is what we are aiming for

newpavlov describes the completion model as:

" In a completion-based model (read io-uring, but I think IOCP behaves similarly, though I am less familiar with it) it's a runtime who "notifies" tasks about completed IO requests. In io-uring you have two queues represented by ring buffers shared with OS. You add submission queue entries (SQE) to the first buffer which describe what you want for OS to do, OS reads them, performs the requested job, and places completion queue events (CQEs) for completed requests into the second buffer.

So in this model a task (Future in your terminology) registers SQE (the registration process may be proxied via user-space runtime) and suspends itself. Let's assume for simplicity that only one SQE was registered for the task. After OS sends CQE for the request, runtime finds a correct state transition function (via meta-information embedded into SQE, which gets mirrored to the relevant CQE) and simply executes it, the requested data (if it was a read) will be already filled into a buffer which is part of the FSM state, so no need for additional syscalls or interactions with the runtime to read this data!

If you are familiar with embedded development, then it should sound quite familiar, since it's roughly how hardware interrupts work as well! You register a job (e.g. DMA transfer), dedicated hardware block does it, and notifies a registered callback after the job was done. Of course, it's quite an oversimplification, but fundamental similarity is there. "

---

" > IOCP models tend to heavily rely on callbacks and closures

While perhaps higher level libraries are written that way, I can’t think of a reason why the primitive components of IOCP require callbacks and closures. The “poll for io-readiness and then issue non-blocking IO” and “issue async IO and then poll for completion” models can be implemented in a reactor pattern in a similar manner. It is just a question of whether the system call happens before or after the reactor loop.

EDIT: Reading some of the other comments and thinking a bit, one annoying thing about IOCP is the cancelation model. With polling IO readiness, it is really easy to cancel IO and close a socket: just unregister from epoll and close it. With IOCP, you will have to cancel the in-flight operation and wait for the completion notification to come in before you can close a socket (if I understand correctly). " ---

jedisct1 18 hours ago [–]

And Zap (scheduler for Zig) is already faster than Tokio.

Zig and other recent languages have been invented after Rust and Go, so they could learn from them, while Rust had to experiment a lot in order to combine async with borrow checking.

So, yes, the async situation in Rust is very awkward, and doing something beyond a Ping server is more complicated than it could be. But that’s what it takes to be a pioneer.

reply

---

speaking about issues with Rust's async/await:

" FWIW, I'd bet almost anything that this problem isn't solvable in any general way without linear types, at which point I bet it would be a somewhat easy modification to what Rust has already implemented. (Most of my development for a long time now has been in C++ using co_await with I/O completion and essentially all of the issues I run into--including the things analogous to "async Drop", which I would argue is actually the same problem as being able to drop a task itself--are solvable using linear types, and any other solutions feel like they would be one-off hacks.) Now, the problem is that the Rust people seem to be against linear types (and no one else is even considering them), so I'm pretty much resigned that I'm going to have to develop my own language at some point (and see no reason to go too deep into Rust in the mean time) :/. "

from this thread: https://news.ycombinator.com/item?id=26407507

---

more on how Rust didn't choose a completion-based future model because (a) it doesn't want allocations, and (b) it wants to be able to drop stuff rather than poll futures to completion before dropping them:

"

At least part of the goal here must be to avoid allocations and reference counting. If you don't care about that, then the design could have been to 'just' pass around atomically-reference-counted buffers everywhere, including as the buffer arguments to AsyncRead?/AsyncWrite?. That would avoid the need for AsyncBufRead? to be separate from AsyncRead?. It wouldn't prevent some unidiomaticness from existing – you still couldn't, say, have an async function do a read into a Vec, because a Vec is not reference counted – but if the entire async ecosystem used reference counted buffers, the ergonomics would be pretty decent.

But we do care about avoiding allocations and reference counting, resulting in this problem. However, that means a completion-based model wouldn't really help, because a completion-based model essentially requires allocations and reference counting for the futures themselves.

To me, the question is whether Rust could have avoided this with a different polling-based model. It definitely could have avoided it with a model where the allocations for async functions are always managed by the system, just like the stacks used for regular functions are. But that would lose the elegance of async fns being 'just' a wrapper over a state machine. Perhaps, though, Rust could also have avoided it with just some tweaks to how Pin works [1]… but I am not sure whether this is actually viable. If it is, then that might be one motivation for eventually replacing Pin with a different construct, albeit a weak motivation by itself.

[1] https://www.reddit.com/r/rust/comments/dtfgsw/iou_rust_bindi...

reply

withoutboats2 19 hours ago [–]

> I am not sure whether this is actually viable.

Having investigated this myself, I would be very surprised to discover that it is.

The only viable solution to make AsyncRead? zero cost for io-uring would be to have required futures to be polled to completion before they are dropped. So you can give up on select and most necessary concurrency primitives. You really want to be able to stop running futures you don't need, after all.

If you want the kernel to own the buffer, you should just let the kernel own the buffer. Therefore, AsyncBufRead?. This will require the ecosystem to shift where the buffer is owned, of course, and that's a cost of moving to io-uring. Tough, but those are the cards we were dealt.

reply

" -- https://news.ycombinator.com/item?id=26407958

---

some discussion on Rust async/await:

https://news.ycombinator.com/item?id=26406989

i only read the first few comments so far

---

Animats 12 hours ago [–]

"But the biggest potential is in ability to fearlessly parallelize majority of Rust code, even when the equivalent C code would be too risky to parallelize. In this aspect Rust is a much more mature language than C."

Yes. Today, I integrated two parts of a 3D graphics program. One refreshes the screen and lets you move the viewpoint around. The other loads new objects into the scene. Until today, all the objects were loaded, then the graphics window went live. Today, I made those operations run in parallel, so the window comes up with just the sky and ground, and over the next few seconds, the scene loads, visibly, without reducing the frame rate.

This took about 10 lines of code changes in Rust. It worked the first time it compiled.

reply

phkahler 5 hours ago [–]

>> One refreshes the screen and lets you move the viewpoint around. The other loads new objects into the scene.

How did you do that in Rust? Doesnt one of those have to own the scene at a time? Or is there a way to make that exclusive ownership more granular?

reply

brink 4 hours ago [–]

The simplest (and often best) option is to use the Arc<Mutex<MyStruct?>> pattern.

The Arc is an async reference counter that allows multiple ownership. And the nested Mutex enforces only one mutable borrow at a time.

reply

adamnemecek 1 hour ago [–]

I'm not sure how your architecture but you might not even need to lock things. I find that using mpsc channels allows me to get around like 60% of locking. Essentially, you have some sort of main loop, then you spawn a tread, load whatever you need there and then send it to the main thread over mpsc. The main thread handles it on the next iteration of the main loop.

reply

---

 matklad 10 hours ago | link | flag |

Strong plus +1. I wish, for application development, it was possible to remove lifetimes, borrow-checker and manual memory management (which are not that useful for this domain) and to keep only fearless concurrency (which is useful even (more so?) in high-level languages). Alas, it seems that Rust’s thread-safety banana is tightly attached to the rest of the jungle.

    5
    kprotty 5 hours ago | link | flag | 

FWIW. Ponylang has exactly this:

    no lifetimes, data is either known statically via simple escape analysis or traced at runtime
    no borrow checker, although it does have a stricter reference capability system
    no manual memory management: uses a deterministically invoked (excluding shared, send data) Mark-and-no-sweep GC
    has the equivalent Send and Sync traits in its reference capability system which provide the same static guarantees

As with everything though it has its own trade offs; Whether that be in its ref-cap system, lack of explicit control in the system, etc.

---

https://matklad.github.io//2021/03/12/goroutines-are-not-significantly-smaller-than-threads.html

disputes the common wisdom that greenthreads take up a lot less memory than threads (he finds greenthreads take up about .25 the memory of normal threads)

discussion: https://news.ycombinator.com/item?id=26440334

---

.NET concurrency:

https://github.com/dotnet/orleans https://microsoft.github.io/coyote/

---

"

Message Passing Interface (MPI)

Programs that need to spread their computation amongst many compute cores, like climate models, often use the Message Passing Interface (MPI) library. The MPI library can be seen as a gateway to using more computing power. Developers in High Performance Computing (HPC) know that MPI is god and all MPIs children were born in its image. One time I heard two HPC developers say “this is the way”, like the Mandalorian, in reference to their agreement to use MPI for a C++ project prototype. ... If you want to read more on the impact of MPI in HPC in general, I highly recommend Johnathan Dursi’s post on why “MPI is killing HPC”. https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html " -- [9]

---

[10]

" Why MPI was so successful ... It started with routines for explicitly sending and receiving messages, very useful collective operations (broadcast, reduce, etc.), and routines for describing layout of data in memory to more efficiently communicate that data. It eventually added sets of routines for implicit message passing (one-sided communications) and parallel I/O, but remained essentially at the transport layer, with sends and receives and gets and puts operating on strings of data of uniform types.

Why MPI is the wrong tool for today

Not only has MPI stayed largely the same in those 25 years, the idea that “everyone uses MPI” has made it nearly impossible for even made-in-HPC-land tools like Chapel or UPC to make any headway, much less quite different systems like Spark or Flink, meaning that HPC users are largely stuck with using an API which was a big improvement over anything else available 25 years ago, but now clearly shows its age. Today, MPI’s approach is hardly ever the best choice for anyone. MPI is at the wrong level of abstraction for application writers

Programming at the transport layer, where every exchange of data has to be implemented with lovingly hand-crafted sends and receives or gets and puts, is an incredibly awkward fit for numerical application developers, who want to think in terms of distributed arrays, data frames, trees, or hash tables. Instead, with MPI, the researcher/developer needs to manually decompose these common data structures across processors, and every update of the data structure needs to be recast into a flurry of messages, synchronizations, and data exchange. ... At the end of this post are sample programs, written as similarly as possible, of solving the problem in MPI, Spark, and Chapel. I’d encourage you to scroll down and take a look. The lines of code count follows: Framework Lines Lines of Boilerplate MPI+Python 52 20+ Spark+Python 28 2 Chapel 20 1

...

In Chapel, the basic abstraction is of a domain – a dense array, sparse array, graph, or what have you – that is distributed across processors. In Spark, it is a resiliant distributed dataset, a table distributed in one dimension. Either of those can map quite nicely onto various sorts of numerical applications. In MPI, the “abstraction“ is of a message. And thus the huge overhead in lines of code.

...

MPI is less than you need at extreme levels of parallelism

...

fault-tolerance, and an ability to re-balance the computation on the altered set of resources, becomes essential....If the highest-level abstraction a library supports is the message, there is no way that the library can know anything about what your data structures are or how they must be migrated.

Fault-tolerance and adaptation are of course genuinely challenging problems; but (for instance) Charm++ (and AMPI atop it) can do adaptation, and Spark can do fault tolerance. But that’s because they were architected differently.

"

---

https://www.datadoghq.com/blog/engineering/introducing-glommio/

https://github.com/TimelyDataflow/differential-dataflow

---

" Querying a database

In Node, let's say you want to query your database in a REPL. Here's what it looks like:

> const { Client } = require("pg"); undefined > client = new Client(/* connection string */); undefined > client.query("select now()"); Promise { <pending> } >

Something about this always felt so depressing. Rationally, I could justify it: nothing lost, nothing gained. If I wanted to board Node's async rocket to the moon, I had to accept inferior ergonomics in a situation like this. I get a promise, not a result, so I need to add additional logic to handle the promise and get a result.

And if only life were so simple:

> await client.query("select now()"); REPL11:1 await client.query("select now()"); ^^^^^ SyntaxError?: await is only valid in async function

Alas, begrudged acceptance, we all got used to "the new way of doing things here."

At this moment, thanks to the proliferation of async/await, I can no longer remember the API for a Promise instance. So I'll just regress all the way to a callback. Which is fortunately possible, because JavaScript?'s "no man left behind" principle ensures callbacks will be well-supported for my grandchildren:

> client.query('select now()', (err, res) => console.log(res)) undefined

Still no result. Five minutes of head scratching and banging ensues until I realize – for the love of von Neumann – I've forgotten to call client.connect(). If you don't call client.connect() before calling client.query(), the pg client will silently push the query to an internal queue. This would be more infuriating if it wasn't so understandable – remember the flawed foundation we're building on here.

So, finally, I call connect() and then query() and I get a result (somewhere in there...):

> client.connect() undefined > client.query('select now()', (err, res) => console.log(res)) Result { command: 'SELECT', rowCount: 1, oid: null, rows: [ { now: 2021-03-20T19:32:42.621Z } ], fields: [ Field { name: 'now', tableID: 0, columnID: 0, dataTypeID: 1184, dataTypeSize: 8, dataTypeModifier: -1, format: 'text' } ], _parsers: [ [Function: parseDate] ], _types: TypeOverrides? { _types: { getTypeParser: [Function: getTypeParser], setTypeParser: [Function: setTypeParser], arrayParser: [Object], builtins: [Object] }, text: {}, binary: {} }, RowCtor?: null, rowAsArray: false }

I'll never forget, after years of only writing Node programs, the moment I first made a SQL query in Elixir's REPL, iex.

I start a connection to a remote Postgres database:

iex()> opts = # connection opts iex()> {:ok, conn} = Postgrex.start_link(opts) {:ok, #PID<0.1330.0>} iex()> ▊

I type in my query:

iex()> Postgrex.query(conn, "select now()")▊

I press the Enter key:

iex()> Postgrex.query(conn, "select now()") ▊

And you know what the REPL does next? It hangs. For a tiny fraction of a second, the beautiful little bugger just hangs. I'll never forget, it looked just like this, for a few milliseconds:

iex()> Postgrex.query(conn, "select now()") ▊

I very perceptibly gasped. And then, my result:

iex()> Postgrex.query(conn, "select now()") {:ok, %Postgrex.Result{ columns: ["now"], command: :select, connection_id: 88764, messages: [], num_rows: 1, rows: [[~U[2021-03-20 19:40:13.378111Z]]] }} iex()> ▊

How can Elixir get away with this? An I/O operation like this is the precise location you want async, right? Did I turn off async in my REPL somehow?? Is Elixir not async???

No. Elixir can get away with this because it's built on top of Erlang/OTP. And Erlang/OTP got concurrency so right.

Concurrency - and the processes that support it - have been a part of OTP since the beginning. And not as a feature, but as a part of its raison d'etre.

When I run Postgrex.start_link above, the function returns to me a pid, which I store in the variable conn. A pid is an address. In this case, Postgrex has started a process that manages the connection to my Postgres database. That process is running somewhere in the background. The pid is a pointer to that process.

When I run Postgrex.query(conn, statement), the first argument I pass to query/2 is the pid of the connection process. The REPL - the process I'm typing in - sends the statement as a message to the connection process.

The human metaphor of two friends passing messages back and forth to each other applies perfectly. Importantly for me the programmer, the REPL cheerily waits until it hears back from the connection process. The call to query/2 is a synchronous one. It's synchronous because it can be.

In fact, whenever an individual process does anything, it is always synchronous. On a local level, an Elixir/Erlang programmer is always thinking about synchronous, functional reductions. Including when sending and receiving messages to other processes. (And is all the while blissfully free of a red/blue function dichotomy.)

In Elixir and Erlang, concurrency doesn't happen at the function layer. It happens at the module layer. You can instantiate a module as a process, and now it's running concurrently with other processes. Each process maintains its own state. And can pass messages back and forth with other processes. " -- [11]

---

    ~
    marius 20 hours ago | link | flag | 
    Although there are plenty of downsides of threads in most environments

How come ? After all, threads are the basic multithread building block exposed directly by the OS.

    6
    andyc edited 9 hours ago | link | flag | 

I should have said synchronous / “straight line” code – saying “threads” sort of confuses the issue. You can have straight line code with process-level concurrency (but no shared state, which is limiting for certain apps, but maybe not as much as you think)

It’s very easy to make an argument that threads exposed by the OS (as opposed to goroutines or Erlang processes) are a big trash fire of design. Historically that’s true; it’s more a product of evolution than design.

One reason is that global variables are idiomatic in C, and idiomatic in the C standard library (e.g. errno, which is now a thread local). Localization also uses global variables, which is another big trash fire I have been deep in: https://twitter.com/oilshellblog/status/1374525405848240130

Another big reason is that when threads were added to Unix, syscalls and signals had to grow semantics with respect to threads. For example select() and epoll(). In some cases there is no way to reconcile it, e.g. fork() is incompatible with threading in fundamental ways.

The other reason I already mentioned is that once you add threads, timeouts and cancellation should be handled with every syscall in order for you to write robust apps. (I think Go and node.js do a good job here. In C and C++ you really need layers on top; I think ZeroMQ? gives you some of this.)

So basically when you add threads, EVERYTHING about language has to change: data structures and I/O. And of course C didn’t have threads originally. Neither did C++ for a long time; I think they have a portable threading API now, but few people use that.

The original concurrency primitive exposed by Unix is processes, not threads. You can say that a thread is a process that allows you to write race conditions :)

From the kernel point of view they’re both basically context switches, except that in the threading case you don’t change the address space. Thus you can race on the entire address space of the process, which is bad. It’s a mechanism that’s convenient for kernel implementers, but impoverished for apps.

OS threads are pretty far from what you need for application programming. You need data structures and I/O too and that’s what Go, Erlang, Clojure, etc. provide.

---

    ~
    hauleth 2 hours ago | link | flag | 
    Are all JS frameworks basically the same as far as concurrency goes?

Pretty much. The only way to have concurrency in JavaScript? is to use async. JS by definition is single-threaded with async capabilities.

    If my main erlang process needs to wait for the ps process to finish, what’s the advantage of having a separate process?

Error handling mostly. In Erlang failure of one process do not propagate to other (unless explicitly requested).

    isn’t the whole point of multithreading so that your main thread can do stuff while waiting for the DB to finish?

Erlang processes weren’t created for “multithreading” but for error isolation. The parallelisation was added later. But returning to your question about “doing other stuff while waiting for DB to finish” - you still can do so, just spawn other process (it is super cheap, as Erlang processes are different from OS processes). You still will need some synchronisation at some point, in Erlang each process is synchronous and linear, and it relies on messages to do communication between processes. So “behind the scenes” in function Postgrex.query/2 (in Erlang functions are defined by module name, function name, and arity, you can have multiple functions named identically with different arity) is something like (quasi-simplified as there is more behind scenes, for example connection pool, and most of the code there will be hidden behind functions, but in it’s “core” it reduces to that):

def query(pid, query) do ref = generate_unique_reference()

  1. Send message to process identified by `pid` with request to execute `query`
  2. `ref` is used to differentiate between different queries. `self/0` returns PID of
  3. current process, we need to send it, so the Postgrex process know whom
  4. send response to. send(pid, {:execute, self(), ref, query})
  5. Wait for response form the Postgrex process receive do {:success, ^ref, result} -> {:ok, result} {:failure, ^ref, reason} -> {:error, reason} after 5000 -> throw TimeoutError? # if we do not get response in 5s, then throw error end end

So in theory, you could do different things waiting for the response from the DB, this is for example how sockets are implemented in gen_tcp and gen_udp, these just send messages to the owning process, and owning process can do different things “in the meantime”. In this case, just for convenience of developer, all that message passing back and forth, is hidden behind utility function. However in theory you can do so in fully asynchronous way. This is almost literally how gen_server (short for generic server) works.

---

an article about how the details of Rust's async stuff is currently confusing: https://fasterthanli.me/articles/pin-and-suffering

consensus in the comments seems to be that Rust's async stuff is new and so has some 'rough edges' that will be improved over time, and that most people shouldn't use it yet (altho some say that unless you are in an unfortunate situation you don't actually end up running into most of these rough edges): https://lobste.rs/s/f98dpj/pin_suffering

one guy pointed out that Python's asyncio is also confusing:

https://lucumr.pocoo.org/2016/10/30/i-dont-understand-asyncio/

that article has a lot of specific suggestions of how the Python design could have been made better; could be useful to read.

i wonder which languages get async right? Erlang/Elixir apparently doesn't have the what-color-is-my-function problem. People seem to like C#'s async.

---

" Deterministic simulation

As we wrote about previously, we have the ability to write 100% reproducible distributed systems tests by running our whole system on a single thread and randomizing the order in which entities execute events starting from a random seed. Simulation is a major, codebase-spanning capability that heavily utilizes the aforementioned techniques of dependency injection and redefs. For example:

    Any part of the system that in production would be a unique thread is coded in terms of executor services. To get an executor service for that particular part of the system, it requests one from an "executor service factory". In production, this returns new threads. In simulation, however, we override that component to provide executor services from our single-threaded, globally managed source.
    Much of our system relies on time (e.g. timeouts), so time is abstracted away from our implementation. Any part of the system that is interested in time consults a "time source" dependency. In production this is the system clock, but in simulation the component is overridden with a "simulated time source" that can be explicitly controlled within our simulation tests.
    Promises are used quite a bit throughout the codebase to manage asynchronous, non-blocking behavior. Simulation uses with-redefs to layer in additionally functionality into promises useful for stepping through simulation.

"

[12]

---

apparently the received wisdom is that kqueue is better than epoll. This argument disputes that:

https://ariadne.space/2021/06/06/actually-bsd-kqueue-is-a-mountain-of-technical-debt/

the commentators seem to disagree tho and think that kqueue is still better:

https://lobste.rs/s/4qi5hg/actually_bsd_kqueue_is_mountain

---

https://aykevl.nl/2019/02/tinygo-goroutines

---

https://github.com/TimelyDataflow/timely-dataflow

---

" Promises All The Way Down

Node was designed before JavaScript? had the concept of Promises or async/await. Node's counterpart to promises was the EventEmitter?, which important APIs are based around, namely sockets and HTTP. Setting aside the ergonomic benefits of async/await, the EventEmitter? pattern has an issue with back-pressure. Take a TCP socket, for example. The socket would emit "data" events when it received incoming packets. These "data" callbacks would be emitted in an unconstrained manner, flooding the process with events. Because Node continues to receive new data events, the underlying TCP socket does not have proper back-pressure, the remote sender has no idea the server is overloaded and continues to send data. To mitigate this problem, a pause() method was added. This could solve the problem, but it required extra code; and since the flooding issue only presents itself when the process is very busy, many Node programs can be flooded with data. The result is a system with bad tail latency.

In Deno, sockets are still asynchronous, but receiving new data requires users to explicitly read(). No extra pause semantics are necessary to properly structure a receiving socket. This is not unique to TCP sockets. The lowest level binding layer to the system is fundamentally tied to promises - we call these bindings "ops". All callbacks in Deno in some form or another arise from promises.

Rust has its own promise-like abstraction, called Futures. Through the "op" abstraction, Deno makes it easy to bind Rust future-based APIs into JavaScript? promises. " [13]

---

https://news.ycombinator.com/item?id=22731248

jcranmer on March 30, 2020 [–]

The Rust async/await amounts to the following:

That's it. There is no runtime provided, not even optionally, in the standard library (let alone core). A fair amount of this post is giving instructions for how to build the runtime to execute asynchronous tasks.

One important thing to note is that Rust uses a polling-based approach to implement tasks. That is, the core interface to a task is to run it and have it return the result or indicate that is waiting on something else. In the case of the latter, there is a callback mechanism (the Waker object) that the task is responsible for calling when it has more information to make progress--and the caller is responsible for calling the task again when that happens. Many (most?) other implementations instead use a resolution model, where the caller provides a function that is called with the result, and when the function gets called (particularly in the case of tasks that can execute quickly) differs from implementation to implementation. Rust's model definitely maps much more easily to a kernel, since the dominant pattern will be to drain a buffer, poke a device, and then wait for an interrupt to happen (the interrupt handler calling the wake function).

Matthias247 on March 31, 2020 [–]

> Rust's model definitely maps much more easily to a kernel, since the dominant pattern will be to drain a buffer, poke a device, and then wait for an interrupt to happen (the interrupt handler calling the wake function).

What is nice about Rusts model is that it prevents the continuation from running inline the event handler which signaled the completion. That avoids a ton of reentrancy issues and questions around "where does my code actually run"? That indeed makes it also interesting to use for interrupt handlers, since it is guaranteed that the code which waits for the interrupt to happen runs purely in the executors thread and will not accidentally take over the interrupt handler.

---

zackmorris on March 30, 2020 [–]

I've shied away from async/await because I haven't seen a good writeup on how to make it deterministic. Come to think of it, all of the times I've encountered it, there was no way to analyze the codepaths and prove that every exceptional situation and failure mode was handled. Maybe I missed something along the way?

So my feeling about it is that it may turn out to be an evolutionary dead end, or anti-pattern at the least. My gut feeling is that async/await is functionally equivalent to the the synchronous/blocking/coroutine/channel system of other languages like Go. Could we write a thin wrapper that converts async/await to that or vice versa?

This is the primary reason why I've stuck to synchronous/blocking PHP with all of its flaws instead of Node.js. I think this is a fundamental thing that shouldn't be glossed over and accepted so readily into other languages.

steveklabnik on March 30, 2020 [–]

What do you mean by "deterministic" here?

> Come to think of it, all of the times I've encountered it, there was no way to analyze the codepaths and prove that every exceptional situation and failure mode was handled.

There is nothing special about async/await with regards to this in Rust, at least if I'm understanding you correctly. Async functions can return Results like any other function for recoverable errors, and can panic like any other function for un-recoverable errors.

> My gut feeling is that async/await is functionally equivalent to the the synchronous/blocking/coroutine/channel system of other languages like Go.

It depends on what level of abstraction you're talking about. For Rust, which cares a lot about details, they're not. I gave a talk comparing and contrasting all this stuff here: https://www.infoq.com/presentations/rust-2019/

zackmorris on March 31, 2020 [–]

Well, I'm coming from the Javascript side, where people use promises a lot for async, but it's almost impossible to trace execution in the debugger. It's usually immediately obvious to me that many exceptions and failure modes have been missed, but I find it difficult to reason about large promise chains and I usually have no idea where I would add more error handling. It tends towards a big ball of spaghetti.

Then if something does fail (wifi blips out without a retry, who knows) the page just hangs with no indication of what went wrong.

Contrast this with Unix-style synchronous blocking code piping data between threads with no shared state. Since every step of execution happens in a single thread, blocking until a pipe returns data, it's trivial to step through the code.

Async/await really becomes a problem in Javascript though because they made it a language keyword in Node.js. So you never know if you are dealing with a sync function or async function. Eventually the entire program has async in front of everything and ends up being written just exactly like sync blocking (but with superfluous syntactic sugar everywhere). So I question what was really gained there. IMHO it just doubles the workload in the mind since now we have to juggle both types of functions and explore those permutations. That's the last thing we need when we're trying to brainstorm a solution from a large possible search space.

I'm also looking at this from a one-shot functional programming perspective. I realize that sync blocking code blocks the main thread, which is usually the GUI rendering loop. There are ways around that though. The best solution I've seen so far is Apple's Grand Central Dispatch:

https://en.wikipedia.org/wiki/Grand_Central_Dispatch

https://www.raywenderlich.com/5370-grand-central-dispatch-tu...

https://www.swiftbysundell.com/articles/a-deep-dive-into-gra...

Basically it works by being sync blocking and providing simple ways to run closures on other threads. I find it much easier to debug than async/await though. Rust probably already has all of that functionality, so I don't understand the advantage of copying Javascript model with all of its caveats. I think what's happening is that people just want to run with the context they're already familiar with.

steveklabnik on March 31, 2020 [–]

Ah, so, async/await, while conceptually similar, isn't the same across languages. Rust's semantics here are very different than JS's. I think Rust's implementation would address all of your issues, at least conceptually.

> I don't understand the advantage of copying Javascript model with all of its caveats. I think what's happening is that people just want to run with the context they're already familiar with.

With respect, I think it's because you don't understand it in Rust yet. It was not directly copied from JS, and was certainly not due to familiarity. Read the post! It's very good :)

jcranmer on March 30, 2020 [–]

There are two important things that are often conflated with async/await: the programming model, and the executor model.

The programming model of async/await is essentially syntactic sugar for a state machine, where each of the states are determined implicitly by the site of the call to await, and the ancillary storage of the state machine is managed implicitly by the compiler. It's basically no different from a generator pattern (and is usually implemented on top of the same infrastructure) [1].

Since the result of calling an async function is a task object that doesn't do anything until you call methods on it (including implicitly via await), most languages integrate these task objects with their event model. This model is completely orthogonal to the feature, and often exists (at least in the library, if not the language) before async/await is added to the language. JS has a built-in event loop that predates async/await by several years, whether programming in client-side browsers or using something like Node.js. Rust is unusual in that it prescribes absolutely nothing for the executor model; all it defines in its standard library is the interface to drive the tasks manually.

I doubt async/await is an evolutionary dead end. It's the third kind of coroutine to hit it big in mainstream languages, the first two being the zero-cost exception handling mechanism (i.e., try/catch) and the generator (i.e. yield). The biggest criticism of zero-cost exceptions essentially amounts to their types not being made explicit, but the design of async/await makes their effects on the type of coroutine quite apparent, so it's not going to run into the same problems.

[1] There is a difference in the pattern, which is that you're probably going to get back different kinds of objects implementing different interfaces, which is a bigger deal if you manually drive the interfaces. Structurally--that is, in terms of the actual machine code running--there is likely to be no difference.

zackmorris on March 31, 2020 [–]

Thank you, that was very insightful. I've felt that way about generators, but it never occurred to me that try/catch could be viewed similarly to a coroutine (because it's basically using goto to jump to alternate states too).

Edit: I forgot to mention that I came to your same conclusion a few years ago about async/await being equivalent to a state machine. It happened a different way for me, when I realized that a coroutine performs the same computation as a state machine, but is much easier to trace. I was working on a game, and made a rather large state machine with perhaps 20-50 states and some side effects where those states interacted with game state. But it simplified down to just a few lines of ordinary flow control code when I tried it as coroutines with message passing. So to me, async/await carries the burden of state machine complexity but loses the simplicity of coroutine (and sync/blocking obviously) traceability. So I just don't see a compelling reason to use it.

lmm on March 31, 2020 [–]

async/await-based code is deterministic until you deliberately introduce the ability for tasks to race. Look at Noether for a language design that explicitly makes those steps. Whereas if you have channels (and don't have some kind of single ownership) then you have nondeterministic races right from the start. So I rather doubt that any such thin wrapper could be formed.

Certainly as a human reader, Future-based code (which async/await is sugar for) is the easiest form of concurrency/parallelism to reason about: yes you have small tasks that might execute in parallel, but all your functions still return values, function composition still works the way you'd expect, there's no way for parallel tasks to silently interfere with each other in some hidden way. The control flow is still what it looks like (unlike with general coroutines). Of course if you can afford to avoid parallelism entirely you probably should, but async/await is the least intrusive/most understandable way to manage it if you do need it.

---

probably a good read: https://os.phil-opp.com/async-await/ https://news.ycombinator.com/item?id=22727985 (discussion probably a good read too)

---

i just finished reading Russ Cox's 2-part series on hardware and then programming language memory models, and his 3rd post in the same series giving background on his upcoming proposal for improvements to Golang's memory model: https://research.swtch.com/mm

my take-aways for Cox's opinions on what a programming language should specify:

I think in low-level languages like Boot, we should guarantee DRF-SC, and allow weak atomics ('coherence' (acquire/release) and unsynchronized (relaxed)), and have undefined behavior with races. The presence of weak atomics implies a complicated memory model.

(actually, is 'coherence' really a good name for acquire/release? All hardware already has coherence and yet acquire/release synchronization still exists, right? So it must provide something else besides just coherence?)

In high-level languages like Oot, we should guarantee DRF-SC, and only provide atomics which are sequential consistency synchronizing (but if more perf is needed, provide additional high level std library primitives that take advantage of weaker sync under the hood). The atomicity should be in the type, not at the point of access. I'm not sure what to do with races in a high-level language. I think we shouldn't have undefined behavior. We should rule out acausal/out-of-thin-air-reads informally but not bother to try formally, as that is still a research question. I guess we can follow Golang's example and define a happens-before ordering the usual way: https://golang.org/ref/mem

i guess then we define data races (as a write and (a read or a write) unordered by the happens-before where at least one of them is not atomic), and we say that, if two writes, either or both of them can take any value, and if a write and a read, the read can take any value.

as for out-of-thin-air-reads, maybe say that each value has a chain (tree, really) of provenance, and all of the sources of this tree must be one of: constants in the program, I/O reads, or program-requested nondeterministic values. (that also implies that we should 0-initialize memory).

---

https://rachelbythebay.com/w/2020/03/07/costly/ describes a problem with Python greenthreads (Python, Gunicorn, Gevent, serving web requests) where all of the greenthreads are woken up at once whenever a request comes in; and, worse, one of might time out and close the connection to the client after another one starts serving the request (which could be probs if the request is an API request that is not idempotent).

https://news.ycombinator.com/item?id=22514388 says ok, greenthreads are good for I/O bound stuff but this is CPU bound, so no surprise -- any system with an event loop and non-preemptible greenthreads will have a problem if one of the greenthreads becomes CPU-bound and blocks the event loop ([14] agrees). Also, Golang takes care of this situation ("I'm a big fan of gevent, and while it does have these shortcomings - they are there because it's all on top of Python, a language which started out with the classic async model (native threads), rather than this model. Golang, on the other hand, doesn't suffer from them as it was designed from the get-go with this threading model in mind. So it allows you to write blocking style code and get the benefits of an event loop (you never have to think about whether you need to await this operation). And on the other hand, goroutines can be preempted if they spend too long doing CPU work, just like normal threads. ".

https://news.ycombinator.com/item?id=22514785 says that " Beam VM (Erlang/Elixir) is another example which copes with the problem very well due to preemption. ".

Someone else asks " How come there's no work stealing? Green threads are supposed to be backed by some N:M thread pool, no?" [15] and others point out that Python has a GIL anyways, isn't really designed for this sort of thing. Someone points out there's a fancy asyncio.run_in_executor for this sort of situation [16].

---

    ~
    vlmutolo 28 minutes ago (unread) | link | flag | 

It’s not a problem in practice. Rust’s futures model handels io-uring just fine. There was some debate over how to handle “cancellations” of futures, e.g. when Rust code wants to just forget that it asked the OS for bytes from a TCP socket. But the “ringbahn” research prototype found a clean solution to that problem.

Actually, that entire blog is a wealth of information about Rust futures.

---

https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

" Timeout provides a way to auto-terminate a potentially long-running operation if it hasn’t finished in a fixed amount of time.

require 'timeout' status = Timeout::timeout(5) { # Something that should be interrupted if it takes more than 5 seconds... }

...

Its implementation originally seems kind of clever. You can read the code here. It starts a new thread, which sets the original thread to x, sleeps for 5 seconds, then raises an exception on the main thread when it’s done, interrupting whatever it was doing.

...

Nobody writes code to defend against an exception being raised on literally any line. That’s not even possible. So Thread.raise is basically like a sneak attack on your code that could result in almost anything. "

---

"Brinch Hansen, looking for the 'right' primitives for concurrency" [17]

hedgehog 10 months ago [–]

This is great. His Edison language is an interesting early exercise in building the minimal language with good support for concurrency. Prior to this dump there much information about Edison online but now you can get the full write-ups and code:

http://pascal.hansotten.com/per-brinch-hansen/edison-a-multi...

http://pascal.hansotten.com/per-brinch-hansen/

https://news.ycombinator.com/item?id=24913959

http://www.cs.otago.ac.nz/cosc441/archprog-brinch_hansen.pdf THE ARCHITECTURE OF CONCURRENT PROGRAMS

https://stacks.stanford.edu/file/druid:bj699xg5754/bj699xg5754.pdf

---

project verona "To provide temporal memory safety (no use after free bugs), you need to know when an object will never be accessed again... In Project Verona, we view giving up concurrent mutation as a necessary step for scalable memory management. By eliminating concurrent mutation, we cannot implement concurrency as a library. There are two alternatives here, expose "unsafe" to enable unsafe libraries for concurrency (e.g. Rust), or provide a concurrency model as part of the language (e.g. Pony).

The former means the language can rely on fewer invariants as it does not understand how the code inside the unsafe blocks is providing concurrency. The latter means you need an amazing concurrency story as you can only have one. Concurrent Ownership

In Project Verona, we are introducing a new model of concurrent programming: concurrent owners, or cowns for short. A cown, pronounced like "cone", encapsulates some set of resources (e.g. regions of memory) and ensures they are accessed by a single thread of execution at a time.

In Verona, we can wrap an object in a cown, and that will make it concurrent.

x is some isolated object graph var c = cown.create(x) c is a cown that mediates access to x. We have lost direct access to x here

Once, we have wrapped an object in a cown, we can only access it by scheduling work on it. In Verona, that is done with the keyword when

when (var x = c) { Access internals of cown(c) using name x in here Builtin.print("Hello\n") } Builtin.print("Goodbye\n") " -- https://github.com/microsoft/verona/blob/master/docs/explore.md

---

Eiffel's SCOOP system:

http://cme.ethz.ch/scoop/ https://en.wikipedia.org/wiki/SCOOP_(software)

---

https://fasterthanli.me/articles/understanding-rust-futures-by-going-way-too-deep

---

https://hyperfiddle.notion.site/Reactive-Clojure-You-don-t-need-a-web-framework-you-need-a-web-language-44b5bfa526be4af282863f34fa1cfffc

---

on concurrency (threads) being hard:

"Stuff like actors with a message inbox (Erlang/Elixir's preemptive green threads) or parallel iterators transparently multiplexing work on all CPU cores (Rust's `rayon` comes to mind) are much better abstractions for us the poor humans to think in terms with. Golang's goroutines are... okay. Far from amazing. Still a big improvement over multithreaded code as you pointed out though, I fully agree with that." [18]

nickbauman on Jan 29, 2020 [–]

    Multithreaded programming is still one of the most problematic activities even for senior programmers

I agree wholeheartedly. FWIW I've found higher level abstractions built into the language itself, such as atoms and refs in Clojure, go a long way toward mitigating the problem once and for all. This gives you an actual harpoon to hunt Amdahl's* Great White Whale of Parallelism. Instead of a toothpick (C++) or a pocket knife (Java) to do it.

The trouble is we need to stop hunting whales if we're going to move the needle in a permanent and significant way...

---

dijit on Jan 28, 2020 [–]

Your comment about threading in win32 being lackluster is surprising. It was my understanding that while the OS doesn't support fork(), that it /does/ support pthreads, and that as soon as you go outside of the process itself, Windows has "better for developers" system calls. I'm speaking specifically about I/O Completion ports.

Working with epoll or select/poll is really nasty when dealing with "some data" that's coming on a socket, where as IOCP allows you to tell the kernel "as soon as you see some data destined for my thread, just populate it in the memory here, tell me when something is there and then tell me how much is there"; for high performance C++ programmers this is basically invaluable and as far as I understand there's no direct competitor in UNIX and unix-like land. (although; according to the C++ devs where I work, "kqueues are less braindead than epoll")

oppositelock on Jan 28, 2020 [–]

I never said it was lackluster, I said that it was equivalent. The differences lie around how asynchronous thread signals are delivered and handled. Win32 is less flexible in that regard, and when you're writing any kind of cross platform threaded software which includes Win32, you must make accommodations for this difference. Pthreads can't be fully emulated on Win32, so you have to have your pthread path, and your win32 path.

And yes, you correctly point out that on nix and Win32, interaction with the kernel is often different. Win32 does a lot right, I won't ever speak negatively of it, it is the desktop OS of choice by a huge margin. It's just that nix and Win32 are so different from each other, that you can't generally write a single implementation of your low level thread interaction for both classes of platforms.

dboreham on Jan 28, 2020 [–]

If you were using signals on Win32, that's the problem.

gpderetta on Jan 29, 2020 [–]

I'm not a win32 programmer, but I think that for a long time windows didn't have os provided condition variables (events are not always an adequate replacement and had nasty footguns around signaling); turns out that iti is hard to implement a cond var with posix semantics and if you managed to get a third paarty implementation it was often buggy. I think they were eventually added in Windows Vista.

[19]

---

on the difficulty of debugging event loops due to callbacks:

"use tornado (event-based Python webserver library) extensively at work - when something goes wrong you don't get a nice stack trace, you get some random sampling of callback spaghetti."

---

milesvp on Jan 28, 2020 [–]

A shame that no one has mentioned cache invalidation as further reason threaded programming is hard. One my biggest takeaways from Martin Thompson’s talk on mechanical sympathy is that the first thing he tries when brought in as a performance consultant is to turn off threading. He mentions locking as a performance problem but that these days cache locality can be the key to speeding up slow applications.

amelius on Jan 28, 2020 [–]

Yes but the cache problem also exists with asynchronous programming.

milesvp on Jan 28, 2020 [–]

True. I wonder how much event systems by their very nature simply destroy cache locality. Still, it's likely much easier to reason about cache hits, and build event handlers such that they remain local for the duration, as opposed to threading, where it's all but impossible to predict what the cache will look like.

---

cancel scopes https://vorpus.org/blog/timeouts-and-cancellation-for-humans/#cancel-tokens-are-level-triggered-and-can-be-scoped-to-match-your-program-s-needs discussion: https://news.ycombinator.com/item?id=22590406

--- ocaml multicore

https://discuss.ocaml.org/t/the-road-to-ocaml-5-0/8584

"Parallelism through Domains [1 73] Concurrency through Effect Handlers [2 68] (without syntactic support and exposed as functions from the standard library [5])"

---

What Memory Model Should the Rust Language Use? recommends:

---

"colorless async"

https://news.ycombinator.com/item?id=29049881 :

https://brunosutic.com/blog/async-ruby

"

sizediterable 9 days ago

prev next [–]

> any blocking operation (a method where Ruby interpreter waits) is compatible with Async and will work asynchronously within Async code block with Ruby 3.0 and later.

That's pretty magical. Does this mean it would be possible to implement structured concurrency [0], but without the function coloring problem?

Regardless, I think I prefer how this looks from a code read/writeability perspective compared to Zig's or Swift's (and potentially future Rust's [2]) approaches.

[0] https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

[1] https://kristoff.it/blog/zig-colorblind-async-await/

[2] https://blog.yoshuawuyts.com/async-overloading/

reply

brunosutic 9 days ago

parent next [–]

Async Ruby is colorless!

It's obvious from the examples provided in the post. For example, you can use the method `URI.open` both synchronously and asynchronously.

It's something I didn't want to mention in the article because it's a relatively advanced async concept, but yea - Async Ruby is already colorless and it's great.

reply

gpderetta 9 days ago

root parent next [–]

IIRC ruby has first class continuations, so it doesn't have to deal with any async/await nonsense.

reply

brunosutic 9 days ago

root parent next [–]

That's right - Async Ruby uses fibers (stackful coroutines) as a concurrency primitive. There's a lot to say about this, but the end result is that we get to write simple synchronous code and it can run asynchronously.

reply

" bradgessler 9 days ago

root parent prev next [–]

What is “colorless” suppose to mean? This is the first time I’ve seen it used.

reply

justsomeuser 9 days ago

root parent next [–]

It references the function return type:

Colored = Promise<T>

Colorless = T

In "colorless" async/await, async functions are implicit (no async keyword), and when called, the compiler auto-unwraps the promise, so everything looks like "sync" code but any function stack could "save and resume" at an "async point".

reply

vmarquet 9 days ago

root parent prev next [–]

I think it's a reference to this blog post https://journal.stuffwithstuff.com/2015/02/01/what-color-is-...

reply

auno 8 days ago

root parent next [–]

Which in turn is making a reference to an even older post at https://ansuz.sooke.bc.ca/entry/23.

reply

"

"

DangitBobby? 9 days ago

parent prev next [–]

As someone who operates mainly in Python, I am so jealous. As far as I'm aware [1], you have to re-write your tooling to take advantage of async in Python. Does anyone have any insight into why Python async doesn't work the same way? Does it come down to fundamental language differences?

1. https://stackoverflow.com/a/63179518/4728007

reply

brunosutic 9 days ago

root parent next [–]

> Does it come down to fundamental language differences?

No, I don't think there's anything fundamentally different.

Ruby 3.0 implements a "fiber scheduler" feature that enables "colorless Async". Fiber scheduler is an obscure Ruby feature, but the end result is brilliant. It was also a huge amount of work.

Side note: Fiber scheduler was implemented by the same guy who created Async Ruby - Samuel Williams. This guy is the mastermind (and master-coder) behind this project.

reply

pansa2 9 days ago

root parent prev next [–]

> Does anyone have any insight into why Python async doesn't work the same way?

Ruby async seems to be implemented using stackful coroutines. IIRC Guido has been opposed to adding these to core Python, preferring stack-less coroutines because they require every yield-point to be explicit (i.e. marked with `await`).

There are libraries for a python that support stackful coroutines, such as gevent.

reply "

"

azth 9 days ago

parent prev next [–]

The first reference is the same route that Java is taking with its green thread implementation by means of Project Loom[0].

[0] https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.... https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html#structured-concurrency

reply "

---

https://tokio.rs/blog/2021-12-announcing-tokio-console

---

Isaacy 24 hours ago

link flag

It’s possible to have an “async” system for systems programming that isn’t colored. Zig solved this problem. In the case of zig, I can write a single function to be called from the Erlang virtual machine’s FFI either in an async fashion to use it’s the vm’s preemptibility, or I can call the same function to be run in its own thread (this is non-async), so you can explore the tradeoffs without having to write your function twice.

    5
    smaddox 21 hours ago | link | flag | 

I really like what Andrew did with color-free async in Zig, but it’s not a panacea. There are limitations. You can’t use multiple different event loops (one blocking, one non-blocking) in the same binary, for example. Unless I’ve misunderstood something.

    ~
    Isaacy 13 hours ago | link | flag | 

You absolutely can use multiple event loops in the same binary. It isn’t obvious how to do it cleanly, if you ask me, though, or I haven’t hit upon “the correct pattern” for it

---

https://github.com/athas/raytracers

---

https://github.com/mukul-rathi/bolt https://mukulrathi.com/create-your-own-programming-language/intro-to-compiler/ https://mukulrathi.com/create-your-own-programming-language/concurrency-runtime-language-tutorial/

---

people talking about async/await gotchas (in general? in specific languages?) https://news.ycombinator.com/item?id=30406908

---

https://lobste.rs/s/mw1f3s/i_believe_zig_has_function_colors

---

https://mccue.dev/pages/5-2-22-go-concurrency-in-java

---

i dont think this is relevant to us but i'm leaving it here just in case: Stacked Futures and why they are impossible https://lobste.rs/s/mar5gp/stacked_futures_why_they_are_impossible

---

"

3 0x2ba22e11 6 days ago

link flag

One thing I’d like to tentatively raise, because I suspect you know it already but it might be useful to know the vocabulary if you don’t already, is that for making things that go fast and parallel in an automatic way in Haskell, people tend to get much better results by implementing Applicative rather than Monad. The implementation can see all the steps that are going to happen up front when it’s Applicative so there’s a lot more freedom to work with, unlike Monad where it doesn’t know what’s going to happen next until the user function actually runs. I suspect that your “monad-ish “thing is probably already more similar to Applicative than Monad.

e.g. There some write ups out of (ick) Facebook about how they have DSLs for making calls to lots of backend APIs in parallel with the code still looking somewhat imperative-ish (at least by Haskell standards anyway) by implementing an Applicative instance and not using Monad.

"

--

it seems like in Erlang, hot code reloading is related to the capability to send functions over channels (which i believe requires that the runtime have the ability to serialize functions)

" Joe Armstrong wrote a blog post titled My favorite Erlang Program https://joearms.github.io/published/2013-11-21-My-favorite-erlang-program.html , which showed a very simple universal server written in Erlang:

universal_server() -> receive {become, F} -> F() end.

You could then write a small program that could fit this function F:

factorial_server() -> receive {From, N} -> From ! factorial(N), factorial_server() end.

factorial(0) -> 1; factorial(N) -> N * factorial(N-1).

If you had an already running universal server, such as you would by having called Pid = spawn(fun universal_server/0), you could then turn that universal server into the factorial server by calling Pid ! {become, fun factorial_server/0}. " -- [20]

---

https://www.1024cores.net/home

---

https://github.com/effil/effil

---