Bayle Shanks's website: proj-oot-ootConcurrencyNotes10

ghoward 24 days ago

prev [–]

I think you are absolutely right about everything you mentioned, except for one:

> I am not convinced one can make something solving all the things Rust solves which is substantially simpler as language.

You could be right about this, but boy, I hope you are wrong. And I'm trying to prove that it's possible.

I think the answer is structured concurrency [1], and here's why:

1. Structured concurrency can replace everything async/await can do. At least, I believe so, kind of like it's believed that structured programming can replace any unstructured use of goto.

2. If you extend structured concurrency a bit so that the thread tree is actually a DAG [2], you basically get a borrow checker for free because every item is rooted in a thread stack somewhere, and where it is rooted is always before everything that it could own and after everything that could own it.

3. Making the thread tree a DAG also means it's easy to implement RAII and have it work meaning no use-after-free, no double frees, etc.

Put that on top of a language that has memory safety by default, and you will have everything that Rust gives you without the complexity.

This isn't theoretical; I have implemented this in C. [3] (See [4] for a use of it.) I have even implemented bounds checking in C. [5]

However, I'm also implementing a programming language with these ideas (see [6] for the example file and [7] within that file for how structured concurrency might be exposed in the language).

As for readability, I do think that what I have is readable. However, I appreciate comments; readability is important to me too, but I've spent so long with this work that I am blind to what is actually readable to other people.

[1]: https://gavinhoward.com/2019/12/structured-concurrency-defin... https://gavinhoward.com/2019/12/structured-concurrency-definition/

[2]: https://lobste.rs/s/8msejg/notes_on_structured_concurrency_g... https://lobste.rs/s/8msejg/notes_on_structured_concurrency_go#c_wkxbvb

[3]: https://git.yzena.com/Yzena/Yc/src/branch/master/include/yc/... https://git.yzena.com/Yzena/Yc/src/branch/master/include/yc/threadset.h#L104-L115

[4]: https://git.yzena.com/Yzena/Yc/src/branch/master/src/rig/bui... https://git.yzena.com/Yzena/Yc/src/branch/master/src/rig/build.c#L972-L1020

[5]: https://git.yzena.com/Yzena/Yc/src/branch/master/include/yc/... https://git.yzena.com/Yzena/Yc/src/branch/master/include/yc/array.h#L208-L263

[6]: https://git.yzena.com/Yzena/Yc/src/branch/master/src/yao/exa... https://git.yzena.com/Yzena/Yc/src/branch/master/src/yao/example.y

[7]: https://git.yzena.com/Yzena/Yc/src/branch/master/src/yao/exa... https://git.yzena.com/Yzena/Yc/src/branch/master/src/yao/example.y#L757-L765

---

interesting post about how the new Vale language has seamless fearless structured concurrency. It sounds like something we might want to do, too:

https://verdagon.dev/blog/seamless-fearless-structured-concurrency

---

https://verdagon.dev/blog/seamless-fearless-structured-concurrency

Structured Concurrency ... With structured concurrency, we can do that with just one line, using OpenMP?. Let's add a #pragma omp parallel for to our original program: ... Launch some threads and run in parallel! #pragma omp parallel for for (int i = 0; i < 5; i++) { results[i] = pow(i, exponent); some expensive calculation } ... "Seamless" Structured Concurrency OpenMP? is a really amazing tool for C structured concurrency because it's seamless:

    The threads can access any data from the surrounding scope. (Notice how all of our threads are accessing results and exponent.)
    It's easy; if we want to write a parallel loop, we don't have to refactor our callers or rearchitect our program to enable it. We just start building!

In other words, seamless concurrency is the ability to read existing data concurrently without refactoring existing code.

...

Concurrency is "fearless" if data races are impossible. ...

 Luckily, we've figured out how to avoid these problems, with some "concurrency ground rules":

    Multiple threads can read the same data, if nobody can modify it.
    If a thread can read data that another thread can modify, the data must be wrapped in a mutex or atomically updated.
    If data is only visible to one thread, it can access that data freely.

We usually need proper fear, respect, and discipline to stick to these rules. But when a language enforces these rules, we have fearless concurrency.

...

Message Passing with Pony

Pony doesn't let us access data from the outside scope, like in our above C examples. Instead, we "send" data between "actors", which are similar to threads.

We can send data if either of these is true:

    We know we have the only reference to the data (the reference has the iso permission). 7
    We know it is deeply immutable (the reference has the val permission).

If we have an val or iso reference to an object, we can send it to another actor.

Pony has fearless concurrency because in this system, data races are impossible; we can never have a modifiable chunk of data that's visible to multiple threads.

Key takeaways from Pony:

    A type system can track whether something's immutable.
    Immutable objects can be shared with multiple threads without risk of data races. 8

...

Structured fearless concurrency

Rust has fearless concurrency that feels a lot like Pony's:

    Similar to iso, Rust's borrow checker enforces that we have the only reference to an object when we want send it to another thread.
    Similar to val, Rust can share things that are immutable.
        For example, we can share Arc<Vec<int>>, which is an atomically reference counted, immutable vector of ints.

...

But alas, Rust doesn't have seamless structured concurrency, because it often can't access variables defined outside the task's scope. 10

...

How do we combine them?

We can make fearless and seamless structured concurrency by:

    Start with C's seamless structured concurrency.
    Only allow reading values created outside the parallel block.
    (my note: e means that you can read and write values created inside the parallel block, but you can only
     read values create outside of the parallel block)
    Relax #2's restrictions in data-race-free ways (we'll explore this in Part 2).

...

 As stated above, to enable fearless structured concurrency, we need to express that a reference points into a read-only region. 17 We can think of a region as a scope.

When we have a reference to an object in a read-only region, we:

    Can't modify it.
    Can load a member (or element) from it. The type system will remember the element is from a read-only region.
    Can pass it to a function, if the function sees the region as read-only.

...

Is there a catch?

There are a couple drawbacks:

    We can only read the data outside the parallel block. 21
        We relax this restriction is Part 2, using mutexes, channels, splitting, atomics, isolated sub-regions, and something called "region shifting".
    All globals must be channels, mutexes, atomics, or immutable. 22

---

do we want Oot to be able to express/make use of (without using some library):

shared memory between threads (compared to only allowing message-passing)
data races (that is, can you avoid the costs of atomics/locking by writing a lock-free / wait-free algorithm; or does the Oot implementation auto-insert atomics or locks in a conservative way no matter what you do)
locks with deadlocks (that is, does Oot force you to do something like use transactions, or use locks but acquire all needed locks at the beginning of a code segment and release them at the end, avoiding the possibility of deadlock? Or does it allow you to use locks, and you have to worry about deadlocks)
non-sequential consistency (eg can you opt to use acquire-release memory ordering for performance, or does Oot guarantee sequential consistency?)
non-determinacy from concurrency (compared to Oot enforcing a concurrency paradigm that guarantees determinacy)

Our goal of non-performance argues against many of those; in many cases, you can express the desired behavior with similar simplicity without those, and it's only for performance reasons that you want those. So a key question is, which of those have expressivity/readability benefits in addition to just performance benefits?

Since we aim to be a language that is good for concurrency and for "brain-like" algorithms, would omitting some of these make it (a) so non-performant that it wouldn't be good for concurrency, or (b) shift the relative performance between choices of algorithm so much that in many cases you would prefer some traditional algorithms over brain-like algorithms, whereas with a better system you would choose the brain-like algorithms?

One question here is, which concurrency-related properties are global properties? Recall that some definitions of expressivity vs syntactic sugar are that the semantics of something that increases expressivity involves a global transformation of code, whereas syntactic sugar is a local transformation.

also, if some sort of concurrency "safety" (or simpler-to-reason-about-code) property is global, then we'd better focus on it more than if it's local, because if we want that property it's hard to bolt on later.

I feel like the way to organize the zoo of concurrency paradigms might be to organize them along axes of whether they do or don't provide various global properties.

---

" Of course, there are tantalizingly simple rules for avoiding deadlock. For example, always acquire locks in the same order [32]. However, this rule is very difficult to apply in practice because no method signature in any widely used programming language indicates what locks the method 6See http://ptolemy.eecs.berkeley.edu 8 acquires. You need to examine the source code of all methods that you call, and all methods that those methods call, in order to confidently invoke a method. Even if we fix this language problem by making locks part of the method signature, this rule makes it extremely difficult to implement symmetric accesses (where interactions can originate from either end). And no such fix gets around the problem that reasoning about mutual exclusion locks is extremely difficult. " -- The Problem with Threads

---

i think an important point that The Problem with Threads made was that sometimes it's better to have constructs that live outside of any single thread, and talk about the interaction between threads, for example the "merge" construct in their Figure 3 (page 11) (and the unlabeled 'tee' after it). They call this sort of thing a 'coordination language'.

---

"std::scoped_lock in C++17 basically eliminates the possibility of deadlocks" [1]

---

"Cache misses or mutex-locks is the choice, there are many concurrent data types that do not lock but that instead copy data. " [2]

" What problems do you have with C++? Pre-C++11, threading was basically unusable, but since then with std::async, STL mutexes, RAII lock guards, condition variables, atomic types, etc, writing multithreading code in C++ is downright pleasant." [3]

---

https://www.taichi-lang.org/ https://docs.taichi-lang.org/blog/accelerate-python-code-100x

---

https://github.com/google/nsync

---

https://www.google.com/search?client=firefox-b-1-d&q=mike+burrows+chubby

---

https://github.com/sourcegraph/conc https://news.ycombinator.com/item?id=34344514

---

https://textual.textualize.io/blog/2023/02/11/the-heisenbug-lurking-in-your-async-code/

---

[4] Threads Cannot be Implemented as a Library Hans-J. Boehm

if the compiler doesn't know about threads, it may make some incorrect optimizations:

it might create virtual subroutines which do additional reads and writes to volatile variables, causing other writes coming in from other processes to get clobbered
it might promote a volatile varible to a register

---

https://tonyg.github.io/squeak-actors/ rec. by https://lobste.rs/s/gpar6n/weaknesses_smalltalk_are_strengths#c_56ucht https://eighty-twenty.org/2019/01/30/actors-for-squeak

---

https://syndicate-lang.org/

function chat(initialNickname, sharedDataspace, stdin) { spawn 'chat-client' { field nickName = initialNickname;

    at sharedDataspace assert Present(this.nickname);
    during sharedDataspace asserted Present($who) {
      on start console.log(`${who} arrived`);
      on stop  console.log(`${who} left`);
      on sharedDataspace message Says(who, $what) {
        console.log(`${who}: ${what}`);
      }
    }

    on stdin message Line($text) {
      if (text.startsWith('/nick ')) {
        this.nickname = text.slice(6);
      } else {
        send sharedDataspace message Says(this.nickname, text);
      }
    }
  }}

---

https://pboyd.io/posts/go-concurrency-fan-out-fan-in/ https://lobste.rs/s/rr73pz/go_concurrency_fan_out_fan

https://github.com/carlmjohnson/flowmatic

---

snej 28 days ago

link

            Go has this standard Context type, which is by convention passed down the call stack to convey bits of state, as an alternative to things like thread-local variables. Like many features of Go it’s annoyingly verbose, but it has the useful feature that as a caller you can set a timeout or expiration time on it, and as a caller there are various affordances for detecting or receiving a channel message when it expires. I think that’s quite a good idea.

---

oivey 19 hours ago

root

parent

next [–]

If subprocesses die (segfault maybe) it isn't uncommon for them to not be cleaned up and/or cause the parent process to hang while it waits for the zombie to respond. That's one I experienced last week on Python 3.9. A thread that experienced that would likely kill the parent process or maybe even exit with a stacktrace. Way easier to debug, and doesn't require me to search through running tasks and manually kill them after each debug cycle.

My impression is that the multiprocessing module is a heroic effort, but unfortunately making the whole system work transparently across multiple OSs and architectures is a nearly insurmountable problem.

empthought 19 hours ago

root

parent

next [–]

You may be interested in the concurrent.futures library, available for over a decade now. It keeps you from shooting yourself in the foot like that.

https://docs.python.org/3/library/concurrent.futures.html

KolenCh? 18 hours ago

root

parent

next [–]

Why do you think it would help?

It provides a nice interface but is using multiprocessing or multi threading under the hood depending on which executioner you use:

> The ProcessPoolExecutor? class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor? uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.

empthought 17 hours ago

root

parent

next [–]

Your trouble seems to involve not understanding how to set up signal handlers, which ProcessPoolExecutor? handles for you and exposes via a BrokenProcessPool? exception.

wiseowise 14 hours ago

root

parent

next [–]

> Derived from BrokenExecutor? (formerly RuntimeError?), this exception class is raised when one of the workers of a ProcessPoolExecutor? has terminated in a non-clean fashion (for example, if it was killed from the outside).

What if it hangs?

empthought 11 hours ago

root

parent

next [–]

That isn’t the scenario originally described, but there is a timeout parameter in future.result().

---

HippoBaro? 1 day ago

parent

next [–]

The argument here is that Rust chose to implement coroutines the wrong way. It went the route of stackless coroutines that need async/await and colored functions. This creates all the friction the article laments over.

But it also praises Go for its implementation, which is also based on a coroutine of a different kind. Stackful coroutines, which do not have any of these problems.

Rust considered using those (and, at first, that was the project's direction). Ultimately, they went to the stackless operation model because stackfull coroutine requires a runtime that preempts coroutines (to do essentially what the kernel does with threads). This was deemed too expensive.

Most people forget, however, that almost no one is using runtime-free async Rust. Most people use Tokio, which is a runtime that does essentially everything the runtime they were trying to avoid building would have done.

So we are left in a situation where most people using async Rust have the worst of both worlds.

That being said, you can use async Rust without an async runtime (or rather, an extremely rudimentary one with extremely low overhead). People in the embedded world do. But they are few, and even they often are unconvinced by async Rust for their own reasons.

withoutboats3 1 day ago

root

parent

next [–]

Rust chose to drop the green thread library so that it could have no runtime, supporting valuable use cases for Rust like embedding a Rust library into a C binary, which we cared about. Go is not really usable for this (technically it's possible, but it's ridiculous for exactly this reason). So those sorts of users are getting a lot of benefit from Rust not having a green threading runtime. As are any users who are not using async for whatever reason.

However, async Rust is not using stackless coroutines for this reason - it's using stackless coroutines because they achieve a better performance profile than stackful coroutines. You can read all about it on Aaron Turon's blog from 2016, when the futures library was first released:

http://aturon.github.io/blog/2016/08/11/futures/

http://aturon.github.io/blog/2016/09/07/futures-design/

It is not the case that people using async Rust are getting the "worst of both worlds." They are getting better performance by default and far greater control over their runtime than they would be using a stackful coroutine feature like Go provides. The trade off is that it's a lot more complicated and has a bunch of additional moving parts they have to learn about and understand. There's no free lunch.

-- https://news.ycombinator.com/item?id=37435515&p=2

---

" This patchset is the first step to open-source this work. As explained in the linked pdf and video, SwitchTo? API has three core operations: wait, resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation on top of which user-space threading libraries can be built.

    Another common use case for FUTEX_SWAP is message passing a-la RPC between tasks: task/thread T1 prepares a message, wakes T2 to work on it, and waits for the results; when T2 is done, it wakes T1 and waits for more work to arrive. Currently the simplest way to implement this is

    a. T1: futex-wake T2, futex-wait
    b. T2: wakes, does what it has been woken to do
    c. T2: futex-wake T1, futex-wait

    With FUTEX_SWAP, steps a and c above can be reduced to one futex operation that runs 5-10 times faster.

A 5~10x speed improvement with FUTEX_SWAP certainly sounds compelling as does the information shared way back at LPC 2013 via the video below and the PDF slides.

---

my summary of

https://bitbashing.io/async-rust.html

Rust's async features are thought to be a flexible, performant solution to the desire for green threading. However, what Rust itself provides are just building blocks that you can use to build async systems in libraries, and the most common such library is tokio, and in order to use tokio the type signatures of the functions that you want to run async-ly, and any functions that call them, have to have certain properties. For this reason, a lot of OTHER libraries have type signatures compatible with tokio's requirements. This makes the ecosystem harder to learn, because in order to learn those other libraries, you have to learn a little bit about this async stuff even when you don't need async yourself. Comparisons are made with other languages which have easier-to-learn green threading, but at the expense of being less flexible, or less performant, or less safe. A debate has arisen between people who think that this additional ecosystem complexity is not worth it (because it complicates life for those who don't need to use green threads, or who need green threads but don't need something as flexible and performant as this solution), and those who think that it is (because a lot of important projects do need green threads, and this is a flexible, performant solution for them).

---

"There’s an important distinction between a future—which does nothing until awaited—and a task, which spawns work in the runtime’s thread pool… returning a future that marks its completion." -- https://bitbashing.io/async-rust.html

---

throw10920 1 day ago

parent

next [–]

> Async contamination

I've always wondered why the "color" of a function can't be a property of its call site instead of its definition. That would completely solve this problem - you declare your functions once, colorlessly, and then can invoke them as async anywhere you want.

lmm 1 day ago

root

parent

next [–]

> I've always wondered why the "color" of a function can't be a property of its call site instead of its definition. That would completely solve this problem - you declare your functions once, colorlessly, and then can invoke them as async anywhere you want.

If you have a non-joke type system (which is to say, Haskell or Scala) you can. I do it all the time. But you need HKT and in Rust each baby step towards that is an RFC buried under a mountain of discussion.

OvermindDL?1 1 day ago

root

parent

next [–]

You can do it without HKTs with an effects system, which you can think of as another kind of generics that causes the function to be sliced in different ways depending on how it's called. There is movement in Rust to try to do this, but I wish it was done before async was implemented considering async could be implemented within it...

ditsuke 1 day ago

root

parent

next [–]

The rust guys are working on this very problem with the keyword generics proposal https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-ge...

-- https://news.ycombinator.com/item?id=37435515

---

[5]

---

https://bitbashing.io/async-rust.html mentions "Because Rust coroutines are stackless, the compiler turns each one into a state machine that advances to the next .await.8 But this makes any recursive async function a recursively-defined type! A user just trying to call a function from itself is met with inscrutable errors until they manually box it or use a crate that does the same.", with footnote 8:

"Learn more in Without Boats’s Futures and Segmented Stacks or the C++ paper P1364: Fibers under the magnifying glass." https://without.boats/blog/futures-and-segmented-stacks/ says " Implementing the stack ... what do you do when the stack runs out of space? There are three basic options: Stack overflow: The stack has a preallocated amount of space, and when it runs out, an error is raised. Segmented stacks: When the stack runs out of space, a new segment is allocated for more space, which is cleaned up after it is no longer needed. Functionally, the stack is linked list. Copying stacks: When the stack runs out of space, a new, larger stack is allocated, and the contents of the current stack are copied to the new stack. Functionally, the stack is like a Rust Vec type.

’s OS threads use the first strategy. However, this is where the memory overhead problem first appears. If you can only use the preallocated space, it needs to be enough space for program execution outside of pathlogical cases. Therefore, an OS thread normally allocates a very large stack. Since most tasks will never approach the maximum size of an OS thread, this is wasted space if you use a 1 task to 1 thread system.

The second option seems very appealing, but it runs into certain serious performance problems. The problem in particular is that creating a new segment is much more expensive than just pushing a stack frame onto a stack would be, but you don’t know where segments are going to need to be created. It’s possible that a function in a hot loop will straddle the segment boundary, requiring a new segment to be allocated and freed every iteration of the loop.

For this reason, both Rust and Go ultimately abandoned segmented stacks. (These two links are also a great resource for learning more about the way these three strategies are implemented, for what its worth.)

Go went with the third option. In order to copy stacks around, Go’s runtime needs to be able to rewrite points that point into the stack, which it can do becuse it already has a tracing garbage collector. Such a solution wouldn’t work in Rust, which is why Rust moved its greenthreads toward option 1, and then eventually got rid of them. In the long term, Rust’s greenthreads evolved into the futures model, which solves the problem rather differently.

Futures as a perfectly sized stack

One of the real breakthroughs of Rust’s futures model was its so-called “perfectly sized stack.” The state of a Rust future contains all of the state that future needs as it is suspended across yield points. Therefore, futures do not need to worry about running out of space in their “stack,” because they always allocate exactly the maximum amount of space they could ever need. Thus we sidestep the question entirely. However, this does not come without its own problems.

The first is the problem of recursive async functions. The compiler cannot determine how many times a recursive async function will recur, and therefore it cannot determine how much state that future will need to store, which corresponds to how deeply the function recurses. Therefore, async functions cannot be recursive without heap allocating the state of the recursion at some point. Whereas with threads the thread will happily use more stack space as you recur, ultimately handling stack overflows with whatever mechanism the runtime has implemented (e.g. an error or a reallocation), with futures you will encounter a compiler error.

...

It’s worth thinking about what’s happened here, though: essentially, you have created a segmented stack in your future. The point where you heap allocate the recursive future is the segment point in your stack; when your code reaches that point, it performs a new allocation for more future state, instead of having it allocated inline. This has all the potential performance pitfalls of segmented stacks (its more expensive to heap allocate than use the space thats already allocated), but with one important difference: instead of the segment allocation occuring at a random, unpredictable time, you explicitly denote in your code that you want the allocation to occur. "

---

[6]

---

My summmary of https://bitbashing.io/async-rust.html https://news.ycombinator.com/item?id=37435515 https://lobste.rs/s/cryfiu/async_rust_is_bad_language https://notgull.net/why-you-want-async/

Rust’s async features are thought to be a flexible, performant solution to the desire for green threading. However, what Rust itself provides are just building blocks that you can use to build async systems in libraries, and the most common such library is tokio, and in order to use tokio the type signatures of the functions that you want to run async-ly, and any functions that call them, have to have certain properties. For this reason, a lot of OTHER libraries have type signatures compatible with tokio’s requirements. This makes the ecosystem harder to learn, because in order to learn those other libraries, you have to learn a little bit about this async stuff even when you don’t need async yourself. Comparisons are made with other languages which have easier-to-learn green threading, but at the expense of being less flexible, or less performant, or less safe. A debate has arisen between people who think that this additional ecosystem complexity is not worth it (because it complicates life for those who don’t need to use green threads, or who need green threads but don’t need something as flexible and performant as this solution), and those who think that it is (because a lot of important projects do need green threads, and this is a flexible, performant solution for them).

---

linkdd 3 days ago

link

flag

    The famous color article implied that async should be an implementation detail to hide, but that’s merely a particular design choice, which chooses implicit magic over clarity and guarantees.

async/await is the callee telling the caller how it wants to be executed/scheduled.

Languages with no async/await let the caller decide how the callee should be executed/scheduled.

In Erlang/Elixir, any function can be given to spawn(), or Task.start(). In Go, any function can be given to go ....

It’s up to the caller to determine how the function should be scheduled. async/await tend to contaminate your whole code base. I’ve seen many Python and some Rust+tokio codebases where everything up to main() was an async function. At that point, can we just get rid of the keyword?

    13
    kornel edited 3 days ago | link | flag |

You’ve given examples of higher-level languages with a fat runtime. This includes golang, which requires special care around FFI.

This “implicit async is fine, you don’t need the await syntax” is another variant of “implicit allocations/GC are fine, you don’t need lifetime syntax”.

True in most cases, but Rust is just not that kind of language, on purpose. It’s designed to “contaminate” codebases with very explicit code, and give low-level control and predictable performance of everything.

---

https://paste.sr.ht/~icefox/16cef4589da456a852eba97888e73b8c0c239972

" sourcehut

    ~icefox/coro.exs
    view paste

coro.exs -rw-r--r-- 1.6 KiB? View raw

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

defmodule Coro do @doc "Create a new coroutine" @spec create((pid -> any())) :: pid def create(f) do IO.puts "create" coro_driver = fn -> # Block until we are resumed receive do {:resume, resumer} -> IO.puts("Resumed for the first time") f.(resumer) other -> raise "Invalid initial resume" end end spawn(coro_driver) end

  @doc "Yield to the thing that resumed this coro"
  @spec yield(pid, any) :: nil
  def yield(resumer, value) do
    # Having to pass the resumer explicitly is wonkity but it's a start
    IO.puts "yield"
    send(resumer, {:yield, self(), value})
    # Block waiting for resumption
    receive do
      {:resume, resumer} -> IO.puts("wossname")
      other -> raise "Invalid resume"
    end 
  end

  @doc "Resume the given coro"
  @spec resume(pid) :: any
  def resume(coro) do
    IO.puts "resume"
    # We need to monitor a process to detect if it quits without yielding to us,
    # see https://www.erlang.org/doc/reference_manual/processes#monitors
    :erlang.monitor(:process, coro)
    send(coro, {:resume, self()})
    receive do
      # The coro yielded to us
      {:yield, pid, value} -> 
        IO.puts("Coro #{inspect pid} yielded to us with value #{inspect value}")
      # The coro exited without yielding
      {:DOWN, ref, _process, _pid2, reason} ->
        IO.puts "Coro exited normally #{inspect reason}"
        :erlang.demonitor(ref)
      other -> raise "aw heck: #{inspect other}"
    end
  endend

x = Coro.create( fn resumer -> IO.puts "Hello world from inside a coro" Coro.yield(resumer, :hi) 3 end) Coro.resume(x) Coro.resume(x)

---

11 benwaffle 3 days ago

link

flag

That’s where I think Go really outshines Rust. Thanks to green threads, all I/O code in Go looks blocking, and reads like simple, imperative code, but is actually translated to non-blocking under the hood.

    50
    david_chisnall 3 days ago | link | flag |

I think this is where Rust really outshines Go. In Rust, all of these problems are exposed as things in the type system that cause the compiler to shout at you. In Go, they’re undefined behaviour that causes weird nondeterminism at run time.

If you capture a pointer to an object in a goroutine and modify it in another, that’s UB in Go unless you have some locking around it. The compiler will accept it because Go has no notion of transfer of ownership at the type level and so there’s no way it can tell the safe cases apart. If the object contains a slice as a field and both goroutines assign to it, you can end up with the bounds of one and the base of another, and now so,e unrelated code that writes through that slice will corrupt memory.

If you capture a pointer in an async context in Rust and try to modify it outside the the compiler tells you it doesn’t have the Send trait and refuses to compile. You fix the bug at build time. It might be hard, but it’s easier that tracking down a heisenbug caused by a data race.

    5
    shanemhansen 3 days ago | link | flag |

go build -race

go test -race

    17
    calvin 2 days ago | link | flag |

This is like replying to “Rust’s borrow checker helps prevents leaks at compile time” with valgrind --leak-check=yes ./a.out.

---

annotations to allow (some? complete?) static analysis for deadlock prevention:

https://abseil.io/docs/cpp/guides/synchronization#thread-annotations

---

" For 2) runtime (libasync.so) implementations would have to cover a lot of aspects they may not need (async compute-focused runtimes like bevy don't need timers, priorities, or even IO) and expose a restrictive API (what's a good generic model for a runtime IO interface? something like io_uring, dpdk, or epoll? what about userspace networking as seen in seastar?). A pluggable runtime mainly works when the language has a smaller scope than "systems programming" like Ponylang or Golang. " -- https://news.ycombinator.com/item?id=37451204

---

HippoBaro? 4 days ago

parent

next [–]

But it also praises Go for its implementation, which is also based on a coroutine of a different kind. Stackful coroutines, which do not have any of these problems.

So we are left in a situation where most people using async Rust have the worst of both worlds.

withoutboats3 4 days ago

root

parent

next [–]

http://aturon.github.io/blog/2016/08/11/futures/

http://aturon.github.io/blog/2016/09/07/futures-design/

---

 peterbourgon edited 5 days ago | link | flag |

    all have been for parallel-processing a loop - which doesn’t let goroutines outlive the calling function

I maintain a “pithy” Go style guide, and this is almost verbatim what I recommend in one of the concurrency sections.

    Goroutines should have well-defined lifetimes

        Goroutines should almost never outlive the function they’re created from
        Avoid “fire and forget” goroutines – know how to stop the goroutine, and verify it’s stopped
        Avoid spaghetti synchronization: sync.Atomic as a state bit, starting/stopping/done channels, etc.
        …

This is closely related to a section about API design.

    Write synchronous APIs

        By default, do work in regular blocking functions
        Let your callers add concurrency if they want to
        No: Start(), Stop(), Wait(), Done()
        Yes: Run(context.Context) error
        Model periodic tasks as sync methods that should be regularly called, not autonomous goroutines
        WaitGroups, Mutexes, etc. as parameters or return values are almost always a design error

---

https://without.boats/blog/thread-per-core/ https://news.ycombinator.com/item?id=37790745

Thread-per-core (without.boats)

" I want to address a controversy that has gripped the Rust community for the past year or so: the choice by the prominent async “runtimes” to default to multi-threaded executors that perform work-stealing to balance work dynamically among their many tasks. Some Rust users are unhappy with this decision, so unhappy that they use language I would characterize as melodramatic:

    The Original Sin of Rust async programming is making it multi-threaded by default. If premature optimization is the root of all evil, this is the mother of all premature optimizations, and it curses all your code with the unholy Send + 'static, or worse yet Send + Sync + 'static, which just kills all the joy of actually writing Rust.... What these people advocate instead is an alternative architecture that they call “thread-per-core.” They promise that this architecture will be simultaneously more performant and easier to implement. In my view, the truth is that it may be one or the other, but not both.

(Side note: Some people prefer instead just running single threaded servers, claiming that they are “IO bound” anyway. What they mean by IO bound is actually that their system doesn’t use enough work to saturate a single core when written in Rust: if that’s the case, of course write a single threaded system. We are assuming here that you want to write a system that uses more than one core of CPU time.) ... Thread-per-core

One of the biggest problems with “thread-per-core” is the name of it. All of the multi-threaded executors that users are railing against are also thread-per-core, in the sense that they create an OS thread per core and then schedule a variable number of tasks (expected to be far greater than the number of cores) over those threads. As Pekka Enberg tweeted in response to a comment I made about thread per core:

    Thread per core combines three big ideas: (1) concurrency should be handled in userspace instead of using expensive kernel threads, (2) I/O should be asynchronous to avoid blocking per-core threads, and (3) data is partitioned between CPU cores to eliminate synchronization cost and data movement between CPU caches. It’s hard to build high throughput systems without (1) and (2), but (3) is probably only needed on really large multicore machines.

Enberg’s paper on performance, which is called “The Impact of Thread-Per-Core Architecture on Application Tail Latency” (and which I will return to in a moment), is the origin of the use of the term “thread-per-core” in the Rust community. ... The distinction being made is really between two optimizations you can make once you have a thread-per-core architecture, and which are in tension: work-stealing tasks between your threads and sharing as little state as possible between them. ... No one would dispute that carefully architecting your system to avoid moving data between CPU caches will achieve better performance than not doing that, but I have a hard time believing that someone who’s biggest complaint is adding Send bounds to some generics is engaged in that kind of engineering. "

duped 1 day ago

next [–]

Personally I feel like this post misses the forest for the trees.

The debate isn't about thread-per-core work stealing executors, it's whether async/await is a good abstraction for it in Rust. And the more async code I write the more I feel that it's leaky and hard to program against.

The alternative concurrency model people want is structured concurrency via stackful coroutines and channels on top of a work stealing executor.

Until someone does the work to demo that and compare it to async/await with futures I don't think there's any productive discussion to be had. People who don't like async are going to avoid it and people who don't care about making sure everything and its mother is Send + Sync + 'static are going to keep on doing it.

jandrewrogers 1 day ago

next [–]

The original problem thread-per-core was invented to solve ~15 years ago was scalability and efficiency of compute on commodity many-core servers. Contrary to what some have suggested, thread-per-core was expressly about optimizing for CPU bound workloads. It turned out to be excellent for high-throughput I/O bound workloads later, albeit requiring more sophisticated I/O handling. When I read articles like this, it looks like speed-running the many software design mistakes that were made when thread-per-core architectures were introduced. To be fair, thread-per-core computer science is poorly documented, having originated primarily in HPC.

This article focuses on a vexing problem of thread-per-core architectures: balancing work across cores. There are four elementary models for this, push/pull of data/load. Work-stealing is essentially the "load pull" model. This only has low overhead if you almost never need to use it e.g. if the work is naturally balanced in a way that few real-world problems actually are. For workloads where dynamic load skew across cores is common, which is the more interesting problem, work-stealing becomes a performance bottleneck due to coordination overhead. Nonetheless, it is easy to understand so people still use work-stealing when the workload is amenable to it, it just doesn’t generalize well. There are a few rare types of workloads (not mentioned in the article) where it is probably the best choice. The model with the most gravity these days seems to be "data push", which is less intuitive but requires much less thread coordination. The "data push" model has its own caveats — there are workloads for which it is poor — but it generalizes well to most common workloads.

Thread-per-core architectures are here to stay -- they cannot be beat for scalability and efficiency. However, I have observed that most software engineers have limited intuition for what a modern and idiomatic thread-per-core design looks like, made worse by the fact that there are relatively few articles or papers that go deep on this subject.

 asd4 1 day ago | prev | next [–]

"What they mean by IO bound is actually that their system doesn’t use enough work to saturate a single core when written in Rust: if that’s the case, of course write a single threaded system."

Many of the applications I write are like this, a daemon sitting in the background reacting to events. Making them single threaded means I can get rid of all the Arc and Mutex overhead (which is mostly syntactic at that point, but makes debugging and maintenance easier). Being able to do this is one of the things I love about Rust: only pay for what you need.

The article that this one is responding to calls out tokio and other async libraries for making it harder to get back to a simple single threaded architecture. Sure there is some hyperbole but I generally agree with the criticism.

Making everything more complex by default because its better for high throughput applications seems to be opposite of Rust's ideals.

---

Sytten 30 days ago

parent

context

favorite

on: Maybe Rust isn’t a good tool for massively concurr...

A lot of that pain could have been avoided if the language had better primitives for async in the std or in the futures crate. Like a trait that executor must implement and a "default" blocking executor to execute async code from sync.

Right now even building a library that support multiple async runtimes is a PITA, I have done it a couple times. So you end up supporting either just tokio and maybe async-std. " -- https://news.ycombinator.com/item?id=37436673

---

https://without.boats/blog/why-async-rust/

---

so:

sounds like the comment above, "structured concurrency via stackful coroutines and channels on top of a work stealing executor", may be what we want
graydon also talked about "standard green-thread runtime with growable stacks" in https://graydon2.dreamwidth.org/307291.html
so you want stacks, so you want them to start small b/c half the point of user-space threading is to have less memory per thread (the other half being cheaper or more intelligent context-switching or scheduling), so you need a way to grow stacks, and the two ways given are segmented stacks, or just reallocating/copying/growing the stack.
segmented stacks sounds hard
https://without.boats/blog/why-async-rust/ says that 3 issues with growing stacks are:
you invalidate pointers from elsewhere into the stack. Simple, either don't allow user code to create pointers into the stack (only the runtime is allowed to do that), or use GC, which we are going to do anyways in Oot
FFI is more expensive. So be it?
embedded use cases forbid a big runtime. Do we care? Can't we make it simple/dumb/small at the cost of perf?
otoh mb Rust's 'stackless coroutine' approach would be easier to use if you abstracted it more (eg had GC and a runtime and made everything async like Go), so maybe we could do that under the hood? otoh mb at that point it's an implementation detail -- mb the API is the same as something like Goroutines except that (since it's cooperative and not preemptive) you have to provide a way to yield for use when library authors are implementing non-blocking I/O operations

the exterior/interior iterator thing in https://without.boats/blog/why-async-rust/ is interesting. Elsewhere, graydon makes the point that you want to optimize (and vectorize) iteration, so interior is better for that. But in https://without.boats/blog/why-async-rust/ even the interior iterator thing is for extensible collections, probably most of the optimization benefits come from special-casing some core language iteration constructions (like for loops and 'map's (as in applymap) and data structures (like arrays and dicts) in the language implementation so that, for example, when you do something like applying the increment operator to each element of an array of ints, that is very fast.

but i guess you still want an interface that custom-written collections (trees, etc) can have to allow iterating over them. The exterior iterator API seems 'simpler' to my poor mind. And that's the Python Iterator protocol, and Python is good. But graydon is pretty smart. So i'm not sure what i think is the best interface there.

---

hdevalence 2 days ago

next [–]

A lot of discussion about async Rust assumes that the reason one would want to use async/Futures is for performance and scalability reasons.

Personally, though, I would strongly prefer to use async rather than explicit threading even for cases where performance wasn’t the highest priority. The conceptual model is just better. Futures allow you to cleanly express composition of sub-tasks in a way that explicit threading doesn’t:

https://monkey.org/~marius/futures-arent-ersatz-threads.html

ori_b 2 days ago

parent

next [–]

Futures are nicely equivalent to one shot channels with threads.

pornel 2 days ago

root

parent

next [–]

Not quite: Rust futures also have immediate cancellation and easy timeouts that can be imposed externally on any future.

In threads that perform blocking I/O you don't get that, and need to support timeouts and cancellation explicitly in every blocking call.

---

sounds like Rust async will be greatly improved with “keyword generics” ("where you can make a function generic over the presence or absence of async") and/or "async traits"

---

in Rust async, here's some comments about something called "cancel safety" from https://lobste.rs/s/6fjkeh/why_async_rust#c_wuwsgs:

My strongest concern with async Rust is actually around cancel safety, which I feel is not presently represented well either in the type system, or in many of the patterns used in prominent futures libraries.

The easiest one to understand is, I think, the tokio async Mutex, which is not really a mutex at all in the classic sense, because it is automatically unlocked on cancellation. In the case where this occurs, there is no poisoning. Cancellation occurs outside of the control of the task that held the mutex, so there’s really nothing you can do to protect yourself completely, and thus you cannot depend on it to protect a critical section. I think if it had been called something else like a governor for arranging a work queue with concurrency of 1 it would more accurately describe what it does and be less prone to misuse.

None of this stops me from using async Rust, FWIW, which I think is pretty great otherwise. We have some folks working on some cancel safe futures code at Oxide: https://github.com/oxidecomputer/cancel-safe-futures

We also have someone investigating async runtimes for embedded environments: https://github.com/cbiffle/lilos

    ~
    cgenschwap 32 hours ago | link | flag |

Cancel safety is a good one. Do you know if there is any work at a language level to address it? I haven’t really dug into the details, and I’ll have to read through the cancel-safe-futures because I was under the impression that the primary issue was not being able to differentiate a drop from completing vs. canceling. I’m realizing I should probably have a better understanding of the constraints here!

Thanks for linking lilos, you folks at Oxide are doing some cool stuff :)

    9
    sunshowers 28 hours ago | link | flag |

Primary author of cancel-safe-futures here.

Not being able to distinguish between cancellation versus completion is definitely part of the issue here! Having that would enable poisoning mutexes on premature drop, for example.

But that wouldn’t address all the issues with cancellations. For example, if you’re flushing two buffered writers concurrently, using try_join is bad because if one of them fails, the code wouldn’t try and flush the other future as far as possible. To address that I wrote join_then_try: https://docs.rs/cancel-safe-futures/latest/cancel_safe_futures/macro.join_then_try.html#why-use-join_then_try

---

(on Rust async)

    api avatar api 22 hours ago | link

    My biggest gripe about async is actually tokio… or rather the fact that you pretty much must use tokio. By deciding not to put an async runtime into std and not to put a fully fleshed out framework for generalizing async runtimes into std either, Rust created a situation where you end up with one de-facto standard runtime that everyone uses. You get the worst of both worlds then: it’s not standard, but you have no choice unless you want to fork a ton of code.

    Tokio is not even the best runtime. Smol is much better with its use of lifetimes and structured concurrency. The fact that dropping a JoinHandle in tokio is a no-op is just awful. It’s also not the fastest runtime, though the difference these days is not large.
         
        pimeys avatar pimeys 19 hours ago | link

        There are a few traits that would already be very beneficial in std, such as AsyncRead, AsyncWrite and as a bonus AsyncDrop. If we’d have the first two and a common way of spawning, it would be much easier to write runtime-independed libraries.

        Waiting for the next article from boats, if they would address some of these issues.

---

samsquire edited 23 hours ago (unread)

link

flag

        Thank you for this David.

        I am a hobbyist in multithreading and parallelism and I really appreciate this research.

        If I understand correctly, you achieve thread safety by only scheduling behaviours when it can be guaranteed without conflict or interference and you use graphs to do so. Your “when” statement seems to set up these preconditions and relationships and those graphs. (I use graphs for parallelism in devops-pipeline.com but I warn you, my implementation is trivial: it’s just a “join” on graph ancestors)

        I tried hard to achieve parallelism while running at full speed in all threads with lock free algorithms but there are two problems: you cannot scale single threaded mutation to the same memory location because synchronization AND contention of locks slow things below even single threaded mutation performance, so adding cores does not buy you anything because memory mutation is your bottleneck.

        If you have a single integer and you want to scale updates to that integer from multiple threads, you need to split the integer per thread. You can have an eventually consistent view for a snapshot of total thread state if you like.

        I am using scheduling to achieve mutual exclusion with my nonblocking barrier primitive in C (the code is zero clause BSD licenced). I use my own lock free algorithm inspired by a lock free ringbuffer (whose ringbuffer was written by Alexander Krizhanovsky) and Wait-Free Queues With Multiple Enqueuers and Dequeuer. I rely on happens-before relationships for safety.

        I send data in bulk, as in bulk synchronous parallel, every thread sends to every other thread on a schedule. This is “stop the world synchronization”. Or turning every mutation into a queue.

        On a 6 core 10th generation Intel i7 NUC machine with 12 hardware threads I get 200-400 million requests per second of sending data between cores depending on buffer size. On a 96 core c5.24xlarge machine in AWS I get 1.6 billion intrathread sends per second. I haven’t tested latency. I have read that people were able to get 500 million messages a second with LMAX Disruptor on .NET.

        I think I can do lock insertions automatically on a dining philosophers problem with my “statelines” and multiplexing idea.

        Regarding bank simulation, I get 300-600 million requests per second in Java by sharding money across buckets: that is rather than sharding by account, I split an £1000 total balance into £125 ×8 buckets per thread if there were 8 threads.

---