Bayle Shanks's website: proj-oot-ootConcurrencyNotes7

great article on Julia GPU programming:

some notes:

main package is GPUArrays
lots of great sample code, including profiling example code
"One might think that the GPU performance suffers from being written in a dynamic language like Julia, but Julia's GPU performance should be pretty much on par with the raw performance of CUDA or OpenCL?. Tim Besard did a great job at integrating the LLVM Nvidia compilation pipeline to achieve the same – or sometimes even better – performance as pure CUDA C code. Tim published a pretty detailed blog post in which he explains this further. CLArrays approach is a bit different and generates OpenCL? C code directly from Julia, which has the same performance as OpenCL? C!"
"GPUs have their own memory space with video memory (VRAM), different caches, and registers. Whatever you do, any Julia object must get transferred to the GPU before you can work with it. Not all types in Julia work on the GPU."
an immutable struct or array that contains only other immutables is called an 'isbits' type (did i get that right?). Anything containing a heap allocated reference is not isbits
types with the 'isbits' property "can be used without constraints on the GPU"
GPUArray Constructors: literals, fill, rand, range (using 1:10 syntax),
"Julia's fusing dot broadcasting notation. This notation allows you to apply a function to each element of an array, and create a new array out of the return values of f. This functionality is usually referred to as a map. The broadcast refers to the fact that arrays with different shapes get broadcasted to the same shape." example: x = zeros(4, 4) # 4x4 array of zeros y = zeros(4) # 4 element array z = 2 # a scalar

y's 1st dimension gets repeated for the 2nd dimension in x
and the scalar z get's repeated for all dimensions
the below is equal to `broadcast(+, broadcast(+, xx, y), z)` x .+ y .+ z

more: https://julialang.org/blog/2018/05/extensible-broadcast-fusion

"This means any Julia function that runs without allocating heap memory (only creating isbits types), can be applied to each element of a GPUArray and multiple dot calls will get fused into one kernel call. As kernel call latency is high, this fusion is a very important optimization."
" Some more operations supported by GPUArrays:

    Conversions and copy! to CPU arrays
    multi dimensional indexing and slicing (xs[1:2, 5, :])
    permutedims
    Concatenation (vcat(x, y), cat(3, xs, ys, zs))
    map, fused broadcast (zs .= xs.^2 .+ ys .* 2)
    fill(CuArray, 0f0, dims), fill!(gpu_array, 0) 
    Reduction over dimensions (reduce(+, xs, dims = 3), sum(x -> x^2, xs, dims = 1)
    Reduction to scalar (reduce(*, xs), sum(xs), prod(xs))
    Various BLAS operations (matrix*matrix, matrix*vector)
    FFTs, using the same API as julia's FFT" (note: lots of hyperlinks in there in the original)

to pass in an arbitrary (GPU-compatible) kernel (which will be called many times in parallel with the given arguments; it doesn't have to just be a map, it can access the whole array, and multiple arrays, if you put them in the arguments; the 'A::GPUArray' parameter seems to just be for getting the length of the array so as to determine how many threads to do? that seems wrong though so maybe i'm wrong): " gpu_call. It can be called as gpu_call(kernel, A::GPUArray, args) and will call kernel with the arguments (state, args...) on the GPU. State is a backend specific object to implement functionality like getting the thread index. A GPUArray needs to get passed as the second argument to dispatch to the correct backend and supply the defaults for the launch parameters. "

---

" I had some extended notes here about "less-mainstream paradigms" and/or "things I wouldn't even recommend pursuing", but on reflection, I think it's kinda a bummer to draw too much attention to them. So I'll just leave it at a short list: actors, software transactional memory, lazy evaluation, backtracking, memoizing, "graphical" and/or two-dimensional languages, and user-extensible syntax. If someone's considering basing a language on those, I'd .. somewhat warn against it. Not because I didn't want them to work -- heck, I've tried to make a few work quite hard! -- but in practice, the cost:benefit ratio doesn't seem to turn out really well. Or hasn't when I've tried, or in (most) languages I've seen. " [2]

---

" Heterogeneous memory and parallelism

These are languages that try to provide abstract "levels" of control flow and data batching/locality, into which a program can cast itself, to permit exploitation of heterogeneous computers (systems with multiple CPUs, or mixed CPU/GPUs, or coprocessors, clusters, etc.)

Languages in this space -- Chapel, Manticore, Legion -- haven't caught on much yet, and seem to be largely overshadowed by manual, not-as-abstract or not-as-language-integrated systems: either cluster-specific tech (like MPI) or GPU-specific tech like OpenCL?/CUDA. But these still feel clunky, and I think there's a potential for the language-supported approaches to come out ahead in the long run. " [3]

---

msangi on Aug 19, 2017 [-]

It's interesting that he doesn't want to draw too much attention to actors while they are prominent in Chris Lattner's manifesto for Swift [1]

[1] https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9...

mcguire on Aug 19, 2017 [-]

Actors are a bit of a off-the-wall paradigm (https://patterns.ponylang.org/async/actorpromise.html), and I'm (as an old network protocol guy) not sure I'm happy with the attempts to make message passing look like sequential programming (like async/await).

I kinda see where Graydon is coming from. I have this broken half-assed Pony parser combinator thing staring at me right now.

---

 kernelbandwidth on Aug 20, 2017 [-]

Would you call Erlang/Elixir actor-based or just having actor-like features?

Actors in Erlang seem more emergent to me than fundamental. To first order, Go and Erlang use similar concurrency strategies with cheap green threads and channels for communication. The main difference being that Go has channels as separates objects that can be passed around (leaving the goroutine anonymous), while Erlang fuses the channels into the process. In this respect, they both have similar levels of "actor-ness" in my mind. The biggest upside that I can see to channels being separate is that a goroutine can have multiple channels, they can be sent around (though so can PIDs), etc. This matters a lot because the channels in Go are statically typed, while Erlang/Elixir's dynamic typing makes this less meaningful since you can send anything on the same channel.

Of course, I supposed if one defines an actor as a sequential threaded process with a built in (dynamically-typed) channel, then Erlang is actor-based, so maybe I contradicted myself. In Erlang/OTP, the actor part is important to the supervision tree error handling strategy, which I think is the biggest upside, but it's not obvious to me that you need the channel and the process totally fused together to handle that.

Specifically in response to your "cheap concurrency good, actors bad" thought, my hypothesis is that actors are at least the more natural strategy in a dynamically typed language, which is why they seem to work well in Erlange/Elixir, but don't see much use in languages like Haskell (or Scala, it seems, where I at least though the Akka actors were kind of awful). Meanwhile, channels seem to fit better with static typing, though I can't quite put my finger on why.

jerf on Aug 21, 2017 [-]

"Would you call Erlang/Elixir actor-based or just having actor-like features?"

First, fair question and I the nature of your follow-on discussion and musings.

IMHO, one of the lessons of Go for other languages is just how important the culture of a language can be. I say that because on a technical level, Go does basically nothing at all to "solve" concurrency. When you get down to it, it's just another threaded language, with all the concurrency issues thereto. An Erlanger is justified in looking at Go's claim to be "good at concurrency" and wondering "Uh... yeah... how?"

And the answer turns out to be the culture that Go was booted up with moreso than the technicals. When you have a culture of writing components to share via communication rather than sharing memory, and even the bits that share memory to try to isolate those into very small elements rather than have these huge conglomerations of locks for which half-a-dozen must be taken very carefully to do anything, you end up with a mostly-sane concurrency experience rather than a nightmare. Technically you could have done that with C++ in the 90s, it's just that nobody did, and none of the libraries would have helped you out.

That did not directly bear on your question. I mention that because I think that while you are correct that Erlang is technically not necessarily actor-oriented, the culture is. OTP pushes you pretty heavily in the direction of actors. Where in Go a default technique of composing two bits of code is to use OO composition, in Erlang your bring them both up as actors using gen_* and wire them together.

"my hypothesis is that actors are at least the more natural strategy in a dynamically typed language, which is why they seem to work well in Erlange/Elixir, but don't see much use in languages like Haskell"

I can pretty much prove that they don't: https://github.com/thejerf/suture It's the process that needs monitoring, and that process may have 0-n ways to communicate. But that's not a criticism of Erlang, as I think that's actually what it does and it just happens to have a fused message box per process.

An intriguing hypothesis I'll have to consider. Thank you.

kernelbandwidth on Aug 21, 2017 [-]

I had not really considered the design patterns of the culture vs the design patterns of the language; this is a very good point.

> I can pretty much prove that they don't: https://github.com/thejerf/suture It's the process that needs monitoring, and that process may have 0-n ways to communicate. But that's not a criticism of Erlang, as I think that's actually what it does and it just happens to have a fused message box per process.

One, very neat library. Two, while I agree this proves the point that the actor model is not needed in the language to build a process supervisor, I think that your Go Supervisor looks a lot like an actor, at least in the way Erlang/Elixir uses them. From what I can see, the Supervisor itself works by looping over channel receives and acts on it. The behavior of the Supervisor lives in a separate goroutine, and you pass around an object that can send messages to this inner behavior loop via some held channels. So basically the object methods provide a client API and the inner goroutine plays the role of a server, in the same separate of responsibilities that gen_* uses.

If we squint a little bit, actors actually look a lot like regular objects with a couple of specific restrictions: all method calls are automatically synchronized at the call/return boundaries (in Go, this is handled explicitly by the channel boundaries instead), no shared memory is allowed, and data fields are always private. I'm sure this wouldn't pass a formal description, but this seems like a pragmatically useful form.

I agree that Go is less actor-oriented than Erlang/Elixir, but given how often I've seen that pattern you used in the Supervisor (and it's one I have also naturally used when writing Go) I'd argue that "Actor" is a major Go design pattern, even if it doesn't go by that name. The difference then, is the degree to how often one pulls out the design pattern. I think the FP aspect pushes Erlang/Elixir in that direction more, as this "Actor" pattern has a second function there -- providing mutable state -- that Go allows more freely.

This discussion has really made me think, thanks. I think you're right that actor-like features are valuable and that the Actor Model in the everything-is-an-actor is not itself the value (or even a positive).

---

skybrian 32 days ago [-]

It's kind of weird to say m:n threading is terrible when it seems to work okay in Go.

It wasn't the right choice for Rust, but that doesn't mean Go's choice was wrong.

Rusky 32 days ago [-]

Go solved the main downside the article cites with M:N threading- it replaced split stacks with growable stacks, via a precise garbage collector.

kibwen 32 days ago [-]

That solved the hot split problem, but that was more of a once-in-a-while gotcha. The main downside for Rust specifically was the complications that arise when doing any sort of C interop in the presence of growable stacks. This mattered more for Rust than Go, since Rust is relatively more intended to freely intermingle with C code (and other languages via, C FFI), whereas the happy path for Go tends to involve exclusively Go code in a given process (even going so far as to reimplement their own libc in Go), and use of cgo is discouraged by the community (https://dave.cheney.net/2016/01/18/cgo-is-not-go).

---

[4] is an interesting post on some details of the Windows kernel architecture and its scheduler. My takeaways for Oot:

core kernel features: "thread dispatching, multiprocessor synchronization, hardware exception handling, and the implementation of low-level machine dependent functions"
we mostly just want round-robin scheduling for Oot threads, but:
in case there are multiple competing 'applications' within one Oot instance, we don't want one application to be able to get more CPU resources just by launching more threads, so we need to be able to group threads together and tell the scheduler; "this whole group counts as one unit for the purpose of the round-robin scheduling"
different threads can have different priorities
there should be a mechanism for a caller-thread to lend its priority to a callee-thread
(not from this article but) there should be a mechanism for a caller-thread or a message-sender-thread to cause the scheduler to immediately switch to the callee-thread or message-recipient-thread without the overhead of running the round-robin thingee
threads can be waiting on something (don't run them at all until it happens), background tasks (low priority), high priority CPU-bottlenecked (CPU-bottleneck can probably be detected by the scheduler b/c these guys must be pre-empted a lot?), or high-priority latency-sensitive. So, i guess maybe we should just split priority into two things: throughput priority and latency priority.
for massive concurrency, watch out for lock contention within the scheduler itself, preventing a zillion threads from simultaneously running on a zillion cores (because the schedulers on all those cores are trying to communicate). Instead, the scheduler on each core should be independent.

---

" Web Workers do not share mutable data between them, instead relying on message-passing for communication. In fact, Chrome allocates a new V8 engine for each of them (called isolates). Isolates share neither compiled code nor JavaScript? objects, and thus they cannot share mutable data like pthreads.

WebAssembly? threads, on the other hand, are threads that can share the same Wasm memory. The underlying storage of the shared memory is accomplished with a SharedArrayBuffer?, a JavaScript? primitive that allows sharing a single ArrayBuffer?'s contents concurrently between workers. Each WebAssembly? thread runs in a Web Worker, but their shared Wasm memory allows them to work much like they do on native platforms. This means that the applications that use Wasm threads are responsible for managing access to the shared memory as in any traditional threaded application. There are many existing code libraries written in C or C++ that use pthreads, and those can be compiled to Wasm and run in true threaded mode, allowing more cores to work on the same data simultaneously. "

"threads are essentially atomic memory accesses and wait/notify, nothing for starting a worker inside of WASM or anything, that is still in JS land. "

---

more on network and IPC message sizes:

Chadwick, G. A. 2013. Communication centric, multi-core, fine-grained processor architecture. Technical Report 832. University of Cambridge, Computer Laboratory says:

" 6.4.1 Network Message Size The network uses flits with a payload size of 58 bits, when including the header this gives a total flit size of 72 bits (this size was chosen as it matches the size of flits sent by the serial link). Whilst Mamba and MIPS64 use an identical network the size of messages sent differs between them. In Mamba all messages sent are 134 bits whilst in MIPS64 messages come in 4 separate categories each with a different size. These are given in table 6.2. It should be noted that no particular effort was made to optimise the messages sizes. Though a request in Mamba will always require more information than a request in MIPS64 as in Mamba any request has a response address referring to a particular word in memory, whilst in MIPS64 the request merely needs to state which node and context the response should be sent back to.

Message type Size in bits MIPS64 Read Request 44 MIPS64 Write Request 95 MIPS64 Read/CAS Response 67 MIPS64 CAS Request 170 Mamba All Messages 134 Table 6.2: Network message sizes "

so here we see numbers around 64 bits, and numbers in between 128 and 256 bits.

this suggests that we should have message sizes of 64 bits or 256 bits

but i still like the idea of a 16bit payload and then 64bit packets (16 bits for each of: destination address, Port (or maybe source address), message type, payload)

This suggests that we have 3 different message types (small, medium, large: 16, 64, ?? bit payloads; for 'large' (??) we could have 64kbits, but 8k bytes is rather large, so perhaps make that 4k or 1k bytes; or could just do 16,64,256 bits and let 1k bytes etc (which is what we had before) be higher level)

obviously 16 bits payloads are useful b/c Oot is built around 16-bit values, and 64 bits are useful because they are the largest standardish value size in contemporary programming.

Note: ed25519 signatures are 512 bits, although their public keys are 256 bits

note: Cache lines are 64 bytes

since we are talking about 8k bytes etc, i think of 8k l1 caches, and of 4k page size. From stackoverflow, why 4k page size: "4 KiB? is widely popuplar page granularity in other architectures too. One could argue that this size comes from the division of a 32-bit virutal address into two 10-bit indexes in page directories/tables and the remaining 12 bits give the 4 KiB? page size"

note: Max safe udp payload is 508 bytes

So what about: 256 bytes = 2048bits for the large msg size payload

So we could have 4 types of messages, with payloads of size in bits: 0,16,64,2048 (0-sized payloads are signals)

the 16, 64 message types could be fixed size. The 0 message type (signal) has no payload at all. The 2048 message type could be variable size (max of 2048 bits = 256 bytes).

This has the advantages of having:

signals: these don't need to be buffered
16: one Oot fundamental value
64: a few Oot fundamental values, or 1 contemporary fundamental value, or an Oot 16-bit packet including headers
2048 (256 bytes): has room for headers, an Ed25519-signature, and substantial payload; fits safely in one UDP packet and leaves room for (many) additional headers; addressable by one byte

Note that 256 byte max is a 16x reduction over my previous proposal (4k)

how many types of signals should we have? i kind of want a rich choice of signal types to allow them to be used for application communication. so maybe 256? That has the benefit that one byte can just hold the signal type. But, that means that to store incoming signals for each process, we need 256 bits (32 bytes), which seems excessive (since we are trying to have zillions of processes here). Also the signal handler table would get long (at least 256 entries times at least 16-bit pointers = at least 512 bytes). So how about 64 signal types. That lets us store incoming signals for each process in 8 bytes (plus we have an 8 byte signal mask), which seems reasonable and has the additional benefit of fitting within one fundamental value on modern processors (so most implementations will be able to change them atomically, not that that matters much since we'll probably only need to change one bit at a time, which can be done atomically anyway by changing one byte). The signal handler table is now at least 128 bytes, which is still pretty large but is more managable. Also i don't know why we'd need to, but we can send the signal state using the payload of just one of our 64-bit fixed size messages. Hmm, feels like that could have some benefit...

Alternately, if we wanted non-rich signals, we could just have 16 of them. That would have the benefit that the signal handler table is reasonably small (32 bytes).

We could mix-and-match a little; have 64 signal types but allow only 16 distinct signal handlers (so, since we have 64 signal types, the last signal handler handles the 16th signal type as well as the other 48). Hmm, i like that. Reserve a few of the signal types for the implementation, of course. At least 4 (out of these, maybe 1 of them is one of the special 16 signals that have per-process handler pointers).

but wait: there was a good reason for our previous choice of 4k max message size: 4k * 16-bit words = 64k bits. Since we had 64k addressable nodes per 'chip', this is needed to allow a bitmap with one bit per node to be transmitted in a single message.

also, having 256 byte messages doesn't fit so well with our having 16-bit words (instead of 8-bit bytes).

so maybe bump the max message size back up to 4k.

also, if there isn't any API/semantic difference besides size between 16-bit and 64-bit fixed size messages, let's just pick one size: 64-bit

---

just a note: why am i obsessed with the quixotic 16-bit words and other 'keeping things small' limits when efficiency is a non-goal of Oot? It's because of the 'brain-like massively parallel' goal:

1) i feel like the mid-to-low level elements of the brain don't pass around digital values with high precision (although this could be wrong, when you conside the analog nature of waveform timing). Eg today's neural net processors see gains by using 8-bit or 16-bit floating point

2) if we are going to have zillions of CPUs with local memories on consumer-affordable hardware, we have to make each one very cheap: very simple processors and very small local memories

(why have fixed sizes at all? originally i wanted everything to be variable sized, but i realized that that adds to implementation complexity, which makes everything less likely to be widely ported)

one issue i have is: in order to keep things small, i've been limiting one 'chip' to 64k processors. But there are many more than 64k processors in the brain (even if a 'processor' is more than one neuron, i think the visual system has more than 256x256 pixels/resolution, although i'm not positive since the eye moves around alot to collect data). So should have more than 64k? Or should we regard one 'chip' more like one cortical column? Wikipedia says " Various estimates suggest there are 50 to 100 cortical minicolumns in a hypercolumn, each comprising around 80 neurons", so a "hypercolumn" would have up to 80*100=8,000; so it would be safe to round up to 64k so that we have some 'implementation margin'. Otoh individual neurons are probably way too low level for us; one thread is probably like a minicolumn, so this would give us only 100 threads per hypercolumn. Another complexity is that the brain probably doesn't have 'addressing precision' to each of the 100 billion or so neurons in the brain (a little more than 2^32); so even if the brain has a bunch of neurons, its node addresses aren't that big; which suggests that a 16-bit address space is indeed big enough for our purposes (either b/c each chip is like only a small local network of a few nearbly neurons; or because each node on the chip is like a huge subnetwork of which there are less thank 64k distinct subnetworks in our brain).

---

so how many bits should each header field be in our 64-bit-payload message? before when i had a 16-bit payload, 16-bit header fields made sense. But if our chip size is only 64k nodes, then 64-bit header fields may not make sense. Otoh, the 'one issue i have is' discussion in the previous section makes me think that there may be a use for >16-bit addressing. And as for the type field, it's not like our tiny processes are going to have a table with 64k entries anyways for the 2^16 possible types. So let's say that we have 3 64-bit header fields, too. This makes the total packet 256 bits (so it still fits in one cache line, much less one safe UDP payload).

might want to split the 64-bit type into a 32-bit port and a 32-bit type.

---

mb 0 bits, 1 bit, 1 byte, 2 bytes (16-bits), 8 bytes (64 bits), 64 bytes (cache line), 256 bytes (next power of 2 below safe/probably not fragmented udp payload size, also 4^4), 508 (safe/probably not fragmented udp payload size) 4k (allocation block, draw row), or 64k (2^16) are the reasonable choices for small message sizes

---

infogulch 2 days ago [-]

I really really hope Go 2 can do something about `context`. Context is the biggest hidden wart of Go. We need the capabilities of context in different packaging.

atombender 2 days ago [-]

This doesn't seem to be a popular opinion, but I agree. It's such a pervasive functionality in concurrent programs that it really should be a built-in aspect of a goroutine.

The problem with context isn't necessarily the interface, it is that it is "viral". If you need context somewhere along a call chain, it infects more than just the place you need it — you almost always have to add it upwards (so the needed site gets the right context) and downwards (if you want to support cancellation/timeout, which is usually the point of introducing a context).

Context's virality also applies to backwards compatibility. There have been discussions of adding context to io.Reader and io.Writer, for example, but there's no elegant way to retrofit them without creating new interfaces that support a context argument. This problem applies to any API; you may not expect your API to require a context today, but it might need one tomorrow, which would require a breaking API change. Given that it's impossible to predict, you might want to pre-emptively add context as an argument to all public APIs, just to be safe. Not good design.

Cancellation/timeout is arguably so core to the language that it should be an implicit part of the runtime, just like goroutines are. It would be trivial for the runtime to associate a context with a goroutine, and have functions for getting the "current" context at any given time. (Erlang got this right, by allowing processes to be outright killed, but it's probably too late to redesign Go to allow that.)

(I'm ignoring the key/value system that comes with the Context interface, because I think it's less core. It certainly seems less used than the other mechanisms. For example, Kubernetes, one of the largest Go codebases, doesn't use it.)

nathanaldensr 2 days ago [-]

Forgive me since I've never used Go, but this sounds similar to having to add the async keyword to methods in C# all the way up the stack once you attempt to use an async method, as well as having to pass CancellationTokens? down the stack to support cancellation. I've noticed it pollutes code with a lot of ceremony that I wish had been added to the runtime itself. Is this what you're talking about?

infogulch 2 days ago [-]

Yeah, context is basically a cancellation token with the same downsides and a bunch of unrelated features piled on (because it came from a bunch of people needing to work around limitations getting together and deciding to put all their hacks in one place, but I digress). But from a certain perspective all functions in go are async by default, so we get to dodge that one.

 dolmen 2 days ago [-]

I have written the 'contextio' package that may be useful to you. It provides io.Reader and io.Writer wrappers that handle context cancellation. This allows to transparently add optinal context cancellation awareness to routines that work with I/O without injecting context as argument.

Doc: https://godoc.org/github.com/dolmen-go/contextio Example: https://godoc.org/github.com/dolmen-go/contextio#example-pac...

atombender 2 days ago [-]

Cool, thanks!

Because I think discussing actual solutions is better that just complaining, this is my current favorite design document to address context:

Go Context Scoping [1] by Eyal Posener. It even has a working PoC? implementation [2] which is pretty ergonomic and could become even moreso with language integration.

I think this concept solves most of the problems we currently face with context. As a consequence of making context available per goroutine it basically becomes an implementation of gouroutine-local storage. But this is more because context.WithValues? exists in the first place than because of this proposal. In fact context has effectively become the de-facto GLS anyway, except it makes everyone's code ugly to do it.

[1]: https://posener.github.io/context-scoping/

[2]: https://github.com/posener/context

interesthrow2 2 days ago [-]

Not making Go routines values like Ada Tasks is what led to awkward solutions shouldered by the developer such as "context". Too many time Go developers were told to solve issues in user-land, this is the consequence of that.

Teckla 2 days ago [-]

Having to manually pass context through so many functions in code bases is definitely not ideal.

---

" async/await in Rust produces a single value, a Future. Executing that future requires calling its poll method repeatedly. That can be extremely simple, or it can be more complex. We call "a thing that calls poll repeatedly" an executor, and they can be written without an OS or even the standard library, just fine.

The largest, most well known executor is Tokio, which is built on top of OS primitives. If you're writing your own OS, you'd use those primitives. "

---

ajross 2 days ago [-]

How did adding exceptions to the language make C++ into C++? RTTI? Multiple virtual inheritance? Default member function generation?

They didn't, not individually. They were popular and uncontroversial (rather less controversial than promise apis, even). And yet...

The point is that Rust seems to be charging blindly down the same road, not that any one feature is going to blow it all up. Frankly IMHO Rust is already harder to learn for a median programmer than C++ is, even if it makes more aesthetic sense to an expert.

---

infogulch 30 days ago [-]

I really really hope Go 2 can do something about `context`. Context is the biggest hidden wart of Go. We need the capabilities of context in different packaging.

atombender 30 days ago [-]

This doesn't seem to be a popular opinion, but I agree. It's such a pervasive functionality in concurrent programs that it really should be a built-in aspect of a goroutine.

nathanaldensr 30 days ago [-]

infogulch 30 days ago [-]

atombender 30 days ago [-]

Absolutely. Contexts are similar to your CancellationToken?; a context contains a channel you can listen to, just like CancellationToken?'s WaitHandle?. They're slightly simpler in that I believe CancellationToken? supports registering callbacks, which contexts don't.

Go doesn't actually have async support in the sense of promises/futures (as seen in C#, JavaScript?, Rust, etc.). The entire language is built around the idea that I/O is synchronous, and concurrency is achieved by spawning more goroutines. So if you have two things that you want to run simultaneously, you spawn two goroutines and wait for them. (Internally, the Go runtime uses asynchronous I/O to achieve concurrency.)

Groxx 30 days ago [-]

>Go doesn't actually have async support

I usually phrase this part of Go as: no async/await, it only has threads. But no thread handles. Everything is potentially parallel under the hood and all coordination requires external constructs like WaitGroups? / chans / etc.

async/await has major complications like changing call syntax through the whole chain, so I actually prefer it this way. the lack of thread handle objects (e.g. a return value from `go func()`) is strange IMO tho.

int_19h 30 days ago [-]

The main advantage of the async/await model is that it's just syntactic sugar on top of CPS, so you can bolt it onto any language that is capable of handling callbacks (even C!). For a good example of that, consider WinRT? - you can write an async method there in C#, the task that it returns can pass through a bunch of C++ frames, and land up in JS code that can then await it - and all that is handled via a common ABI that is ultimately defined in C terms.

Conversely, goroutines require Go stack to be rather different from native stack, which complicates FFI.

Groxx 30 days ago [-]

it's true that it's "just syntactic sugar", but in most languages it has call-site contracts that are either part of the signature (`await x()` to unpack the future/coroutine) or implicit (`x()` in an event loop host). and to change something deep in the stack means changing every call site everywhere.

that's a huge burden on a library (and thus the entire language ecosystem). it splits the world.

to avoid that, you basically need language-level support, so everything is awaitable... at which point you're at the same place as threads (but now they're green), where you cannot know if your callees change their behavior. which is both a blessing and a curse.

---

tl;dr yes but no. do you want your callee's parallelism to be invisible? you can't have both. (afaik. I'd love to see a counter-example if you know of one. there was one very-special-case language that made parallelism a thing you did to code rather than the code doing, but I can't find it at the moment. it only worked on image convolutions at the time.)

int_19h 29 days ago [-]

Right. You can have callee's parallelism be invisible - but only by adding it to your language (can't be done as a library) and making it be similarly invisibly parallel. And even that only works so long as everybody adopts the same system - goroutines don't play well with Ruby fibers, for example.

With the async/await model, you can immediately use it with any language that has callbacks, and then the languages can gradually add async/await on their own schedules.

I would dare say that the async/await model has proven far more successful in practice. It came to the table later than green threads etc (if you consider syntactic sugar a part of it - CPS itself was around for much longer, of course). And yet it was picked up surprisingly fast - and I think the way in which you can bolt it onto the existing language is precisely why. Conversely, every language and VM that has some form of green threads, seems to insist on doing their own that aren't compatible with anything else out there - and then you get insular ecosystems and FFI hell.

Maybe if some OS offered green threads as a primitive, it would have been different. But then again, Win32 has had fibers since mid-90s, and nobody picked that up.

Groxx 29 days ago [-]

thread interop may be a major contributor to "far more successful in practice", because yea - I agree, it's far more common. I don't know that side of things all that well :

having spent a fairly significant amount of time in an event-loop system with async/await tho (python coroutines): I don't know if that's a good thing. getting your head around "thou must never block the event loop, lest ye unceremoniously bring the system to its knees" / never using locks / never confusing your loops / etc requires both significant education and and significant care, and when you get it wrong or performance degrades it can be truly horrific to debug.[1] it's nice that it tends to have fewer data races though.

green thread systems though are trivial - your stack looks normal, your tracing tools look normal, your strategies are basically identical (since they tend to context switch at relatively fine-grained points, so your only real concern is heavy func-call-less computation, which is very rare and easily identified). since I don't have to deal with thread interop[2] I'll take that every single time over async/await.

---

[1] I've helped teams which had already spent weeks or months failing to make progress, only to discover what would be an obvious "oops" or compile-time error elsewhere. some languages do this much better, from what I've seen, but CPS javascript and python coroutines and other ones I've used have been awful experiences. basically, again, language-level support is needed, so I broadly still disagree on "just syntactic sugar" for it to be even remotely acceptable.

[2] ....though cgo has been a nightmare. nearly everyone uses it wrong. so I 100% believe that I could switch sides on this in time :)

Groxx 30 days ago [-]

Halide! That's the one I was thinking of: https://www.youtube.com/watch?v=3uiEyEKji0M

I'd love to find out if this could be a more generally-applicable technique, but it's fascinating regardless.

dolmen 30 days ago [-]

Doc: https://godoc.org/github.com/dolmen-go/contextio Example: https://godoc.org/github.com/dolmen-go/contextio#example-pac...

atombender 30 days ago [-]

Cool, thanks!

smarterclayton 30 days ago [-]

We do use it but mostly in middleware.

However, the reason we don’t use that more is partially because of the viral nature of context - it came late in kube lifecycle, so we didn’t ensure it everywhere and now it’s a lot harder to wire (clients have been iterated on for a while).

I have a closed PR from 2015 to kube that added per request ID tracking that we closed as “wait until we have context everywhere” and we’re still waiting.

riwsky 30 days ago [-]

I agree heartily that context is viral in APIs, but would argue that that's essential to the nature of context. Accordingly: implicit association of context with a goroutine would introduce a complementary API virality issue: you now need need to worry about whether anything above or below you starts to delegate their work to separate goroutines.

atombender 30 days ago [-]

Not sure what you mean here. How does an implicit context change any semantics? You would still be able to override which context is given to goroutines you spawn. As a developer, you'd have to be aware of the implicitness, that's the only difference.

riwsky 30 days ago [-]

I agree it doesn't change the semantics, and that you can express the same set of programs (given that you still let people explicitly handle contexts, send them over channels, etc). I just mean to highlight that removing context-usage information from go function signatures does not mean that usage or non-usage of context isn't part of a function's API—it's just now global state, and needs to consider its interactions with everything else in the same routine (instead of everything lexically in scope).

infogulch 30 days ago [-]

Because I think discussing actual solutions is better that just complaining, this is my current favorite design document to address context:

Go Context Scoping [1] by Eyal Posener. It even has a working PoC? implementation [2] which is pretty ergonomic and could become even moreso with language integration.

[1]: https://posener.github.io/context-scoping/

[2]: https://github.com/posener/context

interesthrow2 30 days ago [-]

Teckla 30 days ago [-]

Having to manually pass context through so many functions in code bases is definitely not ideal.

infogulch 30 days ago [-]

It doubles the surface area of every library that deals with anything related to IO, and forces middle libraries that don't and shouldn't care about context to double their surface area just to support connecting their consumers to their upstream providers. "not ideal" is an understatement.

dolmen 30 days ago [-]

My contextio package might be useful. Check example: https://godoc.org/github.com/dolmen-go/contextio#example-package--Copy

kochthesecond 30 days ago [-]

Coming from languages that lack such a «convention», I quite like Context. Try implementing something similar in java or even node is also a giant pain

Rapzid 30 days ago [-]

It's not the fastest thing in the world, but task-local storage in C# is pretty great. Miss it much in TypeScript?.

---

Ravenscar concurrency profile

" In Ada, creating tasks, synchronizing them, sharing access to resources, are part of the language...For real-time and embedded applications, Ada defines a profile called Ravenscar. It's a subset of the language designed to help schedulability analysis, it is also more compatible with platforms such as micro-controllers that have limited resources....One of the advantages of having tasking as part of the language standard is the portability, you can run the same Ravenscar application on Windows, Linux, MacOs? or an RTOS like VxWorks?. GNAT also provides a small stand alone run-time that implements the Ravenscar tasking on bare metal. This run-time is available, for instance, on ARM Cortex-M micro-controllers. It's like having an RTOS in your language.

...

Tasks

you can declare and implement a single task:

   --  Task declaration
   task My_Task;

   --  Task implementation
   task body My_Task is
   begin
      --  Do something cool here...
   end My_Task;

If you have multiple tasks doing the same job or if you are writing a library, you can define a task type ... One limitation of Ravenscar compared to full Ada, is that the number of tasks has to be known at compile time.

Time

...

     a definition of the Time type which represents the time elapsed since the start of the system
     a definition of the Time_Span type which represents a period between two Time values
     a function Clock that returns the current time (monotonic count since the start of the system)
     Various sub-programs to manipulate Time and Time_Span values

The Ada language also provides an instruction to suspend a task until a given point in time: delay until.

...

Scheduling

Ravenscar has priority-based preemptive scheduling. A priority is assigned to each task and the scheduler will make sure that the highest priority task - among the ready tasks - is executing.

A task can be preempted if another task of higher priority is released, either by an external event (interrupt) or at the expiration of its delay until statement (as seen above).

If two tasks have the same priority, they will be executed in the order they were released (FIFO within priorities).

Task priorities are static, however we will see below that a task can have its priority temporary escalated.

The task priority is an integer value between 1 and 256, higher value means higher priority. It is specified with the Priority aspect:

   Task My_Low_Priority_Task
     with Priority => 1;

   Task My_High_Priority_Task
     with Priority => 2;

Mutual exclusion and shared resources

In Ada, mutual exclusion is provided by the protected objects.

At run-time, the protected objects provide the following properties:

    There can be only one task executing a protected operation at a given time (mutual exclusion)
    There can be no deadlock

In the Ravenscar profile, this is achieved with Priority Ceiling Protocol.

A priority is assigned to each protected object, any tasks calling a protected sub-program must have a priority below or equal to the priority of the protected object.

When a task calls a protected sub-program, its priority will be temporarily raised to the priority of the protected object. As a result, this task cannot be preempted by any of the other tasks that potentially use this protected object, and therefore the mutual exclusion is ensured.

The Priority Ceiling Protocol also provides a solution to the classic scheduling problem of priority inversion.

...

Synchronization

Another cool feature of protected objects is the synchronization between tasks.

It is done with a different kind of operation called an entry.

An entry has the same properties as a protected procedure except it will only be executed if a given condition is true. A task calling an entry will be suspended until the condition is true.

This feature can be used to synchronize tasks.

...

Interrupt Handling

Protected objects are also used for interrupt handling. Private procedures of a protected object can be attached to an interrupt using the Attach_Handler aspect.

" --- There's a mini-RTOS in my language by Fabien Chouteau

---

parallel reduce (fold) is really about the associativity of the operation;

((((a+b)+c)+d)+e) = (a+(b+c))+(d+e) which is why you can start by doing b+c at the same time as d+e (balancing the expression tree somewhat so that there are as many 'leaf nodes' as possible at each step, since each 'leaf node' can be executed in parallel)

---

https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/

" A novice programmer might look at the above list of functions and ask, “Why does C++11 offer so few RMW operations? Why is there an atomic fetch_add, but no atomic fetch_multiply, no fetch_divide and no fetch_shift_left?” There are two reasons:

    Because there is very little need for those RMW operations in practice. Try not to get the wrong impression of how RMWs are used. You can’t write safe multithreaded code by taking a single-threaded algorithm and turning each step into an RMW.
    Because if you do need those operations, you can easily implement them yourself. As the title says, you can do any kind of RMW operation!

...

Out of all the available RMW operations in C++11, the only one that is absolutely essential is compare_exchange_weak. Every other RMW operation can be implemented using that one. It takes a minimum of two arguments: "

---

wahern 4 hours ago [-]

> Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.

That's almost the very definition of a coroutine--explicit transfer of control. In symmetric coroutines you must specify a coroutine for both yield and resume; in asymmetric coroutines you specify what to resume to but yield implicitly returns to whatever resumed the current coroutine. In either case the actual control flow transfer is explicitly invoked.

The term thread is more ambiguous, but it almost always implies control transfers--both the timing and target of control transfer--are implicit and not directly exposed to application logic. (Automagic control transfer might be hidden within commonly used functions (e.g. read and write), injected by the compiler (Go does this), or triggered by hardware.)

You can synthesize a threading framework with both asymmetric and symmetric stackful[1] coroutines by simply overloading the resume and yield operations to transfer control to a scheduler, and then hiding implicit resume/yield points within commonly used functions or by machine translation of the code. In languages where "yield" and "resume" are exposed as regular functions this is especially trivial. Stackful coroutines (as opposed to stackless, which are the most commonly provided type of coroutine) are a powerful enough primitive that building threads is relatively trivial, which is why the concepts are easy to conflate, but they shouldn't be confused.

LISP-y languages blur some of these distinctions as libraries can easily rewrite code; they can inject implicit control transfer and stack management in unobtrusive ways.[2] This isn't possible to the same extent in languages like C, C++, or Rust; lacking a proper control flow primitive (i.e. stackful coroutine) their "threading" frameworks[3] are both syntactically and semantically leaky.

[1] By definition a thread preserves stack state--recursive function state--and this usually implies that stack management occurs at a very low-level in the execution environment, but in any case largely hidden from the logical application code.

[2] OTOH, this is usually inefficient--stack management is a very performance critical aspect of the runtime. For example, Guile, a Scheme implementation, now provides a stackful coroutine primitive. For a good discussion of some of these issues, see https://wingolog.org/archives/2017/06/27/growing-fibers

[3] Specifically the frameworks that attempt to make asynchronous I/O network programming simple and efficient. So-called native threads are a different matter as both stack management and control transfer are largely implemented outside the purview of those languages, very much like how native processes are implemented. If you go back far enough in the literature, especially before virtual memory, the distinctions between process and thread fall away. Nowadays threads are differentiated from processes by sharing the same memory/object space.

FullyFunctional? 3 hours ago [-]

Yes, I agree, but it's not what the OP does, also, mostly often, co-routines gets conflated with cooperative threads/scheduling.

For some color to your other points: the previous version was actually using continuation passing style which worked and was very fast (faster than coroutines), but challenging to understand without a good background in FP and FP implementations.

wahern 18 days ago [-]

> Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.

---

CupOfJava? on May 26, 2017 [-]

The fundamental difference between single threading, coroutines, and multi threading:

Single Thread: 1 thread of execution, 1 execution stack

Coroutine: 1 thread of execution, multiple execution stacks

Multi threading: multiple threads of execution, multiple execution stacks

gpderetta on May 27, 2017 [-]

FYI each coroutine is normally considered its own thread of execution (which is a distinct thing from a kernel scheduled thread).

Coroutines are distinct from threads as they normally lack a scheduler (the decision to which coroutine run next is either made statically ot explicitly by the coroutine); coroutines meant to be used on top of an userspace scheduler are often called fibers (or simply userspace, or green, threads).

---

On C++'s coroutine implementation, someone said:

iheartmemcache on May 26, 2017 [-]

Some resources for those who aren't too active in keeping up with C++, these slides are a good quick summary[0]. The talks linked on this page[1] are particularly good, especially the CppCon?16 Gor Nishanov talk[2]. Paulo[3] has some interesting things to say (though I think some semantics may have changed since, so grain of salt and all).

https://www.slideshare.net/SergeyPlatonov/gor-nishanov-c-coroutines-a-negative-overhead-abstraction [0] Interesting to note, slide #11 uses a tokenizer to demonstrate the usefulness of coroutines. IIRC, Rob Pike used a very similar (maybe his was a lexer/parser?) example in '14 re: Go.

[1] http://luncliff.postach.io/post/exploring-msvc-coroutine

[2] https://channel9.msdn.com/events/CPP/CppCon-2016/CppCon-2016... https://channel9.msdn.com/events/CPP/CppCon-2016/CppCon-2016-Gor-Nishanov-C-Coroutines-Under-the-covers

[3]https://paoloseverini.wordpress.com/2015/03/06/stackless-cor... https://paoloseverini.wordpress.com/2015/03/06/stackless-coroutines-with-vs2015/

"Gor does a pretty good breakdown of what's actually happening in this video https://www.youtube.com/watch?v=8C8NnE1Dg4A "

dis-sys on May 26, 2017 [-]

... Now I am wondering whether they will be providing some channel implementation similar to the one in Golang in the future.

ingve on May 26, 2017 [-]

John Bandela had a great talk about that at CppCon? 2016: "Channels - An alternative to callbacks and futures" [0]

[0] https://www.youtube.com/watch?v=N3CkQu39j5I

---

crudbug on May 26, 2017 [-]

I think with all the async / await noise, the simplicity of co-routines is usually forgotten.

IMHO, they are the right abstraction on top of event-loops. Every major server platform, especially - JVM, CLR, should support them.

I would be very much interested in context-switch data of server applications for Threads vs. Coroutines loads.

vvanders on May 26, 2017 [-]

Yup, I never understand the need to bring a heavyweight thread to bear when coroutines are the actual semantics that most people want.

It's one of my favorite features of Lua and lets you do async-anything in a really nice and clean manner with minimal overhead.

tobz on May 26, 2017 [-]

I'd say that this is roughly the case with .NET, in terms of async/await.

IIRC, you're using a thread pool under the hood, and you sort of have to opt-in: if you're trying to write your own async code, for example, you need to schedule it in a way that's slightly more complicated than just saying `go myFunc()`, but it's possible.

As for the JVM, libraries like Akka and Quasar already provide this, although possibly not as performantly as if the runtime itself provided it?

int_19h on May 26, 2017 [-]

Strictly speaking, whether you're using the thread pool under the hood or not depends on the task scheduler, which is pluggable. The default one that new threads get is indeed going to dispatch to the thread pool (on different threads at that, not the same one). But it can be pretty much anything. Async/await doesn't really care about any of that, it just creates continuations for you, and hands them over to the scheduler.

---

https://github.com/wingo/fibers/wiki/Manual

---

https://stackoverflow.com/questions/5639986/are-blocks-and-libdispatch-available-on-linux https://en.wikipedia.org/wiki/Grand_Central_Dispatch " Blocks are an extension to the syntax of C, C++, and Objective-C programming languages that encapsulate code and data into a single object in a way similar to a closure.[11]"

---

" The Future/Task abstractions that are found in multithreaded programming languages (like C# or Java) also reflect this possibilities: E.g. they allow to specify on which thread a continuation should be run instead of just assuming the current thread. They also need synchronisation internally, because multiple threads might want to use the Future object. "

---

nosefouratyou on May 26, 2017 [-]

I found this to be a good article about the importance of coroutines compared to continuation-passing style: http://orthecreedence.github.io/cl-async/2012/11/07/missing-...

http://orthecreedence.github.io/cl-async/2012/11/07/missing-coroutines-in-common-lisp.html

---

Exchange Types Edit

BEEP defines 5 message types to allow most of the application protocols patterns needed. They are the following: Message MSG A message from one peer to another containing a content. Reply RPY A single reply to a received message containing a content (one-to-one exchange). Error ERR A single reply to a received message containing a content (one-to-one exchange) with error semantic. Answer ANS An answer to a received message containing a content. There might be 0 to n answers for a message (one-to-many exchange). Nul NUL A terminal reply to a message without a content to signal to the peer currently acting as the client the end of a message exchange with 0 or more answers.

Some of the most common application protocol patterns are implemented like follows:

    Request-reply using MSG for request and RPY and ERR for replies
    Single request-multiple replies using MSG, and a series of ANS replies ended by a NUL frame
    Unacknowledged notification using MSG without any reply

[5]

---

https://en.m.wikipedia.org/wiki/BEEP

"The Blocks Extensible Exchange Protocol (BEEP) is a framework for creating network application protocols. BEEP includes building blocks like framing, pipelining, multiplexing, reporting and authentication for connection and message-oriented peer-to-peer (P2P) protocols with support of asynchronous full-duplex communication. "

---

Mb we should expose nonpreemptible sections to the oot user, to allow them to not to have to think about locking stuff as much? But this would mean that bad library code could hang the event loop. Hmm.

---

https://github.com/preshing/turf

" Turf is a configurable C++ platform adapter. It defines a common API for:

    Thread creation
    Thread affinities
    Thread IDs
    Atomic operations
    Mutexes
    Condition variables
    Read-write locks
    Semaphores
    Events
    Timers
    Virtual memory
    Heap allocators
    Asserts

It then implements those things using POSIX, Win32, Mach, Linux, Boost, C++11 and possibly other platform APIs. You configure Turf to use the API you want. "

---

mntmoss 1 day ago [-]

I have a little rant on this.

In the scale of "understanding imperative semantics", awareness of race conditions is probably near the very high end of difficulty, while pointers are only somewhere in the middle. And that's a problem, because there are plenty of languages, JS included, that free you from understanding pointers while still making it very easy to create racy code. As such there are a lot of programmers going around with a false understanding of whether they are writing robust concurrent code - because they don't think it is concurrent. It doesn't say "concurrent" on it - this is a footgun naturally achieved with any sufficiently complex loop - and often they have made some effort to tuck away mutability in tiny functions, hindering efforts to find and fix the resulting race conditions.

pdpi 1 day ago [-]

Inversely, it's relatively easy to abstract away the complexity of pointers (so we do), but useful abstractions that make race-y code impossible are horrendously hard to get right (so we don't).

Rust's borrow checker is a great example of the sort of complexity you _have_ to bring in to have your language protect you from data races.

zzzcpan 1 day ago [-]

You don't need Rust's complexity and borrow checking to protect you from data races. Actor model does that without the bad parts.

hderms 1 day ago [-]

That's true but actor model achieves freedom from data races by forcing ALL messages to be synchronized. This can be useful in some situations but imo isn't a panacea for concurrency issues

zzzcpan 1 day ago [-]

> ALL messages to be synchronized

I'm assuming you mean implementation-wise. Only cross-core communications need synchronization and since messages are asynchronous you can pay synchronization costs only ones for the whole batch.

0815test 1 day ago [-]

Not only that, but race conditions of this broad, "logic-related" sort are especially common in message-passing concurrency models, like the one that Go makes idiomatic. To make things worse, Go also has data races, when using plain old shared-access concurrency!

masklinn 1 day ago [-]

> To make things even worse, Go also has data races, when using plain old shared-access concurrency!

And Go's message passing readily and silently devolves to unprotected shared-data concurrency as you can (and people commonly will either for performance reasons or because of reference types like maps or slices) send pointers through channels.

---

" pthread_atfork has a bug in the design of the standard. It was designed to solve a problem which cannot be solved with pthread_atfork itself. This means that even having the control of NodeJS? thread pool (..the bad design of NodeJS? does not allow to manage the thread pool from outside, so it cannot be preserved after a fork...), it will not be possible to restore the mutexes in the child process. The only possibility is to re-implement the thread pool of NodeJS? with async safe primitives like a semaphore. Async safe primitives will be able to work in the child process handler. But this is not possible as it enters in conflict with the design decision of to not modify the run-times. " -- [6]

https://stackoverflow.com/a/6605487 says that it is impossible for the handler specified by a call to pthread_atfork to modify the mutexes -- using POSIX semaphores instead of mutexes is recommended instead. Also, POSIX pthread_atfork may be deprecated in the future [7].

---

" This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows a thread to wait on several futexes at the same time, and be awoken by any of them. ... From an implementation perspective, the futex list is passed as an array of (pointer,value,bitset) to the kernel, which will enqueue all of them and sleep if none was already triggered. It returns a hint of which futex caused the wake up event to userspace, but the hint doesn't guarantee that is the only futex triggered. Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel. " -- [8]

---

" The Linux kernel has a number of locking constructs:

 (*) spin locks
 (*) R/W spin locks
 (*) mutexes
 (*) semaphores
 (*) R/W semaphores" -- [9]

---

CIRCULAR BUFFERS

Memory barriers can be used to implement circular buffering without the need of a lock to serialise the producer with the consumer. See:

	Documentation/core-api/circular-buffers.rst

for details. " -- [10]

---

Generally, compiler optz can go full bore between sync/special/fence or sync IDs.
Some optzcan be done w.r.t. global shmemobjects.

Programmer supplied, standardized safety nets:

“Don’t know; Assume worst” –Starting method?
Over-marking SYNCs is overly-conservative

Programming Model Support:

doall–no depsbetween iterations –(HPF/F95 –forall, where)
SIMD (CUDA) –Implied multithread access w/o sync or IF cond
Data type -volatile-C/C++
Directives –OpenMP:
- #pragma ompparallel Sync Region
- #pragma ompshared(A) Data Type
Library –(eg, MPI, OpenCL?, CUDA) " -- [11]

---

" We need–

Higher-level disciplined models that enforcediscipline

Deterministic Parallel Java [V. Adve et al.]

–Hardware co-designed with high-level models

DeNovo? hardware [S. Adve et al.]

---

" State-of-the-art•Many deterministic languages today–Functional, pure data parallel, some domain-specific, ...–Much recent work on runtime, library-based approaches•Our work: Language approach for modern O-O methods–Deterministic Parallel Java (DPJ) [V. Adve et al.]

...

Deterministic Parallel Java (DPJ)

Object-oriented type and effect system
- Use “named”regionsto partition the heap
- Annotate methods with effectsummaries: regions read or written
- If program type-checks, guaranteed deterministic*Simple, modular compiler checking*No run-time checks today, may add in future
- Side benefit:regions, effects are valuable documentation
Extended sequential subset of Java (DPC++ ongoing)
- Initial evaluation for expressivity, performance [Oopsla09]
- Integrating disciplined non-determinism
- Encapsulating frameworks and unchecked code
- Semi-automatic tool for effect annotations [ASE09]

DeNovo? Hardware Project [HotPar?'10]

Design hardware to exploit disciplined parallelism
- Simpler hardware
- Scalable performance
- Power/energy efficiency
Working with DPJ as example disciplined model
- Exploit data-race-freedom, region/effect information
  - Simpler coherence*Efficient communication: point to point, bulk, ...
  - Efficient data layout: region vs. cache line centric memory
  - Best of message passing and shared memory

Dynamic race avoidance

A less drastic alternative:–Stay with (more or less) existing programming languages.–Outlaw data races everywhere, even in Java.–Detect all violations.–Raise exception so race outcome cannot be observed.–No need to specify race semantics:
SC for data-race-free suffices.

Making dynamic race avoidance real

We know how to precisely detect data races (e.g. Goldilocks (Elmas et al., PLDI 07), FastTrack? (Flanagan & Freund, PLDI 09) work).
Slow for always-on, large worst-case space overhead.
Alternative:
- Don’t detect all races. (Detect as many as possible for debugging purposes.)
- But guarantee at least SC if race is not detected.
- DRFx (Marino et al., PLDI 10): Guarantee only SC.
- Conflict Exceptions (Lucia et al., ISCA 10): Guarantee also atomicity for synchronization-free-regions

PL semantics with dynamic data-race detection

Programs with data-races may abort.
Programs that don’t abort have (at least) sequentially consistent semantics

Personal opinion1. Sequential consistency for data-race-free programs, with race detectionis a much better programming model than2. Sequential consistencyLet’s work on the former, not the latter! " -- https://www.hboehm.info/misc_slides/10-pldi-adve-boehm-tutorial.pdf

---

naasking 1 hour ago [-]

> Synchronous or asynchronous is something up to the caller to the decide – not to the function itself.

async/await aren't strictly necessary, but to avoid them you need one of:

1. a sufficiently smart compiler with whole program compilation, or

2. to compile two versions of every function, an async variant and a sync variant, or

3. every async wait captures the whole stack, thus wasting a lot of memory.

Async/await is basically a new sort of calling convention, where the program doesn't run in direct style but in continuation-passing style. This permits massive concurrency scaling with little memory overhead, but there other tradeoffs as per above.

Async/await makes perfect sense for Rust which wants to provide zero-overhead abstractions.

---

example of how async/await corresponds to futures, but is much simpler:

[12]

---

" .

async fn handle_get_counters( &self, p: &mut P::Deserializer, ) -> Result<ProtocolEncodedFinal?

, Error> { let args = {/* snip: some code using `?` */}; let res = self.service.get_counters(args).await?; let enc = write_message(p, "getCounters", MessageType::Reply,

p	res.write(p))?;

    Ok(enc)}

Rather than tetrising together a bunch of map and and_then and flatten combinators with ridiculous signatures, practically the only thing to know is that we write .await after asynchronous things and ? after fallible things. " -- [13]

---

jensneuse 21 hours ago [-]

Does this mean that rust async is using poll/epoll/kqueue under the hood?

masklinn 17 hours ago [-]

Yes, the executor will use whatever io multiplexing the platform provides (and it’s been coded to support).

If the executor is Tokio, it’s built on mio which will use one of kqueue, epoll or iocp depending on the platform: https://docs.rs/mio/0.6.19/mio/struct.Poll.html#implementati...

steveklabnik 20 hours ago [-]

Strictly speaking, it's not tied to any particular method. It depends on your executor. That said, the most popular executor does use epoll/kqueue/iocp. (tokio)

---

"To understand the METACALL fork model, first of all we have to understand the implications of the forking model in operative systems and the difference between fork-one and fork-all models. The main difference between fork-one and fork-all is that in fork-one only the thread which called the fork is preserved after the fork (i.e. gets cloned). In fork-all model, all threads are preserved after cloning. POSIX uses fork-one model, meanwhile Oracle Solaris use the fork-all model. Because of fork-one model, forking a running run-time like NodeJS? (which has a thread pool) implies that in the child process the thread pool will be almost dead except the thread which did the fork call. So NodeJS? run-time cannot continue the execution anymore and the event-loop enters into a deadlock state. When a fork is done, the status of the execution is lost by the moment. METACALL is not able to preserve the state when a fork is done. Some run-times do not allow to preserve the internal state. For example, the bad design of NodeJS? does not allow to manage the thread pool from outside, so it cannot be preserved after a fork. "

---

Oot should provide combination fork exec as a primitive (but maybe not provide fork as a primitive, it's too complicated due to issues like fork one and mutexes)

---

https://www.welcometothejungle.co/fr/articles/btc-deep-learning-clojure-haskell

---