great article on Julia GPU programming:
some notes:
more: https://julialang.org/blog/2018/05/extensible-broadcast-fusion
https://julia.guide/broadcasting
Conversions and copy! to CPU arrays
multi dimensional indexing and slicing (xs[1:2, 5, :])
permutedims
Concatenation (vcat(x, y), cat(3, xs, ys, zs))
map, fused broadcast (zs .= xs.^2 .+ ys .* 2)
fill(CuArray, 0f0, dims), fill!(gpu_array, 0)
Reduction over dimensions (reduce(+, xs, dims = 3), sum(x -> x^2, xs, dims = 1)
Reduction to scalar (reduce(*, xs), sum(xs), prod(xs))
Various BLAS operations (matrix*matrix, matrix*vector)
FFTs, using the same API as julia's FFT" (note: lots of hyperlinks in there in the original)---
" I had some extended notes here about "less-mainstream paradigms" and/or "things I wouldn't even recommend pursuing", but on reflection, I think it's kinda a bummer to draw too much attention to them. So I'll just leave it at a short list: actors, software transactional memory, lazy evaluation, backtracking, memoizing, "graphical" and/or two-dimensional languages, and user-extensible syntax. If someone's considering basing a language on those, I'd .. somewhat warn against it. Not because I didn't want them to work -- heck, I've tried to make a few work quite hard! -- but in practice, the cost:benefit ratio doesn't seem to turn out really well. Or hasn't when I've tried, or in (most) languages I've seen. " [2]
---
" Heterogeneous memory and parallelism
These are languages that try to provide abstract "levels" of control flow and data batching/locality, into which a program can cast itself, to permit exploitation of heterogeneous computers (systems with multiple CPUs, or mixed CPU/GPUs, or coprocessors, clusters, etc.)
Languages in this space -- Chapel, Manticore, Legion -- haven't caught on much yet, and seem to be largely overshadowed by manual, not-as-abstract or not-as-language-integrated systems: either cluster-specific tech (like MPI) or GPU-specific tech like OpenCL?/CUDA. But these still feel clunky, and I think there's a potential for the language-supported approaches to come out ahead in the long run. " [3]
---
msangi on Aug 19, 2017 [-]
It's interesting that he doesn't want to draw too much attention to actors while they are prominent in Chris Lattner's manifesto for Swift [1]
[1] https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9...
mcguire on Aug 19, 2017 [-]
Actors are a bit of a off-the-wall paradigm (https://patterns.ponylang.org/async/actorpromise.html), and I'm (as an old network protocol guy) not sure I'm happy with the attempts to make message passing look like sequential programming (like async/await).
I kinda see where Graydon is coming from. I have this broken half-assed Pony parser combinator thing staring at me right now.
---
kernelbandwidth on Aug 20, 2017 [-]
Would you call Erlang/Elixir actor-based or just having actor-like features?
Actors in Erlang seem more emergent to me than fundamental. To first order, Go and Erlang use similar concurrency strategies with cheap green threads and channels for communication. The main difference being that Go has channels as separates objects that can be passed around (leaving the goroutine anonymous), while Erlang fuses the channels into the process. In this respect, they both have similar levels of "actor-ness" in my mind. The biggest upside that I can see to channels being separate is that a goroutine can have multiple channels, they can be sent around (though so can PIDs), etc. This matters a lot because the channels in Go are statically typed, while Erlang/Elixir's dynamic typing makes this less meaningful since you can send anything on the same channel.
Of course, I supposed if one defines an actor as a sequential threaded process with a built in (dynamically-typed) channel, then Erlang is actor-based, so maybe I contradicted myself. In Erlang/OTP, the actor part is important to the supervision tree error handling strategy, which I think is the biggest upside, but it's not obvious to me that you need the channel and the process totally fused together to handle that.
Specifically in response to your "cheap concurrency good, actors bad" thought, my hypothesis is that actors are at least the more natural strategy in a dynamically typed language, which is why they seem to work well in Erlange/Elixir, but don't see much use in languages like Haskell (or Scala, it seems, where I at least though the Akka actors were kind of awful). Meanwhile, channels seem to fit better with static typing, though I can't quite put my finger on why.
jerf on Aug 21, 2017 [-]
"Would you call Erlang/Elixir actor-based or just having actor-like features?"
First, fair question and I the nature of your follow-on discussion and musings.
IMHO, one of the lessons of Go for other languages is just how important the culture of a language can be. I say that because on a technical level, Go does basically nothing at all to "solve" concurrency. When you get down to it, it's just another threaded language, with all the concurrency issues thereto. An Erlanger is justified in looking at Go's claim to be "good at concurrency" and wondering "Uh... yeah... how?"
And the answer turns out to be the culture that Go was booted up with moreso than the technicals. When you have a culture of writing components to share via communication rather than sharing memory, and even the bits that share memory to try to isolate those into very small elements rather than have these huge conglomerations of locks for which half-a-dozen must be taken very carefully to do anything, you end up with a mostly-sane concurrency experience rather than a nightmare. Technically you could have done that with C++ in the 90s, it's just that nobody did, and none of the libraries would have helped you out.
That did not directly bear on your question. I mention that because I think that while you are correct that Erlang is technically not necessarily actor-oriented, the culture is. OTP pushes you pretty heavily in the direction of actors. Where in Go a default technique of composing two bits of code is to use OO composition, in Erlang your bring them both up as actors using gen_* and wire them together.
"my hypothesis is that actors are at least the more natural strategy in a dynamically typed language, which is why they seem to work well in Erlange/Elixir, but don't see much use in languages like Haskell"
I can pretty much prove that they don't: https://github.com/thejerf/suture It's the process that needs monitoring, and that process may have 0-n ways to communicate. But that's not a criticism of Erlang, as I think that's actually what it does and it just happens to have a fused message box per process.
"my hypothesis is that actors are at least the more natural strategy in a dynamically typed language, which is why they seem to work well in Erlange/Elixir, but don't see much use in languages like Haskell"
An intriguing hypothesis I'll have to consider. Thank you.
kernelbandwidth on Aug 21, 2017 [-]
I had not really considered the design patterns of the culture vs the design patterns of the language; this is a very good point.
> I can pretty much prove that they don't: https://github.com/thejerf/suture It's the process that needs monitoring, and that process may have 0-n ways to communicate. But that's not a criticism of Erlang, as I think that's actually what it does and it just happens to have a fused message box per process.
One, very neat library. Two, while I agree this proves the point that the actor model is not needed in the language to build a process supervisor, I think that your Go Supervisor looks a lot like an actor, at least in the way Erlang/Elixir uses them. From what I can see, the Supervisor itself works by looping over channel receives and acts on it. The behavior of the Supervisor lives in a separate goroutine, and you pass around an object that can send messages to this inner behavior loop via some held channels. So basically the object methods provide a client API and the inner goroutine plays the role of a server, in the same separate of responsibilities that gen_* uses.
If we squint a little bit, actors actually look a lot like regular objects with a couple of specific restrictions: all method calls are automatically synchronized at the call/return boundaries (in Go, this is handled explicitly by the channel boundaries instead), no shared memory is allowed, and data fields are always private. I'm sure this wouldn't pass a formal description, but this seems like a pragmatically useful form.
I agree that Go is less actor-oriented than Erlang/Elixir, but given how often I've seen that pattern you used in the Supervisor (and it's one I have also naturally used when writing Go) I'd argue that "Actor" is a major Go design pattern, even if it doesn't go by that name. The difference then, is the degree to how often one pulls out the design pattern. I think the FP aspect pushes Erlang/Elixir in that direction more, as this "Actor" pattern has a second function there -- providing mutable state -- that Go allows more freely.
This discussion has really made me think, thanks. I think you're right that actor-like features are valuable and that the Actor Model in the everything-is-an-actor is not itself the value (or even a positive).
---
skybrian 32 days ago [-]
It's kind of weird to say m:n threading is terrible when it seems to work okay in Go.
It wasn't the right choice for Rust, but that doesn't mean Go's choice was wrong.
Rusky 32 days ago [-]
Go solved the main downside the article cites with M:N threading- it replaced split stacks with growable stacks, via a precise garbage collector.
kibwen 32 days ago [-]
That solved the hot split problem, but that was more of a once-in-a-while gotcha. The main downside for Rust specifically was the complications that arise when doing any sort of C interop in the presence of growable stacks. This mattered more for Rust than Go, since Rust is relatively more intended to freely intermingle with C code (and other languages via, C FFI), whereas the happy path for Go tends to involve exclusively Go code in a given process (even going so far as to reimplement their own libc in Go), and use of cgo is discouraged by the community (https://dave.cheney.net/2016/01/18/cgo-is-not-go).
---
[4] is an interesting post on some details of the Windows kernel architecture and its scheduler. My takeaways for Oot:
---
" Web Workers do not share mutable data between them, instead relying on message-passing for communication. In fact, Chrome allocates a new V8 engine for each of them (called isolates). Isolates share neither compiled code nor JavaScript? objects, and thus they cannot share mutable data like pthreads.
WebAssembly? threads, on the other hand, are threads that can share the same Wasm memory. The underlying storage of the shared memory is accomplished with a SharedArrayBuffer?, a JavaScript? primitive that allows sharing a single ArrayBuffer?'s contents concurrently between workers. Each WebAssembly? thread runs in a Web Worker, but their shared Wasm memory allows them to work much like they do on native platforms. This means that the applications that use Wasm threads are responsible for managing access to the shared memory as in any traditional threaded application. There are many existing code libraries written in C or C++ that use pthreads, and those can be compiled to Wasm and run in true threaded mode, allowing more cores to work on the same data simultaneously. "
"threads are essentially atomic memory accesses and wait/notify, nothing for starting a worker inside of WASM or anything, that is still in JS land. "
---
more on network and IPC message sizes:
" 6.4.1 Network Message Size The network uses flits with a payload size of 58 bits, when including the header this gives a total flit size of 72 bits (this size was chosen as it matches the size of flits sent by the serial link). Whilst Mamba and MIPS64 use an identical network the size of messages sent differs between them. In Mamba all messages sent are 134 bits whilst in MIPS64 messages come in 4 separate categories each with a different size. These are given in table 6.2. It should be noted that no particular effort was made to optimise the messages sizes. Though a request in Mamba will always require more information than a request in MIPS64 as in Mamba any request has a response address referring to a particular word in memory, whilst in MIPS64 the request merely needs to state which node and context the response should be sent back to.
Message type Size in bits MIPS64 Read Request 44 MIPS64 Write Request 95 MIPS64 Read/CAS Response 67 MIPS64 CAS Request 170 Mamba All Messages 134 Table 6.2: Network message sizes "
so here we see numbers around 64 bits, and numbers in between 128 and 256 bits.
this suggests that we should have message sizes of 64 bits or 256 bits
but i still like the idea of a 16bit payload and then 64bit packets (16 bits for each of: destination address, Port (or maybe source address), message type, payload)
This suggests that we have 3 different message types (small, medium, large: 16, 64, ?? bit payloads; for 'large' (??) we could have 64kbits, but 8k bytes is rather large, so perhaps make that 4k or 1k bytes; or could just do 16,64,256 bits and let 1k bytes etc (which is what we had before) be higher level)
obviously 16 bits payloads are useful b/c Oot is built around 16-bit values, and 64 bits are useful because they are the largest standardish value size in contemporary programming.
Note: ed25519 signatures are 512 bits, although their public keys are 256 bits
note: Cache lines are 64 bytes
since we are talking about 8k bytes etc, i think of 8k l1 caches, and of 4k page size. From stackoverflow, why 4k page size: "4 KiB? is widely popuplar page granularity in other architectures too. One could argue that this size comes from the division of a 32-bit virutal address into two 10-bit indexes in page directories/tables and the remaining 12 bits give the 4 KiB? page size"
note: Max safe udp payload is 508 bytes
So what about: 256 bytes = 2048bits for the large msg size payload
So we could have 4 types of messages, with payloads of size in bits: 0,16,64,2048 (0-sized payloads are signals)
the 16, 64 message types could be fixed size. The 0 message type (signal) has no payload at all. The 2048 message type could be variable size (max of 2048 bits = 256 bytes).
This has the advantages of having:
Note that 256 byte max is a 16x reduction over my previous proposal (4k)
how many types of signals should we have? i kind of want a rich choice of signal types to allow them to be used for application communication. so maybe 256? That has the benefit that one byte can just hold the signal type. But, that means that to store incoming signals for each process, we need 256 bits (32 bytes), which seems excessive (since we are trying to have zillions of processes here). Also the signal handler table would get long (at least 256 entries times at least 16-bit pointers = at least 512 bytes). So how about 64 signal types. That lets us store incoming signals for each process in 8 bytes (plus we have an 8 byte signal mask), which seems reasonable and has the additional benefit of fitting within one fundamental value on modern processors (so most implementations will be able to change them atomically, not that that matters much since we'll probably only need to change one bit at a time, which can be done atomically anyway by changing one byte). The signal handler table is now at least 128 bytes, which is still pretty large but is more managable. Also i don't know why we'd need to, but we can send the signal state using the payload of just one of our 64-bit fixed size messages. Hmm, feels like that could have some benefit...
Alternately, if we wanted non-rich signals, we could just have 16 of them. That would have the benefit that the signal handler table is reasonably small (32 bytes).
We could mix-and-match a little; have 64 signal types but allow only 16 distinct signal handlers (so, since we have 64 signal types, the last signal handler handles the 16th signal type as well as the other 48). Hmm, i like that. Reserve a few of the signal types for the implementation, of course. At least 4 (out of these, maybe 1 of them is one of the special 16 signals that have per-process handler pointers).
but wait: there was a good reason for our previous choice of 4k max message size: 4k * 16-bit words = 64k bits. Since we had 64k addressable nodes per 'chip', this is needed to allow a bitmap with one bit per node to be transmitted in a single message.
also, having 256 byte messages doesn't fit so well with our having 16-bit words (instead of 8-bit bytes).
so maybe bump the max message size back up to 4k.
also, if there isn't any API/semantic difference besides size between 16-bit and 64-bit fixed size messages, let's just pick one size: 64-bit
---
just a note: why am i obsessed with the quixotic 16-bit words and other 'keeping things small' limits when efficiency is a non-goal of Oot? It's because of the 'brain-like massively parallel' goal:
1) i feel like the mid-to-low level elements of the brain don't pass around digital values with high precision (although this could be wrong, when you conside the analog nature of waveform timing). Eg today's neural net processors see gains by using 8-bit or 16-bit floating point
2) if we are going to have zillions of CPUs with local memories on consumer-affordable hardware, we have to make each one very cheap: very simple processors and very small local memories
(why have fixed sizes at all? originally i wanted everything to be variable sized, but i realized that that adds to implementation complexity, which makes everything less likely to be widely ported)
one issue i have is: in order to keep things small, i've been limiting one 'chip' to 64k processors. But there are many more than 64k processors in the brain (even if a 'processor' is more than one neuron, i think the visual system has more than 256x256 pixels/resolution, although i'm not positive since the eye moves around alot to collect data). So should have more than 64k? Or should we regard one 'chip' more like one cortical column? Wikipedia says " Various estimates suggest there are 50 to 100 cortical minicolumns in a hypercolumn, each comprising around 80 neurons", so a "hypercolumn" would have up to 80*100=8,000; so it would be safe to round up to 64k so that we have some 'implementation margin'. Otoh individual neurons are probably way too low level for us; one thread is probably like a minicolumn, so this would give us only 100 threads per hypercolumn. Another complexity is that the brain probably doesn't have 'addressing precision' to each of the 100 billion or so neurons in the brain (a little more than 2^32); so even if the brain has a bunch of neurons, its node addresses aren't that big; which suggests that a 16-bit address space is indeed big enough for our purposes (either b/c each chip is like only a small local network of a few nearbly neurons; or because each node on the chip is like a huge subnetwork of which there are less thank 64k distinct subnetworks in our brain).
---
so how many bits should each header field be in our 64-bit-payload message? before when i had a 16-bit payload, 16-bit header fields made sense. But if our chip size is only 64k nodes, then 64-bit header fields may not make sense. Otoh, the 'one issue i have is' discussion in the previous section makes me think that there may be a use for >16-bit addressing. And as for the type field, it's not like our tiny processes are going to have a table with 64k entries anyways for the 2^16 possible types. So let's say that we have 3 64-bit header fields, too. This makes the total packet 256 bits (so it still fits in one cache line, much less one safe UDP payload).
might want to split the 64-bit type into a 32-bit port and a 32-bit type.
---
mb 0 bits, 1 bit, 1 byte, 2 bytes (16-bits), 8 bytes (64 bits), 64 bytes (cache line), 256 bytes (next power of 2 below safe/probably not fragmented udp payload size, also 4^4), 508 (safe/probably not fragmented udp payload size) 4k (allocation block, draw row), or 64k (2^16) are the reasonable choices for small message sizes
---
infogulch 2 days ago [-]
I really really hope Go 2 can do something about `context`. Context is the biggest hidden wart of Go. We need the capabilities of context in different packaging.
reply
atombender 2 days ago [-]
This doesn't seem to be a popular opinion, but I agree. It's such a pervasive functionality in concurrent programs that it really should be a built-in aspect of a goroutine.
The problem with context isn't necessarily the interface, it is that it is "viral". If you need context somewhere along a call chain, it infects more than just the place you need it — you almost always have to add it upwards (so the needed site gets the right context) and downwards (if you want to support cancellation/timeout, which is usually the point of introducing a context).
Context's virality also applies to backwards compatibility. There have been discussions of adding context to io.Reader and io.Writer, for example, but there's no elegant way to retrofit them without creating new interfaces that support a context argument. This problem applies to any API; you may not expect your API to require a context today, but it might need one tomorrow, which would require a breaking API change. Given that it's impossible to predict, you might want to pre-emptively add context as an argument to all public APIs, just to be safe. Not good design.
Cancellation/timeout is arguably so core to the language that it should be an implicit part of the runtime, just like goroutines are. It would be trivial for the runtime to associate a context with a goroutine, and have functions for getting the "current" context at any given time. (Erlang got this right, by allowing processes to be outright killed, but it's probably too late to redesign Go to allow that.)
(I'm ignoring the key/value system that comes with the Context interface, because I think it's less core. It certainly seems less used than the other mechanisms. For example, Kubernetes, one of the largest Go codebases, doesn't use it.)
reply
nathanaldensr 2 days ago [-]
Forgive me since I've never used Go, but this sounds similar to having to add the async keyword to methods in C# all the way up the stack once you attempt to use an async method, as well as having to pass CancellationTokens? down the stack to support cancellation. I've noticed it pollutes code with a lot of ceremony that I wish had been added to the runtime itself. Is this what you're talking about?
reply
infogulch 2 days ago [-]
Yeah, context is basically a cancellation token with the same downsides and a bunch of unrelated features piled on (because it came from a bunch of people needing to work around limitations getting together and deciding to put all their hacks in one place, but I digress). But from a certain perspective all functions in go are async by default, so we get to dodge that one.
reply
dolmen 2 days ago [-]
I have written the 'contextio' package that may be useful to you. It provides io.Reader and io.Writer wrappers that handle context cancellation. This allows to transparently add optinal context cancellation awareness to routines that work with I/O without injecting context as argument.
Doc: https://godoc.org/github.com/dolmen-go/contextio Example: https://godoc.org/github.com/dolmen-go/contextio#example-pac...
reply
atombender 2 days ago [-]
Cool, thanks!
reply
Because I think discussing actual solutions is better that just complaining, this is my current favorite design document to address context:
Go Context Scoping [1] by Eyal Posener. It even has a working PoC? implementation [2] which is pretty ergonomic and could become even moreso with language integration.
I think this concept solves most of the problems we currently face with context. As a consequence of making context available per goroutine it basically becomes an implementation of gouroutine-local storage. But this is more because context.WithValues? exists in the first place than because of this proposal. In fact context has effectively become the de-facto GLS anyway, except it makes everyone's code ugly to do it.
[1]: https://posener.github.io/context-scoping/
[2]: https://github.com/posener/context
reply
reply
interesthrow2 2 days ago [-]
Not making Go routines values like Ada Tasks is what led to awkward solutions shouldered by the developer such as "context". Too many time Go developers were told to solve issues in user-land, this is the consequence of that.
reply
Teckla 2 days ago [-]
Having to manually pass context through so many functions in code bases is definitely not ideal.
reply
---
" async/await in Rust produces a single value, a Future. Executing that future requires calling its poll method repeatedly. That can be extremely simple, or it can be more complex. We call "a thing that calls poll repeatedly" an executor, and they can be written without an OS or even the standard library, just fine.
The largest, most well known executor is Tokio, which is built on top of OS primitives. If you're writing your own OS, you'd use those primitives. "
---
ajross 2 days ago [-]
How did adding exceptions to the language make C++ into C++? RTTI? Multiple virtual inheritance? Default member function generation?
They didn't, not individually. They were popular and uncontroversial (rather less controversial than promise apis, even). And yet...
The point is that Rust seems to be charging blindly down the same road, not that any one feature is going to blow it all up. Frankly IMHO Rust is already harder to learn for a median programmer than C++ is, even if it makes more aesthetic sense to an expert.
reply
---
infogulch 30 days ago [-]
I really really hope Go 2 can do something about `context`. Context is the biggest hidden wart of Go. We need the capabilities of context in different packaging.
atombender 30 days ago [-]
This doesn't seem to be a popular opinion, but I agree. It's such a pervasive functionality in concurrent programs that it really should be a built-in aspect of a goroutine.
The problem with context isn't necessarily the interface, it is that it is "viral". If you need context somewhere along a call chain, it infects more than just the place you need it — you almost always have to add it upwards (so the needed site gets the right context) and downwards (if you want to support cancellation/timeout, which is usually the point of introducing a context).
Context's virality also applies to backwards compatibility. There have been discussions of adding context to io.Reader and io.Writer, for example, but there's no elegant way to retrofit them without creating new interfaces that support a context argument. This problem applies to any API; you may not expect your API to require a context today, but it might need one tomorrow, which would require a breaking API change. Given that it's impossible to predict, you might want to pre-emptively add context as an argument to all public APIs, just to be safe. Not good design.
Cancellation/timeout is arguably so core to the language that it should be an implicit part of the runtime, just like goroutines are. It would be trivial for the runtime to associate a context with a goroutine, and have functions for getting the "current" context at any given time. (Erlang got this right, by allowing processes to be outright killed, but it's probably too late to redesign Go to allow that.)
(I'm ignoring the key/value system that comes with the Context interface, because I think it's less core. It certainly seems less used than the other mechanisms. For example, Kubernetes, one of the largest Go codebases, doesn't use it.)
nathanaldensr 30 days ago [-]
Forgive me since I've never used Go, but this sounds similar to having to add the async keyword to methods in C# all the way up the stack once you attempt to use an async method, as well as having to pass CancellationTokens? down the stack to support cancellation. I've noticed it pollutes code with a lot of ceremony that I wish had been added to the runtime itself. Is this what you're talking about?
infogulch 30 days ago [-]
Yeah, context is basically a cancellation token with the same downsides and a bunch of unrelated features piled on (because it came from a bunch of people needing to work around limitations getting together and deciding to put all their hacks in one place, but I digress). But from a certain perspective all functions in go are async by default, so we get to dodge that one.
atombender 30 days ago [-]
Absolutely. Contexts are similar to your CancellationToken?; a context contains a channel you can listen to, just like CancellationToken?'s WaitHandle?. They're slightly simpler in that I believe CancellationToken? supports registering callbacks, which contexts don't.
Go doesn't actually have async support in the sense of promises/futures (as seen in C#, JavaScript?, Rust, etc.). The entire language is built around the idea that I/O is synchronous, and concurrency is achieved by spawning more goroutines. So if you have two things that you want to run simultaneously, you spawn two goroutines and wait for them. (Internally, the Go runtime uses asynchronous I/O to achieve concurrency.)
Groxx 30 days ago [-]
>Go doesn't actually have async support
I usually phrase this part of Go as: no async/await, it only has threads. But no thread handles. Everything is potentially parallel under the hood and all coordination requires external constructs like WaitGroups? / chans / etc.
async/await has major complications like changing call syntax through the whole chain, so I actually prefer it this way. the lack of thread handle objects (e.g. a return value from `go func()`) is strange IMO tho.
int_19h 30 days ago [-]
The main advantage of the async/await model is that it's just syntactic sugar on top of CPS, so you can bolt it onto any language that is capable of handling callbacks (even C!). For a good example of that, consider WinRT? - you can write an async method there in C#, the task that it returns can pass through a bunch of C++ frames, and land up in JS code that can then await it - and all that is handled via a common ABI that is ultimately defined in C terms.
Conversely, goroutines require Go stack to be rather different from native stack, which complicates FFI.
Groxx 30 days ago [-]
it's true that it's "just syntactic sugar", but in most languages it has call-site contracts that are either part of the signature (`await x()` to unpack the future/coroutine) or implicit (`x()` in an event loop host). and to change something deep in the stack means changing every call site everywhere.
that's a huge burden on a library (and thus the entire language ecosystem). it splits the world.
to avoid that, you basically need language-level support, so everything is awaitable... at which point you're at the same place as threads (but now they're green), where you cannot know if your callees change their behavior. which is both a blessing and a curse.
---
tl;dr yes but no. do you want your callee's parallelism to be invisible? you can't have both. (afaik. I'd love to see a counter-example if you know of one. there was one very-special-case language that made parallelism a thing you did to code rather than the code doing, but I can't find it at the moment. it only worked on image convolutions at the time.)
int_19h 29 days ago [-]
Right. You can have callee's parallelism be invisible - but only by adding it to your language (can't be done as a library) and making it be similarly invisibly parallel. And even that only works so long as everybody adopts the same system - goroutines don't play well with Ruby fibers, for example.
With the async/await model, you can immediately use it with any language that has callbacks, and then the languages can gradually add async/await on their own schedules.
I would dare say that the async/await model has proven far more successful in practice. It came to the table later than green threads etc (if you consider syntactic sugar a part of it - CPS itself was around for much longer, of course). And yet it was picked up surprisingly fast - and I think the way in which you can bolt it onto the existing language is precisely why. Conversely, every language and VM that has some form of green threads, seems to insist on doing their own that aren't compatible with anything else out there - and then you get insular ecosystems and FFI hell.
Maybe if some OS offered green threads as a primitive, it would have been different. But then again, Win32 has had fibers since mid-90s, and nobody picked that up.
Groxx 29 days ago [-]
thread interop may be a major contributor to "far more successful in practice", because yea - I agree, it's far more common. I don't know that side of things all that well :
having spent a fairly significant amount of time in an event-loop system with async/await tho (python coroutines): I don't know if that's a good thing. getting your head around "thou must never block the event loop, lest ye unceremoniously bring the system to its knees" / never using locks / never confusing your loops / etc requires both significant education and and significant care, and when you get it wrong or performance degrades it can be truly horrific to debug.[1] it's nice that it tends to have fewer data races though.
green thread systems though are trivial - your stack looks normal, your tracing tools look normal, your strategies are basically identical (since they tend to context switch at relatively fine-grained points, so your only real concern is heavy func-call-less computation, which is very rare and easily identified). since I don't have to deal with thread interop[2] I'll take that every single time over async/await.
---
[1] I've helped teams which had already spent weeks or months failing to make progress, only to discover what would be an obvious "oops" or compile-time error elsewhere. some languages do this much better, from what I've seen, but CPS javascript and python coroutines and other ones I've used have been awful experiences. basically, again, language-level support is needed, so I broadly still disagree on "just syntactic sugar" for it to be even remotely acceptable.
[2] ....though cgo has been a nightmare. nearly everyone uses it wrong. so I 100% believe that I could switch sides on this in time :)
Groxx 30 days ago [-]
Halide! That's the one I was thinking of: https://www.youtube.com/watch?v=3uiEyEKji0M
I'd love to find out if this could be a more generally-applicable technique, but it's fascinating regardless.
dolmen 30 days ago [-]
I have written the 'contextio' package that may be useful to you. It provides io.Reader and io.Writer wrappers that handle context cancellation. This allows to transparently add optinal context cancellation awareness to routines that work with I/O without injecting context as argument.
Doc: https://godoc.org/github.com/dolmen-go/contextio Example: https://godoc.org/github.com/dolmen-go/contextio#example-pac...
atombender 30 days ago [-]
Cool, thanks!
smarterclayton 30 days ago [-]
We do use it but mostly in middleware.
However, the reason we don’t use that more is partially because of the viral nature of context - it came late in kube lifecycle, so we didn’t ensure it everywhere and now it’s a lot harder to wire (clients have been iterated on for a while).
I have a closed PR from 2015 to kube that added per request ID tracking that we closed as “wait until we have context everywhere” and we’re still waiting.
riwsky 30 days ago [-]
I agree heartily that context is viral in APIs, but would argue that that's essential to the nature of context. Accordingly: implicit association of context with a goroutine would introduce a complementary API virality issue: you now need need to worry about whether anything above or below you starts to delegate their work to separate goroutines.
atombender 30 days ago [-]
Not sure what you mean here. How does an implicit context change any semantics? You would still be able to override which context is given to goroutines you spawn. As a developer, you'd have to be aware of the implicitness, that's the only difference.
riwsky 30 days ago [-]
I agree it doesn't change the semantics, and that you can express the same set of programs (given that you still let people explicitly handle contexts, send them over channels, etc). I just mean to highlight that removing context-usage information from go function signatures does not mean that usage or non-usage of context isn't part of a function's API—it's just now global state, and needs to consider its interactions with everything else in the same routine (instead of everything lexically in scope).
infogulch 30 days ago [-]
Because I think discussing actual solutions is better that just complaining, this is my current favorite design document to address context:
Go Context Scoping [1] by Eyal Posener. It even has a working PoC? implementation [2] which is pretty ergonomic and could become even moreso with language integration.
I think this concept solves most of the problems we currently face with context. As a consequence of making context available per goroutine it basically becomes an implementation of gouroutine-local storage. But this is more because context.WithValues? exists in the first place than because of this proposal. In fact context has effectively become the de-facto GLS anyway, except it makes everyone's code ugly to do it.
[1]: https://posener.github.io/context-scoping/
[2]: https://github.com/posener/context
interesthrow2 30 days ago [-]
Not making Go routines values like Ada Tasks is what led to awkward solutions shouldered by the developer such as "context". Too many time Go developers were told to solve issues in user-land, this is the consequence of that.
Teckla 30 days ago [-]
Having to manually pass context through so many functions in code bases is definitely not ideal.
infogulch 30 days ago [-]
It doubles the surface area of every library that deals with anything related to IO, and forces middle libraries that don't and shouldn't care about context to double their surface area just to support connecting their consumers to their upstream providers. "not ideal" is an understatement.
dolmen 30 days ago [-]
My contextio package might be useful. Check example: https://godoc.org/github.com/dolmen-go/contextio#example-package--Copy
kochthesecond 30 days ago [-]
Coming from languages that lack such a «convention», I quite like Context. Try implementing something similar in java or even node is also a giant pain
Rapzid 30 days ago [-]
It's not the fastest thing in the world, but task-local storage in C# is pretty great. Miss it much in TypeScript?.
---
Ravenscar concurrency profile
" In Ada, creating tasks, synchronizing them, sharing access to resources, are part of the language...For real-time and embedded applications, Ada defines a profile called Ravenscar. It's a subset of the language designed to help schedulability analysis, it is also more compatible with platforms such as micro-controllers that have limited resources....One of the advantages of having tasking as part of the language standard is the portability, you can run the same Ravenscar application on Windows, Linux, MacOs? or an RTOS like VxWorks?. GNAT also provides a small stand alone run-time that implements the Ravenscar tasking on bare metal. This run-time is available, for instance, on ARM Cortex-M micro-controllers. It's like having an RTOS in your language.
...
Tasks
you can declare and implement a single task:
-- Task declaration task My_Task;
-- Task implementation
task body My_Task is
begin
-- Do something cool here...
end My_Task;If you have multiple tasks doing the same job or if you are writing a library, you can define a task type ... One limitation of Ravenscar compared to full Ada, is that the number of tasks has to be known at compile time.
Time
...
a definition of the Time type which represents the time elapsed since the start of the system
a definition of the Time_Span type which represents a period between two Time values
a function Clock that returns the current time (monotonic count since the start of the system)
Various sub-programs to manipulate Time and Time_Span valuesThe Ada language also provides an instruction to suspend a task until a given point in time: delay until.
...
Scheduling
Ravenscar has priority-based preemptive scheduling. A priority is assigned to each task and the scheduler will make sure that the highest priority task - among the ready tasks - is executing.
A task can be preempted if another task of higher priority is released, either by an external event (interrupt) or at the expiration of its delay until statement (as seen above).
If two tasks have the same priority, they will be executed in the order they were released (FIFO within priorities).
Task priorities are static, however we will see below that a task can have its priority temporary escalated.
The task priority is an integer value between 1 and 256, higher value means higher priority. It is specified with the Priority aspect:
Task My_Low_Priority_Task
with Priority => 1; Task My_High_Priority_Task
with Priority => 2;Mutual exclusion and shared resources
In Ada, mutual exclusion is provided by the protected objects.
At run-time, the protected objects provide the following properties:
There can be only one task executing a protected operation at a given time (mutual exclusion)
There can be no deadlock In the Ravenscar profile, this is achieved with Priority Ceiling Protocol.
A priority is assigned to each protected object, any tasks calling a protected sub-program must have a priority below or equal to the priority of the protected object.
When a task calls a protected sub-program, its priority will be temporarily raised to the priority of the protected object. As a result, this task cannot be preempted by any of the other tasks that potentially use this protected object, and therefore the mutual exclusion is ensured.
The Priority Ceiling Protocol also provides a solution to the classic scheduling problem of priority inversion.
...
Synchronization
Another cool feature of protected objects is the synchronization between tasks.
It is done with a different kind of operation called an entry.
An entry has the same properties as a protected procedure except it will only be executed if a given condition is true. A task calling an entry will be suspended until the condition is true.
This feature can be used to synchronize tasks.
...
Interrupt Handling
Protected objects are also used for interrupt handling. Private procedures of a protected object can be attached to an interrupt using the Attach_Handler aspect.
" --- There's a mini-RTOS in my language by Fabien Chouteau
---
parallel reduce (fold) is really about the associativity of the operation;
((((a+b)+c)+d)+e) = (a+(b+c))+(d+e) which is why you can start by doing b+c at the same time as d+e (balancing the expression tree somewhat so that there are as many 'leaf nodes' as possible at each step, since each 'leaf node' can be executed in parallel)
---
https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/
" A novice programmer might look at the above list of functions and ask, “Why does C++11 offer so few RMW operations? Why is there an atomic fetch_add, but no atomic fetch_multiply, no fetch_divide and no fetch_shift_left?” There are two reasons:
Because there is very little need for those RMW operations in practice. Try not to get the wrong impression of how RMWs are used. You can’t write safe multithreaded code by taking a single-threaded algorithm and turning each step into an RMW.
Because if you do need those operations, you can easily implement them yourself. As the title says, you can do any kind of RMW operation!...
Out of all the available RMW operations in C++11, the only one that is absolutely essential is compare_exchange_weak. Every other RMW operation can be implemented using that one. It takes a minimum of two arguments: "
---
wahern 4 hours ago [-]
> Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.
That's almost the very definition of a coroutine--explicit transfer of control. In symmetric coroutines you must specify a coroutine for both yield and resume; in asymmetric coroutines you specify what to resume to but yield implicitly returns to whatever resumed the current coroutine. In either case the actual control flow transfer is explicitly invoked.
The term thread is more ambiguous, but it almost always implies control transfers--both the timing and target of control transfer--are implicit and not directly exposed to application logic. (Automagic control transfer might be hidden within commonly used functions (e.g. read and write), injected by the compiler (Go does this), or triggered by hardware.)
You can synthesize a threading framework with both asymmetric and symmetric stackful[1] coroutines by simply overloading the resume and yield operations to transfer control to a scheduler, and then hiding implicit resume/yield points within commonly used functions or by machine translation of the code. In languages where "yield" and "resume" are exposed as regular functions this is especially trivial. Stackful coroutines (as opposed to stackless, which are the most commonly provided type of coroutine) are a powerful enough primitive that building threads is relatively trivial, which is why the concepts are easy to conflate, but they shouldn't be confused.
LISP-y languages blur some of these distinctions as libraries can easily rewrite code; they can inject implicit control transfer and stack management in unobtrusive ways.[2] This isn't possible to the same extent in languages like C, C++, or Rust; lacking a proper control flow primitive (i.e. stackful coroutine) their "threading" frameworks[3] are both syntactically and semantically leaky.
[1] By definition a thread preserves stack state--recursive function state--and this usually implies that stack management occurs at a very low-level in the execution environment, but in any case largely hidden from the logical application code.
[2] OTOH, this is usually inefficient--stack management is a very performance critical aspect of the runtime. For example, Guile, a Scheme implementation, now provides a stackful coroutine primitive. For a good discussion of some of these issues, see https://wingolog.org/archives/2017/06/27/growing-fibers
[3] Specifically the frameworks that attempt to make asynchronous I/O network programming simple and efficient. So-called native threads are a different matter as both stack management and control transfer are largely implemented outside the purview of those languages, very much like how native processes are implemented. If you go back far enough in the literature, especially before virtual memory, the distinctions between process and thread fall away. Nowadays threads are differentiated from processes by sharing the same memory/object space.
reply
FullyFunctional?