don't know where to put this but it talks about CLOS, crossbars, butterfly networks, which is a topic i had some lost notes on years ago. I'm not interested so much in this post itself, but just a reminder to look up those terms again someday:
brandmeyer on Feb 20, 2019 [–]
I went on a bit of a research expedition to see if there was something that scaled better a general permutation instruction for SIMD machines. General permute scales at O(log(N)) in gate delay+ and O(N^2 * log(N)) in area, where N is the vector length. Its a full crossbar, but the fanout on the wires adds an additional log(N) factor in buffers.
For a while, it seemed like a set of instructions based on a rank-2 CLOS network (aka butterfly network) would get the job done. It scales at O(log(N)) in gate delay+ and O(N * log(N)) in area, and is very capable. Fanout is O(1). You can do all kinds of inter- and intra-lane swaps, rotations, and shifts with it. You can even do things like expand packed 16-bit RGB into 32-bit uniform RGBA.
But things like that SMH algorithm are definitely out of scope: Each input bit can only appear in at most one output location with the butterfly. So the cost to replicate a prefix scales at O(repetitions), which is unfortunate. Some algorithms based on using general shuffle are also relying on the use of PSHUFB's functionality as a complete lookup table, which the butterfly network can't do, either.
My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.
+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.
-- https://news.ycombinator.com/item?id=19210123
dragontamer on Feb 20, 2019 [–]
GPU Coders haven't changed their code in the last 10 years, even as NVidia changed their architecture repeatedly.
PTX Assembly from NVidia still runs on today's architectures. I think this variable-length issue they focus on so much is a bit of a red-herring: NVidia always was 32-way SIMD but the PTX Code remains portable nonetheless.
The power is that PTX Assembly (and AMD's GCN Assembly) has a scalar-model of programming, but its execution is vectorized. So you write scalar code, but the programmer knows (and assumes it to be) in a parallel context. EDIT: I guess PTX is technically interpreted: the number of registers is not fixed, etc. etc. Nonetheless, the general "SIMD-ness" of PTX is static, and has survived a decade of hardware changes.
There are a few primitives needed for this to work: OpenCL?'s "Global Index" and "Local Index" for example. "Global Index" is where you are in the overall workstream, while "Local Index" is useful because intra-workgroup communications are VERY VERY FAST.
And... that's about it? Really. I guess there are a bunch of primitives (the workgroup swizzle operations, "ballot", barrier, etc. etc.), but the general GPU model is actually kinda simple.
I see a lot of these CPU architecture changes, but none of them really seem to be trying to learn from NVidia or AMD's model. A bit of PTX-assembly or GCN Assembly probably would do good to the next generation of CPU Architects.
Const-me on Feb 21, 2019 [–]
GPU programming model is simple because it's limited. Low single-threaded performance, high latency masked by threads switching, inefficient branching, limited memory model (can't allocate RAM, write access is very limited).
If you're happy with these limitations, write OpenCL? code and run it on CPU. Will work much faster than scalar code but likely slower than a GPU would.
the rest of this subthread goes very deep but is interesting:
https://news.ycombinator.com/item?id=19212817
---
friendlysock 4 days ago
| link | flag |
Digging into IPC a bit, I feel like Windows actually had some good stuff to say on the matter.
I think the design space looks something like:
Messages vs streams (here is a cat picture vs here is a continuing generated sequence of cat pictures)
Broadcast messages vs narrowcast messages (notify another app vs notify all apps)
Known format vs unknown pile of bytes (the blob i’m giving you is an image/png versus lol i dunno here’s the size of the bytes and the blob, good luck!)
Cancellable/TTL vs not (if this message is not handled by this time, don’t deliver it)
Small messages versus big messages (here is a thumbnail of a cat versus the digitized CAT scan of a cat)I’m sure there are other axes, but that’s maybe a starting point. Also, fuck POSIX signals. Not in my OS.
---
https://github.com/StephenCleary/AsyncEx
--- ruby ractor
https://bugs.ruby-lang.org/issues/17100
---
ruby ractors
"
Ractor (experimental)
Ractor is an Actor-model like concurrent abstraction designed to provide a parallel execution feature without thread-safety concerns.
You can make multiple ractors and you can run them in parallel. Ractor enables you to make thread-safe parallel programs because ractors can not share normal objects. Communication between ractors are supported by message passing.
To limit sharing of objects, Ractor introduces several restrictions to the Ruby’s syntax (without multiple Ractors, there is no restriction).
The specification and implementation are not matured and may be changed in the future, so this feature is marked as experimental and show the “experimental feature” warning when the first Ractor.new.
The following small program calculates n.prime? (n is relatively a big integer) in parallel with two ractors. You will confirm that the program execution is about x2 times faster compared to the sequential program on the parallel computer.
require 'prime'
See doc/ractor.md for more details. Fiber Scheduler
Fiber#scheduler is introduced for intercepting blocking operations. This allows for light-weight concurrency without changing existing code. Watch “Don’t Wait For Me, Scalable Concurrency for Ruby 3” for an overview of how it works.
Currently supported classes/methods:
Mutex#lock, Mutex#unlock, Mutex#sleep
ConditionVariable#wait
Queue#pop, SizedQueue#push
Thread#join
Kernel#sleep
Process.wait
IO#wait, IO#read, IO#write and related methods (e.g. #wait_readable, #gets, #puts and so on).
IO#select is not supported.(Explain Async gem with links). This example program will perform several HTTP requests concurrently:
(Explain this:)
async is outer gem
async uses this new featurerequire 'async' require 'net/http' require 'uri' Async do ["ruby", "python", "c"].each do
| topic |
Async do
Net::HTTP.get(URI "https://www.google.com/search?q=#{topic}")
end
endend" -- [1]
---
https://tokio.rs/tokio/tutorial/hello-tokio
---
https://en.wikipedia.org/wiki/Petri_net
---
https://github.com/jimblandy/context-switch
" Comparison of Rust async and Linux thread context switch time and memory use
These are a few programs that try to measure context switch time and task memory use in various ways. In summary:
A context switch takes around 0.2µs between async tasks, versus 1.7µs between kernel threads. But this advantage goes away if the context switch is due to I/O readiness: both converge to 1.7µs. The async advantage also goes away in our microbenchmark if the program is pinned to a single core. So inter-core communication is something to watch out for.
Creating a new task takes ~0.3µs for an async task, versus ~17µs for a new kernel thread.
Memory consumption per task (i.e. for a task that doesn't do much) starts at around a few hundred bytes for an async task, versus around 20KiB (9.5KiB user, 10KiB kernel) for a kernel thread. This is a minimum: more demanding tasks will naturally use more.
It's no problem to create 250,000 async tasks, but I was only able to get my laptop to run 80,000 threads (4 core, two way HT, 32GiB).
These are probably not the limiting factors in your application, but it's nice to know that the headroom is there. "
---
12
c-cube 29 hours ago | link | flag | The classic “what color is your function” blog post describes what is, I think, such a pain? You have to choose in your API whether a function can block or not, and it doesn’t compose well.
~
kevinc 29 hours ago | link | flag | I read that one, and I took their point. All this tends to make me wonder if Swift (roughly, Rust minus borrow checker plus Apple backing) is doing the right thing by working on async/await now.
But so far I don’t mind function coloring as I use it daily in TypeScript?. In my experience, functions that need to be async tend to be the most major steps of work. The incoming network request is async, the API call it makes is async, and then all subsequent parsing and page rendering aren’t async, but can be if I like.
Maybe, like another commenter said, whether async/await is a net positive has more to do with adapting the language to a domain that isn’t otherwise its strong suit.
14
kristoff 29 hours ago | link | flag | You might be interested in knowing that Zig has async/await but there is no function coloring problem.
https://kristoff.it/blog/zig-colorblind-async-await/
~
kevinc edited 28 hours ago | link | flag | Indeed this is an interesting difference at least in presentation. Usually, async/await provides sugar for an existing concurrency type like Promise or Task. It doesn’t provide the concurrency in the first place. Function colors are then a tradeoff for hiding the type, letting you think about the task and read it just like plain synchronous code. You retain the option to call without await, such that colors are not totally restrictive, and sometimes you want to use the type by hand; think Promise.all([…]).
Zig seems like it might provide all these same benefits by another method, but it’s hard to tell without trying it. I also can’t tell yet if the async frame type is sugared in by the call, or by the function definition. It seems like it’s a sort of generic, where the nature of the call will specialize it all the way down. If so, neat!
7
kristoff 20 hours ago | link | flag | It seems like it’s a sort of generic, where the nature of the call will specialize it all the way down. If so, neat!
That’s precisely it!
...Part of the complexity of async/await in Zig is that a single library implementation can be used in both blocking and evented mode, so in the end it should never be the case that you can only find an async version of a client library, assuming authors are willing to do the work, but even if not, support can be added incrementally by contributors interested in having their use case supported.
---
---
I don’t know how async/await works exactly, but it definitely has a clear use case. ..
I’d even go further: given the performance of our machines (high latencies and high throughputs), I believe non-blocking I/O at every level is the only reasonable way forward. Not just for networking, but for disk I/O, filling graphics card buffers, everything. Language support for this is becoming as critical as generics themselves. We laughed “lol no generics” at Go, but now I do believe it is time to start laughing “lol no async I/O” as well. The problem now is to figure out how to do it. [8]
---
8 adaszko 31 hours ago
| link | flag |
This is amazing. I had similar feelings (looking previously at JS/Scala futures) when the the plans for async/await were floating around but decided to suspend my disbelief because of how good previous design decisions in the language were. Do you think there’s some other approach to concurrency fit for a runtime-less language that would have worked better?
16
spacejam edited 30 hours ago | link | flag | My belief is generally that threads as they exist today (not as they existed in 2001 when the C10K problem was written, but nevertheless keeps existing as zombie perf canon that no longer refers to living characteristics) are the nicest choice for the vast majority of use cases, and that Rust-style executor-backed tasks are inappropriate even in the rare cases where M:N pays off in languages like Go or Erlang (pretty much just a small subset of latency-bound load balancers that don’t perform very much CPU work per socket). When you start caring about millions of concurrent tasks, having all of the sources of accidental implicit state and interactions of async tasks is a massive liability.
I think The ADA Ravenscar profile (see chapter 2 for “motivation” which starts at pdf page 7 / marked page 3) and its successful application to safety critical hard real time systems is worth looking at for inspiration. It can be broken down to this set of specific features if you want to dig deeper. ADA has a runtime but I’m kind of ignoring that part of your question since it is suitable for hard real-time. In some ways it reminds me of an attempt to get the program to look like a pretty simple petri net.
I think that message passing and STM are not utilized enough, and when used judiciously they can reduce a lot of risk in concurrent systems. STM can additionally be made wait-free and thus suitable for use in some hard real-time systems.
I think that Send and Sync are amazing primitives, and I only wish I could prove more properties at compile time. The research on session types is cool to look at, and you can get a lot of inspiration about how to encode various interactions safely in the type system from the papers coming out around this. But it can get cumbersome and thus create more risks to the overall engineering effort than it solves if you’re not careful.
A lot of the hard parts of concurrency become a bit easier when we’re able to establish maximum bounds on how concurrent we’re going to be. Threads have a little bit more of a forcing function to keep this complexity minimized due to the fact that spawning is fallible due to often under-configured system thread limits. Having fixed concurrency avoids many sources of bugs and performance issues, and enables a lot of relatively unexplored wait-free algorithmic design space that gets bounded worst-case performance (while still usually being able to attempt a lock-free fast path and only falling back to wait-free when contention picks up). Structured concurrency often leans into this for getting more determinism, and I think this is an area with a lot of great techniques for containing risk.
In the end we just have code and data and risk. It’s best to have a language with forcing functions that pressure us to minimize all of these over time. Languages that let you forget about accruing data and code and risk tend to keep people very busy over time. Friction in some places can be a good thing if it encourages less code, less data, and less risk.
16
c-cube 29 hours ago | link | flag | I like rust and I like threads, and do indeed regret that most libraries have been switching to async-only. It’s a lot more complex and almost a new sub-language to learn.
That being said, I don’t see a better technical solution for rust (i.e. no mandatory runtime, no implicit allocations, no compromise on performance) for people who want to manage millions of connections. Sadly a lot of language design is driven by the use case of giant internet companies in the cloud and that’s a problem they have; not sure why anyone else cares. But if you want to do that, threads start getting in the way at 10k threads-ish? Maybe 100k if you tune linux well, but even then the memory overhead and latency are not insignificant, whereas a future can be very tiny.
Ada’s tasks seem awesome but to the best of my knowledge they’re for very limited concurrency (i.e the number of tasks is small, or even fixed beforehand), so it’s not a solution to this particular problem.
Of course async/await in other languages with runtimes is just a bad choice. Python in particular could have gone with “goroutines” (for lack of a better word) like stackless python already had, and avoid a lot of complexity. (How do people still say python is simple?!). At least java’s Loom project is heading in the right direction.
---
~
ngrilly 24 hours ago | link | flag | The green process abstraction seems to work well enough in Erlang to serve tens of thousands of concurrent connections. Why do you think the async/await abstraction won’t work for Rust? (I understand they are very different solutions to a similar problem.)
~
c-cube 16 hours ago | link | flag | Not who you’re asking, but the reason why rust can’t have green threads (as it used to have pre-1.0, and it was scraped), as far as I undertand:
Rust is shooting for C or C++-like levels of performance, with the ability to go pretty close to the metal (or close to whatever C does). This adds some constraints, such as the necessity to support some calling conventions (esp. for C interop), and precludes the use of a GC. I’m also pretty sure the overhead of the probes inserted in Erlang’s bytecode to check for reduction counts in recursive calls would contradict that (in rust they’d also have to be in loops, btw); afaik that’s how Erlang implements its preemptive scheduling of processes. I think Go has split stacks (so that each goroutine takes less stack space) and some probes for preemption, but the costs are real and in particular the C FFI is slower as a result. (saying that as a total non-expert on the topic).
I don’t see why async/await wouldn’t work… since it does; the biggest issues are additional complexity (a very real problem), fragmentation (the ecosystem hasn’t converged yet on a common event loop), and the lack of real preemption which can sometimes cause unfairness. I think Tokio hit some problems on the unfairness side.
~
notriddle edited 1 hour ago | link | flag | The biggest problem with green threads is literally C interop. If you have tiny call stacks, then whenever you call into C you have to make sure there’s enough stack space for it, because the C code you’re calling into doesn’t know how to grow your tiny stack. If you do a lot of C FFI, then you either lose the ability to use small stacks in practice (because every “green” thread winds up making an FFI call and growing its stack) or implementing some complex “stack switching” machinery (where you have a dedicated FFI stack that’s shared between multiple green threads).
Stack probes themselves aren’t that big of a deal. Rust already inserts them sometimes anyway, to avoid stack smashing attacks.
In both cases, you don’t really have zero-overhead C FFI any more, and Rust really wants zero-overhead FFI.
I think Go has split stacks (so that each goroutine takes less stack space)
No they don’t any more. Split Stacks have some really annoying performance cliffs. They instead use movable stacks: when they run out of stack space, they copy it to a larger allocation, a lot like how Vec works, with all the nice “amortized linear” performance patterns that result.
~ andyc 19 hours ago
| link | flag |
Two huge differences:
Erlang’s data structures are immutable (and it has much slower single threaded speed).
Erlang doesn’t have threads like Rust does.That changes everything with regard to concurrency, so you can’t really compare the two. A comparison to Python makes more sense, and Python async has many of the same problems (mutable state, and the need to compose with code and libraries written with other concurrency models)
---
https://github.com/jimblandy/context-switch/issues/1
(linked from https://lobste.rs/s/eppfav/why_i_rewrote_my_rust_keyboard_firmware#c_eibcnp ) ---
https://news.ycombinator.om/item?id=26406989
in this discussion on a post about complaints with Rust async some commentators, such as newpavlov, give what they think would have been better solution. And the_duke says that "Completion is the right choice for languages with a heavy runtime" which is what we are aiming for
newpavlov describes the completion model as:
" In a completion-based model (read io-uring, but I think IOCP behaves similarly, though I am less familiar with it) it's a runtime who "notifies" tasks about completed IO requests. In io-uring you have two queues represented by ring buffers shared with OS. You add submission queue entries (SQE) to the first buffer which describe what you want for OS to do, OS reads them, performs the requested job, and places completion queue events (CQEs) for completed requests into the second buffer.
So in this model a task (Future in your terminology) registers SQE (the registration process may be proxied via user-space runtime) and suspends itself. Let's assume for simplicity that only one SQE was registered for the task. After OS sends CQE for the request, runtime finds a correct state transition function (via meta-information embedded into SQE, which gets mirrored to the relevant CQE) and simply executes it, the requested data (if it was a read) will be already filled into a buffer which is part of the FSM state, so no need for additional syscalls or interactions with the runtime to read this data!
If you are familiar with embedded development, then it should sound quite familiar, since it's roughly how hardware interrupts work as well! You register a job (e.g. DMA transfer), dedicated hardware block does it, and notifies a registered callback after the job was done. Of course, it's quite an oversimplification, but fundamental similarity is there. "
---
" > IOCP models tend to heavily rely on callbacks and closures
While perhaps higher level libraries are written that way, I can’t think of a reason why the primitive components of IOCP require callbacks and closures. The “poll for io-readiness and then issue non-blocking IO” and “issue async IO and then poll for completion” models can be implemented in a reactor pattern in a similar manner. It is just a question of whether the system call happens before or after the reactor loop.
EDIT: Reading some of the other comments and thinking a bit, one annoying thing about IOCP is the cancelation model. With polling IO readiness, it is really easy to cancel IO and close a socket: just unregister from epoll and close it. With IOCP, you will have to cancel the in-flight operation and wait for the completion notification to come in before you can close a socket (if I understand correctly). " ---
jedisct1 18 hours ago [–]
And Zap (scheduler for Zig) is already faster than Tokio.
Zig and other recent languages have been invented after Rust and Go, so they could learn from them, while Rust had to experiment a lot in order to combine async with borrow checking.
So, yes, the async situation in Rust is very awkward, and doing something beyond a Ping server is more complicated than it could be. But that’s what it takes to be a pioneer.
reply
---
speaking about issues with Rust's async/await:
" FWIW, I'd bet almost anything that this problem isn't solvable in any general way without linear types, at which point I bet it would be a somewhat easy modification to what Rust has already implemented. (Most of my development for a long time now has been in C++ using co_await with I/O completion and essentially all of the issues I run into--including the things analogous to "async Drop", which I would argue is actually the same problem as being able to drop a task itself--are solvable using linear types, and any other solutions feel like they would be one-off hacks.) Now, the problem is that the Rust people seem to be against linear types (and no one else is even considering them), so I'm pretty much resigned that I'm going to have to develop my own language at some point (and see no reason to go too deep into Rust in the mean time) :/. "
from this thread: https://news.ycombinator.com/item?id=26407507
---
more on how Rust didn't choose a completion-based future model because (a) it doesn't want allocations, and (b) it wants to be able to drop stuff rather than poll futures to completion before dropping them:
"
At least part of the goal here must be to avoid allocations and reference counting. If you don't care about that, then the design could have been to 'just' pass around atomically-reference-counted buffers everywhere, including as the buffer arguments to AsyncRead?/AsyncWrite?. That would avoid the need for AsyncBufRead? to be separate from AsyncRead?. It wouldn't prevent some unidiomaticness from existing – you still couldn't, say, have an async function do a read into a Vec, because a Vec is not reference counted – but if the entire async ecosystem used reference counted buffers, the ergonomics would be pretty decent.
But we do care about avoiding allocations and reference counting, resulting in this problem. However, that means a completion-based model wouldn't really help, because a completion-based model essentially requires allocations and reference counting for the futures themselves.
To me, the question is whether Rust could have avoided this with a different polling-based model. It definitely could have avoided it with a model where the allocations for async functions are always managed by the system, just like the stacks used for regular functions are. But that would lose the elegance of async fns being 'just' a wrapper over a state machine. Perhaps, though, Rust could also have avoided it with just some tweaks to how Pin works [1]… but I am not sure whether this is actually viable. If it is, then that might be one motivation for eventually replacing Pin with a different construct, albeit a weak motivation by itself.
[1] https://www.reddit.com/r/rust/comments/dtfgsw/iou_rust_bindi...
reply
withoutboats2 19 hours ago [–]
> I am not sure whether this is actually viable.
Having investigated this myself, I would be very surprised to discover that it is.
The only viable solution to make AsyncRead? zero cost for io-uring would be to have required futures to be polled to completion before they are dropped. So you can give up on select and most necessary concurrency primitives. You really want to be able to stop running futures you don't need, after all.
If you want the kernel to own the buffer, you should just let the kernel own the buffer. Therefore, AsyncBufRead?. This will require the ecosystem to shift where the buffer is owned, of course, and that's a cost of moving to io-uring. Tough, but those are the cards we were dealt.
reply
" -- https://news.ycombinator.com/item?id=26407958
---
some discussion on Rust async/await:
https://news.ycombinator.com/item?id=26406989
i only read the first few comments so far
---
Animats 12 hours ago [–]
"But the biggest potential is in ability to fearlessly parallelize majority of Rust code, even when the equivalent C code would be too risky to parallelize. In this aspect Rust is a much more mature language than C."
Yes. Today, I integrated two parts of a 3D graphics program. One refreshes the screen and lets you move the viewpoint around. The other loads new objects into the scene. Until today, all the objects were loaded, then the graphics window went live. Today, I made those operations run in parallel, so the window comes up with just the sky and ground, and over the next few seconds, the scene loads, visibly, without reducing the frame rate.
This took about 10 lines of code changes in Rust. It worked the first time it compiled.
reply
phkahler 5 hours ago [–]
>> One refreshes the screen and lets you move the viewpoint around. The other loads new objects into the scene.
How did you do that in Rust? Doesnt one of those have to own the scene at a time? Or is there a way to make that exclusive ownership more granular?
reply
brink 4 hours ago [–]
The simplest (and often best) option is to use the Arc<Mutex<MyStruct?>> pattern.
The Arc is an async reference counter that allows multiple ownership. And the nested Mutex enforces only one mutable borrow at a time.
reply
adamnemecek 1 hour ago [–]
I'm not sure how your architecture but you might not even need to lock things. I find that using mpsc channels allows me to get around like 60% of locking. Essentially, you have some sort of main loop, then you spawn a tread, load whatever you need there and then send it to the main thread over mpsc. The main thread handles it on the next iteration of the main loop.
reply
---
matklad 10 hours ago | link | flag |
Strong plus +1. I wish, for application development, it was possible to remove lifetimes, borrow-checker and manual memory management (which are not that useful for this domain) and to keep only fearless concurrency (which is useful even (more so?) in high-level languages). Alas, it seems that Rust’s thread-safety banana is tightly attached to the rest of the jungle.
5
kprotty 5 hours ago | link | flag | FWIW. Ponylang has exactly this:
no lifetimes, data is either known statically via simple escape analysis or traced at runtime
no borrow checker, although it does have a stricter reference capability system
no manual memory management: uses a deterministically invoked (excluding shared, send data) Mark-and-no-sweep GC
has the equivalent Send and Sync traits in its reference capability system which provide the same static guaranteesAs with everything though it has its own trade offs; Whether that be in its ref-cap system, lack of explicit control in the system, etc.
---
https://matklad.github.io//2021/03/12/goroutines-are-not-significantly-smaller-than-threads.html
disputes the common wisdom that greenthreads take up a lot less memory than threads (he finds greenthreads take up about .25 the memory of normal threads)
discussion: https://news.ycombinator.com/item?id=26440334
---
.NET concurrency:
https://github.com/dotnet/orleans https://microsoft.github.io/coyote/
---
"
Message Passing Interface (MPI)
Programs that need to spread their computation amongst many compute cores, like climate models, often use the Message Passing Interface (MPI) library. The MPI library can be seen as a gateway to using more computing power. Developers in High Performance Computing (HPC) know that MPI is god and all MPIs children were born in its image. One time I heard two HPC developers say “this is the way”, like the Mandalorian, in reference to their agreement to use MPI for a C++ project prototype. ... If you want to read more on the impact of MPI in HPC in general, I highly recommend Johnathan Dursi’s post on why “MPI is killing HPC”. https://www.dursi.ca/post/hpc-is-dying-and-mpi-is-killing-it.html " -- [9]
---
" Why MPI was so successful ... It started with routines for explicitly sending and receiving messages, very useful collective operations (broadcast, reduce, etc.), and routines for describing layout of data in memory to more efficiently communicate that data. It eventually added sets of routines for implicit message passing (one-sided communications) and parallel I/O, but remained essentially at the transport layer, with sends and receives and gets and puts operating on strings of data of uniform types.
Why MPI is the wrong tool for today
Not only has MPI stayed largely the same in those 25 years, the idea that “everyone uses MPI” has made it nearly impossible for even made-in-HPC-land tools like Chapel or UPC to make any headway, much less quite different systems like Spark or Flink, meaning that HPC users are largely stuck with using an API which was a big improvement over anything else available 25 years ago, but now clearly shows its age. Today, MPI’s approach is hardly ever the best choice for anyone. MPI is at the wrong level of abstraction for application writers
Programming at the transport layer, where every exchange of data has to be implemented with lovingly hand-crafted sends and receives or gets and puts, is an incredibly awkward fit for numerical application developers, who want to think in terms of distributed arrays, data frames, trees, or hash tables. Instead, with MPI, the researcher/developer needs to manually decompose these common data structures across processors, and every update of the data structure needs to be recast into a flurry of messages, synchronizations, and data exchange. ... At the end of this post are sample programs, written as similarly as possible, of solving the problem in MPI, Spark, and Chapel. I’d encourage you to scroll down and take a look. The lines of code count follows: Framework Lines Lines of Boilerplate MPI+Python 52 20+ Spark+Python 28 2 Chapel 20 1
...
In Chapel, the basic abstraction is of a domain – a dense array, sparse array, graph, or what have you – that is distributed across processors. In Spark, it is a resiliant distributed dataset, a table distributed in one dimension. Either of those can map quite nicely onto various sorts of numerical applications. In MPI, the “abstraction“ is of a message. And thus the huge overhead in lines of code.
...
MPI is less than you need at extreme levels of parallelism
...
fault-tolerance, and an ability to re-balance the computation on the altered set of resources, becomes essential....If the highest-level abstraction a library supports is the message, there is no way that the library can know anything about what your data structures are or how they must be migrated.
Fault-tolerance and adaptation are of course genuinely challenging problems; but (for instance) Charm++ (and AMPI atop it) can do adaptation, and Spark can do fault tolerance. But that’s because they were architected differently.
"
---
https://www.datadoghq.com/blog/engineering/introducing-glommio/
https://github.com/TimelyDataflow/differential-dataflow
---
" Querying a database
In Node, let's say you want to query your database in a REPL. Here's what it looks like:
> const { Client } = require("pg"); undefined > client = new Client(/* connection string */); undefined > client.query("select now()"); Promise { <pending> } >
Something about this always felt so depressing. Rationally, I could justify it: nothing lost, nothing gained. If I wanted to board Node's async rocket to the moon, I had to accept inferior ergonomics in a situation like this. I get a promise, not a result, so I need to add additional logic to handle the promise and get a result.
And if only life were so simple:
> await client.query("select now()"); REPL11:1 await client.query("select now()"); ^^^^^ SyntaxError?: await is only valid in async function
Alas, begrudged acceptance, we all got used to "the new way of doing things here."
At this moment, thanks to the proliferation of async/await, I can no longer remember the API for a Promise instance. So I'll just regress all the way to a callback. Which is fortunately possible, because JavaScript?'s "no man left behind" principle ensures callbacks will be well-supported for my grandchildren:
> client.query('select now()', (err, res) => console.log(res)) undefined
Still no result. Five minutes of head scratching and banging ensues until I realize – for the love of von Neumann – I've forgotten to call client.connect(). If you don't call client.connect() before calling client.query(), the pg client will silently push the query to an internal queue. This would be more infuriating if it wasn't so understandable – remember the flawed foundation we're building on here.
So, finally, I call connect() and then query() and I get a result (somewhere in there...):
> client.connect() undefined > client.query('select now()', (err, res) => console.log(res)) Result { command: 'SELECT', rowCount: 1, oid: null, rows: [ { now: 2021-03-20T19:32:42.621Z } ], fields: [ Field { name: 'now', tableID: 0, columnID: 0, dataTypeID: 1184, dataTypeSize: 8, dataTypeModifier: -1, format: 'text' } ], _parsers: [ [Function: parseDate] ], _types: TypeOverrides? { _types: { getTypeParser: [Function: getTypeParser], setTypeParser: [Function: setTypeParser], arrayParser: [Object], builtins: [Object] }, text: {}, binary: {} }, RowCtor?