Bayle Shanks's website: proj-plbook-plPartConcurrency2

continued from [1]

Chapter: scalability

scalability of message passing

depends on model of time

scalability of shared memory

building shared memory out of message passing

scalability of routing

and connectivity

small world networks

Chapter: concurrent control flow constructs

map reduce

map associative reduce

amorphous medium

imagine that there are a zillion nodes in some unknown topology. Notes can directly communicate with their neighbors. One might imagine that this is a discrete approximation to a continuous fluid of nodes. Primitives provided include:

each node continuously broadcasts the contents of a special 'sharing' hashtable to each of its neighbors
nodes can be labeled as to their membership or non-membership of one or more discrete regions, where a region is defined by a name and a membership test
gossip: an idempotent merge function that merges local state with neighbor information to produce a result is applied to a region to produce a shared value over the region (e.g. max or union)
consensus: non-failing nodes participating in a consensus all choose the same value
reduction: non-failing nodes participating in a reduction all share a value which is an aggregate function of values held by the participants (e.g. sum or average)
read/write atomic objects: implemented using consensus and reduction
active gradient: maintains a hop-count radiating from a source or sources
persistent node: a robust virtual node implemented by combining many nodes into a redundant assemblage
homeostasis and repair: (AVOID condition repair) executes 'repair' whenever condition is true; this is used in place of exception handler in a programming pattern of maintaining homeostasis, under the assumption that failures of one or a few nodes are common

http://groups.csail.mit.edu/mac/projects/amorphous/papers/lsmas-final.pdf

a similar system which is explicitly for the purpose of sensor networks is Regiment: http://www.cs.harvard.edu/~mdw/papers/regiment-ipsn07.pdf

Chapter: todo dont know where to put these yet

Bulk synchronous parallel (BSP)

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

tuple spaces

http://software-carpentry.org/blog/2011/03/tuple-spaces-or-good-ideas-dont-always-win.html

(what is the event (re)ordering model of tuple spaces?)

http://www.diku.dk/~vinter/cc/Lecture10TupleSpaces.pdf

serializable continuations

other

" General purpose cores Flexible multithreading 2. Vector hardware Vector parallelism 3. GPUs Restrictive data parallelism 4. FPGAs Customized dataflow 5. Custom accelerators Various forms "

misc

an interesting thing that i am seeing all over is the application of the 'referential transparency'/'purely functional' idea to larger systems, in a few specific ways. First, the software is architected as a stateless (purely functional) computation that is given events and access to external state as input and spits out actions as output. Second, each memory location in the external state doesn't just hold the current value, but also all past values, with some sort of timestamp (or equivalently (in this model), integer sequence number). Not all of these examples do exactly that, but they are all related:

Examples:

urbit
the propagator model [2] [3] (note: new information is merged, not appended; but append is one special case)
time-dependent facts in datomic
appendonly techniques for databases and filesystems (journaling)
ipfs
monads
fluents in situation calculus

Criticism of this pattern, in databases: http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/

the propagator model

"The Propagator Programming Model is built on the idea that the basic computational elements are autonomous machines interconnected by shared cells through which they communicate. Each machine continuously examines the cells it is interested in, and adds information to some based on computations it can make from information from the others. Cells accumulate information from the propagators that produce that information. The key idea here is additivity.

...

The ingredients of a propagator network are cells and propagators. The cells' job is to remember things; the propagators' job is to compute. The analogy is that propagators are like the procedures of a traditional programming language, and cells are like the memory locations; the big difference is that cells accumulate partial information (which may involve arbitrary internal computations), and can therefore have many propagators reading information from them and writing information to them.

...

 Cells must support three operations:

    add some content

    collect the content currently accumulated

    register a propagator to be notified when the accumulated content changes

When new content is added to a cell, the cell must merge the addition with the content already present. When a propagator asks for the content of a cell, the cell must deliver a complete summary of the information that has been added to it.

The merging of content must be commutative, associative, and idempotent. The behavior of propagators must be monotonic with respect to the lattice induced by the merge operation. "

Links:

channel mutex semaphore condition variable

---

"Transactional memory support is what it sounds like: hardware support for transactions. This is through three new instructions, xbegin, xend, and xabort." [4]

---

" Of all the ways of doing concurrency, callbacks are by far the worst, Twisted was plagued by them and is the main reason why it failed, and that was with a much more sane and reasonable language like Python (stackless Python was a much better alternative and used a model similar to Go’s CSP).

And the sad thing is that there are much better alternatives around with much more sound models and environments, Erlang and Go are the two obvious examples, and that is for the highly specialized situations where you have great concurrency needs, for any other problem anything else will be much better than Node.js, even PHP.

– uriel, in response to Why Node.JS is absolutely terrible, by Hasen el Judy " -- http://harmful.cat-v.org/software/node.js

---

https://en.wikipedia.org/wiki/Systolic_array

---

summary of https://glyph.twistedmatrix.com/2014/02/unyielding.html :

threads with shared state are dangerous, because threads can interleave in any way (or even execute in parallel). This makes it hard to reason locally about code.
concurrency paradigms such as explicit, cooperative coroutines (non-preemptive multitasking), where each thread executes in isolation until it yields, and any function that might yield must be syntactically marked, is better, because you only have to worry about other threads executing at the yield.
greenthreads are presented as one solution to the problems of threading. But greenthreads with shared state have the same disadvantage noted above as ordinary threading; it's hard to reason locally about code because, even if individual instructions cannot execute in parallel, it is still true that different threads may interleave arbitrarily
note that this is also an argument for why systems like async/await should indeed mark things as async, rebutting http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/
some alterative concurrent systems include:
- threading with shared state
- callbacks, eg Twisted’s Deferred, JavaScript’s? Promises/A[+], E’s Promises
- futures/promises (managed callbacks), eg Twisted’s Deferred, JavaScript’s? Promises/A[+], E’s Promises
- explicit coroutines (cooperative multitasking with explicit yield, where any function that might call a yield is syntactically marked), eg Java’s “green threads”, Twisted’s Corotwine, eventlet, gevent
- implicit coroutines (eg Java’s “green threads”, Twisted’s Corotwine, eventlet, gevent)

" shared-state multithreading...are a bad idea...make local reasoning difficult...in a single-tasking, nonconcurrent system...When you’re looking at a routine that manipulates some state..., To imagine the different states, you need only to read the routine and imagine executing its instructions in order from top to bottom. This means that the number of instructions you must consider is n, where n is the number of instructions in the routine. By contrast, in a system with arbitrary concurrent execution - one where multiple threads might concurrently execute this routine with the same state - you have to read the method in every possible order, making the complexity nn. ... Those of you who actually use threads to write real software are probably objecting at this point. “Nobody would actually try to write free-threading code like this,” I can hear you complain, “Of course we’d use a lock or a queue to introduce some critical sections if we’re manipulating state.”

Mutexes can help mitigate this combinatorial explosion, but they can’t eliminate it, and they come with their own cost; you need to develop strategies to ensure consistent ordering of their acquisition. Mutexes should really be used to build queues, and to avoid deadlocks those queues should be non-blocking but eventually a system which communicates exclusively through non-blocking queues effectively becomes a set of communicating event loops, and its problems revert to those of an event-driven system; it doesn’t look like regular programming with threads any more. ... But even if you build such a system, if you’re using a language like Python (or the ones detailed above) where modules, classes, and methods are all globally shared, mutable state, it’s always possible to make an error that will affect the behavior of your whole program without even realizing that you’re interacting with state at all. ... What are you going to do instead?

There’s a lot of debate over the best way to do “asynchronous” programming - that is to say, “not threads”, four options are often presented.

    Straight callbacks: Twisted’s IProtocol, JavaScript’s on<foo> idiom, where you give a callback to something which will call it later and then return control to something (usually a main loop) which will execute those callbacks,
    “Managed” callbacks, or Futures: Twisted’s Deferred, JavaScript’s Promises/A[+], E’s Promises, where you create a dedicated result-that-will-be-available-in-the-future object and return it for the caller to add callbacks to,
    Explicit coroutines: Twisted’s @inlineCallbacks, Tulip’s yield from coroutines, C#’s async/await, where you have a syntactic feature that explicitly suspends the current routine,
    and finally, implicit coroutines: Java’s “green threads”, Twisted’s Corotwine, eventlet, gevent, where any function may switch the entire stack of the current thread of control by calling a function which suspends it.

One of these things is not like the others; one of these things just doesn’t belong.

...

Options 1-3 are all ways of representing the cooperative transfer of control within a stateful system. They are a semantic improvement over threads. Callbacks, Futures, and Yield-based coroutines all allow for local reasoning about concurrent operations.

So why does option 4 even show up in this list? ... it’s a bit of a distraction from the much bigger advantage of event-driven programming, which is simply that it’s easier to write programs at scale, in both senses (that is, programs containing lots of code as well as programs which have many concurrent users). ... A system that presents “implicit coroutines” – those which may transfer control to another concurrent task at any layer of the stack without any syntactic indication that this may happen – are simply the dubious optimization by itself. ... When you look at the implementation of a potentially concurrent routine written using callbacks or yielding coroutines, you can visually see exactly where it might yield control, either to other routines, or perhaps even re-enter the same routine concurrently. If you are using callbacks – managed or otherwise – you will see a return statement, or the termination of a routine, which allows execution of the main loop to potentially continue. If you’re using explicit coroutines, you’ll see a yield (or await) statement which suspends the coroutine ... ((example bug:)) ... As it happens, this is the same variety of example Guido van Rossum gives when he describes why chose to use explicit coroutines instead of green threads for the upcoming standard library asyncio module, born out of the “tulip” project, so it's happened to more than one person in real life. ... Let’s say we have this program:

def transfer(amount, payer, payee, server): if not payer.sufficient_funds_for_withdrawl(amount): raise InsufficientFunds?() log("{payer} has sufficient funds.", payer=payer) payee.deposit(amount) log("{payee} received payment", payee=payee) payer.withdraw(amount) log("{payer} made payment", payer=payer) server.update_balances([payer, payee])

In a world without concurrency, this is of course correct. ...

 But if we were to run transfer with the same two accounts in an arbitrary number of threads simultaneously, it is (obviously, I hope) wrong. One thread could update a payer’s balance below the funds-sufficient threshold after the check to see if they’re sufficient, but before issuing the withdrawl.

So, let’s make it concurrent, in the PEP 3156 style. That update_balances routine looks like it probably has to do some network communication and block, so let’s consider that it is as follows:

@coroutine def transfer(amount, payer, payee, server): if not payer.sufficient_funds_for_withdrawl(amount): raise InsufficientFunds?() log("{payer} has sufficient funds.", payer=payer) payee.deposit(amount) log("{payee} received payment", payee=payee) payer.withdraw(amount) log("{payer} made payment", payer=payer) yield from server.update_balances([payer, payee])

...now let’s make another, subtler code change: our hypothetical operations team has requested that we put all of our log messages into a networked log-gathering system for analysis. A reasonable request, so we alter the implementation of log to write to the network.

Now, what will we have to do to modify the green-threaded version of this code? Nothing! ... ((but, by contrast, in the cooperative coroutine version:)) In order to update this routine for a non-blocking version of log, we had to type a yield keyword between the sufficient_funds_for_withdrawl check and the withdraw call, between the deposit and the withdraw call, and between the withdraw and update_balances call. If we know a little about concurrency and a little about what this program is doing, we know that every one of those yield froms are a potential problem. If those log calls start to back up and block, a payer may have their account checked for sufficient funds, then funds could be deducted while a log message is going on, leaving them with a negative balance....the mechanical act of typing these out is an opportunity to notice that something’s wrong, both now and later. Even if we get all the way through making the changes without realizing the problem, when we notice that balances are off, we can look only (reasoning locally!) at the transfer routine and realize, when we look at it, based on the presence of the yield from keywords, that there is something wrong with the transfer routine itself, regardless of the behavior of any of the things it’s calling.

 Tedious as it may be, the asynchronousness of an individual function is, in fact, something that all of its callers must be aware of, just as they must be aware of its arguments and its return type.

In fact you are changing its return type: in Twisted, that return type would be Deferred, and in Tulip, that return type is a new flavor of generator. This new return type represents the new semantics that happen when you make a function start having concurrency implications.

Haskell does this as well, by embedding the IO monad in the return type of any function which needs to have side-effects. This is what certain people mean when they say Deferreds are a Monad.

The main difference between lightweight and heavyweight threads is that it is that, with rigorous application of strict principles like “never share any state unnecessarily”, and “always write tests for every routine at every point where it might suspend”, lightweight threads make it at least possible to write a program that will behave deterministically and correctly, assuming you understand it in its entirety. When you find a surprising bug in production, because a routine that is now suspending in a place it wasn’t before, it’s possible with a lightweight threading system to write a deterministic test that will exercise that code path. With heavyweight threads, any line could be the position of a context switch at any time, so it’s just not tractable to write tests for every possible order of execution. " -- https://glyph.twistedmatrix.com/2014/02/unyielding.html

another post making similar points is https://glyph.twistedmatrix.com/2012/01/concurrency-spectrum-from-callbacks-to.html

---

" smsm42 4 days ago [-]

> the mental model works better for asynchrony; instead of describing a series of steps to follow, and treating the interrupt as the exception, you describe the processes that should be undertaken in certain circumstances

I could never understand how this works better as a mental model. Say you ask somebody to buy you a gadget in a store they don't know. What do you do tell them:

a) "drive in your car on this street, turn left on Prune street, turn right on Elm street, the store will be after the second light. Go there, find "Gadgets" isle, on the second shelf in the middle there would be a green gadget saying "Magnificent Gadget", buy it and bring it home"

or:

b) when you find yourself at home, go to car. When you find yourself in the car, if you have a gadget, drive home, otherwise if you're on Elm street, drive in direction of Prune Street. If you're in the crossing of Elm street and Prune street, turn to Prune street if you have a gadget but to Elm street if you don't. When you are on Prune street, count the lights. When the light count reaches two, if you're on Prune street, then stop and exit the vehicle. If you're outside the vehicle and on Prune street and have no gadget, locate store and enter it, otherwise enter the vehicle. If you're in the store and have no gadget then start counting shelves, otherwise proceed to checkout. Etc. etc. - I can't even finish it!

I don't see how "steps to follow" is not the most natural mental model for humans to achieve things - we're using it every day! We sometimes do go event-driven - like, if you're driving and somebody calls, you may perform event-driven routine "answer the phone and talk to your wife" or "ignore the call and remember to call back when you arrive", etc. But again, most of these routines will be series of steps, only triggered by an event.

" -- [5]

---

" Communicating sequential processes (Go, PHP+MySql?) makes IO have a modestly simpler synchronous syntax at the cost of communicating between I/O operations much more complex (sending a message to a port or performing some sort of transaction instead of just assigning to a value). It's a tradeoff. " -- [6]

---

difference between actors and CSP:

both are 'message passing' models as opposed to 'shared state'

"The difference between the CSP program and the Actor system is that the Actor Z does not get messages using a guarded choice command from multiple channels. Instead it processes messages in arrival ordering, and by the laws for Actor systems, the stop message is guaranteed to arrive." -- [7]

CSP processes are anonymous, while actors have identities.
CSP message-passing fundamentally involves a rendezvous between the processes involved in sending and receiving the message, i.e. the sender cannot transmit a message until the receiver is ready to accept it. In contrast, message-passing in actor systems is fundamentally asynchronous, i.e. message transmission and reception do not have to happen at same time, and senders may transmit messages before receivers are ready to accept them. These approaches may be considered duals of each other, in the sense that rendezvous-based systems can be used to construct buffered communications that behave as asynchronous messaging systems, while asynchronous systems can be used to construct rendezvous-style communications by using a message/acknowledgement protocol to synchronize senders and receivers.
CSP uses explicit channels for message passing, whereas actor systems transmit messages to named destination actors. These approaches may also be considered duals of each other, in the sense that processes receiving through a single channel effectively have an identity corresponding to that channel, while the name-based coupling between actors may be broken by constructing actors that behave as channels. " -- https://en.wikipedia.org/wiki/Communicating_sequential_processes#Comparison_with_the_Actor_Model

CSP:

Process 1 -> channel -> Process 2
- in more detail: Process 1 -> write end of channel -> read end of channel -> Process 2
Communication is addressed to the channel, not the recipient process
When writing, Process 1 blocks until Process 2 reads the message (some implementations optionally relax this, eg Golang's)

Actors:

Actor 1 -> mailbox -> Actor 2
Communication is addressed to the recipient process, not the mailbox
The mailbox is an asynchronous queue; Actor 1 does not block and wait for Actor 2 to get the message

then there's also pi-calculus "The π-calculus, partially inspired by the Actor model as described by Milner above, introduced dynamic topology into the process calculi by allowing dynamic creation of processes and for the names to be passed among different processes. However, the goal of Milner and Hoare to attain an algebraic calculus led to a critical divergence from the Actor model: communication in the process calculi is not direct as in the Actor model but rather indirectly through channels " [8]

CSP link:

http://arild.github.io/csp-presentation/#

https://en.wikipedia.org/wiki/Communicating_sequential_processes#Primitives

"The concurrency primitives of CSP were input, output, guarded commands, and parallel composition whereas the Actor model is based on asynchronous one-way messaging." [9]

---

Actor implementation: Akka

---

Petri nets

---

https://en.wikipedia.org/wiki/Calculus_of_communicating_systems

---

"Leading examples of process calculi include CSP, CCS, ACP, and LOTOS.[1] More recent additions to the family include the π-calculus, the ambient calculus, PEPA, the fusion calculus and the join-calculus." [10]

"CCS, CSP, and ACP constitute the three major branches of the process calculi family: the majority of the other process calculi can trace their roots to one of these three calculi" [11]

"Various process calculi have been studied and not all of them fit the paradigm sketched here. The most prominent example may be the ambient calculus. " [12]

---

"The use of channels for communication is one of the features distinguishing the process calculi from other models of concurrency, such as Petri nets and the Actor model (see Actor model and process calculi)." [13]

---

" Here is how I think Erlang works. I believe Akka is very similar.

Each process has a single mailbox. Messages are put into the receiver's mailbox by the sender, and fetched by the receiver using pattern matching. This matching process can change message ordering in the sense that the oldest message in a mailbox may not match, but a younger one does. In this case the younger one is consumed first. Other than that, message ordering is preserved.

With this in mind, the asynchronous π -calculus extended with input pattern matching input from buffer describes Erlang (and hence Akka) semantics accurately, although one needs to do a bit of work in the encoding, since the π-calculus doesn't have the restriction to single channels per process. However, one usually doesn't want an encoding, but rather a calculus that models the target system directly. Such a calculus exists, and it's called Featherweight Erlang. It is by Mostrous and Vasconcelos. Their paper focusses on typing, but you can ignore that and just look at the untyped calculus in Section 3. " -- [14]

---

https://en.wikipedia.org/wiki/Parallel_programming_model

---

https://en.wikipedia.org/wiki/Barrier_(computer_science)

---

https://en.wikipedia.org/wiki/Rendezvous_(Plan_9)

---

" There is one important downside to the CSP way ((compared to Actors)) of networking concurrent behaviour:

    Any network of goroutines that has some mutual dependency may deadlock. The developer must be aware of this and code so that deadlock is eliminated; this involves identifying mutual dependencies and altering them in some way (there are several well-known techniques, the easiest of these being to use a pure client-server pattern, which is known to be deadlock-free).

But consider what happens if a deadlocking network of goroutines is converted to Akka. The non-blocking communications means that deadlock won't happen - but how can we be sure that the queues will not overflow? There is a direct equivalence between deadlock and exhaustion of not-so-infinite buffers. " -- https://www.quora.com/How-are-Akka-actors-different-from-Go-channels

---

https://en.wikipedia.org/wiki/Systolic_array

---

https://en.wikipedia.org/wiki/Futex

---

some history of CSP:

https://swtch.com/~rsc/thread/

---

" riffraff 1605 days ago [-]

AFAIR kilim guarantees statically checked safety, exactly because "messages" that can be sent across actor borders have guaranteed properties of being aliaseable^mutable. "

---

" A Fiber is a lightweight thread that uses cooperative multitasking instead of preemptive multitasking. A running fiber must explicitly "yield" to allow another fiber to run, which makes their implementation much easier than kernel or user threads.

A Coroutine is a component that generalizes a subroutine to allow multiple entry points for suspending and resuming execution at certain locations. Unlike subroutines, coroutines can exit by calling other coroutines, which may later return to the point where they were invoked in the original coroutine.

A Green Thread is a thread that is scheduled by a virtual machine (VM) instead of natively by the underlying operating system. Green threads emulate multithreaded environments without relying on any native OS capabilities, and they are managed in user space instead of kernel space, enabling them to work in environments that do not have native thread support. " -- http://programmers.stackexchange.com/questions/254140/is-there-a-difference-between-fibers-coroutines-and-green-threads-and-if-that-i

---

concurrency vs parallel

Different people have different definitions. One common one is that concurrent processes may have their steps arbitrarily interleaved but may not necessarily ever have one step from each process executing simultaneously, whereas parallelism admits actual simultaneous execution.

An example is seen in the case of multitasking on a single CPU vs. on multiple CPUs. On a computer with a single CPU, you can still 'multitask', that is, run two separate threads of execution, by first executing the first task a little bit, then executing the second task a little bit, then executing a little more of the first task, etc. The two tasks are executing concurrently, but not in parallel; there is no chance that a step in the first task and a step in the second change will actually execute simultaneously, because you only have one CPU. By contrast, on a machine with two CPUs, you could have one CPU executing the first thread while the second CPU is executing the second thread. This means that the two threads can actually execute simultaneously.

Other people might have different definition of these terms. One alternate definition says concurrency is a property of the program (the program is broken into multiple threads which can tolerate interleaved or simultaneous execution), while parallelism is a property of the hardware (there are actually multiple CPUs). Some people use the phrase 'truly concurrent' to denote simultaneous execution (as opposed to merely interleaved execution).

---

    Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
    Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
    Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
    Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
    Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
    Coroutines: Exactly fibers, except not OS-managed.
    Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.

" -- [15]

"Threads scheduled by user-mode code are what’s known as lightweight threads; in Quasar (as well as in some other languages) they are called fibers."- -http://blog.paralleluniverse.co/2014/02/06/fibers-threads-strands/

(as you can see, different people define 'fibers' differently)

---

a quick overview of some of the stuff in here:

Concurrency paradigms can be deterministic or non-deterministic

A race condition is (often undesirable) non-deterministic program behavior resulting from concurrency. When programming in a non-deterministic concurrency paradigm, race conditions are possible, and the programmer must use patterns, analysis, and discipline to avoid, detect, and eliminate undesirable race conditions.

Threads and stuff like them (there is no consensus on terminology):

shared memory vs none; scheduled vs yield to target; preemptive vs cooperative (have to yield) vs semipreemptive (wont preempt in tight loop). preemptive or semipreemptive implies scheduled.

Processes: no shared memory between different processes, scheduled, preemptive

Threads: mb shared memory, scheduled, preemptive. Defined this way, processes are special cases of threads (threads with no shared memory), note however that in many systems you can have many threads within each process.

Fibers: scheduled? (different people disagree on the definition here; many define it the same as a coroutine, but for some, something that explicitly yields to a scheduler is a fiber), cooperative (have to yield) [16] [17] [18]

coroutines: unscheduled cooperative (have to yield)

strand: either a thread or a fiber

These things can be implemented at the OS level or at the application level. The application-level versions of these constructs are often more efficient (in time and memory) than the OS-level versions. Application-level threads or fibers (strands) can be mapped onto OS ones many:1 (all application-level strands are in one OS thread) or 1:1 (each application-level strand gets its own OS-level thread) or M:N (multiple application-level strands per OS-level thread, but also multiple OS-level threads. These systems sometimes also support migrating application-level strands between OS-level threads, often for the purpose of making sure that one blocking strand doesn't block other strands).

Both of the words 'process' and 'thread' are also used more abstractly to denote different subprograms that are running concurrently.

Some operations are 'blocking'; in general, operations can be categorized as 'blocking' or 'non-blocking'. 'Blocking' means that the operation is waiting for something from the outside ('outside' can mean the world outside the computer, or it can just mean some event from another thread). Blocking operations are usually categorized as either I/O (wait for something to be written to or read from the outside world) or synchronization (wait for some event to occur or some condition to be true other threads and/or shared resources). One says that a thread 'blocks' when it is stuck waiting on a blocking operation. Some languages and libraries support 'non-blocking I/O', which means that instead of blocking the thread, the I/O operation is started in the background while the thread goes on to do other things (logically, at least; sometimes this is actually implemented by spawning a new thread to do the blocking I/O operation). Non-blocking I/O is sometimes referred to as "nio" or "async I/O" or "aio".

Some people talk of 'concurrency' vs 'parallelism'. There is no consensus on the meanings of these terms, however, one meaning is that 'parallelism' means different processes which are running at the same time on different CPUs, so one step from one process might be executing simultaneously with a step from a different process; in contrast, 'concurrency' also includes the case in which a single CPU is multitasking between two processes, so they are never actually running at the same time, but rather they are interleaved.

Some language implementations, such as Python or Ruby, support concurrency but not parallelism (in the above sense) between multiple threads; so the interpreter will only allow one thread to be running at any one moment in time, even if multiple CPUs are available (however, often these language implementations do still allow a program to contain multiple processes, that is, subprograms which do not share memory). In such a situation, concurrency (multiple threads multitasked onto one CPU) is still useful for dealing with I/O, because while one thread is paused waiting for the I/O operation to complete, other threads can be doing other things. One says that a program whose performance is bottlenecked by I/O is 'I/O bound'; these programs will benefit from multitasking even without multiple CPUs. By contrast, programs which are 'compute-bound' are those whose performance is bottlenecked by the speed of the CPU (and RAM); those programs will not benefit much from multitasking on a single CPU but would benefit from parallel execution across many CPUs.

A hierarchy of three programming constructs for reactor-pattern concurrency with particular relevance to (non)-blocking I/O:

The reactor pattern is when you run a program in a single thread, but use only non-blocking operations (such as non-blocking I/O). You switch between logical subprograms so that when one logical subprogram would otherwise 'block' to wait for the result of a blocking operation, instead you just proceed with executing another logical subprogram. One thing to keep in mind with this pattern is that if one of these subprograms does execute a blocking operation, or if it remains running via a loop, then the other subprograms won't run at all during this time because they are all sharing the same thread.

The following can all be implemented in languages which support only 'concurrency' (multitasking onto a single CPU) even if not parallelism (simultaneous execution on many CPUs). The following starts with most primitive (callbacks) (primitive in this hierarchy; the actual implementation involves still lower layers of more primitive concurrency constructs), and then moves up to things built on top of this primitive. That is, callbacks are most primitive here, and promises are implemented on top of callbacks, and async/await is implemented on top of callbacks.

Callbacks. To do a blocking operation, you put in a request for the operation, and as part of the request, you give a callback which will be called when the I/O is complete. In some models, the callbacks can be called anytime, effectively pre-empting other code. In other models, calls to callbacks are queued up and only executed when other computation has completed or at explict 'yield' point in the code (note, though, that one can model the former as just the latter with a 'yield' in between every statement).

Futures/promises. To do a blocking operation, you call a function which returns, not a result, but a 'promise' object. The promise object provides the result value of the operation after it completes. Typically the promise object also provides a way to check if the operation is done yet. Typically the promise object also provides a way to register a callback which will be called when the operation completes (this allows you to 'chain together' a sequence of blocking operations). There is no consensus on what, if anything, is the difference between the words 'future' and 'promise', or even if they are synonyms or not. Promises can be implemented on top of callbacks (they are not equivalent though; a callback might be called many times, whereas the callback registered on a promise will only be called once; although the idea can be extended to Observables, in which there is a stream of zero or more events each of which will cause the callback to be called).

Async/await. These introduce two new keywords, 'async' and 'await', and uses coroutines. When you get a promise object, and you want to wait until the operation completes and then do something else, you 'await' on the promise object. Logically, this lets you write code sequentially as if you are blocking; but actually, the thread it not blocked, because 'await' implicitly yields control to another coroutine while it waits for the operation to complete. Any function containing an 'await' must be marked with the 'async' keyword, and any function that calls any other 'async' function itself must be marked as 'async'. This is important for readability because it makes it clear that calling any function marked 'async' might cause the current coroutine to yield control to another one; when that is possible, the programmer may have to take extra precautions to avoid race conditions.

An alternative to reactor-pattern concurrency is to use threads or processes. OS-level threads and processes are often too slow or memory-intensive to scale up as well as reactor-pattern concurrency, but application-level threads or processes (such as Golang's goroutines or Erlang's processes) are more competitive, especially if the application-level threading framework is M:N and automatically migrates application-level threads so that an application-level thread that blocks doesn't block other application-level threads. A downside of this compared to reactor-pattern concurrency is that operations from different threads may be arbitrarily (or almost arbitrarily) interleaved, meaning that a lot of care must be exercised to avoid race conditions, especially when accessing shared memory. By contrast, reactor-pattern concurrency often only has a few, explicit points when one code path may yield to another one, which makes it easier to reason about race conditions. Process-based systems (no shared memory) ameliorate this downside (but race conditions may still be possible in the form of deadlock and similar).

fork/clone

SIMD, MIMD, etc (Flynn's taxonomy)

shared memory: memory consistency models consistency, eventual consistency, last-write-wins consistency the hard part is resolving inconsistencies logical clocks, lamport clocks, vector clocks, CAP theorem CRDTs

avoiding sharing with shared memory:

uniqueness types
immutable data
threadlocals

atomic operations: these are the actual low-level primitives, supported by hardware, upon which other concurrency constructs are built. consensus number CAS LL/SC atomic reads/writes (single byte, multibyte/word) vector instructions in ISA fences, memory barriers see also memory consistency models

critical sections, mutexes, condition variables, monitors, semaphores, locks, read/write locks, synchronization barriers, monitors locks are not composable read-copy-update

deadlock livelock

wait-free algorithms (lock-free)

dataflow (spreadsheet-like) 'labeled lines' 'active data'

Glitch

map/reduce

process calculi (often based on channels)

Message passing: CSP and Actors. Channels, mailboxes.

select/poll with pattern matching

propagator model

consensus algorithms zero-trust, eg bitcoin

todo systolic array

join-calculus style function calls

todo petri nets

in hardware: metastability arbiters clocks, clock domains, clockless/async architectures todo what are those other things with the grids, etc? cp from other file

GPUs parallel loops

todo https://en.wikipedia.org/wiki/Bulk_synchronous_parallel (BSP)

tuple spaces

events, pub/sub

amorphous medium

optimistic vs pessimistic concurrency

worker pools

signals

transactions STM ACID

pipe, socket blocking channel vs channel with fixed-length queue vs channel with 'infinite' queue

shared queue

---

some general groupings of the above:

you have different 'processes' or 'threads' of execution (subprograms executing linearly).

Fundamentally, a one-CPU doing batch processing can be modeled as a Turing machine (or something equivalent like lambda calculus). But this leaves out some phenomena that are relevant in the real world. Two closely related things that the Turing machine model leaves out are:

interactivity: a Turing machine models batch computation, but a real computer can interact with the external world via inputs and outputs in the middle of executing its program
concurrency: a real computer can have multiple CPUs which are operating simultaneously

These are related in that a model for interactivity also seems to be a solution to concurrency; you could model a multi-CPU computer as a set of interactive Turing machines that can interact with each other and with a shared memory. There is not yet universal consensus on what this model should be. Process calculi are contenders in this space. When you start thinking about this stuff, two concepts come up again and again:

time: it takes time for messages to propagate. This means that a CPU which observes two events in sequence cannot always rely upon other CPUs observing the same ordering of those events. Also, two CPUs might operate at different speeds even if they start out running the same program at the same time; this means that it is complicated for one CPU to infer another CPU's current state from information that the other CPU emitted in the past. This sort of thing is modeled in memory consistency models and in temporal logics, amongst other places.
wholistic properties: we sometimes need to talk about properties of the state of the entire system, properties which have to do with the relation between CPUs and cannot be reduced to statements about the state of single CPUs in isolation. Similarly, some computational operations' importance lies only in their effects on the relationship between CPUs; 'synchronization' operations are like this.

These models can be further expanded to distributed systems models. A key additional property of distributed system models is:

unreliability: messages might be lost in transit or corrupted. Some CPUs may unpredictably go offline.

Note: NFAs (and non-deterministic automata in general) only capture a very constrained kind of concurrency. They can be thought of as concurrently exploring all of the nondeterministic paths, but the 'virtual CPUs' exploring these paths aren't allowed to communicate with each other. This models 'embarrassingly parallel' computations.

in hardware: metastability arbiters clocks, clock domains, clockless/async architectures todo crossbar switches, clos, etc; cp from other file; that file also mentioned SIMD i think

they send messages to each other (message passing, channels, (named destinations/processes?: addressing, named channels?), routing (implementation detail), process calculi, pipes, sockets, signals, events, pub/sub, select/epoll)

Sometimes you would like them to pretend to share memory. But true shared memory is an infeasible abstraction (and note that even a single computer with multiple CPUs is a distributed system in some sense; not in the sense that messages between CPUs have a significant probability of being lost, but in the sense that, because the CPUs are physically distant from each other and from main memory, it takes nonzero time for a signals to travel between CPUs or between a CPU and memory). So we have approximations (memory consistency models, eventual consistency, last-write-wins consistency, CAP theorem, tuple spaces, fences/memory barriers) and ways of working with these approximations (logical clocks, CRDTs, Glitch, Propagators, consensus algorithms).

Sometimes you need to synchronize (constrain the relationship between processes, or in more detail, constrain the state of the total system in a way that cannot be expressed as a constraint that applies to each process separately without reference to any other process). We have synchronization primitives in shared memory (CAS, LLSC).

We have synchronization constructs, some of which may be primitives in various models: critical sections, mutexes, condition variables, monitors, semaphors, locks, read/write locks, synchronization barriers

At higher-levels, shared-memory synchronization is included into memory consistency models (transactions, STM, ACID).

And we have shared data structures:

read-copy-update
queues

and ways to avoiding sharing with shared memory (so that you can analyze programs to avoid races while still using shared memory):

data:
- uniqueness types
- immutable data
- threadlocals (non-shared memory)
control:
- purely functional subroutines

We have ways of specifying parallel control flow.

Processes themselves are first-order (processes, threads, process calculi, fork/clone, join), and can be emulated to an extent within one process via interleaving (reactor model, schedulers, preemption, fibers, coroutines). This is useful not only for implementation or efficiency, but also because it constraints when and how steps from different processes may be simultaneous or interleaved.

At higher levels, these combine with data and synchronization. join-calculus calls, callbacks, promises, async/await, blocking channels.

There are also ways of giving orders to first-order processses without treating them as first-class (sorta like declarative programming is computation without specifying the exact computation, here we might specify the computation but try to abstract away the concurrency): worker pools, SIMD and/or parallel map, map-reduce, dataflow

There are guarantees and other properties of parallel and distributed algorithms/systems (race conditions, wait-freedom, deadlock, livelock, liveness, progress, CAP). And properties of paradigms (deterministic, non-deterministic, composability, SIMD, MIMD). And mathematical models like Turing machines, but for parallelism (process calculi, petri nets). In order to reason about concurrent processes, we have temporal logics [19].

some todos: https://en.wikipedia.org/wiki/Bulk_synchronous_parallel petri nets, amorphous medium, ambient calculus, systolic arrays

---

" The basic operation of an Actor is easy to understand: like a thread, it runs concurrently with other Actors. However, unlike threads it is not pre-emptable. Instead, each Actor has a mailbox and can call a routine named “receive” to check its mailbox for new messages. The “receive” routine takes a filter, and if no messages in an Actor’s mailbox matches the filter, the Actor sleeps until it receives new messages, at which time it’s rescheduled for execution. ... each actor can have no, one or multiple addresses ... the actor models makes no guarantees on the ordering of messages. Queuing and dequeuing of messages in a mailbox are atomic operations, so there cannot be a race condition ... There is no shared state and the interaction between actors is purely based on asynchronous messages. ...

" -- [20]

---

This paper has constructions and proofs of consensus numbers of various primitives:

Wait-Free Synchronization by MAURICE HERLIHY, 1991.

https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

" First, we introduce a simple and general technique, based on reduction to a concensus protocol, for proving statements of the form, “there is no wait-free implementation of X by Y.” We derive a hierarchy of objects such that no object at one level has a wait-free implementation in terms of objects at lower levels. In particular, we show that atomic read/write registers, which have been the focus of much recent attention, are at the bottom of the hierarchy: thay cannot be used to construct wait-free implementations of many simple and familiar data types. Moreover, classical synchronization primitives such astest&set and fetch&add, while more powerful than read and write, are also computationally weak, as are the standard message-passing primitives. Second, nevertheless, we show that there do exist simple universal objects from which one can construct a wait-free implementation of any sequential object. " -- https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

things that are proven to have an infinite consensus number in this paper are:

compare-and-swap
FIFO queue augmented with an additional operation, peek, which returns but does not remote the first item in the queue
linked list with fetch&cons, which "atomically threads at item onto the front of a linked list" ( "Returns old cdr and adds new car on to the head of a list. Use preceding protocol where peek() = tail(car::cdr)." [21])
an array of atomic read/write memory locations with atomic memory-to-memory MOV between them
an array of atomic read/write memory locations with atomic memory-to-memory SWAP between them
(proven elsewhere, i think?) sticky bits: a bit that is initially 'unset' and then can only be written to once. (note that the context here is within one execution of a 'consensus' algorithm) (note the similarity to electrical engineering's "arbiters")

As [22] explains starting with slide 27, this hierarchy is not robust, but as [23] explains, a similar hierarchy, h^r_m, was proposed by Jayanti, and although it has not been proven robust, it has been proven robust in for 'well-behaved' cases.

Other links about consensus number and wait-freedom:

---

The paper which coined 'wait-free' and 'consensus number', Wait-Free Synchronization by MAURICE HERLIHY, 1991, presented a method for creating a wait-free implementation of any data structure. However, it is O(N^3) in space and 'very slow', 'preventing parallelism' ([24], Encyclopedia of Algorithms, Mark Moir).

---

lock-free and wait-free links:

---

is there a library of wait-free data structures somewhere? i see a bunch of libraries mixing lock-free and wait-free.

Here's some wait-free algorithms and implementations that i've heard of (note: some data structures provide wait-freedom only on some operations):

wait-free stack:

queues and ring buffers:

partial: "wait-free enqueue operation with a multiconsumer-blocking dequeue operation, as seen in the URCU (userspace read-copy-update) library (http://lttng.org/urcu)" [25]
partial: "bounded-FIFO (first in first out) with a single-producer wait-free enqueue and multi-consumer lock-free dequeue, as seen in the Concurrency Kit library (http://concurrencykit.org/)." [26]
wait-free ring single-producer single-consumer ring buffer: Concurrency Kit's ck_ring_t (also lock-free for single-producer, multi-consumer)
https://github.com/rigtorp/SPSCQueue "A bounded single-producer single-consumer wait-free and lock-free queue written in C++11"
boost::lockfree::spsc_queue is "a wait-free single-producer/single-consumer queue (commonly known as ringbuffer)"
Wait-Free Queues With Multiple Enqueuers and Dequeuers

Hash maps:

https://github.com/khizmax/libcds implements "FeldmanHashMap?, FeldmanHashSet?: [2013] Steven Feldman, Pierre LaBorde?, Damian Dechev "Concurrent Multi-level Arrays: Wait-free Extensible Hash Maps". Supports thread-safe bidirectional iterators"

Terval:

https://github.com/ucf-cs/tervel (and http://ucf-cs.github.io/Tervel/index.html )
- Multi-Word Compare-and-Swap (MCAS)
- vector
- stack
- ring buffer
- hash map
- related thesis (Feldman): http://stars.library.ucf.edu/cgi/viewcontent.cgi?article=2367&context=etd

Note: As https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom notes, there are also general constructions of wait-free datastructures out of lock-free ones.

perhaps these links will lead to more libraries (i already skimmed them, but not too carefully):

---

http://www.1024cores.net/home/lock-free-algorithms

" So what primitives are in your arsenal for implementation of advanced synchronization algorithms?

Compare-And-Swap Perhaps, it's the most famous primitive, it's other names are CAS, compare-and-exchange, compare-and-set, std::atomic_compare_exchange, InterlockedCompareExchange?, __sync_val_compare_and_swap, LOСK? CMPXCHG and other. It's an instance of so-called atomic RMW (read-modify-write) operation. It's pseudo-code is: T compare-and-swap(T* location, T cmp, T xchg) { do atomically { T val = *location; if (cmp == val) *location = xchg; return val; } } That is, it stores a new value (xchg) into a memory location only if it contains an expected value (cmp), in either case it returns a value that was stored in the location when the operation begins. And all that is done atomically on hardware level.

Fetch-And-Add Also atomic RMW operation, and also conducted atomically in hardware. Aka atomic_fetch_add, InterlockedExchangeAdd?, LOСK? XADD. Below is the pseudo-code: T fetch-and-add(T* location, T x) { do atomically { T val = *location; *location = val + x; return val; } } There are also variations like fetch-and-sub, fetch-and-and, fetch-and-or, fetch-and-xor.

Exchange Atomic RMW. Aka atomic_exchange, XCHG. Dead simple, but not less useful: T exchange(T* location, T x) { do atomically { T val = *location; *location = x; return val; } }

Atomic loads and stores They are not RMW (read-modify-write) operations, they are just independent atomic loads and stores. They are frequently unfairly underestimated. However, they play fundamental role in synchronization algorithms, and they are what you should generally strive for - atomic loads and stores are better/cheaper/faster than atomic RMW operations.

Mutexes and the company Why not? The most stupid thing one can do is try to implement everything in a non-blocking style (of course, if you are not writing infantile research paper, and not betting a money). Generally it's perfectly Ok to use mutexes/condition variables/semaphores/etc on cold-paths. For example, during process or thread startup/shutdown mutexes and condition variables is the way to go.

---

" what is the most important thing regarding synchronization algorithm's performance and scalability? I frequently hear the answer that it's a number of atomic RMW (read-modify-write) instructions (like Compare-And-Swap or Fetch-And-Add) per operation. It's dead wrong. The most important thing is amount of write sharing per operation. Numerical measure of write sharing is number cache-line transfers per operation, and the ideal value is 0. If there is 0 cache-line transfers per operations amortized (we are perfectly Ok with amortization here), then the algorithm is perfectly scalable. Anything other than 0, even 1, is a serious problem for scalability. ... First, if there is write sharing system ungracefully degrades, the more threads we add the slower it becomes.

Second, if there is no write sharing system linearly scales. Yes, atomic RMW operations are slower than plain stores and loads, but they do scale linearly in itself (by the way, cost of atomic RMW operations becomes smaller and smaller with each new processor generation, there are no fundamental reasons why they must be any slower and similar non-atomic read-modify-write sequence).

Third, loads are always scalable. Several threads are able to read a memory location simultaneously. Read-only accesses are your best friends in a concurrent environment. "

-- http://www.1024cores.net/home/lock-free-algorithms/first-things-first

---

https://en.wikipedia.org/wiki/Seqlock

---

"no matter your concurrent programming model du jour, three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions)." -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

---

regarding locks: "And if you care about performance, you are also going to need to think about hardware esoterica such as CMPXCHG, spin waiting, cache contention, optimistic techniques with version counters and memory models, ABA, and so on." -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

---

Time of check to time of use (TOCTOU) bugs: "changes in a system between the checking of a condition (such as a security credential) and the use of the results of that check. This is one example of a race condition." [27]. Example:

" async bool IsRed?(AsyncColor? c) { return (await c.R > 0 && await c.G == 0 && await c.B == 0); }

This rather simple (and silly) function checks to see if an AsyncColor? is “red”; to do so, it reads the R, G, and B properties. For whatever reason, they are asynchronous, so we must await between accesses. If AsyncColor? is a mutable object, well, guess what – these values might change after we’ve read them, opening up a possible TOCTOU bug. For instance, imagine a caller’s surprise when IsRed? may have lied to it:

AsyncColor? c = ...; await IsRed?(c); assert(await c.R > 0);

That assertion can very well fire. Even this callsite has a TOCTOU bug of its own, since c.R might be >0 at the end of IsRed’s? return, but not after the assert expression’s own await has completed. " -- http://joeduffyblog.com/2016/11/30/15-years-of-concurrency/

---

another (or just an old?) idea for ways of categorizing concurrency:

level of close-to-the-metal
over space or not (systems distributed over a network have an additional error possibility; that the other system either didn't get the message, or did but the reply was lost, or did but crashed before replying)
over time or not (memory reordering consistency models, eventual consistency dbs)
control (eg fork) vs data (eg parallel map)

---

Concurrent Collections/tstreams

https://en.wikipedia.org/wiki/Concurrent_Collections / tstreams is a deterministic parallel programming model for pure computations.

You specify the dependency graph between operations using three primitive types: items (data items, which must be immutable), steps (operations, which must be pure functions), and tags (to distinguish instances; eg if you are specifying the workflow for a restaurant, where pies, once ordered by a guest, must be prepared, and then baked, then pies, which are each associated with different, individual guests, are each tagged with a guest_ID). You then specify ordering relationships between steps and data items (eg the operation bake_pie requires a prepared_pie, which is produced by prepare_pie). In CnC? there is also a controller/controlee reliationship which is similar to function calling and which is used to create a new tag and initiate an assembly line of operations to produce the results of some operation for that tag, after satisfying prerequisites (see the "Waiter controlling cooks" section of the CnC tutorial for an example).

Links:

What every systems programmer should know about lockless concurrency

---

CS 149: Parallel Computing This course is an introduction to parallelism and parallel programming. Most new computer architectures are parallel; programming these machines requires knowledge of the basic issues of and techniques for writing parallel software. Topics: varieties of parallelism in current hardware (e.g., fast networks, multicore, accelerators such as GPUs, vector instruction sets), importance of locality, implicit vs. explicit parallelism, shared vs. non-shared memory, synchronization mechanisms (locking, atomicity, transactions, barriers), and parallel programming models (threads, data parallel/streaming, MapReduce?, Apache Spark, SPMD, message passing, SIMT, transactions, and nested parallelism). Significant parallel programming assignments will be given as homework. The course is open to students who have completed the introductory CS course sequence through 110 and have taken CS 143. Terms: Win

Credit/No Credit Instructors: Olukotun, O. (PI) ; Zaharia, M. (PI) ; Aberger, C. (TA) ; Nötzli, A. (TA) ... more » Schedule for CS 149

Units: 3-4

UG Reqs: GER:DB-EngrAppSci?

Grading: Letter or

---

preemptive threading vs single-threaded nonblocking (async) I/O with callbacks vs single-threaded nonblocking (async) I/O with coroutines:

" 1) ((preemptive threading)) Use preemptive system threads that can execute in parallel. A task requiring simultaneous waiting is given an operating system thread of its own so it can block without stopping the entire program. But threads require significant memory and other resources per thread. Also, the operating system can arbitrarily interleave the execution of system threads, requiring the programmer to carefully protect shared resources with locks and condition variables, which is exceedingly error-prone.

2) ((single-threaded nonblocking (async) I/O with callbacks)) Have a single-threaded program, where that single thread runs an event loop whose job is to react to external events by invoking a callback function that has been registered. While it doesn't require the same kind of complex synchronization that preemptive threads do, the inverted control structure of this approach requires your own control flow to thread awkwardly through the system's event loop, leading to a maze of event callbacks. " -- [28]

advantages of coroutines:

KayEss? on May 26, 2017 [-]

The implementation of the coroutines might be complex, but their use is very straightforward -- the code looks almost exactly the same as the blocking code would, just with the awaits in there.

As for the number of threads, when I use Boost ASIO with coroutines I often end up with multiple threads servicing a pool of coroutines, so if there is shared state between them then there is still synchronisation. I use channels implemented on top of eventfd to help with that.

When I converted some network code from callbacks to coroutines the code ended up about 1/3 as long and was far simpler and easier to understand. It also fixed several bugs I simply couldn't find.

The reality is that the callbacks are far more complex to use than the coroutines.

vvanders on May 26, 2017 [-]

Yup, we used to let our designers loose on coroutines as a way to script Game AI. If designers can use them safely then I'm sure the average programmer can too :).

butterisgood on May 26, 2017 [-] ... Having written heavily callback-driven code. Threading contexts manually, thinking about object lifetimes, I can say that coroutines have a much cleaner flow to them and can make it easier to write and later read and debug programs than crazy callbacks and contexts.

Object lifetimes are one thing coroutines help with in a big way. I think it's why Rust is so attractive to so many.

mhink on May 26, 2017 [-] ... This is the biggest reason I vastly prefer using coroutines in my Javascript projects (by way of `redux-saga` if there's heavy lifting, or just async/await otherwise).

" -- [29] and [30] and others in the thread [31]

" dom0 on May 26, 2017 [-]

2) ((single-threaded code with nonblocking I/O and callbacks)) is concurrency, 1) ((multiple processes executing at the same time)) is parallelism; there are various programming models for multiple threads. The basic approach of mutual exclusion (=shared memory, locks, barriers) is just one of them, another approach would be message passing / actors (queues, ZeroMQ? etc.). You can mix the two, and the latter typically is implemented on top of the former, because hardware does not have message passing facilities. Some other approaches don't really fit that scheme (e.g. TBB).

Choosing an approach in this field is not always obvious, though sometimes it is. E.g. many stupid server workloads are bound on I/O concurrency, so green threads are the best fit.

However, as soon as CPU scaling is required, multiple independently schedulable kernel tasks are required -- you either have to fork/CreateProcess? or spawn threads. This does not necessarily imply that larger parts of the application have to be aware of this. E.g. it's entirely possible to plug a behind-the-scenes CPU thread into an event loop. " -- [32]

---

Some types of concurrency: Shared memory, message passing, data parallel

---

POSIX pthreads issues fork vs forkall

'fork' duplicates the current process but only duplicates the thread calling 'fork'; the duplicate doesn't contain the other threads. If you want all of the threads, some systems provide a non-standard 'forkall' in addition to 'fork', but note that this can cause problems if the forker isn't aware of what the other threads are doing, e.g. if another thread is in the middle of file I/O, now both copies will try to complete that file I/O.

Copying only one thread could be a problem if the child process tries to acquire a mutex that was held by a thread that wasn't duplicated; in POSIX only the thread that locked a mutex can unlock it, so such mutexes will never be unlocked. Many POSIX library functions such as malloc and printf may hold mutexes internally. Therefore, after forking, a child process should only use Async Signal Safe POSIX library calls.

The most common use of 'fork' is as part of the fork-exec pattern, that is, the fork is immediately followed by an 'exec' (which stops executing this binary executable and starts executing another binary executable instead), and the only purpose was to create a new process for the 'exec'. In this case, no problem with copying only one thread. But if you actually wanted to keep the current program running, then you have to worry more.

There is a way to register a handler upon forking, pthread_at, but it is called in the context of the thread that issued the fork(), so it can't help with other threads' mutexes (this has been called a defect in the POSIX standard One way around this is to use only pthread semaphores instead of mutexes; semaphores can be unlocked by any thread, not just the one that called them. This doesn't help with mutexes used internally in the standard library, however. See the section 'Fork model' of the Metacall README [https://github.com/metacall/core#57-fork-model for more on this.

---

https://link.springer.com/chapter/10.1007%2F978-3-540-30186-8_11 "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS"

---

good sort-of intro to a bunch of stuff

https://www.cs.rochester.edu/u/scott/courses/458/notes/coherence_etc

---

one motivation for C-style catch-fire undefined behavior of data races:

" unsigned x; If (x < 3) {... async x change switch(x) { case 0: ... case 1: ... case 2: ...}}

Assume switch statement compiled as branch table.
May assume x is in range.
synchronous change to x causes wild branch (not just wrong value) " -- [33]

So, to prohibit the wild branch from switch statement scenario, this requires prohibition of all async updates of ordinary variables (that is, the compiler should be able to assume that, in any given thread, ordinary variables don't change unless written to in that thread).

---

http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

---

jiggawatts 5 days ago

parent [–]

on: .NET 5.0

> [Async has] Huge disadvantages with absolutely zero performance gains for 95% of programmers.

One curious thing is that Java has elected to use lightweight threads instead of async methods: https://www.javacodegeeks.com/2019/12/project-loom.html

Their arguments are compelling:

Works with existing debuggers and debugging techniques.
Stack traces aren't polluted with irrelevant garbage.
Doesn't create method apartheid, where async and non-async can't "mix".
Doesn't force library authors to "buy into" the async model.

int_19h 5 days ago [–]

Cannot interop with other languages/VMs that do async differently.

The nice thing about promise-based async is that it's very easy to map to straight C ABI callbacks, which means that it can be fully cross-language. Green threads are runtime-specific.

Ironlink 5 days ago [–]

"Cannot" seems a bit harsh. CompletableFuture? isn't going away, and is made for callback situations.

With green threads being "free" (you can have millions), it might be reasonable to just block your green thread until the callback arrives.

reply " -- [34]

---

https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/

the myths are

" “different cores can have different/stale values in their individual caches”. Or that the reason we need volatiles in languages like Java, is to “prevent shared-data from being cached locally”, and force them to be “read/written all the way to main memory”. "

the article also concisely explains the MESI protocol and gives a few examples

---

https://aykevl.nl/2019/02/tinygo-goroutines

" Go uses goroutines for mostly-cooperative multitasking. In general, each goroutine has a separate stack where it can store things like temporary values and return addresses. At the moment the main Go compiler starts out with a 2kB stack for each goroutine and grows it as needed.

TinyGo? is different. The system it uses for goroutines is based on the async/await model like in C#, JavaScript? and now also C++. In fact, we're borrowing the C++ implementation that's used in Clang/LLVM. The big difference here is that TinyGo? inserts async/await keywords automatically. ... Luckily, you don't have to worry about the color of your function in Go, but I've chosen to use this as an implementation strategy for TinyGo?. The reason is that TinyGo? also wants to support WebAssembly? and WebAssembly? does not support efficient stack switching like basically every other ISA does: the stack has been hidden entirely for security reasons. So I've decided to use coroutines1 instead of actually separate stacks. "

---

https://news.ycombinator.com/item?id=22731248

jcranmer on March 30, 2020 [–]

The Rust async/await amounts to the following:

Magic keywords (i.e., async and await) to manage the state machine aspect of the function for you.
A standard interface for describing an asynchronous task (core::future::Future). The magic keywords create an implementation of this interface.
An interface for saying that task progress can be made (core::task::Waker).

That's it. There is no runtime provided, not even optionally, in the standard library (let alone core). A fair amount of this post is giving instructions for how to build the runtime to execute asynchronous tasks.

One important thing to note is that Rust uses a polling-based approach to implement tasks. That is, the core interface to a task is to run it and have it return the result or indicate that is waiting on something else. In the case of the latter, there is a callback mechanism (the Waker object) that the task is responsible for calling when it has more information to make progress--and the caller is responsible for calling the task again when that happens. Many (most?) other implementations instead use a resolution model, where the caller provides a function that is called with the result, and when the function gets called (particularly in the case of tasks that can execute quickly) differs from implementation to implementation. Rust's model definitely maps much more easily to a kernel, since the dominant pattern will be to drain a buffer, poke a device, and then wait for an interrupt to happen (the interrupt handler calling the wake function).

Matthias247 on March 31, 2020 [–]

> Rust's model definitely maps much more easily to a kernel, since the dominant pattern will be to drain a buffer, poke a device, and then wait for an interrupt to happen (the interrupt handler calling the wake function).

What is nice about Rusts model is that it prevents the continuation from running inline the event handler which signaled the completion. That avoids a ton of reentrancy issues and questions around "where does my code actually run"? That indeed makes it also interesting to use for interrupt handlers, since it is guaranteed that the code which waits for the interrupt to happen runs purely in the executors thread and will not accidentally take over the interrupt handler.

---

https://os.phil-opp.com/async-await/ Writing an OS in Rust Async/Await "In this post we explore cooperative multitasking and the async/await feature of Rust. We take a detailed look how async/await works in Rust, including the design of the Future trait, the state machine transformation, and pinning. We then add basic support for async/await to our kernel by creating an asynchronous keyboard task and a basic executor."

---

looks like a great intro:

https://begriffs.com/posts/2020-03-23-concurrent-programming.html discussion: https://news.ycombinator.com/item?id=22672128

---

structuredconcurrency #asyncawait #asynclet #taskgroup #swift

https://oleb.net/2021/structured-concurrency/

---

Parallel Programming and Parallel Algorithms, "Computer Architecture" Chapter 5, M.Zargham

---

continued at [35]