Table of Contents for Programming Languages: a survey

Chapter : introduction to concurrency

"For the past 30 years, computer performance has been driven by Moore's Law; from now on, it will be driven by Amdahl's Law. Writing code that effectively exploits multiple processors can be very challenging." -- Doron Rajwan

see also "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software", by Herb Sutter

"Concurrency is the next major revolution in how we write software" -- Herb Sutter

"Commonly used programming models are prone to subtle, hard to reproduce bugs, and parallel programs are notoriously hard to test due to data races, non-deterministic interleavings, and complex memory models,"

optimistic vs pessimistic concurrency

memory models (architecture)

(contrast with programming language 'memory models', which we discuss later under the name 'memory order')

shared memory vs "shared nothing"

shared memory



(mb should put this one into a separate section about non-uniform memory models)

deterministic paradigms

nondeterministic paradigms



note: sometimes the words 'synchronous' and 'synchronization' are used to mean 'two threads observe the same ordering of some events', rather than 'two threads are doing the same thing at the same time'.



defn data race

when the final result of the computation may differ depending on which ordering occurs out of a nondeterministic possible set of orderings of events? todo is that right?

e.g. two threads try to add to a counter simultaneously. One update gets lost todo expand example.

similar example: two ATMs try to withdraw from the same bank account at the same time. both succeed; 'double spend'; global invariant not maintained.

or mb defn is: two or more operations access one memory location concurrently, and at least one of them is a write

" assume the language allows distinguishing between synchronization and ordinary (non-synchronization or data) operations (see below). We say that two memory operations conflict if they access the same memory location (e.g., variable or array element), and at least one is a write. We say that a program (on a particular input) allows a data race if it has a sequentially consistent execution (i.e., a program-ordered interleaving of operations of the individual threads) in which two conflicting ordinary operations execute “simultaneously”. For our purposes, two operations execute “simultaneously” if they occur next to each other in the interleaving and correspond to different threads. Since these operations occur adjacently in the interleaving, we know that they could equally well have occurred in the opposite order; there are no intervening operations to enforce the order.... it is also possible to define data races as conflicting accesses not ordered by synchronization, as is done in Java. These definitions are essentially equivalent." -- Adve, Boehm. Memory Models: A Case for Rethinking Parallel Languages and Hardware

" First of all, let’s make sure we know what we’re talking about. In current usage a data race is synonymous with a low-level data race, as opposed to a high-level race that involves either multiple memory locations, or multiple accesses per thread. Everybody agrees on the meaning of data conflict, which is multiple threads accessing the same memory location, at least one of them through a write. But a data conflict is not necessarily a data race. In order for it to become a race, one more condition must be true: the access has to be “simultaneous.” ... In fact, most languages try to provide the so called DRF (Data Race Free) guarantee, which states that all executions of data-race-free programs are sequentially consistent. Don’t be alarmed by the apparent circularity of the argument: you start with sequentially consistent executions to prove data-race freedom and, if you don’t find any data races, you conclude that all executions are sequentially consistent. But if you do find a data race this way, then you know that non-sequentially-consistent executions are also possible. " --

defn synchronization

when there is a data race, you must 'synchronize', here meaning use some mechanism to constrain the nondeterminism in orderings (possibly all the way to determinism) (todo is that defn right?)

to implement synchronization over shared memory you need some sort of read-modify-write primitive, in general one with infinite consensus number (see atomics, above)

definition liveness

threads should make forward progress, not get 'stuck' (todo is that defn right?)

(todo group under above)

(todo which are built in terms of the others)

(todo which are more or less efficient)

Chapter : coarse-grained control flows (threads and processes) :



green thread

coroutines, cooperative multitasking vs preemptive

green thread hierarchy

  with greenthreads/coroutines/fibers you can often feasibly have many more concurrent threads than with OS threads.
 e.g. erlang, e.g.

subroutine call

coroutine (synchronous green thread)

async green thread (a subroutine call or coroutine at the runtime level) could share scoped variables via a saguaro stack the runtime can transform code written like blocking I/O into async I/O by making an async I/O call then yielding the greenthread and not scheduling it again until the async I/O call completes

green process (separate address space, a subroutine call or coroutine at the runtime level) erlang todo are these async?

thread (shared address space, OS scheduler)


"The only other languages I know (edit: besides Go) with built-in lightweight thread schedulers are Erlang/Elixir and Haskell, with the former lacking static typing and the latter lacking management-willing-to-use-ability." --


and join

forking at the OS level

random interesting stuff on 'fork':

" The autoconf-generated configure can be slow because it executes programs such as a C compiler many times in order to test whether various libraries, header files, and language features are present. This particularly affects Cygwin, which, due to its lack of a native fork system call, may execute configure scripts considerably slower than Linux.[6] "



the thread blocks until the expression returns True

some languages (eg SPIN Promela) use bare expressions for blocking: when a bare expression, such as 'finish == 2', is encountered, the thread blocks until the expression returns True




worker pool

deferred task queues

Chapter: fine-grained control flows

(todo: am i using the phrase fine-grained parallelism correctly? does GPGPU stuff fit here?)

parallelism annotation

e.g. Haskell's par

data parallelism

does this belong in this chapter? i think so...

pmap, preduce, pscan

parallel data types which are automatically evaluated in parallel when a function like 'map' is applied to them

compiler implementation of nested data parallelism


Flynn's taxonomy: SISD SIMD MISD MIMD SPMD


a turing-universal SIMD mechanism can emulate MIMD, but only without IPC and communication with main memory


and data and code parallelism

active data

book rec: connection machine

Chapter: message passing

message passing



and channel... ?




Example of selective receive in Ada's "rendezvous-based" concurrency: In Ada, "it allows you to treat a thread as an object with methods; calling a method on the object sends a message to the thread.", eg:

        when usage < capacity =>
          accept Push(value: in integer) do
            data[usage] := value;
            usage := usage + 1;
        when usage > 0 =>
          accept Pop(value: out integer) do
            usage := usage - 1;
            value := data[usage];
      end select;
    end loop;



synchrononus (unbuffered) and async (buffered)

In Golang, closing a closed channel panics (therefore, close is not idempotent). A send to or a receive from a nil channel blocks forever, a send to a closed channel panics, and a receive from a closed channel returns zero [2]. In combination with Golang's select statement, this can lead to useful idioms [3].

The Golang feature proposal [4] opines that this works okay when you have 1 sender and N receivers on a channel but that it's not great for having multiple senders on the channel. It suggests providing an operation to attempt to send to a channel that doesn't panic when the channel is closed, and suggests reference-counting both receivers and senders (separately) of each channel channels and auto-closing them when they have no senders or no receivers. Replies by a Golang designer [5] [6] like the reference counting idea but are skeptical of having close by receivers function as a signal to senders, for reasons i don't fully understand. In the special case of buffered channels, one reason is that items that the sender has already sent, that are sitting in the buffer when the channel is closed, may never arrive.

Golang's 'select' only works on channels and needs the programmer to know how many inputs it is selecting between in advance (unlike epoll) (i think Golang provides 'reflect.Select' for when you don't know how many things you have to select on?). Some have proposed that 'select' should work on any custom synchronization primitive, such as condition variables [7].

Some have proposed a way to make arbitrarily buffered channels (buffered channels with no fixed buffer size limit) in order to make channels that don't block [8]. Another (? is this distinct from all of the above suggestions?) suggestion is to allowing the 'dup' syscall to apply to channels, and then allowing someone to 'close' the dup'd channel without affecting the original one [9].

alternative: callbacks


Chapter: process calculi

especially related to channels/message passing.

communicating sequential processes (CSP) (process calculus)

pi calculus (process calculus)

In Pi Calculus, a channel can itself be sent over a channel.

Chapter: sequencing around communication

receiving data


always use timeouts!


and futures etc


con: "...the cumbersome use of callbacks. This, of course, is a problem with many asynchronous computations: you gain concurrency but lose the natural code flow. It’s hard for a programmer to reason about which line of code executes on which thread, and passing information from one callback to another is cumbersome as well." --

TJ Holowaychuk describes how callbacks can be more error-prone than some other concurrency techniques:

"In Go when my code is done, it’s done, you can’t get re-execute the statement. This is not true in Node, you could think a routine is completely finished, until a library accidentally invokes a callback multiple times, or doesn’t properly clear handlers, and cause code to re-execute." [10]

todo: i think: With callbacks, programmers can forget to propagate errors, leading to an error crashing a thread without the rest of the program noticing.


An alternative to callbacks that allows you to write the code in an easier-to-read manner.

'promises' are used somewhat interchangably, although some languages have both 'promises' and 'futures' and distinguish between them

implicit (it's just a value from the program's point of view) vs. explicit (it's a structure that you query to see if it's been bound yet)

in some languages, the writable side of the future is separated from the readable side (like a pipe). This allows you to e.g. copy the readable side and pass it to 5 consumers, and pass the writable side to 1 producer, without fearing that one of the consumers will overstep their authority and try to write to the future.

in clojure, it's easy to make futures behave like ordinary variables, because stateful variables must be 'deref'd anyway

clojure has something called a 'promise' and something called a 'future'. The difference is that with a clojure 'future', at the time of future creation you bind an expression that will be evaluated to give the result; whereas with a clojure 'promise', the promise is just a structure that you later bind a result to, e.g. the computation that will produce the result need not be determined yet at the time of promise creation. (see ). Promises are read-only. (this terminology is flipped in the C++ Folly Futures library [11])

in c++, futures are (see ) the read side and promises are the write side.

A hardware version is full/empty bits, found for example in the : "For example, an array A is initially written with "empty" bits, and any thread reading a value from A blocks until another thread writes a value.". In the Cray MTA, the blocking on empty bits eventually times out. Opcodes are provided which ignore full/empty status.

There is a standard for promises in Javascript, . It was referenced by , so perhaps it is canonical.

facebook's C++ Folly Futures library has a good explanation of various things (if you can read C++):

some constructs in the Folly library mentioned in that introduction are:

Rust futures:


C# await:

the idea is that if you have a promise, you give that promise to the 'await' keyword, which puts the current block of code 'on hold' until the promise completes, and in the meantime yields control to YOUR caller.

Python 3.5's async/await:

Javascript's (contrasted with callbacks, and (raw) promises):

labeled lines

like neuro -- always have an answer even if your inputs aren't ready (the answer can be 'i don't know')

sending data

yield (and generators)

(just like in the serial case with coroutines)


(just like in the serial case)

send a message thru a channel

alternative: callbacks


alter a 'shared' memory


generalize to unordered eventcounts (cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer; that article reserves the term 'eventual consistency' when there is a maximum time duration until consistency)

condition variable

supports 3 operations

wait: go to sleep until you get woken (see below)

signal: wake one thread waiting on condVar

broadcast: wake all threads waiting on condVar

example usage: producer/consumer: Every now and then, the producer sends some data, and then the consumer is supposed to do something with it. It would be a waste of time for the consumer to busy wait for new data. Solution: The producer signals the condition variable when they produce something. the consumer sleeps on the condition then wakes up and processes what the producer produced.

todo should this also be mentioned in 'locks and friends'?


? or should this be in a different chapter

event-driven programming, pub/sub

? or should this be in a different chapter

streams and reactive programming

'observable' = 'stream'

you can treat streams like lazy lists; you can filter and map over streams

a promise is a stream with only one event to be emitted on it

to apply an async operation (which returns a 'promise' type of stream) to a stream, you flatmap that operation over the stream; this (a) maps the operation over the stream, transforming each event (input to the async operation) into a promise stream (output of the async operation), (b) takes the resulting stream of streams and flattens (merges) it into one stream (since each promise stream only had one event on it, each one will contribute one event to the final, flattened stream)


spreadsheet-like dependent updates

? should this be subsumed under dataflow or event-driven programming? ? maybe not: a key feature two-way (symmetric) updates or even cyclic dependencies


Chapter: models of concurrent time

motivating problem:

Logical time and logical clocks

Usually we track time with physical time, e.g. we assign a global real number representing time to each event. But one can also only consider the ordering of events; in computation, often it does not matter if step 3 occurred at 13:30 and step 4 occurred at 14:53, just that step 4 came after step 3. So we might consider a discrete version of time.

Sometimes it is unclear whether some events came before or after others. For example, you may a distributed system with a bunch of processor nodes whose clocks are not synchronized very precisely. Or, you may be working with 'events' that are actually intervals, for example, 'processor A executes subroutine X' and 'processor B executes subroutine Y', in which case two 'events' can overlap. Or, you may be working with events whose ordering relative to each other is known in some cases but unknown in others; for example two concurrent threads that take an action A1, then wait for the other at a synchronization barrier, then take another action A2; you know that threads 1's A1 preceded thread 2's A2, but you don't know if thread 1's A1 preceded thread 2's A1.

For these reasons we often model logical time not as a totally ordered structure (similar to the natural numbers), but rather as a partial order.

In many applications we assume that each individual process's logical time is totally ordered (that is, that each individual process knows, for any pair of events that it observed or generated, which one came first; therefore all the events observed or generated by an individual process can be arranged in order as a line), but that the global logical time of a concurrent distributed system is not.

We define a logical time by defining an ordering relation on events, such as 'happens-before' or 'caused' or 'might-have-caused'.

A logical clock is a function C whose domain is events and whose range is some representation of logical time, that is, C maps events to times, or said another way, for any event e, C(e) tells us what time that event occurred at. We call the value returned by C(e) a 'timestamp'. We place a requirement on C that relates the ordering of the timestamps to the ordering of events; "if e1 < e2, then C(e1) < C(e2)".

See also 'logical clocks for consistency', below, in which we discuss schemes by which processes communicate with each other about their logical clocks for the purpose of combining their local logical time into a global logical time.


event (re)ordering models

event (re)ordering models is my own term; "memory model" in programming languages; "consistency model" in theory

for some authors, in shared memory, a "coherence model" is about event (re)ordering for events pertaining to a single item in memory; a "consistency model" is about event (re)ordering for events across different items. Typically, stronger guarantees are provided for coherence than for consistency; typically, coherence is taken to mean that every observer is guaranteed to observe the same sequence of values for any single item. For other authors, "coherence" and "consistency" are synonyms.

partial and total orders

one technique is to attach integer timestamps to each event. Each processor sends out each event with a unique timestamp which is greater than all timestamps it has ever sent or observed (Lamport).

relaxed (totally ordered only within each variable), release (acquire-release) (partially ordered), sequentially consistent (totally ordered)

'serializability' from database literate, where event = transaction compressed to a point: "a transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time" --

others from (should we include this?): gives an example where it's hard to explain a delay in terms of a single reordering (but you can explain it when you realize that different processes can observe different reorderings):

P1: A=1; t1=A; t2=B P2: B=1; t3=B; t4=A assert(!(t2 == 0 and t4==0)

the assertion can fail because it may take a long time for the writes of 1 to A and B to propagate inter-processor, even though those same writes were read locally immediately. gives an example that can fail on read/read or write/write reordering (i think) but not on write/read reordering (x/y being read as "reorder the x, which in the program code is before the y, to after the y"):

P1: A = 1; flag = 1;

P2: while (flag == 0); assert(A == 1);

Example from of something that fails under processor consistency but not (i think) under casual ordering:

P1: A = 1;

P2: while (A==0); B = 1;

P3: while (B==0); print A;

everyone likes this example:

P1: A = 1; print B;

P2: B = 1; print A;

and this similar one, which is part of Dekker's (sp?) algorithm for critical sections:

P1: A = 1; if (B == 0) {enter critical section}

P2: B = 1; if (A == 0) {enter critical section}

note that if there are any delays, or write/read reorderings, both threads may enter the critical section

Example of counterintuitive things with 'C relaxed' ordering:

" For example, with x and y initially zero,

Thread 1: r1 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(x, r1, memory_order_relaxed); Thread 2: r2 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(y, 42, memory_order_relaxed);

is allowed to produce r1 == r2 == 42. "

consistency models in which we distinguish 'synchronization operations' and give them stronger consistency properties ( ; todo i dont quite understand these):

"an operation is defined as a synch operation if it forms a race with any operation in any sequentially consistent execution" --

" Background: Most large-scale distributed systems (i.e ., databases) apply replication for scalability, but can support only weak consistency:

    DNS: Updates are propagated slowly, and inserts may not be immediately visible.
    NEWS: Articles and reactions are pushed and pulled throughout the Internet, such that reactions can be seen before postings.
    Lotus Notes: Geographically dispersed servers replicate documents, but make no attempt to keep (concurrent) updates mutually consistent.
    WWW: Caches all over the place, but there need be no guarantee that you are reading the most recent version of a page." -- (should we include this?):

client-central consistency properties: Assume there is a 'client' who migrates across processes in between each read and write and who accesses shared memory from whichever process they currently reside in. We want to provide guarantees for this client but only regarding their own data.

" For concurrent operations, the absence of synchronized clocks prevents us from establishing a total order. We must allow each operation to come just before, or just after, any other in-flight writes. If we write a, initiate a write of b, then perform a read, we could see either a or b depending on which operation takes place first. After operations complete, visibility is guaranteed

On the other hand, we obviously shouldn’t interact with a value from the future–we have no way to tell what those operations will be. The latest possible state an operation can see is the one just prior to the response received by the client. " -- [12]


here's a map of some consistency models (adapted from fig. 2 of Highly Available Transactions: Virtues and Limitations by Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, Ion Stoica, drawn by Kyle Kingsbury):

In the above figure, the pink background indicates models such that no distributed system that implements them can satisfy the availability criterion ("every user that can contact a correct (non-failing) server eventually receives a response from that server, even in the presence of arbitrary, indefinitely long network partitions between servers" -- [13]). The blue background indicates models that are compatible with availability but only if client nodes must always talk to the same server node ('sticky availability'; its definition is more complicated if different server nodes deal with different subsets of the data, see [14] section 4.1). The green background indicates models that are compatible with availability without this requirement.

Some definitions of some things in this figure (these definitions are mostly quoted from [15])

"In the modern distributed systems context, “sequential consistency” means something slightly different (and weaker) than what architects intend. What architects call “sequential consistency” is more like what distributed systems folks would call “linearizability”." [17]


what else? apparently Chu spaces can model this stuff?

example from

So, you have a multinode database. The database works by having 'primaries' and secondaries; each piece of data is owned by a single 'primary' node. All write operations must go to the relevant primary node. A read from a primary node of data owned by it is guarantees to return the result of the most recent previous write operations.

In the face of a network partition, (a) a write only succeed (and only take effect) if they have been acknowledged by a majority of nodes, and (b) if the primary ends up in the minority partition, then a new primary will be elected in the majority partition.

What if the old primary in the minority partition still thinks it's the primary after the new primary has been elected? This will still not allow writes to the minority partition; because writes only succeed if they have been acknowledged by a majority of nodes, and the old primary is stuck in a minority partition and so can't get that many acknowledgments.

However, a client could write to the new primary node and then read from the old primary node. Since the old primary node is still telling clients that it is the primary, the client might think that the read gives the aforementioned guarantee that it must return the most recent previous write operation. In other words, the client could be getting 'stale reads' while thinking it has a guarantee that reads will not be stale.

'Read Your Writes' is not guaranteed here:

'Monotonic Reads' is not guaranteed here:

(this also means that the PRAM consistency property does not hold, because PRAM implies both Read Your Writes (RYW) and Monotonic Reads (MR) (ie PRAM is strictly stronger than RYW and strictly stronger than MR), and furthermore, it implies that none of causal consistency, sequential consistency, linearizable consistency hold, because causal consistency is strictly stronger than PRAM, and each of the others are strictly stronger than the previous).

" For instance, let’s say two documents are always supposed to be updated one after the other–like creating a new comment document, and writing a reference to it into a user’s feed. Stale reads allow you to violate the implicit foreign key constraint there: the user could load their feed and see a reference to comment 123, but looking up comment 123 in Mongo returns Not Found.

A user could change their name from Charles to Désirée, have the change go through successfully, but when the page reloads, still see their old name. Seeing old values could cause users to repeat operations that they should have only attempted once–for instance, adding two or three copies of an item to their cart, or double-posting a comment. Your clients may be other programs, not people–and computers are notorious for retrying “failed” operations very fast.

Stale reads can also cause lost updates. For instance, a web server might accept a request to change a user’s profile photo information, write the new photo URL to the user record in Mongo, and contact the thumbnailer service to generate resized copies and publish them to S3. The thumbnailer sees the old photo, not the newly written one. It happily resizes it, and uploads the old avatar to S3. As far as the backend and thumbnailer are concerned, everything went just fine–but the user’s photo never changes. We’ve lost their update. " -- [18]

example from

Consider the example from the previous section, now drop the requirement that writes only succeed if they have been acknowledged by a majority of nodes. Now writes to the minority partition are visible to later reads from nodes in the minority partition.

" For instance, suppose we have a user registration service keyed by a unique username. Now imagine a partition occurs, and two users–Alice and Bob–try to claim the same username–one on each side of the partition. Alice’s request is routed to the majority primary, and she successfully registers the account. Bob, talking to the minority primary, will see his account creation request time out. The minority primary will eventually roll his account back to a nonexistent state, and when the partition heals, accept Alice’s version.

But until the minority primary detects the failure, Bob’s invalid user registration will still be visible for reads. After registration, the web server redirects Alice to /my/account to show her the freshly created account. However, this HTTP request winds up talking to a server whose client still thinks the minority primary is valid–and that primary happily responds to a read request for the account with Bob’s information.

Alice’s page loads, and in place of her own data, she sees Bob’s name, his home address, his photograph, and other things that never should have been leaked between users.

You can probably imagine other weird anomalies. Temporarily visible duplicate values for unique indexes. Locks that appear to be owned by two processes at once. Clients seeing purchase orders that never went through.

Or consider a reconciliation process that scans the list of all transactions a client has made to make sure their account balance is correct. It sees an attempted but invalid transaction that never took place, and happily sets the user’s balance to reflect that impossible transaction. The mischievous transaction subsequently disappears on rollback, leaving customer support to puzzle over why the numbers don’t add up.

Or, worse, an admin goes in to fix the rollback, assumes the invalid transaction should have taken place, and applies it to the new primary. The user sensibly retried their failed purchase, so they wind up paying twice for the same item. Their account balance goes negative. They get hit with an overdraft fine and have to spend hours untangling the problem with support. " -- [19]

CAP theorem

The 'CAP' in CAP theorem is an acronym for Consistency, Availability, and Partition tolerance, defined as follows.

" given consistency, availability, and partition tolerance, any given system may guarantee at most two of those properties. While Eric Brewer’s CAP conjecture was phrased in these informal terms, the CAP theorem has very precise definitions:

    Consistency means linearizability, and in particular, a linearizable register. Registers are equivalent to other systems, including sets, lists, maps, relational databases, and so on, so the theorem can be extended to cover all kinds of linearizable systems.
    Partition tolerance means that partitions can happen. Providing consistency and availability when the network is reliable is easy. Providing both when the network is not reliable is provably impossible. If your network is not perfectly reliable–and it isn’t–you cannot choose CA. This means that all practical distributed systems on commodity hardware can guarantee, at maximum, either AP or CP.
    Availability means that every request to a non-failing node must complete successfully. Since network partitions are allowed to last arbitrarily long, this means that nodes cannot simply defer responding until after the partition heals." -- [20]

(CA, AP, and CP refer to the property of satisfying two of the the three CAP properties; CA means namely Consistency+Availability, AP means Availability+Partition tolerance, CP means Consistency+Partition tolerance)

The theorem:

event (re)ordering primitives

fences , memory barriers

acquire/release ordering vs. sequentially consistent ordering:

acquire/release in terms of memory barriers: " Acquire semantics is a property which can only apply to operations which read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.

    Release semantics is a property which can only apply to operations which write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order." --

a release is load/store + store/store, and and acquire is load/load, load/store.

in C (at least in C++, todo) variable declared 'atomic' are automatically also considered 'volatile', i.e. the processor assumes that other threads may be reading/writing them (cite )


branching time



Chapter : shared memory

note: when i say 'shared' here i am including a distributed or remote memory resource that multiple processes can write to, e.g. a database. this is very different from how the term 'shared memory' is usually used so mb i should change that. see

in fact, when thinking about pitfalls of shared memory, i find it easiest if i imagine the 'shared memory' to be a virtual construct which is actually realized by message passing between separate processes with their own caches.

pros and cons



ACID properties

hardware shared memory


cache coherence

consistency model

PGAS Partitioned Global Address Space (non-uniform memory with processor affinity)


distributed shared memory


atomic primitives

atomic primitives

Common atomic primitives:

Simple primitives:

Read-modify-write primitives:

Atomic primitives can be ranked by their consensus number, which says how many concurrent processes can be brought to consensus in a wait-free manner using the primitive (todo: i have only the vaguest idea what this sentence means) (Herlihy, Maurice (January, 1991). "Wait-free synchronization". ACM Trans. Program. Lang. Syst. 13 (1): 124–149. doi:10.1145/114005.102808). From

So, the only primitives from the first list above with infinite consensus number are Compare-and-swap and Load-Link/Store-Conditional. I believe that either of these can be used to implement the others, but i'm not positive. Both of these are common: "As of 2013, most multiprocessor architectures support CAS in hardware." ; "All of Alpha, PowerPC?, MIPS, and ARM provide LL/SC instructions" -- and claims that "As of 2013, the compare-and-swap operation is the most popular synchronization primitive for implementing both lock-based and nonblocking concurrent data structures."

When using compare-and-swap, you have to be wary of the ABA problem, in which the programmer erroneously assumes that if some memory location was A in the past and is A now, then it has not been modified in between the past and now -- in reality it's possible that its value was changed to a new value B in the meantime, but then was changed back to A. Therefore, if you store the old value A of a memory location, and then later compare-and-swap, comparing to the old value A, it is possible that the CAS will succeed even if the memory location was modified in the meantime, provided its value has been changed back to A before the CAS.

One workaround for the ABA problem is to add a counter to the memory location [21] [22]; for example, a 32-bit value V can be augmented by another 32 counter bits C to yield a 64-bit value representing the tuple (V, C). Each time the algorithm writes a new V to the memory location, it also increments C; now even if the V which is written is identical to an older V, the C will probably be different. The only problem is if the addition wraps around and exactly 2^32 updates have occurred in between accesses from some observer. One solution might be to not wrap around counter increments, and require a global synchronization if the counter ever reaches 2^32. (how many bits are needed for the counter? [23] suggests that 32 bits are usually sufficient but that 16 bits are often not).

An issue in the use of atomics is that the hardware generally only provides atomics for specific bit-widths. E.g. there might be a way to atomically write to a single byte, but not a way to atomically replace all of the values in a 200-byte array with others.

If a programmer uses a high-level language that provides atomic variables that support atomic assignment, they must be careful to understand what is and isn't atomic. For example, in many languages, if a is an atomic variable, "a = a + 1" is not, as a whole, atomic. This would often be compiled into an atomic read from the memory location containing "a" into a register, followed by an increment of that register, followed by an atomic write of the value in that register into the memory location containing "a".


" 26 has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false; bool y= false; int z= 0;

a() { x= true; } b() { y= true; } c() { while (!x); if (y) z++; } d() { while (!y); if (x) z++; }

kick off a, b, c, d, join all threads assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.


With ack_rel, c() can perceive that x=true; in a() happens before y=true; in b() at the same time d() can perceive that y=true; happens before x=true; (due to lack of "global ordering".) In particular c() can perceive x==true and y==false at the same time d() can perceive y==true and x==false. So z might not be incremented by either of c() or d(). With seq_cst, if c() perceives x=true; happens before y=true;, so does d(). – nodakai Jan 20 '16 at 11:03


@nodakai, Your explanation is accurate but I think the phrase "happens before" can be misleading since the crux of the issue with acquire-release is that neither of the writes happens-before the other. – jhoffman0x May 1 '18 at 3:50 "

locks and friends

mutex (lock)

build out of critical sections or atomic primitives

can be used to implement critical sections



todo: is there any difference between 'lock' and 'mutex'?

if you can do this you are good (theorem?):


synchronization barriers

critical sections

critical sections: a region of code (shared across threads) that only one thread can be in at any given time

critical sections can be built out of semaphores, or vice versa (i think; see )

or can be built out of atomic primitives


generalization of mutex; n threads may hold the semaphore at once (e.g. with a mutex n = 1). the semaphore is a finite discrete resource.

" Semaphores vs. mutexes

A mutex is essentially the same thing as a binary semaphore and sometimes uses the same basic implementation. The differences between them are in how they are used. While a binary semaphore may be used as a mutex, a mutex is a more specific use-case, in that only the process that locked the mutex is supposed to unlock it. This constraint makes it possible to implement some additional features in mutexes:

Serializing tokens


" Tokens are similar to mutexes in that they can, if used correctly, prevent multiple threads from accessing a shared resource at the same time. Unlike mutexes, however, they do NOT exclude other threads from accessing the resource while they are blocked or asleep. "

so if you acquire token A, then do something, then acquire token B, then you might block while trying to acquire token B, and while you are blocked, the resource controlled by token A becomes unlocked and other processes can use it/modify it under you (so after you do any blocking operation, such as acquiring token B, you'd better check to make sure that any previously acquired resources haven't been modified by someone else in the meantime, if that matters to you; the best idiom is to try to acquire all tokens that you will need at once at the beginning of your function)

readers-writer lock

generalization of semaphore:

when the writer lock is held, no one else can acquire either the writer lock or a reader lock

when a reader lock is held, anyone else can acquire another reader lock, but the writer lock cannot be acquired

when a thread is blocked trying to acquire the writer lock, no reader lock can be acquired


the monitor is an object whose methods are all critical sections, e.g. there is a mutex associated with each monitor which must be acquired when you call a method, and which is released before the method returns

built out of mutexes

other data structures



side note: there is an interesting connection between reversible computing and transactions: anything done within a transaction must be reversible (undo-able; rollback-able)



"three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions) ...

The canonical syntactic footprint of TM is also beautiful and simple. You say:

atomic { ... concurrency-safe code goes here ... }

And everything in that block is magically concurrency-safe. (Well, you still need to ensure the consistency part, but isolation and atomicity are built-in. Mix this with Eiffel- or Spec#-style contracts and assertions like those in .NET 4.0, run at the end of each transaction, and you’re well on your way to verified consistency also. The ‘check E’ work in Haskell was right along these lines.) " ---

Implementation, and some design decisions:

" Turtles, but How Far Down? Or, Bounded vs. Unbounded Transactions

Not all transactions are equal. There is a broad spectrum of TMs, ranging from those that are bounded to updating, say, 4 memory locations or cache lines, to those that are entirely unbounded. ... Models can be pulled along other axes, however, such as whether memory locations must be tagged in order to be used in a transaction or not, etc. Haskell requires this tagging (via TVars) so that side-effects are evident in the type system as with any other kind of monad.

We quickly settled on unbounded transactions. ... hindsight, this was a critical decision that had far-reaching implications. And to be honest, I now frequently doubt that it was the right call. We had our hearts in the right places, and the entire industry was trekking down the same path at the same time (with the notable exception of Haskell). But with the wisdom of age and hindsight, I do believe limited forms of TM could be wildly successful at particular tasks and yet would have avoided many of the biggest challenges with unbounded TM. ... The model of unbounded transactions is the hard part. You surround any arbitrary code with ‘atomic { … }’ and everything just works. It sounds beautiful. But just think about what can happen within a transaction: memory operations, calls to other methods, P/Invokes to Win32, COM Interop, allocation of finalizable objects, I/O calls (disk, network, databases, console, …), GUI operations, lock acquisitions, CAS operations, …, the list goes on and on. ... Three main approaches were seriously considered:

    IL rewriting. Use a tool that passes over the IL post-compilation to inject transaction calls.
    Hook the (JIT) compiler. The runtime itself would intrinsically know how to inject such calls.
    Library-based. All transactional operations would be explicit calls into the TM infrastructure.... we chose approach #2 for our “real” prototype, and never looked back. ...

Throughout, it became abundantly clear that TM, much like generics, was a systemic and platform-wide technology shift. It didn’t require type theory, but the road ahead sure wasn’t going to be easy. ... And we also hit some tough snags early on. Some were trivial, like what happens when an exception is thrown out of an atomic block. Of course that exception was likely constructed within the atomic block (‘throw new SomeException?()’ being the most common form of ‘throw’), so we decided we probably need to smuggle at least some of that exception’s state out of the transaction. Like its stack trace. And perhaps its message string. I wrote the initial incarnation of the CLR exception subsystem support, and stopped at shallow serialization across the boundary. But this was a slippery slope, and eventually the general problem was seen, leading to more generalized nesting models (which I shall describe briefly below). Another snag, which was quite non-trivial, was the impact to the debugging experience. Depending on various implementation choices – like in-place versus buffered writes – you may need to teach the debugger about TM intrinsically. And some of the challenges were fundamental to building a first class TM implementation. Clearly the GC needed to know about TM and its logging, because it needs to keep both the “before” and “after” state of the transaction alive, in case it needed to roll back. The JIT compiler was very quickly splayed open and on the surgery table. And so on. ... There are other overheads that are not so obvious. Optimistic reads mandate that there is a version number for each location somewhere, and pessimistic writes mandate that there is a lock for each location somewhere. ... We quickly settled an approach much like Harris’s (and, at the time, pretty much the industry/research standard): optimistic reads, in-place pessimistic writes, and automatic retry. ... You’re in one atomic block and then enter into another one. What happens if that inner transaction commits or rolls back, before the fate of the outer transaction is known? Intuition guided us to the following answer:

    If the inner transaction rolls back, the outer transaction does not necessarily do so. However, no matter what the outer transaction does, the effects of the inner will not be seen.
    If the inner transaction commits, the effects remain isolated in the outer transaction. It “commits into” the outer transaction, we took to saying. Only if the outer transaction subsequently commits will the inner’s effects be visible; if it rolls back, they are not.... " --

" A Better Condition Variable

Here’s a brief aside on one of TM’s bonus features.

Some TM variants also provide for “condition variable”-like facilities for coordination among threads. I think Haskell was the first such TM to provide a ‘retry’ and ‘orElse’ capability. When a ‘retry’ is encountered, the current transaction is rolled back, and restarted once the condition being sought becomes true. How does the TM subsystem know when that might be? This is an implementation detail, but one obvious choice is to monitor the reads that occurred leading up to the ‘retry’ – those involved in the evaluation of the predicate – and once any of them changes, to reschedule that transaction to run. Of course, it will reevaluate the predicate and, if it has become false, the transaction will ‘retry’ again.

A simple blocking queue could be written this way. For example:

object TakeOne?() { atomic { if (Count == 0) { retry; }

        return Pop();

If, upon entering the atomic block, Count is witnessed as being zero, we issue a retry. The transaction subsystem notices we read Count with a particular version number, and then blocks the current transaction until Count’s associated version number changes. The transaction is then rescheduled, and races to read Count once again. After Count is seen as non-zero, the Pop is attempted. The Pop, of course, may fail because of a race – i.e. we read Count optimistically without blocking out writers – but the usual transaction automatic-reattempt logic will kick in to mask the race in that case.

The ‘orElse’ feature is a bit less obvious, though still rather useful. It enables choice among multiple transactions, each of which may end up issuing a ‘retry’. I don’t think I’ve seen it in any TMs except for ours and Haskell’s.

To illustrate, imagine we’ve got 3 blocking queues like the one above. Now imagine we’d like to take from the first of those three that becomes non-empty. ‘orElse’ makes this simple:

BlockingQueue? bq1 = ..., bq2 = ..., bq3 = ...;

atomic { object obj = orElse { bq1.TakeOne?(), bq2.TakeOne?(), bq3.TakeOne?() }; }

While ‘orElse’ is perhaps an optional feature, you simply can’t write certain kinds of multithreaded algorithms without ‘retry’. Anything that requires cross-thread communication would need to use spin variables. " --

" Deliberate Plans of Action: Policy

I waved my hands a bit above perhaps without you even knowing it. When I talk about optimistic, pessimistic, and automatic retry, I am baking in a whole lot of policy. ... There was a similar problem when deciding to back off outer layers of nesting, and in fact this becomes more complicated when deadlocks are involved. Imagine:

atomic { atomic { x++; y++; atomic { atomic { y++; x++; } } } }

This deadlock-prone example is tricky because rolling back the inner-most transactions won’t be sufficient to break the deadlock that may occur. Instead the TM policy manager needs to detect that multiple levels of nesting are involved and must be blown away in order to unstick forward progress. " --

Cons: " Disillusionment Part I: the I/O Problem...what about a read or write from a single block or entire file on the filesystem? Or output to the console? Or an entry in the Event Log? What about a web service call across the LAN? Allocation of some native memory?...The answer seemed clear. At least in theory. The transaction literature, including Reuter and Gray’s classic, had a simple answer for such things: on-commit and on-rollback actions, to perform or compensate the logical action being performed, respectively. (Around this same time, the Haskell folks were creating just this capability in their STM, where ‘onCommit’ and ‘onRollback’ would take arbitrary lambdas to execute the said operation at the proper time.) " --

" For example, how would you treat a lock block that was called from within a transaction? (You might say “that’s just not supported”, but when adding TM to an existing, large ecosystem of software, you’ve got to draw the compatibility line somewhere. If you draw it too carelessly, large swaths of existing software will not work; and in this case, that often meant that we claimed to provide unbounded transactions, and yet we would be placing bounds on them such that a lot of existing software could not be composed with transactions. Not good.) " --

" Disillusionment Part II: Weak or Strong Atomicity?

All along, we had this problem nipping at our heels. What happens if code accesses the same memory locations from inside and outside a transaction? We certainly expected this to happen over the life of a program: state surely transitions from public and shared among threads to private to a single thread regularly. But if some location were to be accessed transactionally and non-transactionally concurrently, at once, we’d (presumably) have a real mess on our hands. A supposedly atomic, isolated, etc. transaction would no longer be protected from the evils of racey code.

For example:

atomic { // Tx0     x++; // No-Tx

... Another approach was static analysis. We could require transactional locations to be tagged, for example. This had the unfortunate consequence of making reusable data structures less, well, reusable. Collections for example presumably need to be usable from within and outside transactions alike. ...

bool itIsOwned = false;
MyObj x = new MyObj();


Disillusionment Part III: the Privatization Problem
atomic { // Tx0                         atomic { // Tx1
    // Claim the state for my use:          if (!itIsOwned)
    itIsOwned = true;                           x.field += 42;
}                                       }

int z = x.field;

The Tx0 transaction changes itIsOwned to true, and then commits. After it has committed, it proceeds to using whatever state was claimed (in this case an object referred to by variable x) outside of the purview of TM. Meanwhile, another transaction Tx1 has optimistically read itIsOwned as false, and has gone ahead to use x. An update in-place system will allow that transaction to freely change the state of x. Of course, it will roll back here, because isItOwned changed to true. But by then it is too late: the other thread using x outside of a transaction will see constantly changing state – torn reads even – and who knows what will happen from there. A known flaw in any weakly atomic, update in-place TM.

If this example appears contrived, it’s not. It shows up in many circumstances. The first one in which we noticed it was when one transaction removes a node from a linked list, while another transaction is traversing that same list. If the former thread believes it “owns” the removed element simply because it took it out of the list, someone’s going to be disappointed when its state continues to change.

This, we realized, is just part and parcel of an optimistic TM system that does in-place writes. I don’t know that we ever fully recovered from this blow. It was a tough pill to swallow. After that meeting, everything changed: a somber mood was present and I think we all needed a drink. Nevertheless we plowed forward.

We explored a number of alternatives. And so did the industry at large, because that intern in question published a paper on the problem. One obvious solution is to have a transaction that commits a change to a particular location wait until all transactions that have possibly read that location have completed – a technique we called quiescence. We experimented with this approach, but it was extraordinarily complicated, for obvious reasons.

We experimented with blends of pessimistic operations instead of optimistic, alternative commit protocols, like using a “commit ticket” approach that serializes transaction commits, each of which tended to sacrifice performance greatly. Eventually the team decided to do buffered writes instead of in-place writes, because any concurrent modifications in a transaction will simply not modify the actual memory being used outside of the transaction unless that transaction successfully commits.

This, however, led to still other problems, like the granular loss of atomicity problem. Depending on the granularity of your buffered writes – we chose object-level – you can end up with false sharing of memory locations between transactional and non-transactional code. Imagine you update two separate fields of an object from within and outside a transaction, respectively, concurrently. Is this legal? Perhaps not. The transaction may bundle state updates to the whole object, rather than just one field.

All these snags led to the realization that we direly needed a memory model for TM. " --


non-blocking algorithms

(i think this concept is only relevant for shared memory, but am i right?)




wait free algorithms

" Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes...It was shown in the 1980s[4] that all algorithms can be implemented wait-free, and many transformations from serial code, called universal constructions, have been demonstrated. However, the resulting performance does not in general match even naïve blocking designs. Several papers have since improved the performance of universal constructions, but still, their performance is far below blocking designs.

Several papers have investigated the hardness of creating wait-free algorithms. For example, it has been shown[5] that the widely available atomic conditional primitives, CAS and LL/SC, cannot provide starvation-free implementations of many common data structures without memory costs growing linearly in the number of threads. But in practice these lower bounds do not present a real barrier as spending a word per thread in the shared memory is not considered too costly for practical systems.

Until 2011, wait-free algorithms were rare, both in research and in practice. However, in 2011 Kogan and Petrank[6] presented a wait-free queue building on the CAS primitive, generally available on common hardware. Their construction expands the lock-free queue of Michael and Scott,[7] which is an efficient queue often used in practice. A follow-up paper by Kogan and Petrank[8] provided a methodology for making wait-free algorithms fast and used this methodology to make the wait-free queue practically as fast as its lock-free counterpart. " --



Let's say you have a linked list being read by a bunch of concurrent readers processes and you want to delete one of the items in the middle.

One way would be to say, anyone accessing the list, read or write, must first acquire a lock.

Another way, the read-copy-update (RCU) way, would be to say, anyone reading the list must first make an entry in a shared log saying that they are about to read it. When they are done, they make another entry in the shared log saying they are done. When you want to delete a node, first you atomically change the 'next' pointer at the previous node to leave out the node to be deleted. Now you look at the log and identify all of the readers who have started reading and haven't finished. When all of these have finished, you can delete the node. Note that while you're waiting, other readers can start reading, and this doesn't delay you further, because you know they will not see the node that is about to be deleted, because you already changes the 'next' pointer.




lamport's observation that it seems to be a physical fact that arbiters take longer to arbitrate if the events they are given arrive closer together in time

avoiding sharing with shared memory

uniqueness types

immutable data


Chapter: consistency

Eventual consistency

also called best-effort consistency (cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer; that article reserves the term 'eventual consistency' when there is a maximum time duration until consistency)

Last write wins

Logical clocks consistency schemes

Vector clocks

Consider the following example (slightly modified from

Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. Dave Alice emails everyone and suggests they meet on Wednesday. Ben emails Dave and says he can do any day, but he'd prefer Tuesday or Thursday. Dave replies privately to Ben that he'd prefer Tuesday, so unless someone can't make it, it'll be Tuesday. Then Cathy emails Dave and says she absolutely can't make any day but Thursday, so Dave replies privately to Cathy that it'll be Thursday. Then Dave gets sick and stops checking his email. Alice emails everyone again to find out what day was decided, but she gets mixed messages. Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can't be reached, and Ben and Cathy's computers' clocks are unreliable, so their email timestamps can't be used to determine which email from Dave was sent latest. So none of Alice, Ben, and Cathy know which day has been chosen (assume for the purpose of the example that they can't just work it out themselves).

Consider the decision about which day it is to be a document with different versions. We want to know which version is most recent. The vector clock solution is this: a list of clocks is appended to each message. A 'clock' is just an integer. There is one clock for each person who has seen any previous version of the message. Whenever a person is about to send a message, they increase their clock.

So in the examples, the messages are:

Dave (to all): "Dinner? Vector clock: {Dave: 1}" Alice (to all): "Yeah how about Wednesday? Vector clock: {Alice: 1, Dave: 1}" Ben (to Dave and Alice): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Vector clock: {Alice: 1, Ben: 1, Dave: 1}" Dave (to Ben): "I like Tuesday too if no one else minds. Vector clock: {Alice: 1, Ben: 1, Dave: 2}" Cathy (to Dave): "I can only do Thursday. Is that okay? Vector clock: {Alice: 1, Cathy: 1, Dave: 1}" Dave (to Cathy): "OK. Vector clock: {Alice: 1, Ben: 1, Cathy: 1, Dave: 3}"

Now even if Dave goes silent, Ben and Cathy can compare the latest messages they've seen, which would be their replies from Dave, and determine that the latest version, that is, the version which is a descendent of every earlier version, is Dave's message to Cathy confirming Thursday. They can agree that since Dave has seen everything that everyone had said before replying to Cathy that Thursday would be okay, that Thursday is reasonable.

Vector clocks can also distinguish the case in which there is no single 'latest version' (multiple conflicting versions). For example, if instead of emailing Dave about Tuesday and Thursday, Ben had emailed only Alice, and she had replied, we would have had:

Dave (to all): "Dinner? Vector clock: {Dave: 1}" Alice (to all): "Yeah how about Wednesday? Vector clock: {Alice: 1, Dave: 1}" Ben (to Alice): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Vector clock: {Alice: 1, Ben: 1, Dave: 1}" Alice (to Ben): "I like Tuesday too if no one else minds. Vector clock: {Alice: 2, Ben: 1, Dave: 1}" Cathy (to Dave): "I can only do Thursday. Is that okay? Vector clock: {Alice: 1, Cathy: 1, Dave: 1}" Dave (to Cathy): "OK. Vector clock: {Alice: 1, Cathy: 1, Dave: 2}"

Now if Ben and Cathy compare the latest messages they've seen, they see:

"I like Tuesday too if no one else minds. Vector clock: {Alice: 2, Ben: 1, Dave: 1}"


"OK. Vector clock: {Alice: 1, Cathy: 1, Dave: 2}"

Neither of these has the property that the clocks in its vector clock are all greater than or equal to the corresponding clocks in the other. So neither is a descendent of the other; we have a version conflict.

One problem with vector clocks is that they are not scalable, because each message must be padded with metadata whose length is worst-case proportional to the total number of processes in the system. Charron-Bost proved that there is no scalable system (one using clocks with less entries than the total number of processes in the system) that can correctly determine whether or not one message is a descendent of the other in all cases (cite Bernadette Charron-Bost. Concerning the Size of Logical Clocks in Distributed Systems. Information Processing Letters 01/1991; 39:11-16).

Another problem is that if someone is also interacting with some of the same people on a different document, e must remember the last clock number that he sent out for each conversation.

(note: technically speaking i think vector clocks are supposed to be incremented by the receiver, not the sender but in the examples above i had the sender increment)

Lamport clocks

Lamport clocks are like vector clocks where there is only one integer attached to each message. Like vector clocks, each process also keeps track of one integer. They work like this:

In the previous example, this would be

Dave (to all): "When shall we meet? Lamport clock: 0" Alice (to all): "Let's meet on Wednesday. Lamport clock: 1" Ben (to Dave): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Lamport clock: 2" Dave (to Ben): "I like Tuesday too if no one else minds. Lamport clock: 3" Cathy (to Dave): "I can only do Thursday. Is that okay? Lamport clock: 2" Dave (to Cathy): "OK Thursday it is then. Lamport clock: 4"

Lamport clocks have the property that if one message is a descendent of another. the descendent will always have a greater clock value than its ancestor. But if the messages are incomparable (neither is a descendent of another), the Lamport clock might be equal for both of them, or it might be greater for either one of them.

Lamport clocks are scalable, unlike vector clocks. The functionality that vector clocks have that this doesn't is that you can't always tell if there is no single latest version (if there is a conflict). If there is no version which is a descendent of all other versions, it is still possible for one message to have a greater Lamport clock value.

Other clock-ish schemes

"Plausible clocks" are clock schemes in which, if one message is a descendent of the other, the clock scheme says so; but it might also say that even if the two messages are incomparable (incomparable here is sometimes called "concurrent").

"Matrix clocks" are when each process remembers, for each other process that it receives messages from, the latest vector clock that it saw in any message from that other process. In other words, it keeps track not only of what it has seen, but also of what other processes have seen.

"Version vectors" are like vector clocks but for synchronization of documents; when two parties synch, they set the clocks in the document that they synched to the max of their individual clocks.

You can 'prune' the vectors in vector clocks; in the vector, for each clock, have a 'last modified' timestamp (it's okay that the processes' clocks aren't synchronized, this timestamp is only an optimization, if it's wrong it won't affect correctness). Now when the vector clock gets 'too big', delete the oldest counters from the vector. If you compare the resulting document to a very old ancestor, the new document won't appear to be an ancestor because it is missing some clocks that the old one has; so this system will sometimes mistakenly tell you that there is no single latest version even when there is. But it will never tell you that there is a latest version when there isn't one.

Interval tree clocks


The hard part is conflict resolution

If one is using vector clocks as part of providing some sort of consistency, all the clocks do is to detect conflicts. You still have to find some way to resolve those conflicts, and how conflicts should be resolved is an application specific question.


" if any two versions of the document can be merged associatively, commutatively, and idempotently; e.g. they form a CRDT " -- [25]

Some CRDTs:


Gossip protocols

Chapter: distributed systems

types of network failures you have to worry about a lot









CAP theorem

The CAP theorem says that you cannot have a database that guarantees consistency and availability and partition tolerance.

What the theorem actually says

Since the meaning of the terms in the theorem are often misunderstood, we provide excerpts from the paper Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

Partially synchronous model: "every node has a clock, and all clocks increase at the same rate. However, the clocks themselves are not synchronized, in that they may display different values at the same real time...A local timer can be used to schedule an action to occur a certain interval of time after some other event. Furthermore, assume that every message is either delivered within a given, known time..., or it is lost. Also, every node processes a received message within a given, known time..., and local processing takes zero time"


"Atomic [5], or linearizable [4], consistency ...Under this consistency guarantee, there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time. One important property of an atomic read/write shared memory is that any read operation that begins after a write operation completes must return that value, or the result of a later write operation...Discussing atomic consistency is somewhat different than talking about an ACID database, as database consistency refers to transactions, while atomic consistency refers only to a property of a single request/response operation sequence. And it has a different meaning than the Atomic in ACID, as it subsumes the database notions of both Atomic and Consistent"

Availability: "every request received by a non-failing node in the system must result in a response"

Partition Tolerance: The above definitions of availability and atomicity are qualified by the need to tolerate partitions...When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost"

" It is impossible in the partially synchronous network model to implement a read/write data object that guarantees the following properties:

in all executions (even those in which messages are lost).

It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties:

Summary of the proof (of a different but similar theorem in the same paper): "The basic idea of the proof is to assume that all messages between G1 and G2 are lost. Then if a write occurs in G1, and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation."

What it does not say

CAP does not say that no nodes from a distributed consistent database can continue to serve in the case of a network partition. As Nick Lavezzo points out, nodes in a minority partition could simply detect that they have been partitioned and respond to all requests with an error message. If a majority partition exists with writable copies of all data, then clients who can connect to the majority partition can still read and write data. The CAP theorem is simply saying that not ALL nodes can continue to accept updates during a partition if consistency is to be maintained.

The CAP theorem does not say that a distributed database cannot be consistent and available during times when there is no partition.

The CAP theorem does not say that a distributed database must choose, in the event of a partition, C or A for all operations; the choice may be made differently for each operation, and may depend on the user or data involved. And "all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists." ([27]).

The CAP theorem does not say that nodes cannot respond to requests at all, or cannot respond selectively to read requests during a partition. Nodes are still permitted to return error messages, to return stale data, or to accept read requests but refuse writes. As Brewer says, the Availability in CAP means availability for updates.




zero-trust global consensus algorithms e.g. bitcoin (but this is getting far away from programming language constructs...)

Chapter: scalability

scalability of message passing

depends on model of time

scalability of shared memory

building shared memory out of message passing

scalability of routing

and connectivity

small world networks

Chapter: concurrent control flow constructs

map reduce

map associative reduce

amorphous medium

imagine that there are a zillion nodes in some unknown topology. Notes can directly communicate with their neighbors. One might imagine that this is a discrete approximation to a continuous fluid of nodes. Primitives provided include:

a similar system which is explicitly for the purpose of sensor networks is Regiment:

Chapter: todo dont know where to put these yet

Bulk synchronous parallel (BSP)

tuple spaces

(what is the event (re)ordering model of tuple spaces?)

serializable continuations


" General purpose cores Flexible multithreading 2. Vector hardware Vector parallelism 3. GPUs Restrictive data parallelism 4. FPGAs Customized dataflow 5. Custom accelerators Various forms "

Chapter : Safety for concurrency

caller blocking allows shared memory

If thread A calls a function F in thread B, and blocks until the function in thread B returns control, then the caller can pass in pointers as some of the parameters to F; there is no danger of both threads accessing these pointers simultaneously because thread A is blocked until F completes, provided that F is not allowed to create aliases of these pointers and store them in a memory location that survives past the completion of F (which can be accomplished by a type system that supports uniqueness types) [28].

type systems for concurrency

todo unpack comment by Alexandrescu in [29]

milewski's owner types

and uniqueness types

linear logic

..and that guy's system for uniqueness of each end of a channel

disciplined parallelism

View memory models (event (re)ordering models) in terms of: if you have a shared memory with this model, what is the set of rules that you can follow when programming such that, if you follow those rules, you can imagine that the memory is sequentially consistent?

A data-race-free memory model guarantees sequential consistency for any data-race-free program. A data-race-free memory model might also guarantee sequential consistency for a program with races, as long as the variables participating in races are identified as such (the terminology is that such variables are not 'data variables', but 'synchronization variables').

Chapter : languages and libraries


Languages i've heard about in multiple places todo:


Others (for me to look at todo):

" here are already a large num- ber of research and commercial projects developing new disciplined parallel programming models for determinis- tic and non-deterministic algorithms [5]; e.g., Ct [24], CnC? [17], Cilk++ [11], Galois [33], SharC? [7], Kendo [44], Prometheus [6], Grace [10], Axum [26], and DPJ [14]. Most of these, including all but one of the commercial sys- tems, guarantee the absence of data races for programs that type-check, satisfying the first requirement of our work im- mediately. Moreover, most of these also enforce a require- ment of structured parallel control (e.g., a nested fork join model, pipelining, etc.), which is much easier to reason about than arbitrary (unstructured) thread synchronization. "

" 7. Related Work Type and Effect Systems: Several researchers have described ef- fect systems for enforcing a locking discipline in nondeter ministic programs that prevents data races and deadlocks [5, 20, 34] o r guar- antees isolation for critical sections [29]. Matsakis et al . [41] have recently proposed a type system that guarantees race-freed om for locks and other synchronization constructs using a constru ct called an “interval” for expressing parallelism. While there is so me over- lap with our work in the guarantees provided (race freedom, d ead- lock freedom, and isolation), the mechanisms are very diffe rent (ex- plicit synchronization vs. atomic statements supported by STM). Further, these systems do not provide determinism by defaul t. Fi- nally, there is no other effect system we know of that provide s both race freedom and strong isolation together ...

Beckman et al. [13] show how to use access permissions to re- move STM synchronization overhead. While the goals are the s ame as ours, the mechanisms are different (alias control vs. typ e and effect annotations). The two mechanisms have different tra deoffs in expressivity and power: for example, Beckman et al.’s met hod can eliminate write barriers only if an object is accessed th rough a unique reference, whereas our system can eliminate barrie rs for access through shared references, so long as the access does not cause interfering effects. However, alias restrictions ca n express some patterns (such as permuting unique references in a data struc ture) that our system cannot. As future work, it would be inte resting to explore these tradeoffs further ... Nondeterministic Parallel Programming: Several research efforts are developing parallel models for nondeterminist ic codes with irregular data access patterns, such as Delaunay mesh r efine- ment. Galois [36] provides a form of isolation, but with iter ations of parallel loops (instead of atomic statements) as the isolat ed compu- tations. Concurrency is increased by detecting conflicts at the level of method calls, instead of reads and writes, and using seman tic commutativity properties. Lublinerman et al. [39] have pro posed object assemblies as an alternative model for expressing irregular, graph-based computations ... Kulkarni et al. [35] have recently proposed task types as a way of enforcing a property they call pervasive atomicity . This work shares with ours the broad goal of reducing the number of concurrent interleavings the programmer must consider. Ho wever, Kulkarni et al. adopt an actor-inspired approach, in which d ata is non-shared by default, and sharing musk occur through speci al “task objects.” This is in contrast to our approach of allowi ng familiar shared-memory patterns of programming, but using effect annotations to enforce safety properties. Finally, none of the work discussed above provides any deterministic-by-default gu arantee. "


Concurrency in electronics

clock domains


flip-flops ("latches"): setup and hold time; metastability; putting two flip-flips in a row


boundary between two clock domains






Byzantine faults

what is consensus good for?

a) " ongardie 19 hours ago

You end up needing consensus for a lot of fault-tolerant systems that need to provide consistent results. For example, if your system allows users to choose their own usernames, and you're trying to guarantee that all usernames are unique, and your system needs to automatically deal with server failures, then I think you also need consensus.

Another way to think about it is that consensus gets you the equivalent of a compare-and-swap operation in a distributed setting. Just as compare-and-swap is useful for building synchronization primitives with shared memory, consensus is useful for building synchronization primitives across a network.


reply "

b) "Consensus protocols are the basis for the state machine replication approach to distributed computing" -- [30] "a replicated log for consensus." -- [31]

" Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers are available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).

Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems. Each server has a state machine and a log. The state machine is the component that we want to make fault-tolerant, such as a hash table. It will appear to clients that they are interacting with a single, reliable state machine, even if a minority of the servers in the cluster fail. Each state machine takes as input commands from its log. In our hash table example, the log would include commands like set x to 3. A consensus algorithm is used to agree on the commands in the servers' logs. The consensus algorithm must ensure that if any state machine applies set x to 3 as the nth command, no other state machine will ever apply a different nth command. As a result, each state machine processes the same series of commands and thus produces the same series of results and arrives at the same series of states. " -- [32]


an interesting thing that i am seeing all over is the application of the 'referential transparency'/'purely functional' idea to larger systems, in a few specific ways. First, the software is architected as a stateless (purely functional) computation that is given events and access to external state as input and spits out actions as output. Second, each memory location in the external state doesn't just hold the current value, but also all past values, with some sort of timestamp (or equivalently (in this model), integer sequence number). Not all of these examples do exactly that, but they are all related:


Criticism of this pattern, in databases:

the propagator model

"The Propagator Programming Model is built on the idea that the basic computational elements are autonomous machines interconnected by shared cells through which they communicate. Each machine continuously examines the cells it is interested in, and adds information to some based on computations it can make from information from the others. Cells accumulate information from the propagators that produce that information. The key idea here is additivity.


The ingredients of a propagator network are cells and propagators. The cells' job is to remember things; the propagators' job is to compute. The analogy is that propagators are like the procedures of a traditional programming language, and cells are like the memory locations; the big difference is that cells accumulate partial information (which may involve arbitrary internal computations), and can therefore have many propagators reading information from them and writing information to them.


 Cells must support three operations:
    add some content
    collect the content currently accumulated
    register a propagator to be notified when the accumulated content changes

When new content is added to a cell, the cell must merge the addition with the content already present. When a propagator asks for the content of a cell, the cell must deliver a complete summary of the information that has been added to it.

The merging of content must be commutative, associative, and idempotent. The behavior of propagators must be monotonic with respect to the lattice induced by the merge operation. "


toread electronics links

Concurrency in various languages and in standard libraries of various languages

channel mutex semaphore condition variable

Chapter: Links


possible links

not sure if these are essential enough to merit a link, todo check them out:

PPT] . Has information on big-O notation traffic costs of various synchronization primitives.


"Transactional memory support is what it sounds like: hardware support for transactions. This is through three new instructions, xbegin, xend, and xabort." [39]


" Of all the ways of doing concurrency, callbacks are by far the worst, Twisted was plagued by them and is the main reason why it failed, and that was with a much more sane and reasonable language like Python (stackless Python was a much better alternative and used a model similar to Go’s CSP).

And the sad thing is that there are much better alternatives around with much more sound models and environments, Erlang and Go are the two obvious examples, and that is for the highly specialized situations where you have great concurrency needs, for any other problem anything else will be much better than Node.js, even PHP.

– uriel, in response to Why Node.JS is absolutely terrible, by Hasen el Judy " --



summary of :

" shared-state multithreading...are a bad idea...make local reasoning a single-tasking, nonconcurrent system...When you’re looking at a routine that manipulates some state..., To imagine the different states, you need only to read the routine and imagine executing its instructions in order from top to bottom. This means that the number of instructions you must consider is n, where n is the number of instructions in the routine. By contrast, in a system with arbitrary concurrent execution - one where multiple threads might concurrently execute this routine with the same state - you have to read the method in every possible order, making the complexity nn. ... Those of you who actually use threads to write real software are probably objecting at this point. “Nobody would actually try to write free-threading code like this,” I can hear you complain, “Of course we’d use a lock or a queue to introduce some critical sections if we’re manipulating state.”

Mutexes can help mitigate this combinatorial explosion, but they can’t eliminate it, and they come with their own cost; you need to develop strategies to ensure consistent ordering of their acquisition. Mutexes should really be used to build queues, and to avoid deadlocks those queues should be non-blocking but eventually a system which communicates exclusively through non-blocking queues effectively becomes a set of communicating event loops, and its problems revert to those of an event-driven system; it doesn’t look like regular programming with threads any more. ... But even if you build such a system, if you’re using a language like Python (or the ones detailed above) where modules, classes, and methods are all globally shared, mutable state, it’s always possible to make an error that will affect the behavior of your whole program without even realizing that you’re interacting with state at all. ... What are you going to do instead?

There’s a lot of debate over the best way to do “asynchronous” programming - that is to say, “not threads”, four options are often presented.

    Straight callbacks: Twisted’s IProtocol, JavaScript’s on<foo> idiom, where you give a callback to something which will call it later and then return control to something (usually a main loop) which will execute those callbacks,
    “Managed” callbacks, or Futures: Twisted’s Deferred, JavaScript’s Promises/A[+], E’s Promises, where you create a dedicated result-that-will-be-available-in-the-future object and return it for the caller to add callbacks to,
    Explicit coroutines: Twisted’s @inlineCallbacks, Tulip’s yield from coroutines, C#’s async/await, where you have a syntactic feature that explicitly suspends the current routine,
    and finally, implicit coroutines: Java’s “green threads”, Twisted’s Corotwine, eventlet, gevent, where any function may switch the entire stack of the current thread of control by calling a function which suspends it.

One of these things is not like the others; one of these things just doesn’t belong.


Options 1-3 are all ways of representing the cooperative transfer of control within a stateful system. They are a semantic improvement over threads. Callbacks, Futures, and Yield-based coroutines all allow for local reasoning about concurrent operations.

So why does option 4 even show up in this list? ... it’s a bit of a distraction from the much bigger advantage of event-driven programming, which is simply that it’s easier to write programs at scale, in both senses (that is, programs containing lots of code as well as programs which have many concurrent users). ... A system that presents “implicit coroutines” – those which may transfer control to another concurrent task at any layer of the stack without any syntactic indication that this may happen – are simply the dubious optimization by itself. ... When you look at the implementation of a potentially concurrent routine written using callbacks or yielding coroutines, you can visually see exactly where it might yield control, either to other routines, or perhaps even re-enter the same routine concurrently. If you are using callbacks – managed or otherwise – you will see a return statement, or the termination of a routine, which allows execution of the main loop to potentially continue. If you’re using explicit coroutines, you’ll see a yield (or await) statement which suspends the coroutine ... ((example bug:)) ... As it happens, this is the same variety of example Guido van Rossum gives when he describes why chose to use explicit coroutines instead of green threads for the upcoming standard library asyncio module, born out of the “tulip” project, so it's happened to more than one person in real life. ... Let’s say we have this program:

def transfer(amount, payer, payee, server): if not payer.sufficient_funds_for_withdrawl(amount): raise InsufficientFunds?() log("{payer} has sufficient funds.", payer=payer) payee.deposit(amount) log("{payee} received payment", payee=payee) payer.withdraw(amount) log("{payer} made payment", payer=payer) server.update_balances([payer, payee])

In a world without concurrency, this is of course correct. ...

 But if we were to run transfer with the same two accounts in an arbitrary number of threads simultaneously, it is (obviously, I hope) wrong. One thread could update a payer’s balance below the funds-sufficient threshold after the check to see if they’re sufficient, but before issuing the withdrawl.

So, let’s make it concurrent, in the PEP 3156 style. That update_balances routine looks like it probably has to do some network communication and block, so let’s consider that it is as follows:

@coroutine def transfer(amount, payer, payee, server): if not payer.sufficient_funds_for_withdrawl(amount): raise InsufficientFunds?() log("{payer} has sufficient funds.", payer=payer) payee.deposit(amount) log("{payee} received payment", payee=payee) payer.withdraw(amount) log("{payer} made payment", payer=payer) yield from server.update_balances([payer, payee]) let’s make another, subtler code change: our hypothetical operations team has requested that we put all of our log messages into a networked log-gathering system for analysis. A reasonable request, so we alter the implementation of log to write to the network.

Now, what will we have to do to modify the green-threaded version of this code? Nothing! ... ((but, by contrast, in the cooperative coroutine version:)) In order to update this routine for a non-blocking version of log, we had to type a yield keyword between the sufficient_funds_for_withdrawl check and the withdraw call, between the deposit and the withdraw call, and between the withdraw and update_balances call. If we know a little about concurrency and a little about what this program is doing, we know that every one of those yield froms are a potential problem. If those log calls start to back up and block, a payer may have their account checked for sufficient funds, then funds could be deducted while a log message is going on, leaving them with a negative balance....the mechanical act of typing these out is an opportunity to notice that something’s wrong, both now and later. Even if we get all the way through making the changes without realizing the problem, when we notice that balances are off, we can look only (reasoning locally!) at the transfer routine and realize, when we look at it, based on the presence of the yield from keywords, that there is something wrong with the transfer routine itself, regardless of the behavior of any of the things it’s calling.

 Tedious as it may be, the asynchronousness of an individual function is, in fact, something that all of its callers must be aware of, just as they must be aware of its arguments and its return type.

In fact you are changing its return type: in Twisted, that return type would be Deferred, and in Tulip, that return type is a new flavor of generator. This new return type represents the new semantics that happen when you make a function start having concurrency implications.

Haskell does this as well, by embedding the IO monad in the return type of any function which needs to have side-effects. This is what certain people mean when they say Deferreds are a Monad.

The main difference between lightweight and heavyweight threads is that it is that, with rigorous application of strict principles like “never share any state unnecessarily”, and “always write tests for every routine at every point where it might suspend”, lightweight threads make it at least possible to write a program that will behave deterministically and correctly, assuming you understand it in its entirety. When you find a surprising bug in production, because a routine that is now suspending in a place it wasn’t before, it’s possible with a lightweight threading system to write a deterministic test that will exercise that code path. With heavyweight threads, any line could be the position of a context switch at any time, so it’s just not tractable to write tests for every possible order of execution. " --

another post making similar points is


" smsm42 4 days ago [-]

> the mental model works better for asynchrony; instead of describing a series of steps to follow, and treating the interrupt as the exception, you describe the processes that should be undertaken in certain circumstances

I could never understand how this works better as a mental model. Say you ask somebody to buy you a gadget in a store they don't know. What do you do tell them:

a) "drive in your car on this street, turn left on Prune street, turn right on Elm street, the store will be after the second light. Go there, find "Gadgets" isle, on the second shelf in the middle there would be a green gadget saying "Magnificent Gadget", buy it and bring it home"


b) when you find yourself at home, go to car. When you find yourself in the car, if you have a gadget, drive home, otherwise if you're on Elm street, drive in direction of Prune Street. If you're in the crossing of Elm street and Prune street, turn to Prune street if you have a gadget but to Elm street if you don't. When you are on Prune street, count the lights. When the light count reaches two, if you're on Prune street, then stop and exit the vehicle. If you're outside the vehicle and on Prune street and have no gadget, locate store and enter it, otherwise enter the vehicle. If you're in the store and have no gadget then start counting shelves, otherwise proceed to checkout. Etc. etc. - I can't even finish it!

I don't see how "steps to follow" is not the most natural mental model for humans to achieve things - we're using it every day! We sometimes do go event-driven - like, if you're driving and somebody calls, you may perform event-driven routine "answer the phone and talk to your wife" or "ignore the call and remember to call back when you arrive", etc. But again, most of these routines will be series of steps, only triggered by an event.


" -- [40]


" Communicating sequential processes (Go, PHP+MySql?) makes IO have a modestly simpler synchronous syntax at the cost of communicating between I/O operations much more complex (sending a message to a port or performing some sort of transaction instead of just assigning to a value). It's a tradeoff. " -- [41]


difference between actors and CSP:

both are 'message passing' models as opposed to 'shared state'

"The difference between the CSP program and the Actor system is that the Actor Z does not get messages using a guarded choice command from multiple channels. Instead it processes messages in arrival ordering, and by the laws for Actor systems, the stop message is guaranteed to arrive." -- [42]




then there's also pi-calculus "The π-calculus, partially inspired by the Actor model as described by Milner above, introduced dynamic topology into the process calculi by allowing dynamic creation of processes and for the names to be passed among different processes. However, the goal of Milner and Hoare to attain an algebraic calculus led to a critical divergence from the Actor model: communication in the process calculi is not direct as in the Actor model but rather indirectly through channels " [43]



CSP link:

"The concurrency primitives of CSP were input, output, guarded commands, and parallel composition whereas the Actor model is based on asynchronous one-way messaging." [44]


Actor implementation: Akka


Petri nets



"Leading examples of process calculi include CSP, CCS, ACP, and LOTOS.[1] More recent additions to the family include the π-calculus, the ambient calculus, PEPA, the fusion calculus and the join-calculus." [45]

"CCS, CSP, and ACP constitute the three major branches of the process calculi family: the majority of the other process calculi can trace their roots to one of these three calculi" [46]

"Various process calculi have been studied and not all of them fit the paradigm sketched here. The most prominent example may be the ambient calculus. " [47]


"The use of channels for communication is one of the features distinguishing the process calculi from other models of concurrency, such as Petri nets and the Actor model (see Actor model and process calculi)." [48]


" Here is how I think Erlang works. I believe Akka is very similar.

Each process has a single mailbox. Messages are put into the receiver's mailbox by the sender, and fetched by the receiver using pattern matching. This matching process can change message ordering in the sense that the oldest message in a mailbox may not match, but a younger one does. In this case the younger one is consumed first. Other than that, message ordering is preserved.

With this in mind, the asynchronous π -calculus extended with input pattern matching input from buffer describes Erlang (and hence Akka) semantics accurately, although one needs to do a bit of work in the encoding, since the π-calculus doesn't have the restriction to single channels per process. However, one usually doesn't want an encoding, but rather a calculus that models the target system directly. Such a calculus exists, and it's called Featherweight Erlang. It is by Mostrous and Vasconcelos. Their paper focusses on typing, but you can ignore that and just look at the untyped calculus in Section 3. " -- [49]





" There is one important downside to the CSP way ((compared to Actors)) of networking concurrent behaviour:

    Any network of goroutines that has some mutual dependency may deadlock. The developer must be aware of this and code so that deadlock is eliminated; this involves identifying mutual dependencies and altering them in some way (there are several well-known techniques, the easiest of these being to use a pure client-server pattern, which is known to be deadlock-free).

But consider what happens if a deadlocking network of goroutines is converted to Akka. The non-blocking communications means that deadlock won't happen - but how can we be sure that the queues will not overflow? There is a direct equivalence between deadlock and exhaustion of not-so-infinite buffers. " --




some history of CSP:


" riffraff 1605 days ago [-]

AFAIR kilim guarantees statically checked safety, exactly because "messages" that can be sent across actor borders have guaranteed properties of being aliaseable^mutable. "


" A Fiber is a lightweight thread that uses cooperative multitasking instead of preemptive multitasking. A running fiber must explicitly "yield" to allow another fiber to run, which makes their implementation much easier than kernel or user threads.

A Coroutine is a component that generalizes a subroutine to allow multiple entry points for suspending and resuming execution at certain locations. Unlike subroutines, coroutines can exit by calling other coroutines, which may later return to the point where they were invoked in the original coroutine.

A Green Thread is a thread that is scheduled by a virtual machine (VM) instead of natively by the underlying operating system. Green threads emulate multithreaded environments without relying on any native OS capabilities, and they are managed in user space instead of kernel space, enabling them to work in environments that do not have native thread support. " --


concurrency vs parallel

Different people have different definitions. One common one is that concurrent processes may have their steps arbitrarily interleaved but may not necessarily ever have one step from each process executing simultaneously, whereas parallelism admits actual simultaneous execution.

An example is seen in the case of multitasking on a single CPU vs. on multiple CPUs. On a computer with a single CPU, you can still 'multitask', that is, run two separate threads of execution, by first executing the first task a little bit, then executing the second task a little bit, then executing a little more of the first task, etc. The two tasks are executing concurrently, but not in parallel; there is no chance that a step in the first task and a step in the second change will actually execute simultaneously, because you only have one CPU. By contrast, on a machine with two CPUs, you could have one CPU executing the first thread while the second CPU is executing the second thread. This means that the two threads can actually execute simultaneously.

Other people might have different definition of these terms. One alternate definition says concurrency is a property of the program (the program is broken into multiple threads which can tolerate interleaved or simultaneous execution), while parallelism is a property of the hardware (there are actually multiple CPUs). Some people use the phrase 'truly concurrent' to denote simultaneous execution (as opposed to merely interleaved execution).



    Process: OS-managed (possibly) truly concurrent, at least in the presence of suitable hardware support. Exist within their own address space.
    Thread: OS-managed, within the same address space as the parent and all its other threads. Possibly truly concurrent, and multi-tasking is pre-emptive.
    Green Thread: These are user-space projections of the same concept as threads, but are not OS-managed. Probably not truly concurrent, except in the sense that there may be multiple worker threads or processes giving them CPU time concurrently, so probably best to consider this as interleaved or multiplexed.
    Protothreads: I couldn't really tease a definition out of these. I think they are interleaved and program-managed, but don't take my word for it. My sense was that they are essentially an application-specific implementation of the same kind of "green threads" model, with appropriate modification for the application domain.
    Fibers: OS-managed. Exactly threads, except co-operatively multitasking, and hence not truly concurrent.
    Coroutines: Exactly fibers, except not OS-managed.
    Goroutines: They claim to be unlike anything else, but they seem to be exactly green threads, as in, process-managed in a single address space and multiplexed onto system threads. Perhaps somebody with more knowledge of Go can cut through the marketing material.

" -- [50]

"Threads scheduled by user-mode code are what’s known as lightweight threads; in Quasar (as well as in some other languages) they are called fibers."- -

(as you can see, different people define 'fibers' differently)


a quick overview of some of the stuff in here:

Concurrency paradigms can be deterministic or non-deterministic

A race condition is (often undesirable) non-deterministic program behavior resulting from concurrency. When programming in a non-deterministic concurrency paradigm, race conditions are possible, and the programmer must use patterns, analysis, and discipline to avoid, detect, and eliminate undesirable race conditions.

Threads and stuff like them (there is no consensus on terminology):

shared memory vs none; scheduled vs yield to target; preemptive vs cooperative (have to yield) vs semipreemptive (wont preempt in tight loop). preemptive or semipreemptive implies scheduled.

Processes: no shared memory between different processes, scheduled, preemptive

Threads: mb shared memory, scheduled, preemptive. Defined this way, processes are special cases of threads (threads with no shared memory), note however that in many systems you can have many threads within each process.

Fibers: scheduled? (different people disagree on the definition here; many define it the same as a coroutine, but for some, something that explicitly yields to a scheduler is a fiber), cooperative (have to yield) [51] [52] [53]

coroutines: unscheduled cooperative (have to yield)

strand: either a thread or a fiber

These things can be implemented at the OS level or at the application level. The application-level versions of these constructs are often more efficient (in time and memory) than the OS-level versions. Application-level threads or fibers (strands) can be mapped onto OS ones many:1 (all application-level strands are in one OS thread) or 1:1 (each application-level strand gets its own OS-level thread) or M:N (multiple application-level strands per OS-level thread, but also multiple OS-level threads. These systems sometimes also support migrating application-level strands between OS-level threads, often for the purpose of making sure that one blocking strand doesn't block other strands).

Both of the words 'process' and 'thread' are also used more abstractly to denote different subprograms that are running concurrently.

Some operations are 'blocking'; in general, operations can be categorized as 'blocking' or 'non-blocking'. 'Blocking' means that the operation is waiting for something from the outside ('outside' can mean the world outside the computer, or it can just mean some event from another thread). Blocking operations are usually categorized as either I/O (wait for something to be written to or read from the outside world) or synchronization (wait for some event to occur or some condition to be true other threads and/or shared resources). One says that a thread 'blocks' when it is stuck waiting on a blocking operation. Some languages and libraries support 'non-blocking I/O', which means that instead of blocking the thread, the I/O operation is started in the background while the thread goes on to do other things (logically, at least; sometimes this is actually implemented by spawning a new thread to do the blocking I/O operation). Non-blocking I/O is sometimes referred to as "nio" or "async I/O" or "aio".

Some people talk of 'concurrency' vs 'parallelism'. There is no consensus on the meanings of these terms, however, one meaning is that 'parallelism' means different processes which are running at the same time on different CPUs, so one step from one process might be executing simultaneously with a step from a different process; in contrast, 'concurrency' also includes the case in which a single CPU is multitasking between two processes, so they are never actually running at the same time, but rather they are interleaved.

Some language implementations, such as Python or Ruby, support concurrency but not parallelism (in the above sense) between multiple threads; so the interpreter will only allow one thread to be running at any one moment in time, even if multiple CPUs are available (however, often these language implementations do still allow a program to contain multiple processes, that is, subprograms which do not share memory). In such a situation, concurrency (multiple threads multitasked onto one CPU) is still useful for dealing with I/O, because while one thread is paused waiting for the I/O operation to complete, other threads can be doing other things. One says that a program whose performance is bottlenecked by I/O is 'I/O bound'; these programs will benefit from multitasking even without multiple CPUs. By contrast, programs which are 'compute-bound' are those whose performance is bottlenecked by the speed of the CPU (and RAM); those programs will not benefit much from multitasking on a single CPU but would benefit from parallel execution across many CPUs.

A hierarchy of three programming constructs for reactor-pattern concurrency with particular relevance to (non)-blocking I/O:

The reactor pattern is when you run a program in a single thread, but use only non-blocking operations (such as non-blocking I/O). You switch between logical subprograms so that when one logical subprogram would otherwise 'block' to wait for the result of a blocking operation, instead you just proceed with executing another logical subprogram. One thing to keep in mind with this pattern is that if one of these subprograms does execute a blocking operation, or if it remains running via a loop, then the other subprograms won't run at all during this time because they are all sharing the same thread.

The following can all be implemented in languages which support only 'concurrency' (multitasking onto a single CPU) even if not parallelism (simultaneous execution on many CPUs). The following starts with most primitive (callbacks) (primitive in this hierarchy; the actual implementation involves still lower layers of more primitive concurrency constructs), and then moves up to things built on top of this primitive. That is, callbacks are most primitive here, and promises are implemented on top of callbacks, and async/await is implemented on top of callbacks.

Callbacks. To do a blocking operation, you put in a request for the operation, and as part of the request, you give a callback which will be called when the I/O is complete. In some models, the callbacks can be called anytime, effectively pre-empting other code. In other models, calls to callbacks are queued up and only executed when other computation has completed or at explict 'yield' point in the code (note, though, that one can model the former as just the latter with a 'yield' in between every statement).

Futures/promises. To do a blocking operation, you call a function which returns, not a result, but a 'promise' object. The promise object provides the result value of the operation after it completes. Typically the promise object also provides a way to check if the operation is done yet. Typically the promise object also provides a way to register a callback which will be called when the operation completes (this allows you to 'chain together' a sequence of blocking operations). There is no consensus on what, if anything, is the difference between the words 'future' and 'promise', or even if they are synonyms or not. Promises can be implemented on top of callbacks (they are not equivalent though; a callback might be called many times, whereas the callback registered on a promise will only be called once; although the idea can be extended to Observables, in which there is a stream of zero or more events each of which will cause the callback to be called).

Async/await. These introduce two new keywords, 'async' and 'await', and uses coroutines. When you get a promise object, and you want to wait until the operation completes and then do something else, you 'await' on the promise object. Logically, this lets you write code sequentially as if you are blocking; but actually, the thread it not blocked, because 'await' implicitly yields control to another coroutine while it waits for the operation to complete. Any function containing an 'await' must be marked with the 'async' keyword, and any function that calls any other 'async' function itself must be marked as 'async'. This is important for readability because it makes it clear that calling any function marked 'async' might cause the current coroutine to yield control to another one; when that is possible, the programmer may have to take extra precautions to avoid race conditions.

An alternative to reactor-pattern concurrency is to use threads or processes. OS-level threads and processes are often too slow or memory-intensive to scale up as well as reactor-pattern concurrency, but application-level threads or processes (such as Golang's goroutines or Erlang's processes) are more competitive, especially if the application-level threading framework is M:N and automatically migrates application-level threads so that an application-level thread that blocks doesn't block other application-level threads. A downside of this compared to reactor-pattern concurrency is that operations from different threads may be arbitrarily (or almost arbitrarily) interleaved, meaning that a lot of care must be exercised to avoid race conditions, especially when accessing shared memory. By contrast, reactor-pattern concurrency often only has a few, explicit points when one code path may yield to another one, which makes it easier to reason about race conditions. Process-based systems (no shared memory) ameliorate this downside (but race conditions may still be possible in the form of deadlock and similar).


SIMD, MIMD, etc (Flynn's taxonomy)

shared memory: memory consistency models consistency, eventual consistency, last-write-wins consistency the hard part is resolving inconsistencies logical clocks, lamport clocks, vector clocks, CAP theorem CRDTs

avoiding sharing with shared memory:

atomic operations: these are the actual low-level primitives, supported by hardware, upon which other concurrency constructs are built. consensus number CAS LL/SC atomic reads/writes (single byte, multibyte/word) vector instructions in ISA fences, memory barriers see also memory consistency models

critical sections, mutexes, condition variables, monitors, semaphores, locks, read/write locks, synchronization barriers, monitors locks are not composable read-copy-update

deadlock livelock

wait-free algorithms (lock-free)

dataflow (spreadsheet-like) 'labeled lines' 'active data'



process calculi (often based on channels)

Message passing: CSP and Actors. Channels, mailboxes.

select/poll with pattern matching

propagator model

consensus algorithms zero-trust, eg bitcoin

todo systolic array

join-calculus style function calls

todo petri nets

in hardware: metastability arbiters clocks, clock domains, clockless/async architectures todo what are those other things with the grids, etc? cp from other file

GPUs parallel loops

todo (BSP)

tuple spaces

events, pub/sub

amorphous medium

optimistic vs pessimistic concurrency

worker pools


transactions STM ACID

pipe, socket blocking channel vs channel with fixed-length queue vs channel with 'infinite' queue

shared queue


some general groupings of the above:

you have different 'processes' or 'threads' of execution (subprograms executing linearly).

Fundamentally, a one-CPU doing batch processing can be modeled as a Turing machine (or something equivalent like lambda calculus). But this leaves out some phenomena that are relevant in the real world. Two closely related things that the Turing machine model leaves out are:

These are related in that a model for interactivity also seems to be a solution to concurrency; you could model a multi-CPU computer as a set of interactive Turing machines that can interact with each other and with a shared memory. There is not yet universal consensus on what this model should be. Process calculi are contenders in this space. When you start thinking about this stuff, two concepts come up again and again:

These models can be further expanded to distributed systems models. A key additional property of distributed system models is:

Note: NFAs (and non-deterministic automata in general) only capture a very constrained kind of concurrency. They can be thought of as concurrently exploring all of the nondeterministic paths, but the 'virtual CPUs' exploring these paths aren't allowed to communicate with each other. This models 'embarrassingly parallel' computations.

in hardware: metastability arbiters clocks, clock domains, clockless/async architectures todo crossbar switches, clos, etc; cp from other file; that file also mentioned SIMD i think

they send messages to each other (message passing, channels, (named destinations/processes?: addressing, named channels?), routing (implementation detail), process calculi, pipes, sockets, signals, events, pub/sub, select/epoll)

Sometimes you would like them to pretend to share memory. But true shared memory is an infeasible abstraction (and note that even a single computer with multiple CPUs is a distributed system in some sense; not in the sense that messages between CPUs have a significant probability of being lost, but in the sense that, because the CPUs are physically distant from each other and from main memory, it takes nonzero time for a signals to travel between CPUs or between a CPU and memory). So we have approximations (memory consistency models, eventual consistency, last-write-wins consistency, CAP theorem, tuple spaces, fences/memory barriers) and ways of working with these approximations (logical clocks, CRDTs, Glitch, Propagators, consensus algorithms).

Sometimes you need to synchronize (constrain the relationship between processes, or in more detail, constrain the state of the total system in a way that cannot be expressed as a constraint that applies to each process separately without reference to any other process). We have synchronization primitives in shared memory (CAS, LLSC).

We have synchronization constructs, some of which may be primitives in various models: critical sections, mutexes, condition variables, monitors, semaphors, locks, read/write locks, synchronization barriers

At higher-levels, shared-memory synchronization is included into memory consistency models (transactions, STM, ACID).

And we have shared data structures:

and ways to avoiding sharing with shared memory (so that you can analyze programs to avoid races while still using shared memory):

We have ways of specifying parallel control flow.

Processes themselves are first-order (processes, threads, process calculi, fork/clone, join), and can be emulated to an extent within one process via interleaving (reactor model, schedulers, preemption, fibers, coroutines). This is useful not only for implementation or efficiency, but also because it constraints when and how steps from different processes may be simultaneous or interleaved.

At higher levels, these combine with data and synchronization. join-calculus calls, callbacks, promises, async/await, blocking channels.

There are also ways of giving orders to first-order processses without treating them as first-class (sorta like declarative programming is computation without specifying the exact computation, here we might specify the computation but try to abstract away the concurrency): worker pools, SIMD and/or parallel map, map-reduce, dataflow

There are guarantees and other properties of parallel and distributed algorithms/systems (race conditions, wait-freedom, deadlock, livelock, liveness, progress, CAP). And properties of paradigms (deterministic, non-deterministic, composability, SIMD, MIMD). And mathematical models like Turing machines, but for parallelism (process calculi, petri nets). In order to reason about concurrent processes, we have temporal logics [54].

some todos: petri nets, amorphous medium, ambient calculus, systolic arrays


" The basic operation of an Actor is easy to understand: like a thread, it runs concurrently with other Actors. However, unlike threads it is not pre-emptable. Instead, each Actor has a mailbox and can call a routine named “receive” to check its mailbox for new messages. The “receive” routine takes a filter, and if no messages in an Actor’s mailbox matches the filter, the Actor sleeps until it receives new messages, at which time it’s rescheduled for execution. ... each actor can have no, one or multiple addresses ... the actor models makes no guarantees on the ordering of messages. Queuing and dequeuing of messages in a mailbox are atomic operations, so there cannot be a race condition ... There is no shared state and the interaction between actors is purely based on asynchronous messages. ...

" -- [55]


This paper has constructions and proofs of consensus numbers of various primitives:

Wait-Free Synchronization by MAURICE HERLIHY, 1991.

" First, we introduce a simple and general technique, based on reduction to a concensus protocol, for proving statements of the form, “there is no wait-free implementation of X by Y.” We derive a hierarchy of objects such that no object at one level has a wait-free implementation in terms of objects at lower levels. In particular, we show that atomic read/write registers, which have been the focus of much recent attention, are at the bottom of the hierarchy: thay cannot be used to construct wait-free implementations of many simple and familiar data types. Moreover, classical synchronization primitives such astest&set and fetch&add, while more powerful than read and write, are also computationally weak, as are the standard message-passing primitives. Second, nevertheless, we show that there do exist simple universal objects from which one can construct a wait-free implementation of any sequential object. " --

things that are proven to have an infinite consensus number in this paper are:

As [57] explains starting with slide 27, this hierarchy is not robust, but as [58] explains, a similar hierarchy, h^r_m, was proposed by Jayanti, and although it has not been proven robust, it has been proven robust in for 'well-behaved' cases.

Other links about consensus number and wait-freedom:


The paper which coined 'wait-free' and 'consensus number', Wait-Free Synchronization by MAURICE HERLIHY, 1991, presented a method for creating a wait-free implementation of any data structure. However, it is O(N^3) in space and 'very slow', 'preventing parallelism' ([59], Encyclopedia of Algorithms, Mark Moir).


lock-free and wait-free links:



is there a library of wait-free data structures somewhere? i see a bunch of libraries mixing lock-free and wait-free.

Here's some wait-free algorithms and implementations that i've heard of (note: some data structures provide wait-freedom only on some operations):

wait-free stack:

queues and ring buffers:

Hash maps:


Note: As notes, there are also general constructions of wait-free datastructures out of lock-free ones.

perhaps these links will lead to more libraries (i already skimmed them, but not too carefully):


" So what primitives are in your arsenal for implementation of advanced synchronization algorithms?

Compare-And-Swap Perhaps, it's the most famous primitive, it's other names are CAS, compare-and-exchange, compare-and-set, std::atomic_compare_exchange, InterlockedCompareExchange?, __sync_val_compare_and_swap, LOСK? CMPXCHG and other. It's an instance of so-called atomic RMW (read-modify-write) operation. It's pseudo-code is: T compare-and-swap(T* location, T cmp, T xchg) { do atomically { T val = *location; if (cmp == val) *location = xchg; return val; } } That is, it stores a new value (xchg) into a memory location only if it contains an expected value (cmp), in either case it returns a value that was stored in the location when the operation begins. And all that is done atomically on hardware level.

Fetch-And-Add Also atomic RMW operation, and also conducted atomically in hardware. Aka atomic_fetch_add, InterlockedExchangeAdd?, LOСK? XADD. Below is the pseudo-code: T fetch-and-add(T* location, T x) { do atomically { T val = *location; *location = val + x; return val; } } There are also variations like fetch-and-sub, fetch-and-and, fetch-and-or, fetch-and-xor.

Exchange Atomic RMW. Aka atomic_exchange, XCHG. Dead simple, but not less useful: T exchange(T* location, T x) { do atomically { T val = *location; *location = x; return val; } }

Atomic loads and stores They are not RMW (read-modify-write) operations, they are just independent atomic loads and stores. They are frequently unfairly underestimated. However, they play fundamental role in synchronization algorithms, and they are what you should generally strive for - atomic loads and stores are better/cheaper/faster than atomic RMW operations.

Mutexes and the company Why not? The most stupid thing one can do is try to implement everything in a non-blocking style (of course, if you are not writing infantile research paper, and not betting a money). Generally it's perfectly Ok to use mutexes/condition variables/semaphores/etc on cold-paths. For example, during process or thread startup/shutdown mutexes and condition variables is the way to go.



" what is the most important thing regarding synchronization algorithm's performance and scalability? I frequently hear the answer that it's a number of atomic RMW (read-modify-write) instructions (like Compare-And-Swap or Fetch-And-Add) per operation. It's dead wrong. The most important thing is amount of write sharing per operation. Numerical measure of write sharing is number cache-line transfers per operation, and the ideal value is 0. If there is 0 cache-line transfers per operations amortized (we are perfectly Ok with amortization here), then the algorithm is perfectly scalable. Anything other than 0, even 1, is a serious problem for scalability. ... First, if there is write sharing system ungracefully degrades, the more threads we add the slower it becomes.

Second, if there is no write sharing system linearly scales. Yes, atomic RMW operations are slower than plain stores and loads, but they do scale linearly in itself (by the way, cost of atomic RMW operations becomes smaller and smaller with each new processor generation, there are no fundamental reasons why they must be any slower and similar non-atomic read-modify-write sequence).

Third, loads are always scalable. Several threads are able to read a memory location simultaneously. Read-only accesses are your best friends in a concurrent environment. "




"no matter your concurrent programming model du jour, three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions)." --


regarding locks: "And if you care about performance, you are also going to need to think about hardware esoterica such as CMPXCHG, spin waiting, cache contention, optimistic techniques with version counters and memory models, ABA, and so on." --


Time of check to time of use (TOCTOU) bugs: "changes in a system between the checking of a condition (such as a security credential) and the use of the results of that check. This is one example of a race condition." [62]. Example:

" async bool IsRed?(AsyncColor? c) { return (await c.R > 0 && await c.G == 0 && await c.B == 0); }

This rather simple (and silly) function checks to see if an AsyncColor? is “red”; to do so, it reads the R, G, and B properties. For whatever reason, they are asynchronous, so we must await between accesses. If AsyncColor? is a mutable object, well, guess what – these values might change after we’ve read them, opening up a possible TOCTOU bug. For instance, imagine a caller’s surprise when IsRed? may have lied to it:

AsyncColor? c = ...; await IsRed?(c); assert(await c.R > 0);

That assertion can very well fire. Even this callsite has a TOCTOU bug of its own, since c.R might be >0 at the end of IsRed’s? return, but not after the assert expression’s own await has completed. " --


another (or just an old?) idea for ways of categorizing concurrency:


Concurrent Collections/tstreams / tstreams is a deterministic parallel programming model for pure computations.

You specify the dependency graph between operations using three primitive types: items (data items, which must be immutable), steps (operations, which must be pure functions), and tags (to distinguish instances; eg if you are specifying the workflow for a restaurant, where pies, once ordered by a guest, must be prepared, and then baked, then pies, which are each associated with different, individual guests, are each tagged with a guest_ID). You then specify ordering relationships between steps and data items (eg the operation bake_pie requires a prepared_pie, which is produced by prepare_pie). In CnC? there is also a controller/controlee reliationship which is similar to function calling and which is used to create a new tag and initiate an assembly line of operations to produce the results of some operation for that tag, after satisfying prerequisites (see the "Waiter controlling cooks" section of the CnC tutorial for an example).



Some Intel-related parallel libraries and similar:

(maybe toread: )


Intel Threaded Building Blocks (TBB)

" Library contents

TBB is a collection of components for parallel programming:

    Basic algorithms: parallel_for, parallel_reduce, parallel_scan
    Advanced algorithms: parallel_while, parallel_do, parallel_pipeline, parallel_sort
    Containers: concurrent_queue, concurrent_priority_queue, concurrent_vector, concurrent_hash_map
    Memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator
    Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive_mutex
    Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store
    Timing: portable fine grained global time stamp
    Task scheduler: direct access to control the creation and activation of tasks" -- [70]


Intel Array Building Blocks

Chapter 8: Statements Function Execution Semantics Basic Block Statements Notational Conventions for Statements Elementwise Statements Function Call Statements Reordering Statements gather scatter pack unpack shuffle unshuffle repeat distribute repeat_row, repeat_col, repeat_page transpose swap_row, swap_col, swap_page shift_constant, shift_constant_reverse shift_clamp, shift_clamp_reverse rotate, rotate_reverse reverse Facility Statements const_vector bitwise_cast cast cat extract replace replace_row, replace_col, replace_page replace_element section index extract_row, extract_col, extract_page get_elt_coord get_neighbor expect_size mask sort, sort_rank alloc length get_nrows, get_ncols, get_npages Nesting Statements Collective Statements Reduction Statements Scan Statements Merge Statements Control Flow Statements if Statements Loops for Loops while Loops do Loops break Statements continue Statements return Statements Special Statements when Statements


from [71] :

hierarchy of common memory consistency model strengths, from strongest to weakest:

a "woefully incomplete" characterization of these memory consistency models: consider reorderings of the following instruction pairs: load/load, load/store, store/load, store/store:


" 1. All threads are interleaved into a single “thread” 2. The interleaved thread respects each thread’s original instruction ordering (“program order”) 3. Loads return the value of the most recent store to the same address, according to the interleaving


For performance, most processors weaken rule #2, and most weaken #1 as well.


Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order)? ... That would make it illegal to forward values from a store buffer!.. Because with a store buffer, cores can read their own writes “early”.

Option 1: forbid store buffer forwarding, keep a simpler memory model, sacrifice performance

Option 2: change the memory model to allow store buffer forwarding, at the cost of a more complex model

Nearly all processors today choose #2

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order), with an exception for store buffer forwarding?


example of the exception for store buffer forwarding:

2 CPUs. Each CPU has 2 threads: threads 1 and 2 on CPU 1, threads 3 and 4 on CPU 2. Thread 1 (on CPU 1) stores a value to memory location A, then Thread 2 reads from memory location A. Starting at about the same time, Thread 3 (on CPU 2) stores a different value to memory location A, then Thread 4 reads from memory location A. If the time it takes for these stores to propagate between CPUs is short, Thread 2 perceives an ordering on which Thread 1's store came before Thread 3's store, but Thread 4 perceives the opposite ordering. So, in this case, there is no interleaving perceived by all threads.


" Memory Model Landscape

Sequential Consistency (SC)

Total Store Order (TSO)

Weaker memory models


example of acausal paradox due to load-store reordering, from

"Example: Ld-St Reordering Permitting a store to be issued to the memory before previous loads have completed, allows load values to be affected by future stores in the same thread" For example,

Process 1: r1 = Load(a) Store(b,1)

Process 2: r2 = Load(b) Store(a,r2)

Load-store reordering would allow the '1' stored by Process 1 into b to be loaded into r2 by process 2's load, and then stored into a by process 2's store, and then loaded into r1 from a by process 1! Implementation-wise, here is what could happen:


note: sometimes relaxed memory formalisms would permit similar but even more paradoxical results, called out-of-thin-air results, see

Outlawing Ghosts: Avoiding Out-of-Thin-Air Results


" While strong memory models like SC and SPARC/Intel-TSO are well understood, weak memory models of commercial ISAs like ARM and POWER are driven too much by microarchitectural details, and inadequately documented by manufacturers.... This forces the researchers to formalize these weak memory models by empirically determining the allowed/disallowed behaviors of commercial processors and then constructing models to fit these observations " -- Weak Memory Models: Balancing Definitional Simplicity and Implementation Flexibility


" SC [1] is the simplest model, but naive implementations of SC suffer from poor performance. Although researchers have proposed aggressive techniques to preserve SC [29]–[38], they are rarely adopted in commercial processors perhaps due to their hardware complexity. Instead the manufactures and researchers have chosen to present weaker memory models, e.g., TSO [2], [3], [25], [39], PSO [4], RMO [4], Alpha [5], Processor Consistency [40], Weak Consistency [41], RC [6], CRF [42], Instruction Reordering + Store Atomicity [43], POWER [11] and ARM [44]. The tutorials by Adve et al. [45] and by Maranget et al. [46] provide relationships among some of these models. A large amount of research has also been devoted to specifying the memory models of high-level languages: C++ [26], [47]–[50], Java [51]–[53],

" -- [74]


" Recently, Lustig et al. have used Memory Ordering Specification Tables (MOSTs) to describe memory models, and proposed a hardware scheme to dynamically convert programs across memory models described in MOSTs [19]. MOST specifies the ordering strength (e.g., locally ordered, multi-copy atomic) of two instructions from the same processor under different conditions (e.g., data dependency, control dependency). " -- [75]


What every systems programmer should know about lockless concurrency



CS 149: Parallel Computing This course is an introduction to parallelism and parallel programming. Most new computer architectures are parallel; programming these machines requires knowledge of the basic issues of and techniques for writing parallel software. Topics: varieties of parallelism in current hardware (e.g., fast networks, multicore, accelerators such as GPUs, vector instruction sets), importance of locality, implicit vs. explicit parallelism, shared vs. non-shared memory, synchronization mechanisms (locking, atomicity, transactions, barriers), and parallel programming models (threads, data parallel/streaming, MapReduce?, Apache Spark, SPMD, message passing, SIMT, transactions, and nested parallelism). Significant parallel programming assignments will be given as homework. The course is open to students who have completed the introductory CS course sequence through 110 and have taken CS 143. Terms: Win

Credit/No Credit Instructors: Olukotun, O. (PI) ; Zaharia, M. (PI) ; Aberger, C. (TA) ; Nötzli, A. (TA) ... more » Schedule for CS 149
Units: 3-4UG Reqs: GER:DB-EngrAppSci?Grading: Letter or


preemptive threading vs single-threaded nonblocking (async) I/O with callbacks vs single-threaded nonblocking (async) I/O with coroutines:

" 1) ((preemptive threading)) Use preemptive system threads that can execute in parallel. A task requiring simultaneous waiting is given an operating system thread of its own so it can block without stopping the entire program. But threads require significant memory and other resources per thread. Also, the operating system can arbitrarily interleave the execution of system threads, requiring the programmer to carefully protect shared resources with locks and condition variables, which is exceedingly error-prone.

2) ((single-threaded nonblocking (async) I/O with callbacks)) Have a single-threaded program, where that single thread runs an event loop whose job is to react to external events by invoking a callback function that has been registered. While it doesn't require the same kind of complex synchronization that preemptive threads do, the inverted control structure of this approach requires your own control flow to thread awkwardly through the system's event loop, leading to a maze of event callbacks. " -- [76]

advantages of coroutines:


KayEss? on May 26, 2017 [-]

The implementation of the coroutines might be complex, but their use is very straightforward -- the code looks almost exactly the same as the blocking code would, just with the awaits in there.

As for the number of threads, when I use Boost ASIO with coroutines I often end up with multiple threads servicing a pool of coroutines, so if there is shared state between them then there is still synchronisation. I use channels implemented on top of eventfd to help with that.

When I converted some network code from callbacks to coroutines the code ended up about 1/3 as long and was far simpler and easier to understand. It also fixed several bugs I simply couldn't find.

The reality is that the callbacks are far more complex to use than the coroutines.

vvanders on May 26, 2017 [-]

Yup, we used to let our designers loose on coroutines as a way to script Game AI. If designers can use them safely then I'm sure the average programmer can too :).

butterisgood on May 26, 2017 [-] ... Having written heavily callback-driven code. Threading contexts manually, thinking about object lifetimes, I can say that coroutines have a much cleaner flow to them and can make it easier to write and later read and debug programs than crazy callbacks and contexts.

Object lifetimes are one thing coroutines help with in a big way. I think it's why Rust is so attractive to so many.

mhink on May 26, 2017 [-] ... This is the biggest reason I vastly prefer using coroutines in my Javascript projects (by way of `redux-saga` if there's heavy lifting, or just async/await otherwise).

" -- [77] and [78] and others in the thread [79]

" dom0 on May 26, 2017 [-]

2) ((single-threaded code with nonblocking I/O and callbacks)) is concurrency, 1) ((multiple processes executing at the same time)) is parallelism; there are various programming models for multiple threads. The basic approach of mutual exclusion (=shared memory, locks, barriers) is just one of them, another approach would be message passing / actors (queues, ZeroMQ? etc.). You can mix the two, and the latter typically is implemented on top of the former, because hardware does not have message passing facilities. Some other approaches don't really fit that scheme (e.g. TBB).

Choosing an approach in this field is not always obvious, though sometimes it is. E.g. many stupid server workloads are bound on I/O concurrency, so green threads are the best fit.

However, as soon as CPU scaling is required, multiple independently schedulable kernel tasks are required -- you either have to fork/CreateProcess? or spawn threads. This does not necessarily imply that larger parts of the application have to be aware of this. E.g. it's entirely possible to plug a behind-the-scenes CPU thread into an event loop. " -- [80]


Some types of concurrency: Shared memory, message passing, data parallel