books-programmingLanguages-programmingLanguagesPartConcurrency

Table of Contents for Programming Languages: a survey

Chapter : introduction to concurrency

"For the past 30 years, computer performance has been driven by Moore's Law; from now on, it will be driven by Amdahl's Law. Writing code that effectively exploits multiple processors can be very challenging." -- Doron Rajwan

see also "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software", by Herb Sutter

"Concurrency is the next major revolution in how we write software" -- Herb Sutter

 "Commonly used programming models are prone to subtle, hard to reproduce bugs, and parallel programs are notoriously hard to test due to data races, non-deterministic interleavings, and complex memory models,"

https://www.google.com/search?client=ubuntu&channel=fs&q=semaphor+monitor&ie=utf-8&oe=utf-8#fp=b07468abf853b2df&q=

optimistic vs pessimistic concurrency

memory models (architecture)

(contrast with programming language 'memory models', which we discuss later under the name 'memory order')

shared memory vs "shared nothing"

shared memory

NUMA

PGAS

(mb should put this one into a separate section about non-uniform memory models)

https://en.wikipedia.org/wiki/Partitioned_global_address_space

deterministic paradigms

nondeterministic paradigms

asynchronous

synchronous

note: sometimes the words 'synchronous' and 'synchronization' are used to mean 'two threads observe the same ordering of some events', rather than 'two threads are doing the same thing at the same time'.

composable

non-composable

defn data race

when the final result of the computation may differ depending on which ordering occurs out of a nondeterministic possible set of orderings of events? todo is that right?

e.g. two threads try to add to a counter simultaneously. One update gets lost todo expand example.

similar example: two ATMs try to withdraw from the same bank account at the same time. both succeed; 'double spend'; global invariant not maintained.

or mb defn is: two or more operations access one memory location concurrently, and at least one of them is a write

" assume the language allows distinguishing between synchronization and ordinary (non-synchronization or data) operations (see below). We say that two memory operations conflict if they access the same memory location (e.g., variable or array element), and at least one is a write. We say that a program (on a particular input) allows a data race if it has a sequentially consistent execution (i.e., a program-ordered interleaving of operations of the individual threads) in which two conflicting ordinary operations execute “simultaneously”. For our purposes, two operations execute “simultaneously” if they occur next to each other in the interleaving and correspond to different threads. Since these operations occur adjacently in the interleaving, we know that they could equally well have occurred in the opposite order; there are no intervening operations to enforce the order.... it is also possible to define data races as conflicting accesses not ordered by synchronization, as is done in Java. These definitions are essentially equivalent." -- Adve, Boehm. Memory Models: A Case for Rethinking Parallel Languages and Hardware

" First of all, let’s make sure we know what we’re talking about. In current usage a data race is synonymous with a low-level data race, as opposed to a high-level race that involves either multiple memory locations, or multiple accesses per thread. Everybody agrees on the meaning of data conflict, which is multiple threads accessing the same memory location, at least one of them through a write. But a data conflict is not necessarily a data race. In order for it to become a race, one more condition must be true: the access has to be “simultaneous.” ... In fact, most languages try to provide the so called DRF (Data Race Free) guarantee, which states that all executions of data-race-free programs are sequentially consistent. Don’t be alarmed by the apparent circularity of the argument: you start with sequentially consistent executions to prove data-race freedom and, if you don’t find any data races, you conclude that all executions are sequentially consistent. But if you do find a data race this way, then you know that non-sequentially-consistent executions are also possible. " -- http://corensic.wordpress.com/category/memory-model/

defn synchronization

when there is a data race, you must 'synchronize', here meaning use some mechanism to constrain the nondeterminism in orderings (possibly all the way to determinism) (todo is that defn right?)

to implement synchronization over shared memory you need some sort of read-modify-write primitive, in general one with infinite consensus number (see atomics, above)

definition liveness

threads should make forward progress, not get 'stuck' (todo is that defn right?)

(todo group under above)

(todo which are built in terms of the others)

(todo which are more or less efficient)

Chapter : coarse-grained control flows (threads and processes) :

thread

pthreads

green thread

coroutines, cooperative multitasking vs preemptive

green thread hierarchy

  with greenthreads/coroutines/fibers you can often feasibly have many more concurrent threads than with OS threads.
 e.g. erlang, e.g. http://blog.paralleluniverse.co/post/64210769930/spaceships2

subroutine call

coroutine (synchronous green thread)

  https://en.wikipedia.org/wiki/Fiber_%28computer_science%29

async green thread (a subroutine call or coroutine at the runtime level) could share scoped variables via a saguaro stack the runtime can transform code written like blocking I/O into async I/O by making an async I/O call then yielding the greenthread and not scheduling it again until the async I/O call completes

green process (separate address space, a subroutine call or coroutine at the runtime level) erlang todo are these async?

thread (shared address space, OS scheduler)

process

fork

and join

task

worker

worker pool

deferred task queues

Chapter: fine-grained control flows

(todo: am i using the phrase fine-grained parallelism correctly? does GPGPU stuff fit here?)

parallelism annotation

e.g. Haskell's par

data parallelism

does this belong in this chapter? i think so...

pmap

parallel data types which are automatically evaluated in parallel when a function like 'map' is applied to them

compiler implementation of nested data parallelism

http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/ndpslides.pdf

MIMD vs SIMD

Flynn's taxonomy: SISD SIMD MISD MIMD SPMD

GPGPU

a turing-universal SIMD mechanism can emulate MIMD, but only without IPC and communication with main memory

e.g. https://www.google.com/search?client=ubuntu&channel=fs&q=simd+emulate+imid&ie=utf-8&oe=utf-8#channel=fs&psj=1&q=simd+emulate+mimd

and data and code parallelism

active data

book rec: connection machine

Chapter: message passing

message passing

pipe

socket

and channel... ?

pub/sub

signals

actor

csp

synchrononus (unbuffered) and async (buffered)

Chapter: sequencing around communication

receiving data

blocking

always use timeouts!

callbacks

and futures etc

callbacks

con: "...the cumbersome use of callbacks. This, of course, is a problem with many asynchronous computations: you gain concurrency but lose the natural code flow. It’s hard for a programmer to reason about which line of code executes on which thread, and passing information from one callback to another is cumbersome as well." -- http://blog.paralleluniverse.co/post/64210769930/spaceships2

futures

'promises' are used somewhat interchangably, although some languages have both 'promises' and 'futures' and distinguish between them

implicit (it's just a value from the program's point of view) vs. explicit (it's a structure that you query to see if it's been bound yet)

in some languages, the writable side of the future is separated from the readable side (like a pipe). This allows you to e.g. copy the readable side and pass it to 5 consumers, and pass the writable side to 1 producer, without fearing that one of the consumers will overstep their authority and try to write to the future.

in clojure, it's easy to make futures behave like ordinary variables, because stateful variables must be 'deref'd anyway

clojure has something called a 'promise' and something called a 'future'. The difference is that with a clojure 'future', at the time of future creation you bind an expression that will be evaluated to give the result; whereas with a clojure 'promise', the promise is just a structure that you later bind a result to, e.g. the computation that will produce the result need not be determined yet at the time of promise creation. (see http://stackoverflow.com/questions/4623536/how-do-clojure-futures-and-promises-differ )

in c++, futures are (see http://stackoverflow.com/questions/12620186/futures-vs-promises ) the read side and promises are the write side.

A hardware version is full/empty bits, found for example in the https://en.wikipedia.org/wiki/Cray_MTA : "For example, an array A is initially written with "empty" bits, and any thread reading a value from A blocks until another thread writes a value.". In the Cray MTA, the blocking on empty bits eventually times out. Opcodes are provided which ignore full/empty status.

await

labeled lines

like neuro -- always have an answer even if your inputs aren't ready (the answer can be 'i don't know')

sending data

yield

(just like in the serial case with coroutines)

return

(just like in the serial case)

send a message thru a channel

pub/sub

alter a 'shared' memory

events

generalize to unordered eventcounts (cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer; that article reserves the term 'eventual consistency' when there is a maximum time duration until consistency)

condition variable

supports 3 operations

wait: go to sleep until you get woken (see below)

signal: wake one thread waiting on condVar

broadcast: wake all threads waiting on condVar

example usage: producer/consumer: Every now and then, the producer sends some data, and then the consumer is supposed to do something with it. It would be a waste of time for the consumer to busy wait for new data. Solution: The producer signals the condition variable when they produce something. the consumer sleeps on the condition then wakes up and processes what the producer produced.

todo should this also be mentioned in 'locks and friends'?

dataflow

? or should this be in a different chapter

event-driven programming, pub/sub

? or should this be in a different chapter

Chapter: models of concurrent time

motivating problem:

Logical time and logical clocks

Usually we track time with physical time, e.g. we assign a global real number representing time to each event. But one can also only consider the ordering of events; in computation, often it does not matter if step 3 occurred at 13:30 and step 4 occurred at 14:53, just that step 4 came after step 3. So we might consider a discrete version of time.

Sometimes it is unclear whether some events came before or after others. For example, you may a distributed system with a bunch of processor nodes whose clocks are not synchronized very precisely. Or, you may be working with 'events' that are actually intervals, for example, 'processor A executes subroutine X' and 'processor B executes subroutine Y', in which case two 'events' can overlap. Or, you may be working with events whose ordering relative to each other is known in some cases but unknown in others; for example two concurrent threads that take an action A1, then wait for the other at a synchronization barrier, then take another action A2; you know that threads 1's A1 preceded thread 2's A2, but you don't know if thread 1's A1 preceded thread 2's A1.

For these reasons we often model logical time not as a totally ordered structure (similar to the natural numbers), but rather as a partial order.

In many applications we assume that each individual process's logical time is totally ordered (that is, that each individual process knows, for any pair of events that it observed or generated, which one came first; therefore all the events observed or generated by an individual process can be arranged in order as a line), but that the global logical time of a concurrent distributed system is not.

We define a logical time by defining an ordering relation on events, such as 'happens-before' or 'caused' or 'might-have-caused'.

A logical clock is a function C whose domain is events and whose range is some representation of logical time, that is, C maps events to times, or said another way, for any event e, C(e) tells us what time that event occurred at. We call the value returned by C(e) a 'timestamp'. We place a requirement on C that relates the ordering of the timestamps to the ordering of events; "if e1 < e2, then C(e1) < C(e2)".

See also 'logical clocks for consistency', below, in which we discuss schemes by which processes communicate with each other about their logical clocks for the purpose of combining their local logical time into a global logical time.

Links:

event (re)ordering models

event (re)ordering models is my own term; "memory model" in programming languages; "consistency model" in theory

for some authors, in shared memory, a "coherence model" is about event (re)ordering for events pertaining to a single item in memory; a "consistency model" is about event (re)ordering for events across different items. Typically, stronger guarantees are provided for coherence than for consistency; typically, coherence is taken to mean that every observer is guaranteed to observe the same sequence of values for any single item. For other authors, "coherence" and "consistency" are synonyms.

https://en.wikipedia.org/wiki/Consistency_model

partial and total orders

one technique is to attach integer timestamps to each event. Each processor sends out each event with a unique timestamp which is greater than all timestamps it has ever sent or observed (Lamport).

relaxed (totally ordered only within each variable), release (acquire-release) (partially ordered), sequentially consistent (totally ordered)

'serializability' from database literate, where event = transaction compressed to a point: "a transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time" -- https://en.wikipedia.org/wiki/Serializability

others from http://www.cs.rice.edu/~druschel/comp413/lectures/replication.html (should we include this?):

http://corensic.wordpress.com/2011/06/13/understanding-violations-of-sequential-consistency/ gives an example where it's hard to explain a delay in terms of a single reordering (but you can explain it when you realize that different processes can observe different reorderings):

P1: A=1; t1=A; t2=B P2: B=1; t3=B; t4=A assert(!(t2 == 0 and t4==0)

the assertion can fail because it may take a long time for the writes of 1 to A and B to propagate inter-processor, even though those same writes were read locally immediately.

http://people.engr.ncsu.edu/efg/506/sum06/lectures/notes/lec20.pdf gives an example that can fail on read/read or write/write reordering (i think) but not on write/read reordering (x/y being read as "reorder the x, which in the program code is before the y, to after the y"):

P1: A = 1; flag = 1;

P2: while (flag == 0); assert(A == 1);

Example from http://people.engr.ncsu.edu/efg/506/sum06/lectures/notes/lec20.pdf of something that fails under processor consistency but not (i think) under casual ordering:

P1: A = 1;

P2: while (A==0); B = 1;

P3: while (B==0); print A;

everyone likes this example:

P1: A = 1; print B;

P2: B = 1; print A;

and this similar one, which is part of Dekker's (sp?) algorithm for critical sections:

P1: A = 1; if (B == 0) {enter critical section}

P2: B = 1; if (A == 0) {enter critical section}

note that if there are any delays, or write/read reorderings, both threads may enter the critical section

Example of counterintuitive things with 'C relaxed' ordering:

" For example, with x and y initially zero,

Thread 1: r1 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(x, r1, memory_order_relaxed); Thread 2: r2 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(y, 42, memory_order_relaxed);

is allowed to produce r1 == r2 == 42. "

consistency models in which we distinguish 'synchronization operations' and give them stronger consistency properties ( http://people.cs.pitt.edu/~mosse/cs2510/class-notes/Consistency1.pdf ; todo i dont quite understand these):

"an operation is defined as a synch operation if it forms a race with any operation in any sequentially consistent execution" -- http://www.cs.utah.edu/~rajeev/cs7820/pres/7820-12.pdf

" Background: Most large-scale distributed systems (i.e ., databases) apply replication for scalability, but can support only weak consistency:

    DNS: Updates are propagated slowly, and inserts may not be immediately visible.
    NEWS: Articles and reactions are pushed and pulled throughout the Internet, such that reactions can be seen before postings.
    Lotus Notes: Geographically dispersed servers replicate documents, but make no attempt to keep (concurrent) updates mutually consistent.
    WWW: Caches all over the place, but there need be no guarantee that you are reading the most recent version of a page." -- http://www.cs.rice.edu/~druschel/comp413/lectures/replication.html (should we include this?):

client-central consistency properties: Assume there is a 'client' who migrates across processes in between each read and write and who accesses shared memory from whichever process they currently reside in. We want to provide guarantees for this client but only regarding their own data.

todo https://en.wikipedia.org/wiki/Happened-before

Links:

what else? apparently Chu spaces can model this stuff?

event (re)ordering primitives

http://en.cppreference.com/w/c/atomic/memory_order

fences , memory barriers

https://en.wikipedia.org/wiki/Consistency_model

acquire/release ordering vs. sequentially consistent ordering: http://stackoverflow.com/questions/14861822/acquire-release-versus-sequentially-consistent-memory-order

acquire/release in terms of memory barriers: " Acquire semantics is a property which can only apply to operations which read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.

    Release semantics is a property which can only apply to operations which write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order." -- http://preshing.com/20120913/acquire-and-release-semantics/#comment-20810

a release is load/store + store/store, and and acquire is load/load, load/store.

in C (at least in C++, todo) variable declared 'atomic' are automatically also considered 'volatile', i.e. the processor assumes that other threads may be reading/writing them (cite http://corensic.wordpress.com/2011/05/09/dealing-with-benign-data-races-the-c-way/ )

Links:

branching time

todo

Links:

Chapter : shared memory

note: when i say 'shared' here i am including a distributed or remote memory resource that multiple processes can write to, e.g. a database. this is very different from how the term 'shared memory' is usually used so mb i should change that. see https://en.wikipedia.org/wiki/Distributed_shared_memory

in fact, when thinking about pitfalls of shared memory, i find it easiest if i imagine the 'shared memory' to be a virtual construct which is actually realized by message passing between separate processes with their own caches.

pros and cons

Pros:

Cons:

ACID properties

hardware shared memory

UMA, ccNUMA

cache coherence

consistency model

PGAS Partitioned Global Address Space (non-uniform memory with processor affinity)

Links:

distributed shared memory

DSM

atomic primitives

atomic primitives

Common atomic primitives:

Simple primitives:

Read-modify-write primitives:

Atomic primitives can be ranked by their consensus number, which says how many concurrent processes can be brought to consensus in a wait-free manner using the primitive (todo: i have only the vaguest idea what this sentence means) (Herlihy, Maurice (January, 1991). "Wait-free synchronization". ACM Trans. Program. Lang. Syst. 13 (1): 124–149. doi:10.1145/114005.102808). From https://en.wikipedia.org/wiki/Read-modify-write:

So, the only primitives from the first list above with infinite consensus number are Compare-and-swap and Load-Link/Store-Conditional. I believe that either of these can be used to implement the others, but i'm not positive. Both of these are common: "As of 2013, most multiprocessor architectures support CAS in hardware." ; "All of Alpha, PowerPC?