Bayle Shanks's website: books-programmingLanguages-programmingLanguagesPartConcurrency

Table of Contents for Programming Languages: a survey

Chapter : introduction to concurrency

"For the past 30 years, computer performance has been driven by Moore's Law; from now on, it will be driven by Amdahl's Law. Writing code that effectively exploits multiple processors can be very challenging." -- Doron Rajwan

see also "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software", by Herb Sutter

"Concurrency is the next major revolution in how we write software" -- Herb Sutter

 "Commonly used programming models are prone to subtle, hard to reproduce bugs, and parallel programs are notoriously hard to test due to data races, non-deterministic interleavings, and complex memory models,"

https://www.google.com/search?client=ubuntu&channel=fs&q=semaphor+monitor&ie=utf-8&oe=utf-8#fp=b07468abf853b2df&q=

optimistic vs pessimistic concurrency

memory models (architecture)

(contrast with programming language 'memory models', which we discuss later under the name 'memory order')

shared memory vs "shared nothing"

shared memory

NUMA

PGAS

(mb should put this one into a separate section about non-uniform memory models)

https://en.wikipedia.org/wiki/Partitioned_global_address_space

deterministic paradigms

nondeterministic paradigms

asynchronous

synchronous

note: sometimes the words 'synchronous' and 'synchronization' are used to mean 'two threads observe the same ordering of some events', rather than 'two threads are doing the same thing at the same time'.

composable

non-composable

defn data race

when the final result of the computation may differ depending on which ordering occurs out of a nondeterministic possible set of orderings of events? todo is that right?

e.g. two threads try to add to a counter simultaneously. One update gets lost todo expand example.

similar example: two ATMs try to withdraw from the same bank account at the same time. both succeed; 'double spend'; global invariant not maintained.

or mb defn is: two or more operations access one memory location concurrently, and at least one of them is a write

" assume the language allows distinguishing between synchronization and ordinary (non-synchronization or data) operations (see below). We say that two memory operations conflict if they access the same memory location (e.g., variable or array element), and at least one is a write. We say that a program (on a particular input) allows a data race if it has a sequentially consistent execution (i.e., a program-ordered interleaving of operations of the individual threads) in which two conflicting ordinary operations execute “simultaneously”. For our purposes, two operations execute “simultaneously” if they occur next to each other in the interleaving and correspond to different threads. Since these operations occur adjacently in the interleaving, we know that they could equally well have occurred in the opposite order; there are no intervening operations to enforce the order.... it is also possible to define data races as conflicting accesses not ordered by synchronization, as is done in Java. These definitions are essentially equivalent." -- Adve, Boehm. Memory Models: A Case for Rethinking Parallel Languages and Hardware

" First of all, let’s make sure we know what we’re talking about. In current usage a data race is synonymous with a low-level data race, as opposed to a high-level race that involves either multiple memory locations, or multiple accesses per thread. Everybody agrees on the meaning of data conflict, which is multiple threads accessing the same memory location, at least one of them through a write. But a data conflict is not necessarily a data race. In order for it to become a race, one more condition must be true: the access has to be “simultaneous.” ... In fact, most languages try to provide the so called DRF (Data Race Free) guarantee, which states that all executions of data-race-free programs are sequentially consistent. Don’t be alarmed by the apparent circularity of the argument: you start with sequentially consistent executions to prove data-race freedom and, if you don’t find any data races, you conclude that all executions are sequentially consistent. But if you do find a data race this way, then you know that non-sequentially-consistent executions are also possible. " -- http://corensic.wordpress.com/category/memory-model/

defn synchronization

when there is a data race, you must 'synchronize', here meaning use some mechanism to constrain the nondeterminism in orderings (possibly all the way to determinism) (todo is that defn right?)

to implement synchronization over shared memory you need some sort of read-modify-write primitive, in general one with infinite consensus number (see atomics, above)

definition liveness

threads should make forward progress, not get 'stuck' (todo is that defn right?)

(todo group under above)

(todo which are built in terms of the others)

(todo which are more or less efficient)

Chapter : coarse-grained control flows (threads and processes) :

thread

pthreads

green thread

coroutines, cooperative multitasking vs preemptive

green thread hierarchy

  with greenthreads/coroutines/fibers you can often feasibly have many more concurrent threads than with OS threads.
 e.g. erlang, e.g. http://blog.paralleluniverse.co/post/64210769930/spaceships2

subroutine call

coroutine (synchronous green thread)

  https://en.wikipedia.org/wiki/Fiber_%28computer_science%29

async green thread (a subroutine call or coroutine at the runtime level) could share scoped variables via a saguaro stack the runtime can transform code written like blocking I/O into async I/O by making an async I/O call then yielding the greenthread and not scheduling it again until the async I/O call completes

green process (separate address space, a subroutine call or coroutine at the runtime level) erlang todo are these async?

thread (shared address space, OS scheduler)

process

fork

and join

task

worker

worker pool

deferred task queues

Chapter: fine-grained control flows

(todo: am i using the phrase fine-grained parallelism correctly? does GPGPU stuff fit here?)

parallelism annotation

e.g. Haskell's par

data parallelism

does this belong in this chapter? i think so...

pmap

parallel data types which are automatically evaluated in parallel when a function like 'map' is applied to them

compiler implementation of nested data parallelism

http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/ndpslides.pdf

MIMD vs SIMD

Flynn's taxonomy: SISD SIMD MISD MIMD SPMD

GPGPU

a turing-universal SIMD mechanism can emulate MIMD, but only without IPC and communication with main memory

e.g. https://www.google.com/search?client=ubuntu&channel=fs&q=simd+emulate+imid&ie=utf-8&oe=utf-8#channel=fs&psj=1&q=simd+emulate+mimd

and data and code parallelism

active data

book rec: connection machine

Chapter: message passing

message passing

pipe

socket

and channel... ?

pub/sub

signals

actor

selective receive (erlang, also libraries such as Java Pulsar)

csp

synchrononus (unbuffered) and async (buffered)

Chapter: sequencing around communication

receiving data

blocking

always use timeouts!

callbacks

and futures etc

callbacks

con: "...the cumbersome use of callbacks. This, of course, is a problem with many asynchronous computations: you gain concurrency but lose the natural code flow. It’s hard for a programmer to reason about which line of code executes on which thread, and passing information from one callback to another is cumbersome as well." -- http://blog.paralleluniverse.co/post/64210769930/spaceships2

futures

'promises' are used somewhat interchangably, although some languages have both 'promises' and 'futures' and distinguish between them

implicit (it's just a value from the program's point of view) vs. explicit (it's a structure that you query to see if it's been bound yet)

in some languages, the writable side of the future is separated from the readable side (like a pipe). This allows you to e.g. copy the readable side and pass it to 5 consumers, and pass the writable side to 1 producer, without fearing that one of the consumers will overstep their authority and try to write to the future.

in clojure, it's easy to make futures behave like ordinary variables, because stateful variables must be 'deref'd anyway

clojure has something called a 'promise' and something called a 'future'. The difference is that with a clojure 'future', at the time of future creation you bind an expression that will be evaluated to give the result; whereas with a clojure 'promise', the promise is just a structure that you later bind a result to, e.g. the computation that will produce the result need not be determined yet at the time of promise creation. (see http://stackoverflow.com/questions/4623536/how-do-clojure-futures-and-promises-differ )

in c++, futures are (see http://stackoverflow.com/questions/12620186/futures-vs-promises ) the read side and promises are the write side.

A hardware version is full/empty bits, found for example in the https://en.wikipedia.org/wiki/Cray_MTA : "For example, an array A is initially written with "empty" bits, and any thread reading a value from A blocks until another thread writes a value.". In the Cray MTA, the blocking on empty bits eventually times out. Opcodes are provided which ignore full/empty status.

await

labeled lines

like neuro -- always have an answer even if your inputs aren't ready (the answer can be 'i don't know')

sending data

yield

(just like in the serial case with coroutines)

return

(just like in the serial case)

send a message thru a channel

pub/sub

alter a 'shared' memory

events

generalize to unordered eventcounts (cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer; that article reserves the term 'eventual consistency' when there is a maximum time duration until consistency)

condition variable

supports 3 operations

wait: go to sleep until you get woken (see below)

signal: wake one thread waiting on condVar

broadcast: wake all threads waiting on condVar

example usage: producer/consumer: Every now and then, the producer sends some data, and then the consumer is supposed to do something with it. It would be a waste of time for the consumer to busy wait for new data. Solution: The producer signals the condition variable when they produce something. the consumer sleeps on the condition then wakes up and processes what the producer produced.

todo should this also be mentioned in 'locks and friends'?

dataflow

? or should this be in a different chapter

event-driven programming, pub/sub

? or should this be in a different chapter

Chapter: models of concurrent time

motivating problem:

events are not transmitted instantaineously between concurrent processes
therefore, events may be observed in different orders by different processes
therefore, give up on the idea of a universally agreed-upon event ordering (useful even if not believed in in a realist way; but theory of special relativity suggests that maybe you should take it realistically; see also philosophical tenselessness)

Logical time and logical clocks

Usually we track time with physical time, e.g. we assign a global real number representing time to each event. But one can also only consider the ordering of events; in computation, often it does not matter if step 3 occurred at 13:30 and step 4 occurred at 14:53, just that step 4 came after step 3. So we might consider a discrete version of time.

Sometimes it is unclear whether some events came before or after others. For example, you may a distributed system with a bunch of processor nodes whose clocks are not synchronized very precisely. Or, you may be working with 'events' that are actually intervals, for example, 'processor A executes subroutine X' and 'processor B executes subroutine Y', in which case two 'events' can overlap. Or, you may be working with events whose ordering relative to each other is known in some cases but unknown in others; for example two concurrent threads that take an action A1, then wait for the other at a synchronization barrier, then take another action A2; you know that threads 1's A1 preceded thread 2's A2, but you don't know if thread 1's A1 preceded thread 2's A1.

For these reasons we often model logical time not as a totally ordered structure (similar to the natural numbers), but rather as a partial order.

In many applications we assume that each individual process's logical time is totally ordered (that is, that each individual process knows, for any pair of events that it observed or generated, which one came first; therefore all the events observed or generated by an individual process can be arranged in order as a line), but that the global logical time of a concurrent distributed system is not.

We define a logical time by defining an ordering relation on events, such as 'happens-before' or 'caused' or 'might-have-caused'.

A logical clock is a function C whose domain is events and whose range is some representation of logical time, that is, C maps events to times, or said another way, for any event e, C(e) tells us what time that event occurred at. We call the value returned by C(e) a 'timestamp'. We place a requirement on C that relates the ordering of the timestamps to the ordering of events; "if e1 < e2, then C(e1) < C(e2)".

See also 'logical clocks for consistency', below, in which we discuss schemes by which processes communicate with each other about their logical clocks for the purpose of combining their local logical time into a global logical time.

Links:

Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (7): 558–565. doi:10.1145/359545.359563

event (re)ordering models

event (re)ordering models is my own term; "memory model" in programming languages; "consistency model" in theory

for some authors, in shared memory, a "coherence model" is about event (re)ordering for events pertaining to a single item in memory; a "consistency model" is about event (re)ordering for events across different items. Typically, stronger guarantees are provided for coherence than for consistency; typically, coherence is taken to mean that every observer is guaranteed to observe the same sequence of values for any single item. For other authors, "coherence" and "consistency" are synonyms.

https://en.wikipedia.org/wiki/Consistency_model

partial and total orders

one technique is to attach integer timestamps to each event. Each processor sends out each event with a unique timestamp which is greater than all timestamps it has ever sent or observed (Lamport).

relaxed (totally ordered only within each variable), release (acquire-release) (partially ordered), sequentially consistent (totally ordered)

'serializability' from database literate, where event = transaction compressed to a point: "a transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time" -- https://en.wikipedia.org/wiki/Serializability

others from http://www.cs.rice.edu/~druschel/comp413/lectures/replication.html (should we include this?):

strict consistency: read returns most recent write.
linearizability; i know this is in between strict consistency sequentially consistency but i'm not sure exactly what it is. i think it means sequential consistency, plus the condition that each event is assigned a timestamp and reads return the write with the greatest timestamp less than their own. The ordering of the timestamps does not necessarily have to be clock time ordering. http://regal.csep.umflint.edu/~swturner/Classes/csc577/Online/Chapter06/img06.html ; . see also https://en.wikipedia.org/wiki/Linearizability#Definition_of_linearizability , http://people.cs.pitt.edu/~mosse/cs2510/class-notes/Consistency1.pdf , http://stackoverflow.com/questions/4179587/difference-between-linearizability-and-serializability http://regal.csep.umflint.edu/~swturner/Classes/csc577/Online/Chapter06/Chapter06.html
sequentially consistent (totally ordered): "... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program." (cite Leslie Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs", IEEE Trans. Comput. C-28,9 (Sept. 1979), 690-691) That is, there is a total ordering on all events in the system. The events observed by any process are in this order, and any process causing multiple events causes them in the order specified by its program. This shared ordering does not necessarily need to be the same as the order in physical time in which the processes sent the events. E.g. at this point we have freed ourselves from the need to agree with physical time
relativistic causal consistency: FIFO + causal + the constraint that a series of loads on any processor with no intervening stores will execute atomically with respect to other processors, e.g. other processors' stores will not become visible in the middle of the series of loads . This requires a 'wait-for-readers' barrier which is the same as the wait-for-readers lock used in the Read-Copy-Update pattern http://www.pdx.edu/computer-science/sites/www.pdx.edu.computer-science/files/Triplett.pdf
causal consistency (is the same as acquire-consume?): pairs of writes that are causally related are seen by the same order by every observer.
total store order (TSO): "reads are not reordered with respect to reads; writes are not reordered with respect to reads that come earlier in the program order; writes are not reordered with respect to most writes (excluding, e.g., multiple writes implicit in string operations like REP MOVSB); reads may be reordered with respect to writes that come earlier in program order as long as those writes are to a different memory location; reads are not reordered with respect to I/O instructions, locked instructions and other serializing instructions. There are no guarantees whatsoever of ordering between writes of different processors, the outcome of concurrent writes to the same memory location is non-deterministic." -- http://corensic.wordpress.com/2011/06/13/understanding-violations-of-sequential-consistency/ (luisceze says that TSO is stronger than processor consistency, is this true? what about the agreement on order of writes from each processor in processor consistency? what about the coherency in processor consistency? so i'm not very confident that TSO is stronger). http://people.engr.ncsu.edu/efg/506/sum06/lectures/notes/lec20.pdf provides a different defn of TSO: Store atomicity—there is a global order defined between all writes; A processor’s own reads and writes are ordered with respect to itself; The processor issuing the write may observe it sooner then the other processors; Special membar instructions are used to impose global ordering between write to read"
processor consistency: Goodman's (1989). Some authors use it as a synonym for PRAM consistency (see below), but others consider it to be stricter than that; to be PRAM + coherence. Gharachorloo et al define this differently. See Ahamad et al. (1993). "processors must agree on the order of writes from each processor but can disagree on the order of writes by different processors, as long as those writes are to different locations" -- http://teachersites.schoolworld.com/webpages/QAbbas/files/memory%20models.pdf . Others simply say that reads can be reordered before writes (including before others' writes), but writes cannot be reordered before reads or before each other.
Commit-Reconcile & Fences (CRF). Coherence is required. Caches are explicitly modeled; loads and stores go to caches, and Reconcile and Commit are explicit primitives to force read caches to refresh from main memory and write caches to flush to main memory, respectively. A set of 'fence' primitives prevent reordering.
coherence: every processor observes the same total order of values in each memory location
FIFO consistency, also called PRAM (pipelined RAM) consistency: "Writes done by a single process are received by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes ." -- http://www.cs.rice.edu/~druschel/comp413/lectures/replication.html . Note that even writes to the same memory location might be reordered, so this type of consistency does not have the property 'coherent'. PRAM is due to Lipton and Sandberg and is named because writes by a single process can be pipelined, that is, the process does not have to stall waiting for each one to complete before starting the next one. Note: the acronym 'PRAM' is also used for Parallel Random Access Machine, which is something different.
location consistency: by default, a process may see any value that was ever associated with a memory location, and when it writes, there are no guarantees on when/how/if this write will be propagated. However, each memory location is associated with a lock, and processes can also attempt to acquire and later release that lock. The memory location has a notation of what its 'current value' is, and only a process currently holding the lock can see what this is. If a process writes to a location while holding its lock, then at the time of releasing the lock, the current value becomes the most recent write by that process. In addition, there is a 'sync' operation which is a synchronization barrier; after two processes synchronize, they will agree on whether a given write by one of them occurred before or after the synchronization barriers (todo doublecheck this). ftp://www.capsl.udel.edu/pub/doc/papers/GaSa95.ps "Location Consistency: Stepping Beyond the Memory coherence barrier"
Slow memory: "Only Write operations issued by the same processor and to the same memory location must be seen by others in that order" -- http://www.cambridge.org/us/files/8013/6681/4910/6184_Chapter12.pdf (chapter of http://www.amazon.com/Distributed-Computing-Principles-Algorithms-Systems/dp/0521189845 )
'C relaxed': "No ordering of other memory accesses is ensured whatsoever. This means that it is not possible to synchronize several threads using the atomic variable. Reads and writes to the atomic variable itself are ordered. Once a thread reads a value, a subsequent read by the same thread from the same object can not yield an earlier value. " -- http://en.cppreference.com/w/c/atomic/memory_order
dag coherence: todo learn about this: http://supertech.csail.mit.edu/papers/ipps96.pdf . seems like the idea is: independent tasks have nondeterministic views of shared memory, except that "When a procedure, such as matrixmul performs a spawn, the spawned procedure becomes a child of the procedure that performed the spawn. Each procedure is broken by sync statements into non-blocking sequences of instructions, called threads , and the threads of the computation are organized into a dag representing the partial execution order defined by the program" and each thread is guaranteed to see writes by previous threads, where 'previous' is defined according to the dag

http://corensic.wordpress.com/2011/06/13/understanding-violations-of-sequential-consistency/ gives an example where it's hard to explain a delay in terms of a single reordering (but you can explain it when you realize that different processes can observe different reorderings):

P1: A=1; t1=A; t2=B P2: B=1; t3=B; t4=A assert(!(t2 == 0 and t4==0)

the assertion can fail because it may take a long time for the writes of 1 to A and B to propagate inter-processor, even though those same writes were read locally immediately.

http://people.engr.ncsu.edu/efg/506/sum06/lectures/notes/lec20.pdf gives an example that can fail on read/read or write/write reordering (i think) but not on write/read reordering (x/y being read as "reorder the x, which in the program code is before the y, to after the y"):

P1: A = 1; flag = 1;

P2: while (flag == 0); assert(A == 1);

Example from http://people.engr.ncsu.edu/efg/506/sum06/lectures/notes/lec20.pdf of something that fails under processor consistency but not (i think) under casual ordering:

P1: A = 1;

P2: while (A==0); B = 1;

P3: while (B==0); print A;

everyone likes this example:

P1: A = 1; print B;

P2: B = 1; print A;

and this similar one, which is part of Dekker's (sp?) algorithm for critical sections:

P1: A = 1; if (B == 0) {enter critical section}

P2: B = 1; if (A == 0) {enter critical section}

note that if there are any delays, or write/read reorderings, both threads may enter the critical section

Example of counterintuitive things with 'C relaxed' ordering:

" For example, with x and y initially zero,

Thread 1: r1 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(x, r1, memory_order_relaxed); Thread 2: r2 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(y, 42, memory_order_relaxed);

is allowed to produce r1 == r2 == 42. "

consistency models in which we distinguish 'synchronization operations' and give them stronger consistency properties ( http://people.cs.pitt.edu/~mosse/cs2510/class-notes/Consistency1.pdf ; todo i dont quite understand these):

"an operation is defined as a synch operation if it forms a race with any operation in any sequentially consistent execution" -- http://www.cs.utah.edu/~rajeev/cs7820/pres/7820-12.pdf

weak consistency: sync operations have sequential consistency. Sync operations don't begin until all previous writes have been completed everywere. No non-sync operations begin until all previous sync operations are complete
release consistency (i think this is the same as acquire-release consistency): sync operations have FIFO consistency. Each sequence of data operations is preceded by an acquire operation and followed by a release operation. No data operation can begin until all previous acquires by the current process have been completed. No release operation can begin until all previous operations by the process have been completed. Implementation: all updates are propagated to other processes during the release operation. Note: can be generalized to 'entry consistency' in which different data is assigned to sync operation targets (mutexes) and data are assigned to a scope, are only sequentially ordered with other operations within the same scope; this is similar to entry consistency.
entry consistency: like release consistency, except that there are multiple synchronization variables, and each data variable is associated with one or more synchronization variables. Each synchronization operation only applies to one synchronization variable. So rather than ordering everything a process did with respect to each release, only the subset of operations which targeting data variables corresponding to release's synchronization variable are ordered. "scope consistency" is a similar system in which the association between data variables and synchronization variables is determined by a process 'opening' and 'closing' a scope, and all operations done in the meantime are assumed to associate with the same synch variables (i think, not sure i got that right; how is entry consistency different from regional consistency?).

" Background: Most large-scale distributed systems (i.e ., databases) apply replication for scalability, but can support only weak consistency:

    DNS: Updates are propagated slowly, and inserts may not be immediately visible.

    NEWS: Articles and reactions are pushed and pulled throughout the Internet, such that reactions can be seen before postings.

    Lotus Notes: Geographically dispersed servers replicate documents, but make no attempt to keep (concurrent) updates mutually consistent.

    WWW: Caches all over the place, but there need be no guarantee that you are reading the most recent version of a page." -- http://www.cs.rice.edu/~druschel/comp413/lectures/replication.html (should we include this?):

client-central consistency properties: Assume there is a 'client' who migrates across processes in between each read and write and who accesses shared memory from whichever process they currently reside in. We want to provide guarantees for this client but only regarding their own data.

read your writes: if you write, then read, then either you see your write, or it has been overwritten by a later update from someone else
- "Example: Updating your Web page and guaranteeing that your Web browser shows the newest version instead of its cached copy"
monotonic reads: if you read, then read again, the second read will show at least as recent a data version as the first read
- "Example: Reading (not modifying) incoming mail while you are on the move. Each time you connect to a different e-mail server, that server fetches (at least) all the updates from the server you previously visited. "
writes follows reads: if you read, then write, "then that write is performed after all writes that preceeded the read."
monotonic writes: if you write, then write again, then every server that sees the second write should previously have seen the first write

todo https://en.wikipedia.org/wiki/Happened-before

Links:

Memory Consistency Models. Mosberger 1991.
Regional Consistency: Programmability and Performance for Non-Cache-Coherent Systems
http://www.diku.dk/~vinter/cc/Lecture11DSM.pdf
http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf . Readable overview of data-race-free, Java, and C++'s memory models, and what some of their designers would like to head.
http://cs.gmu.edu/cne/modules/dsm/green/memcohe.html
https://parasol.tamu.edu/~rwerger/Courses/654/consistency1.pdf
http://www.cs.utah.edu/~rajeev/cs7820/pres/7820-12.pdf has some (old, probably out of date) graphs showing how much various relaxations help
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf details on present-day actual hardware and what their memory consistency models are and why, and what memory barriers they expose
https://www.kernel.org/doc/Documentation/memory-barriers.txt

what else? apparently Chu spaces can model this stuff?

event (re)ordering primitives

http://en.cppreference.com/w/c/atomic/memory_order

fences , memory barriers

https://en.wikipedia.org/wiki/Consistency_model

acquire/release ordering vs. sequentially consistent ordering: http://stackoverflow.com/questions/14861822/acquire-release-versus-sequentially-consistent-memory-order

acquire/release in terms of memory barriers: " Acquire semantics is a property which can only apply to operations which read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.

    Release semantics is a property which can only apply to operations which write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order." -- http://preshing.com/20120913/acquire-and-release-semantics/#comment-20810

a release is load/store + store/store, and and acquire is load/load, load/store.

in C (at least in C++, todo) variable declared 'atomic' are automatically also considered 'volatile', i.e. the processor assumes that other threads may be reading/writing them (cite http://corensic.wordpress.com/2011/05/09/dealing-with-benign-data-races-the-c-way/ )

Links:

branching time

todo

Links:

Chapter : shared memory

note: when i say 'shared' here i am including a distributed or remote memory resource that multiple processes can write to, e.g. a database. this is very different from how the term 'shared memory' is usually used so mb i should change that. see https://en.wikipedia.org/wiki/Distributed_shared_memory

in fact, when thinking about pitfalls of shared memory, i find it easiest if i imagine the 'shared memory' to be a virtual construct which is actually realized by message passing between separate processes with their own caches.

pros and cons

Pros:

simple idea
efficiently pass information between processes (pass a pointer to a location in shared memory)
start with a uniprocessor program and incrementally make it concurrent

Cons:

complicated to reason about what to do when two threads might access the same memory location and one of them is writing to it
not scalable without relaxing consistency constaints (see above); we can make systems with 8 CPUs sharing memory efficiently, and probably up to a few thousand, but if you tried to have hundreds of thousands of processors accessing the same memory, it would probably be much less efficient than giving each one its own memory

ACID properties

hardware shared memory

UMA, ccNUMA

cache coherence

consistency model

PGAS Partitioned Global Address Space (non-uniform memory with processor affinity)

Links:

how much does relaxed consistency help for scalability? only a little:
- Scalability Analysis of Release and Sequential Consistency Models in NoC based Multicore Systems
- http://www.research.rutgers.edu/~wc3/papers/schulz.pdf.gz
Willow: a scalable shared-memory multiprocessor
Shared Memory Programming for Large Scale Machines: "Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine...This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene?/LR©, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors"

distributed shared memory

DSM

atomic primitives

Common atomic primitives:

Simple primitives:

Atomic load, atomic store

Read-modify-write primitives:

Atomic primitives can be ranked by their consensus number, which says how many concurrent processes can be brought to consensus in a wait-free manner using the primitive (todo: i have only the vaguest idea what this sentence means) (Herlihy, Maurice (January, 1991). "Wait-free synchronization". ACM Trans. Program. Lang. Syst. 13 (1): 124–149. doi:10.1145/114005.102808). From https://en.wikipedia.org/wiki/Read-modify-write:

consensus number infinity: memory-to-memory move and swap, augmented queue, compare-and-swap, fetch-and-cons, sticky byte, Load-Link/Store-Conditional
consensus number 2n-2: n-register assignment
consensus number 2: test-and-set, swap, fetch-and-add, queue, stack
consensus number 1: atomic read and atomic write

So, the only primitives from the first list above with infinite consensus number are Compare-and-swap and Load-Link/Store-Conditional. I believe that either of these can be used to implement the others, but i'm not positive. Both of these are common: "As of 2013, most multiprocessor architectures support CAS in hardware." ; "All of Alpha, PowerPC?, MIPS, and ARM provide LL/SC instructions" -- https://en.wikipedia.org/wiki/Compare-and-swap and https://en.wikipedia.org/wiki/Load-Link/Store-Conditional

https://en.wikipedia.org/wiki/Compare-and-swap claims that "As of 2013, the compare-and-swap operation is the most popular synchronization primitive for implementing both lock-based and nonblocking concurrent data structures."

An issue in the use of atomics is that the hardware generally only provides atomics for specific bit-widths. E.g. there might be a way to atomically write to a single byte, but not a way to atomically replace all of the values in a 200-byte array with others.

If a programmer uses a high-level language that provides atomic variables that support atomic assignment, they must be careful to understand what is and isn't atomic. For example, in many languages, if a is an atomic variable, "a = a + 1" is not, as a whole, atomic. This would often be compiled into an atomic read from the memory location containing "a" into a register, followed by an increment of that register, followed by an atomic write of the value in that register into the memory location containing "a".

Links:

https://en.wikipedia.org/wiki/Atomicity_%28programming%29
Dmitry Vyukov's page on atomic primitives
Dmitry Vyukov's page on memory models
Andy Glew's comp-arch.net wiki's thoughts on possible atomic RMW operations
http://fileadmin.cs.lth.se/cs/Education/EDAN25/F06.pdf
http://en.cppreference.com/w/c/atomic
http://www.hpl.hp.com/research/linux/atomic_ops/
Andy Glew's comp-arch.net wiki's page on LL/SC, including how these instruction evolved as an optimistic concurrency alternative to an earlier primitive load-locked/store-unlock, which locks memory.
Foundations of the C++ Concurrency Memory Model

locks and friends

mutex (lock)

build out of critical sections or atomic primitives

can be used to implement critical sections

also https://en.wikipedia.org/wiki/Reentrant_mutex

also https://en.wikipedia.org/wiki/Spinlock

todo: is there any difference between 'lock' and 'mutex'?

if you can do this you are good (theorem?):

acquire locks in a consistent order
acquire all of the locks that you need at the beginning of a critical section, then release them all in reverse order at the end

Links:

synchronization barriers

critical sections

critical sections: a region of code (shared across threads) that only one thread can be in at any given time

critical sections can be built out of semaphores, or vice versa (i think; see https://en.wikipedia.org/wiki/Mutual_exclusion https://en.wikipedia.org/wiki/Critical_section )

or can be built out of atomic primitives

semaphore

generalization of mutex; n threads may hold the semaphore at once (e.g. with a mutex n = 1). the semaphore is a finite discrete resource.

Serializing tokens

https://en.wikipedia.org/wiki/Serializing_tokens

composable/deadlock-free

" Tokens are similar to mutexes in that they can, if used correctly, prevent multiple threads from accessing a shared resource at the same time. Unlike mutexes, however, they do NOT exclude other threads from accessing the resource while they are blocked or asleep. "

so if you acquire token A, then do something, then acquire token B, then you might block while trying to acquire token B, and while you are blocked, the resource controlled by token A becomes unlocked and other processes can use it/modify it under you (so after you do any blocking operation, such as acquiring token B, you'd better check to make sure that any previously acquired resources haven't been modified by someone else in the meantime, if that matters to you; the best idiom is to try to acquire all tokens that you will need at once at the beginning of your function)

readers-writer lock

generalization of semaphore:

when the writer lock is held, no one else can acquire either the writer lock or a reader lock

when a reader lock is held, anyone else can acquire another reader lock, but the writer lock cannot be acquired

when a thread is blocked trying to acquire the writer lock, no reader lock can be acquired

monitor

the monitor is an object whose methods are all critical sections, e.g. there is a mutex associated with each monitor which must be acquired when you call a method, and which is released before the method returns

built out of mutexes

other data structures

queue

transactions

stm

(composable/deadlock-free?)

non-blocking algorithms

(i think this concept is only relevant for shared memory, but am i right?)

https://en.wikipedia.org/wiki/Non-blocking_synchronization

lock-free

wait-free

composable

wait free algorithms

" Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes...It was shown in the 1980s[4] that all algorithms can be implemented wait-free, and many transformations from serial code, called universal constructions, have been demonstrated. However, the resulting performance does not in general match even naïve blocking designs. Several papers have since improved the performance of universal constructions, but still, their performance is far below blocking designs.

Several papers have investigated the hardness of creating wait-free algorithms. For example, it has been shown[5] that the widely available atomic conditional primitives, CAS and LL/SC, cannot provide starvation-free implementations of many common data structures without memory costs growing linearly in the number of threads. But in practice these lower bounds do not present a real barrier as spending a word per thread in the shared memory is not considered too costly for practical systems.

Until 2011, wait-free algorithms were rare, both in research and in practice. However, in 2011 Kogan and Petrank[6] presented a wait-free queue building on the CAS primitive, generally available on common hardware. Their construction expands the lock-free queue of Michael and Scott,[7] which is an efficient queue often used in practice. A follow-up paper by Kogan and Petrank[8] provided a methodology for making wait-free algorithms fast and used this methodology to make the wait-free queue practically as fast as its lock-free counterpart. " -- https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom

Links:

read-copy-update

Let's say you have a linked list being read by a bunch of concurrent readers processes and you want to delete one of the items in the middle.

One way would be to say, anyone accessing the list, read or write, must first acquire a lock.

Another way, the read-copy-update (RCU) way, would be to say, anyone reading the list must first make an entry in a shared log saying that they are about to read it. When they are done, they make another entry in the shared log saying they are done. When you want to delete a node, first you atomically change the 'next' pointer at the previous node to leave out the node to be deleted. Now you look at the log and identify all of the readers who have started reading and haven't finished. When all of these have finished, you can delete the node. Note that while you're waiting, other readers can start reading, and this doesn't delay you further, because you know they will not see the node that is about to be deleted, because you already changes the 'next' pointer.

Links:

consensus

arbiters

lamport's observation that it seems to be a physical fact that arbiters take longer to arbitrate if the events they are given arrive closer together in time

avoiding sharing with shared memory

uniqueness types

immutable data

threadlocals

Chapter: consistency

Eventual consistency

also called best-effort consistency (cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer; that article reserves the term 'eventual consistency' when there is a maximum time duration until consistency)

Last write wins

Logical clocks consistency schemes

Vector clocks

Consider the following example (slightly modified from http://basho.com/why-vector-clocks-are-easy/)

Alice, Ben, Cathy, and Dave are planning to meet next week for dinner. Dave Alice emails everyone and suggests they meet on Wednesday. Ben emails Dave and says he can do any day, but he'd prefer Tuesday or Thursday. Dave replies privately to Ben that he'd prefer Tuesday, so unless someone can't make it, it'll be Tuesday. Then Cathy emails Dave and says she absolutely can't make any day but Thursday, so Dave replies privately to Cathy that it'll be Thursday. Then Dave gets sick and stops checking his email. Alice emails everyone again to find out what day was decided, but she gets mixed messages. Cathy claims to have settled on Thursday with Dave, and Ben claims to have settled on Tuesday with Dave. Dave can't be reached, and Ben and Cathy's computers' clocks are unreliable, so their email timestamps can't be used to determine which email from Dave was sent latest. So none of Alice, Ben, and Cathy know which day has been chosen (assume for the purpose of the example that they can't just work it out themselves).

Consider the decision about which day it is to be a document with different versions. We want to know which version is most recent. The vector clock solution is this: a list of clocks is appended to each message. A 'clock' is just an integer. There is one clock for each person who has seen any previous version of the message. Whenever a person is about to send a message, they increase their clock.

So in the examples, the messages are:

Dave (to all): "Dinner? Vector clock: {Dave: 1}" Alice (to all): "Yeah how about Wednesday? Vector clock: {Alice: 1, Dave: 1}" Ben (to Dave and Alice): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Vector clock: {Alice: 1, Ben: 1, Dave: 1}" Dave (to Ben): "I like Tuesday too if no one else minds. Vector clock: {Alice: 1, Ben: 1, Dave: 2}" Cathy (to Dave): "I can only do Thursday. Is that okay? Vector clock: {Alice: 1, Cathy: 1, Dave: 1}" Dave (to Cathy): "OK. Vector clock: {Alice: 1, Ben: 1, Cathy: 1, Dave: 3}"

Now even if Dave goes silent, Ben and Cathy can compare the latest messages they've seen, which would be their replies from Dave, and determine that the latest version, that is, the version which is a descendent of every earlier version, is Dave's message to Cathy confirming Thursday. They can agree that since Dave has seen everything that everyone had said before replying to Cathy that Thursday would be okay, that Thursday is reasonable.

Vector clocks can also distinguish the case in which there is no single 'latest version' (multiple conflicting versions). For example, if instead of emailing Dave about Tuesday and Thursday, Ben had emailed only Alice, and she had replied, we would have had:

Dave (to all): "Dinner? Vector clock: {Dave: 1}" Alice (to all): "Yeah how about Wednesday? Vector clock: {Alice: 1, Dave: 1}" Ben (to Alice): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Vector clock: {Alice: 1, Ben: 1, Dave: 1}" Alice (to Ben): "I like Tuesday too if no one else minds. Vector clock: {Alice: 2, Ben: 1, Dave: 1}" Cathy (to Dave): "I can only do Thursday. Is that okay? Vector clock: {Alice: 1, Cathy: 1, Dave: 1}" Dave (to Cathy): "OK. Vector clock: {Alice: 1, Cathy: 1, Dave: 2}"

Now if Ben and Cathy compare the latest messages they've seen, they see:

"I like Tuesday too if no one else minds. Vector clock: {Alice: 2, Ben: 1, Dave: 1}"

and

"OK. Vector clock: {Alice: 1, Cathy: 1, Dave: 2}"

Neither of these has the property that the clocks in its vector clock are all greater than or equal to the corresponding clocks in the other. So neither is a descendent of the other; we have a version conflict.

One problem with vector clocks is that they are not scalable, because each message must be padded with metadata whose length is worst-case proportional to the total number of processes in the system. Charron-Bost proved that there is no scalable system (one using clocks with less entries than the total number of processes in the system) that can correctly determine whether or not one message is a descendent of the other in all cases (cite Bernadette Charron-Bost. Concerning the Size of Logical Clocks in Distributed Systems. Information Processing Letters 01/1991; 39:11-16).

Another problem is that if someone is also interacting with some of the same people on a different document, e must remember the last clock number that he sent out for each conversation.

(note: technically speaking i think vector clocks are supposed to be incremented by the receiver, not the sender but in the examples above i had the sender increment)

Lamport clocks

Lamport clocks are like vector clocks where there is only one integer attached to each message. Like vector clocks, each process also keeps track of one integer. They work like this:

On receiving a message, the receiver sets its clock to be the maximum of its previous clock value and the received value
A process increments its clock before it sends a message
When a process sends a message, it includes its current clock with the message

In the previous example, this would be

Dave (to all): "When shall we meet? Lamport clock: 0" Alice (to all): "Let's meet on Wednesday. Lamport clock: 1" Ben (to Dave): "I can do Wednesday, but Tuesday or Thursday would be a bit better. Lamport clock: 2" Dave (to Ben): "I like Tuesday too if no one else minds. Lamport clock: 3" Cathy (to Dave): "I can only do Thursday. Is that okay? Lamport clock: 2" Dave (to Cathy): "OK Thursday it is then. Lamport clock: 4"

Lamport clocks have the property that if one message is a descendent of another. the descendent will always have a greater clock value than its ancestor. But if the messages are incomparable (neither is a descendent of another), the Lamport clock might be equal for both of them, or it might be greater for either one of them.

Lamport clocks are scalable, unlike vector clocks. The functionality that vector clocks have that this doesn't is that you can't always tell if there is no single latest version (if there is a conflict). If there is no version which is a descendent of all other versions, it is still possible for one message to have a greater Lamport clock value.

Other clock-ish schemes

"Plausible clocks" are clock schemes in which, if one message is a descendent of the other, the clock scheme says so; but it might also say that even if the two messages are incomparable (incomparable here is sometimes called "concurrent").

"Matrix clocks" are when each process remembers, for each other process that it receives messages from, the latest vector clock that it saw in any message from that other process. In other words, it keeps track not only of what it has seen, but also of what other processes have seen.

"Version vectors" are like vector clocks but for synchronization of documents; when two parties synch, they set the clocks in the document that they synched to the max of their individual clocks.

You can 'prune' the vectors in vector clocks; in the vector, for each clock, have a 'last modified' timestamp (it's okay that the processes' clocks aren't synchronized, this timestamp is only an optimization, if it's wrong it won't affect correctness). Now when the vector clock gets 'too big', delete the oldest counters from the vector. If you compare the resulting document to a very old ancestor, the new document won't appear to be an ancestor because it is missing some clocks that the old one has; so this system will sometimes mistakenly tell you that there is no single latest version even when there is. But it will never tell you that there is a latest version when there isn't one.

Links:

http://basho.com/why-vector-clocks-are-hard/

The hard part is conflict resolution

If one is using vector clocks as part of providing some sort of consistency, all the clocks do is to detect conflicts. You still have to find some way to resolve those conflicts, and how conflicts should be resolved is an application specific question.

http://pl.atyp.us/wordpress/?p=2601 gives some examples of how it can be tricky to resolve conflicts

Chapter: distributed systems

types of network failures you have to worry about a lot

https://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing

naming

redundancy

durability

idempotency

REST

caching

upgrades

proxy

CAP theorem

The CAP theorem says that you cannot have a database that guarantees consistency and availability and partition tolerance.

What the theorem actually says

Since the meaning of the terms in the theorem are often misunderstood, we provide excerpts from the paper Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

Partially synchronous model: "every node has a clock, and all clocks increase at the same rate. However, the clocks themselves are not synchronized, in that they may display different values at the same real time...A local timer can be used to schedule an action to occur a certain interval of time after some other event. Furthermore, assume that every message is either delivered within a given, known time..., or it is lost. Also, every node processes a received message within a given, known time..., and local processing takes zero time"

Consistency:

"Atomic [5], or linearizable [4], consistency ...Under this consistency guarantee, there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time. One important property of an atomic read/write shared memory is that any read operation that begins after a write operation completes must return that value, or the result of a later write operation...Discussing atomic consistency is somewhat different than talking about an ACID database, as database consistency refers to transactions, while atomic consistency refers only to a property of a single request/response operation sequence. And it has a different meaning than the Atomic in ACID, as it subsumes the database notions of both Atomic and Consistent"

Availability: "every request received by a non-failing node in the system must result in a response"

Partition Tolerance: The above definitions of availability and atomicity are qualified by the need to tolerate partitions...When a network is partitioned, all messages sent from nodes in one component of the partition to nodes in another component are lost"

" It is impossible in the partially synchronous network model to implement a read/write data object that guarantees the following properties:

Availability
Atomic consistency

in all executions (even those in which messages are lost).

It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties:

Availability
Atomic consistency in all fair executions (including those in which messages are lost). "

Summary of the proof (of a different but similar theorem in the same paper): "The basic idea of the proof is to assume that all messages between G1 and G2 are lost. Then if a write occurs in G1, and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation."

What it does not say

CAP does not say that no nodes from a distributed consistent database can continue to serve in the case of a network partition. As Nick Lavezzo points out, nodes in a minority partition could simply detect that they have been partitioned and respond to all requests with an error message. If a majority partition exists with writable copies of all data, then clients who can connect to the majority partition can still read and write data. The CAP theorem is simply saying that not ALL nodes can continue to accept updates during a partition if consistency is to be maintained.

The CAP theorem does not say that a distributed database cannot be consistent and available during times when there is no partition.

The CAP theorem does not say that a distributed database must choose, in the event of a partition, C or A for all operations; the choice may be made differently for each operation, and may depend on the user or data involved. And "all three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions have nuances, including disagreement within the system about whether a partition exists." ([1]).

The CAP theorem does not say that nodes cannot respond to requests at all, or cannot respond selectively to read requests during a partition. Nodes are still permitted to return error messages, to return stale data, or to accept read requests but refuse writes. As Brewer says, the Availability in CAP means availability for updates.

routing

security

zero-trust global consensus algorithms e.g. bitcoin (but this is getting far away from programming language constructs...)

Chapter: scalability

scalability of message passing

depends on model of time

scalability of shared memory

building shared memory out of message passing

scalability of routing

and connectivity

small world networks

Chapter: concurrent control flow constructs

map reduce

map associative reduce

amorphous medium

imagine that there are a zillion nodes in some unknown topology. Notes can directly communicate with their neighbors. One might imagine that this is a discrete approximation to a continuous fluid of nodes. Primitives provided include:

each node continuously broadcasts the contents of a special 'sharing' hashtable to each of its neighbors
nodes can be labeled as to their membership or non-membership of one or more discrete regions, where a region is defined by a name and a membership test
gossip: an idempotent merge function that merges local state with neighbor information to produce a result is applied to a region to produce a shared value over the region (e.g. max or union)
consensus: non-failing nodes participating in a consensus all choose the same value
reduction: non-failing nodes participating in a reduction all share a value which is an aggregate function of values held by the participants (e.g. sum or average)
read/write atomic objects: implemented using consensus and reduction
active gradient: maintains a hop-count radiating from a source or sources
persistent node: a robust virtual node implemented by combining many nodes into a redundant assemblage
homeostasis and repair: (AVOID condition repair) executes 'repair' whenever condition is true; this is used in place of exception handler in a programming pattern of maintaining homeostasis, under the assumption that failures of one or a few nodes are common

http://groups.csail.mit.edu/mac/projects/amorphous/papers/lsmas-final.pdf

a similar system which is explicitly for the purpose of sensor networks is Regiment: http://www.cs.harvard.edu/~mdw/papers/regiment-ipsn07.pdf

Chapter: todo dont know where to put these yet

Bulk synchronous parallel (BSP)

https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

tuple spaces

http://software-carpentry.org/blog/2011/03/tuple-spaces-or-good-ideas-dont-always-win.html

(what is the event (re)ordering model of tuple spaces?)

http://www.diku.dk/~vinter/cc/Lecture10TupleSpaces.pdf

serializable continuations

other

" General purpose cores Flexible multithreading 2. Vector hardware Vector parallelism 3. GPUs Restrictive data parallelism 4. FPGAs Customized dataflow 5. Custom accelerators Various forms "

Chapter : Safety for concurrency

type systems for concurrency

todo unpack comment by Alexandrescu in [2]

milewski's owner types

and uniqueness types

linear logic

..and that guy's system for uniqueness of each end of a channel

disciplined parallelism

View memory models (event (re)ordering models) in terms of: if you have a shared memory with this model, what is the set of rules that you can follow when programming such that, if you follow those rules, you can imagine that the memory is sequentially consistent?

A data-race-free memory model guarantees sequential consistency for any data-race-free program. A data-race-free memory model might also guarantee sequential consistency for a program with races, as long as the variables participating in races are identified as such (the terminology is that such variables are not 'data variables', but 'synchronization variables').

Chapter : languages and libraries

Popular:

http://www.open-mpi.org/

Languages i've heard about in multiple places todo:

Cilk
Unified C (has been used on BlueGene?)

GPGPU-focused:

http://www.nvidia.com/object/cuda_home_new.html
- and PTX instruction set http://docs.nvidia.com/cuda/parallel-thread-execution/index.html
http://www.khronos.org/opencl/
http://msdn.microsoft.com/en-us/library/vstudio/hh265136.aspx
- and DirectCompute? instruction set http://msdn.microsoft.com/en-us/library/ff471356%28v=vs.85%29.aspx
https://en.wikipedia.org/wiki/OpenACC

Others (for me to look at todo):

http://stanford-ppl.github.io/Delite/
http://dpj.cs.uiuc.edu/DPJ/Download.html
- http://dpj.cs.uiuc.edu/DPJ/Publications_files/DPJ-OOPSLA-2009.pdf
- http://denovo.cs.illinois.edu/Pubs/10-cpc-dpj.pdf A Language for Deterministic-by-Default Parallel Programming
- http://dpj.cs.uiuc.edu/DPJ/Publications_files/DPJ-POPL-2011-Nondet.pdf
- http://dpj.cs.uiuc.edu/DPJ/Current_Design.html

" here are already a large num- ber of research and commercial projects developing new disciplined parallel programming models for determinis- tic and non-deterministic algorithms [5]; e.g., Ct [24], CnC? [17], Cilk++ [11], Galois [33], SharC? [7], Kendo [44], Prometheus [6], Grace [10], Axum [26], and DPJ [14]. Most of these, including all but one of the commercial sys- tems, guarantee the absence of data races for programs that type-check, satisfying the first requirement of our work im- mediately. Moreover, most of these also enforce a require- ment of structured parallel control (e.g., a nested fork join model, pipelining, etc.), which is much easier to reason about than arbitrary (unstructured) thread synchronization. "

" 7. Related Work Type and Effect Systems: Several researchers have described ef- fect systems for enforcing a locking discipline in nondeter ministic programs that prevents data races and deadlocks [5, 20, 34] o r guar- antees isolation for critical sections [29]. Matsakis et al . [41] have recently proposed a type system that guarantees race-freed om for locks and other synchronization constructs using a constru ct called an “interval” for expressing parallelism. While there is so me over- lap with our work in the guarantees provided (race freedom, d ead- lock freedom, and isolation), the mechanisms are very diffe rent (ex- plicit synchronization vs. atomic statements supported by STM). Further, these systems do not provide determinism by defaul t. Fi- nally, there is no other effect system we know of that provide s both race freedom and strong isolation together ...

Beckman et al. [13] show how to use access permissions to re- move STM synchronization overhead. While the goals are the s ame as ours, the mechanisms are different (alias control vs. typ e and effect annotations). The two mechanisms have different tra deoffs in expressivity and power: for example, Beckman et al.’s met hod can eliminate write barriers only if an object is accessed th rough a unique reference, whereas our system can eliminate barrie rs for access through shared references, so long as the access does not cause interfering effects. However, alias restrictions ca n express some patterns (such as permuting unique references in a data struc ture) that our system cannot. As future work, it would be inte resting to explore these tradeoffs further ... Nondeterministic Parallel Programming: Several research efforts are developing parallel models for nondeterminist ic codes with irregular data access patterns, such as Delaunay mesh r efine- ment. Galois [36] provides a form of isolation, but with iter ations of parallel loops (instead of atomic statements) as the isolat ed compu- tations. Concurrency is increased by detecting conflicts at the level of method calls, instead of reads and writes, and using seman tic commutativity properties. Lublinerman et al. [39] have pro posed object assemblies as an alternative model for expressing irregular, graph-based computations ... Kulkarni et al. [35] have recently proposed task types as a way of enforcing a property they call pervasive atomicity . This work shares with ours the broad goal of reducing the number of concurrent interleavings the programmer must consider. Ho wever, Kulkarni et al. adopt an actor-inspired approach, in which d ata is non-shared by default, and sharing musk occur through speci al “task objects.” This is in contrast to our approach of allowi ng familiar shared-memory patterns of programming, but using effect annotations to enforce safety properties. Finally, none of the work discussed above provides any deterministic-by-default gu arantee. "

Links:

Chapter: Links

todo

possible links

not sure if these are essential enough to merit a link, todo check them out:

http://regal.csep.umflint.edu/~swturner/Classes/csc578/lectures.html http://regal.csep.umflint.edu/~swturner/Classes/csc578/csc578.html (mb)
http://regal.csep.umflint.edu/~swturner/Classes/csc577/csc577.html (mb)
http://www.cs.utah.edu/~rajeev/cs7820/pres/08-7820-15.pdf (mb)
http://www.amazon.com/Distributed-Computing-Principles-Algorithms-Systems/dp/0521189845
(cite Unify: A scalable, loosely-coupled, distributed shared memory multicomputer)
Distributed Shared Memory for Large-Scale Dynamic Systems
http://www.parawiki.org/index.php/Main_Page
http://www.mypearsonstore.com/bookstore/distributed-systems-principles-and-paradigms-9780132392273?xid=PSED
http://developer.android.com/training/articles/smp.html
http://www.diku.dk/~vinter/cc/
http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf
http://people.engr.ncsu.edu/efg/506/sum06/lectures/notes.html
http://www.cs.rit.edu/~pns6910/docs/Distributed%20Shared%20Memory%20Systems/A%20survey%20of%20distributed%20shared%20memory%20systems.pdf
http://www.cdf.toronto.edu/~csc469h/fall/handouts/nitzberg91.pdf
Global and high-contention operations: Barriers, reductions, and highly-contended locks. A talk about synchronization and synchronization primitives given in Bill Mark's fine-grained parallelism class. [PDF

PPT] http://www.cs.utexas.edu/~coonske/presentations/synchronization.pdf . Has information on big-O notation traffic costs of various synchronization primitives.