proj-plbook-plChSharedMemory

Table of Contents for Programming Languages: a survey

Chapter : shared memory

note: when i say 'shared' here i am including a distributed or remote memory resource that multiple processes can write to, e.g. a database. this is very different from how the term 'shared memory' is usually used so mb i should change that. see https://en.wikipedia.org/wiki/Distributed_shared_memory

in fact, when thinking about pitfalls of shared memory, i find it easiest if i imagine the 'shared memory' to be a virtual construct which is actually realized by message passing between separate processes with their own caches.

pros and cons

Pros:

Cons:

ACID properties

hardware shared memory

UMA, ccNUMA

cache coherence

consistency model

PGAS Partitioned Global Address Space (non-uniform memory with processor affinity)

Links:

distributed shared memory

DSM

atomic primitives

atomic primitives

Common atomic primitives:

Simple primitives:

Read-modify-write primitives:

Atomic primitives can be ranked by their consensus number, which says how many concurrent processes can be brought to consensus in a wait-free manner using the primitive (todo: i have only the vaguest idea what this sentence means) (Herlihy, Maurice (January, 1991). "Wait-free synchronization". ACM Trans. Program. Lang. Syst. 13 (1): 124–149. doi:10.1145/114005.102808). From https://en.wikipedia.org/wiki/Read-modify-write:

So, the only primitives from the first list above with infinite consensus number are Compare-and-swap and Load-Link/Store-Conditional. I believe that either of these can be used to implement the others, but i'm not positive. Both of these are common: "As of 2013, most multiprocessor architectures support CAS in hardware." ; "All of Alpha, PowerPC?, MIPS, and ARM provide LL/SC instructions" -- https://en.wikipedia.org/wiki/Compare-and-swap and https://en.wikipedia.org/wiki/Load-Link/Store-Conditional

https://en.wikipedia.org/wiki/Compare-and-swap claims that "As of 2013, the compare-and-swap operation is the most popular synchronization primitive for implementing both lock-based and nonblocking concurrent data structures."

When using compare-and-swap, you have to be wary of the ABA problem, in which the programmer erroneously assumes that if some memory location was A in the past and is A now, then it has not been modified in between the past and now -- in reality it's possible that its value was changed to a new value B in the meantime, but then was changed back to A. Therefore, if you store the old value A of a memory location, and then later compare-and-swap, comparing to the old value A, it is possible that the CAS will succeed even if the memory location was modified in the meantime, provided its value has been changed back to A before the CAS.

One workaround for the ABA problem is to add a counter to the memory location [1] [2]; for example, a 32-bit value V can be augmented by another 32 counter bits C to yield a 64-bit value representing the tuple (V, C). Each time the algorithm writes a new V to the memory location, it also increments C; now even if the V which is written is identical to an older V, the C will probably be different. The only problem is if the addition wraps around and exactly 2^32 updates have occurred in between accesses from some observer. One solution might be to not wrap around counter increments, and require a global synchronization if the counter ever reaches 2^32. (how many bits are needed for the counter? [3] suggests that 32 bits are usually sufficient but that 16 bits are often not).

An issue in the use of atomics is that the hardware generally only provides atomics for specific bit-widths. E.g. there might be a way to atomically write to a single byte, but not a way to atomically replace all of the values in a 200-byte array with others.

If a programmer uses a high-level language that provides atomic variables that support atomic assignment, they must be careful to understand what is and isn't atomic. For example, in many languages, if a is an atomic variable, "a = a + 1" is not, as a whole, atomic. This would often be compiled into an atomic read from the memory location containing "a" into a register, followed by an increment of that register, followed by an atomic write of the value in that register into the memory location containing "a".

Links:

" 26

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false; bool y= false; int z= 0;

a() { x= true; } b() { y= true; } c() { while (!x); if (y) z++; } d() { while (!y); if (x) z++; }

kick off a, b, c, d, join all threads assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

...

With ack_rel, c() can perceive that x=true; in a() happens before y=true; in b() at the same time d() can perceive that y=true; happens before x=true; (due to lack of "global ordering".) In particular c() can perceive x==true and y==false at the same time d() can perceive y==true and x==false. So z might not be incremented by either of c() or d(). With seq_cst, if c() perceives x=true; happens before y=true;, so does d(). – nodakai Jan 20 '16 at 11:03

...

@nodakai, Your explanation is accurate but I think the phrase "happens before" can be misleading since the crux of the issue with acquire-release is that neither of the writes happens-before the other. – jhoffman0x May 1 '18 at 3:50 "

locks and friends

mutex (lock)

build out of critical sections or atomic primitives

can be used to implement critical sections

also https://en.wikipedia.org/wiki/Reentrant_mutex

also https://en.wikipedia.org/wiki/Spinlock

todo: is there any difference between 'lock' and 'mutex'?

if you can do this you are good (theorem?):

Links:

synchronization barriers

critical sections

critical sections: a region of code (shared across threads) that only one thread can be in at any given time

critical sections can be built out of semaphores, or vice versa (i think; see https://en.wikipedia.org/wiki/Mutual_exclusion https://en.wikipedia.org/wiki/Critical_section )

or can be built out of atomic primitives

semaphore

generalization of mutex; n threads may hold the semaphore at once (e.g. with a mutex n = 1). the semaphore is a finite discrete resource.

" Semaphores vs. mutexes

A mutex is essentially the same thing as a binary semaphore and sometimes uses the same basic implementation. The differences between them are in how they are used. While a binary semaphore may be used as a mutex, a mutex is a more specific use-case, in that only the process that locked the mutex is supposed to unlock it. This constraint makes it possible to implement some additional features in mutexes:

Serializing tokens

https://en.wikipedia.org/wiki/Serializing_tokens

composable/deadlock-free

" Tokens are similar to mutexes in that they can, if used correctly, prevent multiple threads from accessing a shared resource at the same time. Unlike mutexes, however, they do NOT exclude other threads from accessing the resource while they are blocked or asleep. "

so if you acquire token A, then do something, then acquire token B, then you might block while trying to acquire token B, and while you are blocked, the resource controlled by token A becomes unlocked and other processes can use it/modify it under you (so after you do any blocking operation, such as acquiring token B, you'd better check to make sure that any previously acquired resources haven't been modified by someone else in the meantime, if that matters to you; the best idiom is to try to acquire all tokens that you will need at once at the beginning of your function)

readers-writer lock

generalization of semaphore:

when the writer lock is held, no one else can acquire either the writer lock or a reader lock

when a reader lock is held, anyone else can acquire another reader lock, but the writer lock cannot be acquired

when a thread is blocked trying to acquire the writer lock, no reader lock can be acquired

monitor

the monitor is an object whose methods are all critical sections, e.g. there is a mutex associated with each monitor which must be acquired when you call a method, and which is released before the method returns

built out of mutexes

other data structures

queue

transactions

side note: there is an interesting connection between reversible computing and transactions: anything done within a transaction must be reversible (undo-able; rollback-able)

stm

(composable/deadlock-free?)

"three fundamental concepts crop up again and again: isolation (of state), atomicity (of state transitions), and consistency (of those atomic transitions) ...

The canonical syntactic footprint of TM is also beautiful and simple. You say:

atomic { ... concurrency-safe code goes here ... }

And everything in that block is magically concurrency-safe. (Well, you still need to ensure the consistency part, but isolation and atomicity are built-in. Mix this with Eiffel- or Spec#-style contracts and assertions like those in .NET 4.0, run at the end of each transaction, and you’re well on your way to verified consistency also. The ‘check E’ work in Haskell was right along these lines.) " --- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

Implementation, and some design decisions:

" Turtles, but How Far Down? Or, Bounded vs. Unbounded Transactions

Not all transactions are equal. There is a broad spectrum of TMs, ranging from those that are bounded to updating, say, 4 memory locations or cache lines, to those that are entirely unbounded. ... Models can be pulled along other axes, however, such as whether memory locations must be tagged in order to be used in a transaction or not, etc. Haskell requires this tagging (via TVars) so that side-effects are evident in the type system as with any other kind of monad.

We quickly settled on unbounded transactions. ... hindsight, this was a critical decision that had far-reaching implications. And to be honest, I now frequently doubt that it was the right call. We had our hearts in the right places, and the entire industry was trekking down the same path at the same time (with the notable exception of Haskell). But with the wisdom of age and hindsight, I do believe limited forms of TM could be wildly successful at particular tasks and yet would have avoided many of the biggest challenges with unbounded TM. ... The model of unbounded transactions is the hard part. You surround any arbitrary code with ‘atomic { … }’ and everything just works. It sounds beautiful. But just think about what can happen within a transaction: memory operations, calls to other methods, P/Invokes to Win32, COM Interop, allocation of finalizable objects, I/O calls (disk, network, databases, console, …), GUI operations, lock acquisitions, CAS operations, …, the list goes on and on. ... Three main approaches were seriously considered:

    IL rewriting. Use a tool that passes over the IL post-compilation to inject transaction calls.
    Hook the (JIT) compiler. The runtime itself would intrinsically know how to inject such calls.
    Library-based. All transactional operations would be explicit calls into the TM infrastructure.... we chose approach #2 for our “real” prototype, and never looked back. ...

Throughout, it became abundantly clear that TM, much like generics, was a systemic and platform-wide technology shift. It didn’t require type theory, but the road ahead sure wasn’t going to be easy. ... And we also hit some tough snags early on. Some were trivial, like what happens when an exception is thrown out of an atomic block. Of course that exception was likely constructed within the atomic block (‘throw new SomeException?()’ being the most common form of ‘throw’), so we decided we probably need to smuggle at least some of that exception’s state out of the transaction. Like its stack trace. And perhaps its message string. I wrote the initial incarnation of the CLR exception subsystem support, and stopped at shallow serialization across the boundary. But this was a slippery slope, and eventually the general problem was seen, leading to more generalized nesting models (which I shall describe briefly below). Another snag, which was quite non-trivial, was the impact to the debugging experience. Depending on various implementation choices – like in-place versus buffered writes – you may need to teach the debugger about TM intrinsically. And some of the challenges were fundamental to building a first class TM implementation. Clearly the GC needed to know about TM and its logging, because it needs to keep both the “before” and “after” state of the transaction alive, in case it needed to roll back. The JIT compiler was very quickly splayed open and on the surgery table. And so on. ... There are other overheads that are not so obvious. Optimistic reads mandate that there is a version number for each location somewhere, and pessimistic writes mandate that there is a lock for each location somewhere. ... We quickly settled an approach much like Harris’s (and, at the time, pretty much the industry/research standard): optimistic reads, in-place pessimistic writes, and automatic retry. ... You’re in one atomic block and then enter into another one. What happens if that inner transaction commits or rolls back, before the fate of the outer transaction is known? Intuition guided us to the following answer:

    If the inner transaction rolls back, the outer transaction does not necessarily do so. However, no matter what the outer transaction does, the effects of the inner will not be seen.
    If the inner transaction commits, the effects remain isolated in the outer transaction. It “commits into” the outer transaction, we took to saying. Only if the outer transaction subsequently commits will the inner’s effects be visible; if it rolls back, they are not.... " -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

" A Better Condition Variable

Here’s a brief aside on one of TM’s bonus features.

Some TM variants also provide for “condition variable”-like facilities for coordination among threads. I think Haskell was the first such TM to provide a ‘retry’ and ‘orElse’ capability. When a ‘retry’ is encountered, the current transaction is rolled back, and restarted once the condition being sought becomes true. How does the TM subsystem know when that might be? This is an implementation detail, but one obvious choice is to monitor the reads that occurred leading up to the ‘retry’ – those involved in the evaluation of the predicate – and once any of them changes, to reschedule that transaction to run. Of course, it will reevaluate the predicate and, if it has become false, the transaction will ‘retry’ again.

A simple blocking queue could be written this way. For example:

object TakeOne?() { atomic { if (Count == 0) { retry; }

        return Pop();
    }}

If, upon entering the atomic block, Count is witnessed as being zero, we issue a retry. The transaction subsystem notices we read Count with a particular version number, and then blocks the current transaction until Count’s associated version number changes. The transaction is then rescheduled, and races to read Count once again. After Count is seen as non-zero, the Pop is attempted. The Pop, of course, may fail because of a race – i.e. we read Count optimistically without blocking out writers – but the usual transaction automatic-reattempt logic will kick in to mask the race in that case.

The ‘orElse’ feature is a bit less obvious, though still rather useful. It enables choice among multiple transactions, each of which may end up issuing a ‘retry’. I don’t think I’ve seen it in any TMs except for ours and Haskell’s.

To illustrate, imagine we’ve got 3 blocking queues like the one above. Now imagine we’d like to take from the first of those three that becomes non-empty. ‘orElse’ makes this simple:

BlockingQueue? bq1 = ..., bq2 = ..., bq3 = ...;

atomic { object obj = orElse { bq1.TakeOne?(), bq2.TakeOne?(), bq3.TakeOne?() }; }

While ‘orElse’ is perhaps an optional feature, you simply can’t write certain kinds of multithreaded algorithms without ‘retry’. Anything that requires cross-thread communication would need to use spin variables. " -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

" Deliberate Plans of Action: Policy

I waved my hands a bit above perhaps without you even knowing it. When I talk about optimistic, pessimistic, and automatic retry, I am baking in a whole lot of policy. ... There was a similar problem when deciding to back off outer layers of nesting, and in fact this becomes more complicated when deadlocks are involved. Imagine:

atomic { atomic { x++; y++; atomic { atomic { y++; x++; } } } }

This deadlock-prone example is tricky because rolling back the inner-most transactions won’t be sufficient to break the deadlock that may occur. Instead the TM policy manager needs to detect that multiple levels of nesting are involved and must be blown away in order to unstick forward progress. " -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

Cons: " Disillusionment Part I: the I/O Problem...what about a read or write from a single block or entire file on the filesystem? Or output to the console? Or an entry in the Event Log? What about a web service call across the LAN? Allocation of some native memory?...The answer seemed clear. At least in theory. The transaction literature, including Reuter and Gray’s classic, had a simple answer for such things: on-commit and on-rollback actions, to perform or compensate the logical action being performed, respectively. (Around this same time, the Haskell folks were creating just this capability in their STM, where ‘onCommit’ and ‘onRollback’ would take arbitrary lambdas to execute the said operation at the proper time.) " -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

" For example, how would you treat a lock block that was called from within a transaction? (You might say “that’s just not supported”, but when adding TM to an existing, large ecosystem of software, you’ve got to draw the compatibility line somewhere. If you draw it too carelessly, large swaths of existing software will not work; and in this case, that often meant that we claimed to provide unbounded transactions, and yet we would be placing bounds on them such that a lot of existing software could not be composed with transactions. Not good.) " -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

" Disillusionment Part II: Weak or Strong Atomicity?

All along, we had this problem nipping at our heels. What happens if code accesses the same memory locations from inside and outside a transaction? We certainly expected this to happen over the life of a program: state surely transitions from public and shared among threads to private to a single thread regularly. But if some location were to be accessed transactionally and non-transactionally concurrently, at once, we’d (presumably) have a real mess on our hands. A supposedly atomic, isolated, etc. transaction would no longer be protected from the evils of racey code.

For example:

atomic { // Tx0     x++; // No-Tx
    x++;
}

... Another approach was static analysis. We could require transactional locations to be tagged, for example. This had the unfortunate consequence of making reusable data structures less, well, reusable. Collections for example presumably need to be usable from within and outside transactions alike. ...

bool itIsOwned = false;
MyObj x = new MyObj();

...

Disillusionment Part III: the Privatization Problem
...
{{{
atomic { // Tx0                         atomic { // Tx1
    // Claim the state for my use:          if (!itIsOwned)
    itIsOwned = true;                           x.field += 42;
}                                       }

int z = x.field;
...

The Tx0 transaction changes itIsOwned to true, and then commits. After it has committed, it proceeds to using whatever state was claimed (in this case an object referred to by variable x) outside of the purview of TM. Meanwhile, another transaction Tx1 has optimistically read itIsOwned as false, and has gone ahead to use x. An update in-place system will allow that transaction to freely change the state of x. Of course, it will roll back here, because isItOwned changed to true. But by then it is too late: the other thread using x outside of a transaction will see constantly changing state – torn reads even – and who knows what will happen from there. A known flaw in any weakly atomic, update in-place TM.

If this example appears contrived, it’s not. It shows up in many circumstances. The first one in which we noticed it was when one transaction removes a node from a linked list, while another transaction is traversing that same list. If the former thread believes it “owns” the removed element simply because it took it out of the list, someone’s going to be disappointed when its state continues to change.

This, we realized, is just part and parcel of an optimistic TM system that does in-place writes. I don’t know that we ever fully recovered from this blow. It was a tough pill to swallow. After that meeting, everything changed: a somber mood was present and I think we all needed a drink. Nevertheless we plowed forward.

We explored a number of alternatives. And so did the industry at large, because that intern in question published a paper on the problem. One obvious solution is to have a transaction that commits a change to a particular location wait until all transactions that have possibly read that location have completed – a technique we called quiescence. We experimented with this approach, but it was extraordinarily complicated, for obvious reasons.

We experimented with blends of pessimistic operations instead of optimistic, alternative commit protocols, like using a “commit ticket” approach that serializes transaction commits, each of which tended to sacrifice performance greatly. Eventually the team decided to do buffered writes instead of in-place writes, because any concurrent modifications in a transaction will simply not modify the actual memory being used outside of the transaction unless that transaction successfully commits.

This, however, led to still other problems, like the granular loss of atomicity problem. Depending on the granularity of your buffered writes – we chose object-level – you can end up with false sharing of memory locations between transactional and non-transactional code. Imagine you update two separate fields of an object from within and outside a transaction, respectively, concurrently. Is this legal? Perhaps not. The transaction may bundle state updates to the whole object, rather than just one field.

All these snags led to the realization that we direly needed a memory model for TM. " -- http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-transactional-memory/

Links:

non-blocking algorithms

(i think this concept is only relevant for shared memory, but am i right?)

https://en.wikipedia.org/wiki/Non-blocking_synchronization

lock-free

wait-free

composable

wait free algorithms

" Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes...It was shown in the 1980s[4] that all algorithms can be implemented wait-free, and many transformations from serial code, called universal constructions, have been demonstrated. However, the resulting performance does not in general match even naïve blocking designs. Several papers have since improved the performance of universal constructions, but still, their performance is far below blocking designs.

Several papers have investigated the hardness of creating wait-free algorithms. For example, it has been shown[5] that the widely available atomic conditional primitives, CAS and LL/SC, cannot provide starvation-free implementations of many common data structures without memory costs growing linearly in the number of threads. But in practice these lower bounds do not present a real barrier as spending a word per thread in the shared memory is not considered too costly for practical systems.

Until 2011, wait-free algorithms were rare, both in research and in practice. However, in 2011 Kogan and Petrank[6] presented a wait-free queue building on the CAS primitive, generally available on common hardware. Their construction expands the lock-free queue of Michael and Scott,[7] which is an efficient queue often used in practice. A follow-up paper by Kogan and Petrank[8] provided a methodology for making wait-free algorithms fast and used this methodology to make the wait-free queue practically as fast as its lock-free counterpart. " -- https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom

Links:

read-copy-update

Let's say you have a linked list being read by a bunch of concurrent readers processes and you want to delete one of the items in the middle.

One way would be to say, anyone accessing the list, read or write, must first acquire a lock.

Another way, the read-copy-update (RCU) way, would be to say, anyone reading the list must first make an entry in a shared log saying that they are about to read it. When they are done, they make another entry in the shared log saying they are done. When you want to delete a node, first you atomically change the 'next' pointer at the previous node to leave out the node to be deleted. Now you look at the log and identify all of the readers who have started reading and haven't finished. When all of these have finished, you can delete the node. Note that while you're waiting, other readers can start reading, and this doesn't delay you further, because you know they will not see the node that is about to be deleted, because you already changes the 'next' pointer.

Links:

consensus

arbiters

lamport's observation that it seems to be a physical fact that arbiters take longer to arbitrate if the events they are given arrive closer together in time

avoiding sharing with shared memory

uniqueness types

immutable data

threadlocals