proj-oot-ootConcurrencyNotes6

[1]:

" IPC can be synchronous or asynchronous. Asynchronous IPC is analogous to network communication: the sender dispatches a message and continues executing. The receiver checks (polls) for the availability of the message by attempting a receive, or is alerted to it via some notification mechanism. Asynchronous IPC requires that the kernel maintains buffers and queues for messages, and deals with buffer overflows; it also requires double copying of messages (sender to kernel and kernel to receiver). In synchronous IPC, the first party (sender or receiver) blocks until the other party is ready to perform the IPC. It does not require buffering or multiple copies, but the implicit rendezvous can make programming tricky. Most programmers prefer asynchronous send and synchronous receive.

First-generation microkernels typically supported synchronous as well as asynchronous IPC, and suffered from poor IPC performance. Jochen Liedtke assumed the design and implementation of the IPC mechanisms to be the underlying reason for this poor performance. In his L4 microkernel he pioneered methods that lowered IPC costs by an order of magnitude.[13] These include an IPC system call that supports a send as well as a receive operation, making all IPC synchronous, and passing as much data as possible in registers. Furthermore, Liedtke introduced the concept of the direct process switch, where during an IPC execution an (incomplete) context switch is performed from the sender directly to the receiver. If, as in L4, part or all of the message is passed in registers, this transfers the in-register part of the message without any copying at all. Furthermore, the overhead of invoking the scheduler is avoided; this is especially beneficial in the common case where IPC is used in an RPC-type fashion by a client invoking a server. Another optimization, called lazy scheduling, avoids traversing scheduling queues during IPC by leaving threads that block during IPC in the ready queue. Once the scheduler is invoked, it moves such threads to the appropriate waiting queue. As in many cases a thread gets unblocked before the next scheduler invocation, this approach saves significant work. Similar approaches have since been adopted by QNX and MINIX 3.[citation needed]

In a series of experiments, Chen and Bershad compared memory cycles per instruction (MCPI) of monolithic Ultrix with those of microkernel Mach combined with a 4.3BSD Unix server running in user space. Their results explained Mach's poorer performance by higher MCPI and demonstrated that IPC alone is not responsible for much of the system overhead, suggesting that optimizations focused exclusively on IPC will have limited impact.[14] Liedtke later refined Chen and Bershad's results by making an observation that the bulk of the difference between Ultrix and Mach MCPI was caused by capacity cache-misses and concluding that drastically reducing the cache working set of a microkernel will solve the problem.[15]

In a client-server system, most communication is essentially synchronous, even if using asynchronous primitives, as the typical operation is a client invoking a server and then waiting for a reply. As it also lends itself to more efficient implementation, most microkernels generally followed L4's lead and only provided a synchronous IPC primitive. Asynchronous IPC could be implemented on top by using helper threads. However, experience has shown that the utility of synchronous IPC is dubious: synchronous IPC forces a multi-threaded design onto otherwise simple systems, with the resulting synchronization complexities. Moreover, an RPC-like server invocation sequentializes client and server, which should be avoided if they are running on separate cores. Versions of L4 deployed in commercial products have therefore found it necessary to add an asynchronous notification mechanism to better support asynchronous communication. This signal-like mechanism does not carry data and therefore does not require buffering by the kernel. By having two forms of IPC, they have nonetheless violated the principle of minimality. Other versions of L4 have switched to asynchronous IPC completely.[16]

As synchronous IPC blocks the first party until the other is ready, unrestricted use could easily lead to deadlocks. Furthermore, a client could easily mount a denial-of-service attack on a server by sending a request and never attempting to receive the reply. Therefore, synchronous IPC must provide a means to prevent indefinite blocking. Many microkernels provide timeouts on IPC calls, which limit the blocking time. In practice, choosing sensible timeout values is difficult, and systems almost inevitably use infinite timeouts for clients and zero timeouts for servers. As a consequence, the trend is towards not providing arbitrary timeouts, but only a flag which indicates that the IPC should fail immediately if the partner is not ready. This approach effectively provides a choice of the two timeout values of zero and infinity. Recent versions of L4 and MINIX have gone down this path (older versions of L4 used timeouts, as does QNX). "

---

if ur using synchronous message passing (by which i mean that the sender blocks until receiver receives the message) on the same system, then you can possibly do zero-copy; the receiver can read and manipulate the sent data in-place while the sender is blocked. If giving the receiver access to that memory segment would be too kludgy, then the IPC system can provide function calls to read subsections of the message; this allows the receiver to process a large message incrementally, rather than allocating a buffer big enough for the whole message and then copying it in before it starts processing it.

---

synchronous IPC copying can be further reduced by allowing a message sender to send a list of message segments to be concatenated (eg a header and a payload); in the common case in which the payload is a data structure already in use in the sender before message construction, this prevents the sender from having to allocate a new buffer for the entire message and copying the payload into it before sending.

---

" Designing generalized lock-free algorithms is hard

---

" We could still scale to a very impressive amount of threads before we had stack chunks. The core reasons we can actually scale with so many threads are different: safe-points and the scheduler tie into allocations, allocations are fast and common, the scheduler is fast, and the overhead of an individual green thread is incredibly small (less than 20 machine words.) A real OS thread is vastly more expensive to manage, stack chunks or not. " -- [3]