notes-computer-jasper-jasperConcurrencyNotes1

https://developer.apple.com/library/ios/DOCUMENTATION/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html

-- http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/ndpslides.pdf

3 kinds of concurrency in haskell:

explicit threads: forkIO and STM. issues: nondeterministic implicit threadS: par and seq. issues: hard to really optimize data parallelism: pmap etc

data parallelism: flat and nested. flat is conventional, nested is better. Flat data parallel: Apply sequential operation to bulk data Nested data parallel: Apply parallel operation to bulk data

parallel array types, parallel array comprehensions

https://www.google.com/search?client=ubuntu&channel=fs&q=semaphor+monitor&ie=utf-8&oe=utf-8#fp=b07468abf853b2df&q=semaphore+monitor+mutex+lock+annotation++actor+csp+stm+future+promise+dataflow+unique+worker+pool+pipe+proxy+queue+condition+tuple+spaces++task+event+await+asynchronous+synchronous+deterministic+nondeterministic

--

http://freeprogrammersblog.vhex.net/post/linux-39-introdued-new-way-of-writing-socket-servers/2

--

" a new way of thinking about multiprocess socket servers. There are two standard UNIX designs of such servers:

    fork - a main process creates a listening socket and calls accept. After a client connects, a child process is created, which inherits the newly created socket and communicates with the client, when the communication ends the process dies,
    prefork - a main process creates a listening socket and creates N child processes, which inherit the listening socket. The child processes call accept on the socket. Operating system is responsible for distributing incoming connections among all the child processes.

The prefork model works better (when it can be used1) - it uses a preallocated pool of processes instead of calling fork() for every connection. It also gives more control over used system resources by ability to choose a size of the process pool, whereas in the fork model a number of processes can grow uncontrollably.

SO_REUSEPORT option can be viewed as a vastly simplified prefork model. No need to run the parent process, manage children, setup signals etc. Just spawn any number of single-process, single-threaded servers, kill them and add new as necessary. The kernel does all the work of the parent process from the classic prefork model. "

--

http://www.kb.ecei.tohoku.ac.jp/~terauchi/papers/concur06-capcon-full.pdf

http://research.cs.berkeley.edu/project/cccd-impl/

--

" Jim Sukha Says:

September 19, 2013 at 9:31 pm

I think it is worth pointing out, that C++ *can* offer some protection against data-races in a parallel programs. If you write parallel programs using a language that supports strict fork-join task-based parallelism, then you can use a race-detector to find races in your code.

Cilk Plus provides such extensions to C and C++ for task parallelism which exactly fits into this category. [Quasi-promotional links will follow. :) ]

https://www.cilkplus.org/cilk-plus-tutorial

It also provides Cilk Screen, a corresponding race detector for finding data races.

http://software.intel.com/en-us/articles/an-introduction-to-the-cilk-screen-race-detector

Note that Cilk Screen actually exploits the strict fork-join structure of parallelism to make strong correctness guarantees. For a large class of interesting Cilk Plus programs (e.g, mostly those that use only cilk_spawn, cilk_sync, cilk_for, and reducers), one can actually

This guarantee is not true in general of parallel programming using pthreads, or even other task-parallel programming platforms (e.g., TBB), because these platforms support more arbitrary, unstructured parallelism.

Language support for strict fork-join parallelism in the style of Cilk Plus has been proposed to the C++ standard.

http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3409.pdf

For anyone who is serious about making C++ usable for mainstream parallel programming, I would encourage taking a look at this proposal. The strictness provided by having language keywords for task parallelism provides stronger guarantees for race-detection tools, as compared task-parallel libraries like Microsoft PPL or Intel TBB. In addition to the benefits of strictness, having a language extension for fork-join parallelism instead of a library interface is at least in my opinion, a much more intuitive and user-friendly way to support parallelism in C++. "

--

" Mike Smith Says:

September 20, 2013 at 12:23 pm

Very nice article.

“Two words: data races. Imperative languages offer no protection against data races — maybe with the exception of D.”

Have you looked at Rust? Variable declarations default to immutable, and the type system disallows sharing of mutable data between tasks. "

" Bartosz Milewski Says:

September 20, 2013 at 1:30 pm

I don’t know much about Rust, but I noticed one thing: Like Erlang, it supports only one model of concurrency: message passing. That’s very limiting. Haskell actually lets you share memory in a very controlled way, which makes the implementation of STM possible. "

" Bartosz Milewski Says:

September 20, 2013 at 2:04 pm

Actually, behind the scenes Haskell uses pointers for everything, including passing data to and from functions. It also tries to share as much data as possible when making modifications to data structures. It also gives the programmer the option of not creating thunks (using strictness annotations) and using unboxed values inside data structures. But it does all this without exposing the program to data races. "

--

" millstone 2 days ago

link

> The power of multicore processors, vector units, and GPUs is being squandered by the industry because of an obsolete programming paradigm.

It’s worth addressing these in turn. In reverse order:

I dispute that the C paradigm squanders GPUs. OpenCL? and CUDA are the two most prominent languages written with GPUs in mind, and both have a lot more in common with C than with Haskell. In particular they eschew functional mainstays like recursion, linked lists, and garbage collection. So for GPUs, it seems like the ideal paradigm is closer to C than Haskell.

For vector units, there’s some recent excitement about auto-vectorization in Haskell. But I’m skeptical about the potential of auto-vectorization in general, since it only seems to apply to simple loops, and can’t take advantage of all instructions. Most effectively vectorized code uses either compiler intrinsics or outright assembly, and I don’t see that changing any time soon (I’d love to be proven wrong on this though).

Lastly, multicore processors. C++11 more or less standardized the creaky POSIX threading model, with some additions like thread locals - hardly state of the art. I wonder if the author is familiar with other approaches, like libdispatch, which provide a much improved model.

One last observation. Parallelism is not a goal! Improved performance is the goal, and parallelism is one mechanism to improved performance. If your Haskell program runs at half the speed of a C++ program, but scales to twice as many cores so as to make the overall time equal, that’s not a wash: it’s a significant win for C++, which did the same thing with half the resources (think battery drain, heat, other processes, etc.) Horse in front of cart.

reply

upvote

lmm 1 day ago

link

>I dispute that the C paradigm squanders GPUs. OpenCL? and CUDA are the two most prominent languages written with GPUs in mind, and both have a lot more in common with C than with Haskell. In particular they eschew functional mainstays like recursion, linked lists, and garbage collection. So for GPUs, it seems like the ideal paradigm is closer to C than Haskell.

I think the more obvious explanation is that it's much easier to write a compiler for C than for Haskell. Thus the first tooling available for any new system is almost always C. It doesn't mean C's the best tool.

>For vector units, there’s some recent excitement about auto-vectorization in Haskell. But I’m skeptical about the potential of auto-vectorization in general, since it only seems to apply to simple loops, and can’t take advantage of all instructions. Most effectively vectorized code uses either compiler intrinsics or outright assembly, and I don’t see that changing any time soon (I’d love to be proven wrong on this though).

We'll have to wait and see on this.

>Lastly, multicore processors. C++11 more or less standardized the creaky POSIX threading model, with some additions like thread locals - hardly state of the art. I wonder if the author is familiar with other approaches, like libdispatch, which provide a much improved model.

My experience is that alternative approaches to something as fundamental as parallelism never take off, because they can't gather an ecosystem around them (I'm looking in particular at the chaos of approaches available in perl and especially python). I think a modern language needs to have a single, obvious, preferred way of achieving parallelism/concurrency that libraries can build on top of (and that's why javascript has been so successful - while it's a terrible language in many ways, there is one and only one way you handle parallel-type problems, and it's (slightly) better than the standard way in other languages)

reply "

--

" Processes that are waiting on other resources, such as a file read, wait on the EVENT data structure. Thus all processes waiting on a single resource wait on a single event. When the resource becomes available, the event is caused, which wakes up all the processes waiting on it. Processes may wait on multiple events for any one of them to happen, including a time out. Events are fully user programmable – that is, users can write systems that use the generalized event system provided by the MCP." -- https://en.wikipedia.org/wiki/Burroughs_MCP

--

"

Port files

Another technique for inter-process communication (IPC) is port files. They are like Unix pipes, except that they are generalized to be multiway and bidirectional. Since these are an order of magnitude slower than other IPC techniques such as libraries, it is better to use other techniques where the IPC is between different processes on the same machine.

The most advantageous use of port files is therefore for distributed IPC. Port files were introduced with BNA (Burroughs Network Architecture), but with the advent of standard networking technologies such as OSI and TCP/IP, port files can be used with these networks as well.

A server listening for incoming connections declares a port file (a file with the KIND attribute equal to PORT). Each connection that is made from a client creates a subfile with an index, so each port file represents multiple connections to different clients around the network.

A server process receives client requests from anywhere on the network by issuing a read on the port file (subfile = 0 to read from any subfile). It issues a response to the client that issued the request by writing to the particular subfile from which the request was read. " -- https://en.wikipedia.org/wiki/Burroughs_MCP

--

http://debu.gs/entries/inferno-part-2-let-s-make-a-cluster

--

http://blog.paralleluniverse.co/post/64210769930/spaceships2

" The challenge can be met, I believe, by adopting solutions that rest on two legs: languages (or at the very least – programming idioms) designed for concurrency, and appropriate transactional concurrent data-structures.

The former will need to enforce immutability, help manage state, and provide simple and effective concurrency mechanisms, such as message passing. So far, Clojure and Erlang seem to fit the bill. The latter will need to work hand-in-hand with the application, and help it exploit parallelism rather than hinder it, as many databases are prone to do.

... Callbacks are indeed an annoyance... To solve this, we have developed a lightweight thread library for the JVM called Quasar, along with its Clojure API – Pulsar.

...Quasar lightweight threads (which we call fibers)...unlike regular threads, you can have millions of fibers... Instead of a callback, you have fiber-blocking operations, that are as simple as a plain (thread-) blocking operation, only cheaper...Quasar then builds on the foundation of fibers, and includes Go-like CSP constructs (selectable channels), and an Erlang-like actor system. "

http://blog.paralleluniverse.co/post/49445260575/quasar-pulsar

" Lightweight Threads, Channels and Actors for the JVM

We’ll start at the end: Quasar and Pulsar are two new open-source libraries in Java and Clojure respectively that add Erlang-like actor-model (and Go-like coroutine/channel) programs to the JVM.

...

Several paradigms that try to make writing correct and efficient multithreaded code have emerged, but the actor model offered by Erlang has won some popularity, and for good reason: the actor model is easy to understand and lends itself to building complex fault-tolerant software. My own admiration for Erlang began when I realized that, to paraphrase Philip Greenspun, every mission-critical military software I’d worked on contained an ad-hoc, informally specified implementation of half of Erlang. Other languages — notably Go with channels and goroutines, and Clojure with agents — and frameworks (like Akka), have copied some aspects of the actor model, but they all suffer from deficiencies that make them more complicated, or less powerful than Erlang.

... A notable attempt that also provides lightweight threads is the Kilim library, and Erjang, the JVM Erlang implementation which uses Kilim under the hood. Kilim, however, is a very intrusive and complex infrastructure, while Erjang seems to no longer be actively maintained....Kilim; it fuses all three into a monolithic package....

...

There have been several attempts of porting actors to the JVM. Quasar and Pulsar’s main contribution — from which many of their advantages stem — is true lightweight threads[1]. Lightweight threads provide many of the same benefits “regular”, OS threads do, namely a single, simple control flow and the ability to block and wait for some resource to become available while in the meantime allowing other threads to run on the CPU. Unlike regular threads, lightweight threads are not scheduled by the operating system so their context-switch is often faster, and they require far less system resources. As a result, a single machine can handle millions of them.

Quasar/Pulsar actors have the following benefits:

    Extreme simplicity
    Pattern matching
    Binaries and pattern matching on binaries (just like Erlang!)
    Selective receive
    Super-scalable asynchronous IO accessed with a synchronous, blocking IO API.

Just to show where we’re headed, here’s an almost line-by-line port of the canonical Erlang pingpong example to Pulsar:

(defsusfn ping [n] (if (== n 0) (do (! :pong :finished) (println "ping finished")) (do (! :pong [:ping @self]) (receive :pong (println "Ping received pong")) (recur (dec n)))))

(defsusfn pong [] (receive :finished (println "Pong finished") (do (println "Pong received ping") (! ping :pong) (recur))))

(defn -main [] (register :pong (spawn pong)) (spawn ping 3) :ok) "

" Quasar, the Implementation

An Erlang-like actor system (i.e. with blocking, selective receives), requires three components:

    Coroutines
    A scheduler
    Queues

Coroutines are functions that can be paused and later resumed. They are necessary to build lightweight threads because they provide “user-mode” management of the stack. A good, multithreaded scheduler, is then added to schedule and run the coroutines while making the best use of available cores. This combination gives us lightweight threads. Finally, queues are needed to implement channels or actor mailboxes.

It is important these three components be logically separated. Modularity assist in future maintenance, and allows for a best-of-breed approach[2]. "

" Of the three pieces, coroutines are the most challenging to implement on the JVM, as they require bytecode instrumentation. Essentially, every possibly-pausing method (the Quasar/Pulsar nomenclature is “suspendable”) must be inspected to find all invocations of other suspendable methods. Before the call to a suspendable method, code must be injected to push the caller’s local variables onto a stack object. Also, code must be injected to the beginning of every suspendable method, that upon resuming would jump to the instruction following the pause-point. This has to be done to all suspendable methods on the (OS thread) stack.

Fortunately, I found Matthias Mann’s continuations library that not only does exactly that and only that, but is also small, fast and easy to understand.

Next, came the scheduler. Here the job was easy. Java’s ForkJoinPool? is a masterpiece of multi-threaded task scheduling, made even better for Java 8. In the meantime, we’re using the jsr166e library that contains the fork/join implementation that’s to go into the next major JDK release. We’ve used it in SpaceBase?, and the performance improvement over the Java 7 implementation (which was very impressive to begin with) are significant. With a solid foundation like fork/join you can’t go wrong.

Combining Matthias Mann’s coroutines with fork/join gives us lightweight threads, which in Quasar are called fibers. A fiber can block, or “park”, just like a thread, by pausing the continuation and returning from the fork/join task. To be “unparked”, it is simply scheduled (forked) again.

Finally, I rolled my own queue implementations, partly as an exercise, but also because there’s room for improvement over Java’s concurrent Queues in our case. The reason is that the queues we need aren’t symmetric. Many fibers append messages onto the queues but only one pulls them out, so these are single-consumer multiple-producers queues.

Quasar has three types of lock-free queue implementations: concurrent linked-lists based on John M. Mellor-Crummey’s 1987 paper (for some reason the simpler Craig-Landin-Hagersten queue I’ve tried gave worse performance than Java’s multiple-consumer multiple-producer ConcurrentLinkedQueue?