proj-jasper-jasperConcurrencyNotes1

https://developer.apple.com/library/ios/DOCUMENTATION/Cocoa/Conceptual/Multithreading/ThreadSafety/ThreadSafety.html

-- http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/ndpslides.pdf

3 kinds of concurrency in haskell:

explicit threads: forkIO and STM. issues: nondeterministic implicit threadS: par and seq. issues: hard to really optimize data parallelism: pmap etc

data parallelism: flat and nested. flat is conventional, nested is better. Flat data parallel: Apply sequential operation to bulk data Nested data parallel: Apply parallel operation to bulk data

parallel array types, parallel array comprehensions

https://www.google.com/search?client=ubuntu&channel=fs&q=semaphor+monitor&ie=utf-8&oe=utf-8#fp=b07468abf853b2df&q=semaphore+monitor+mutex+lock+annotation++actor+csp+stm+future+promise+dataflow+unique+worker+pool+pipe+proxy+queue+condition+tuple+spaces++task+event+await+asynchronous+synchronous+deterministic+nondeterministic

--

http://freeprogrammersblog.vhex.net/post/linux-39-introdued-new-way-of-writing-socket-servers/2

--

" a new way of thinking about multiprocess socket servers. There are two standard UNIX designs of such servers:

    fork - a main process creates a listening socket and calls accept. After a client connects, a child process is created, which inherits the newly created socket and communicates with the client, when the communication ends the process dies,
    prefork - a main process creates a listening socket and creates N child processes, which inherit the listening socket. The child processes call accept on the socket. Operating system is responsible for distributing incoming connections among all the child processes.

The prefork model works better (when it can be used1) - it uses a preallocated pool of processes instead of calling fork() for every connection. It also gives more control over used system resources by ability to choose a size of the process pool, whereas in the fork model a number of processes can grow uncontrollably.

SO_REUSEPORT option can be viewed as a vastly simplified prefork model. No need to run the parent process, manage children, setup signals etc. Just spawn any number of single-process, single-threaded servers, kill them and add new as necessary. The kernel does all the work of the parent process from the classic prefork model. "

--

http://www.kb.ecei.tohoku.ac.jp/~terauchi/papers/concur06-capcon-full.pdf

http://research.cs.berkeley.edu/project/cccd-impl/

--

" Jim Sukha Says:

September 19, 2013 at 9:31 pm

I think it is worth pointing out, that C++ *can* offer some protection against data-races in a parallel programs. If you write parallel programs using a language that supports strict fork-join task-based parallelism, then you can use a race-detector to find races in your code.

Cilk Plus provides such extensions to C and C++ for task parallelism which exactly fits into this category. [Quasi-promotional links will follow. :) ]

https://www.cilkplus.org/cilk-plus-tutorial

It also provides Cilk Screen, a corresponding race detector for finding data races.

http://software.intel.com/en-us/articles/an-introduction-to-the-cilk-screen-race-detector

Note that Cilk Screen actually exploits the strict fork-join structure of parallelism to make strong correctness guarantees. For a large class of interesting Cilk Plus programs (e.g, mostly those that use only cilk_spawn, cilk_sync, cilk_for, and reducers), one can actually

This guarantee is not true in general of parallel programming using pthreads, or even other task-parallel programming platforms (e.g., TBB), because these platforms support more arbitrary, unstructured parallelism.

Language support for strict fork-join parallelism in the style of Cilk Plus has been proposed to the C++ standard.

http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3409.pdf

For anyone who is serious about making C++ usable for mainstream parallel programming, I would encourage taking a look at this proposal. The strictness provided by having language keywords for task parallelism provides stronger guarantees for race-detection tools, as compared task-parallel libraries like Microsoft PPL or Intel TBB. In addition to the benefits of strictness, having a language extension for fork-join parallelism instead of a library interface is at least in my opinion, a much more intuitive and user-friendly way to support parallelism in C++. "

--

" Mike Smith Says:

September 20, 2013 at 12:23 pm

Very nice article.

“Two words: data races. Imperative languages offer no protection against data races — maybe with the exception of D.”

Have you looked at Rust? Variable declarations default to immutable, and the type system disallows sharing of mutable data between tasks. "

" Bartosz Milewski Says:

September 20, 2013 at 1:30 pm

I don’t know much about Rust, but I noticed one thing: Like Erlang, it supports only one model of concurrency: message passing. That’s very limiting. Haskell actually lets you share memory in a very controlled way, which makes the implementation of STM possible. "

" Bartosz Milewski Says:

September 20, 2013 at 2:04 pm

Actually, behind the scenes Haskell uses pointers for everything, including passing data to and from functions. It also tries to share as much data as possible when making modifications to data structures. It also gives the programmer the option of not creating thunks (using strictness annotations) and using unboxed values inside data structures. But it does all this without exposing the program to data races. "

--

" millstone 2 days ago

link

> The power of multicore processors, vector units, and GPUs is being squandered by the industry because of an obsolete programming paradigm.

It’s worth addressing these in turn. In reverse order:

I dispute that the C paradigm squanders GPUs. OpenCL? and CUDA are the two most prominent languages written with GPUs in mind, and both have a lot more in common with C than with Haskell. In particular they eschew functional mainstays like recursion, linked lists, and garbage collection. So for GPUs, it seems like the ideal paradigm is closer to C than Haskell.

For vector units, there’s some recent excitement about auto-vectorization in Haskell. But I’m skeptical about the potential of auto-vectorization in general, since it only seems to apply to simple loops, and can’t take advantage of all instructions. Most effectively vectorized code uses either compiler intrinsics or outright assembly, and I don’t see that changing any time soon (I’d love to be proven wrong on this though).

Lastly, multicore processors. C++11 more or less standardized the creaky POSIX threading model, with some additions like thread locals - hardly state of the art. I wonder if the author is familiar with other approaches, like libdispatch, which provide a much improved model.

One last observation. Parallelism is not a goal! Improved performance is the goal, and parallelism is one mechanism to improved performance. If your Haskell program runs at half the speed of a C++ program, but scales to twice as many cores so as to make the overall time equal, that’s not a wash: it’s a significant win for C++, which did the same thing with half the resources (think battery drain, heat, other processes, etc.) Horse in front of cart.

reply

upvote

lmm 1 day ago

link

>I dispute that the C paradigm squanders GPUs. OpenCL? and CUDA are the two most prominent languages written with GPUs in mind, and both have a lot more in common with C than with Haskell. In particular they eschew functional mainstays like recursion, linked lists, and garbage collection. So for GPUs, it seems like the ideal paradigm is closer to C than Haskell.

I think the more obvious explanation is that it's much easier to write a compiler for C than for Haskell. Thus the first tooling available for any new system is almost always C. It doesn't mean C's the best tool.

>For vector units, there’s some recent excitement about auto-vectorization in Haskell. But I’m skeptical about the potential of auto-vectorization in general, since it only seems to apply to simple loops, and can’t take advantage of all instructions. Most effectively vectorized code uses either compiler intrinsics or outright assembly, and I don’t see that changing any time soon (I’d love to be proven wrong on this though).

We'll have to wait and see on this.

>Lastly, multicore processors. C++11 more or less standardized the creaky POSIX threading model, with some additions like thread locals - hardly state of the art. I wonder if the author is familiar with other approaches, like libdispatch, which provide a much improved model.

My experience is that alternative approaches to something as fundamental as parallelism never take off, because they can't gather an ecosystem around them (I'm looking in particular at the chaos of approaches available in perl and especially python). I think a modern language needs to have a single, obvious, preferred way of achieving parallelism/concurrency that libraries can build on top of (and that's why javascript has been so successful - while it's a terrible language in many ways, there is one and only one way you handle parallel-type problems, and it's (slightly) better than the standard way in other languages)

reply "

--

" Processes that are waiting on other resources, such as a file read, wait on the EVENT data structure. Thus all processes waiting on a single resource wait on a single event. When the resource becomes available, the event is caused, which wakes up all the processes waiting on it. Processes may wait on multiple events for any one of them to happen, including a time out. Events are fully user programmable – that is, users can write systems that use the generalized event system provided by the MCP." -- https://en.wikipedia.org/wiki/Burroughs_MCP

--

"

Port files

Another technique for inter-process communication (IPC) is port files. They are like Unix pipes, except that they are generalized to be multiway and bidirectional. Since these are an order of magnitude slower than other IPC techniques such as libraries, it is better to use other techniques where the IPC is between different processes on the same machine.

The most advantageous use of port files is therefore for distributed IPC. Port files were introduced with BNA (Burroughs Network Architecture), but with the advent of standard networking technologies such as OSI and TCP/IP, port files can be used with these networks as well.

A server listening for incoming connections declares a port file (a file with the KIND attribute equal to PORT). Each connection that is made from a client creates a subfile with an index, so each port file represents multiple connections to different clients around the network.

A server process receives client requests from anywhere on the network by issuing a read on the port file (subfile = 0 to read from any subfile). It issues a response to the client that issued the request by writing to the particular subfile from which the request was read. " -- https://en.wikipedia.org/wiki/Burroughs_MCP

--

http://debu.gs/entries/inferno-part-2-let-s-make-a-cluster

--

http://blog.paralleluniverse.co/post/64210769930/spaceships2

" The challenge can be met, I believe, by adopting solutions that rest on two legs: languages (or at the very least – programming idioms) designed for concurrency, and appropriate transactional concurrent data-structures.

The former will need to enforce immutability, help manage state, and provide simple and effective concurrency mechanisms, such as message passing. So far, Clojure and Erlang seem to fit the bill. The latter will need to work hand-in-hand with the application, and help it exploit parallelism rather than hinder it, as many databases are prone to do.

... Callbacks are indeed an annoyance... To solve this, we have developed a lightweight thread library for the JVM called Quasar, along with its Clojure API – Pulsar.

...Quasar lightweight threads (which we call fibers)...unlike regular threads, you can have millions of fibers... Instead of a callback, you have fiber-blocking operations, that are as simple as a plain (thread-) blocking operation, only cheaper...Quasar then builds on the foundation of fibers, and includes Go-like CSP constructs (selectable channels), and an Erlang-like actor system. "

http://blog.paralleluniverse.co/post/49445260575/quasar-pulsar

" Lightweight Threads, Channels and Actors for the JVM

We’ll start at the end: Quasar and Pulsar are two new open-source libraries in Java and Clojure respectively that add Erlang-like actor-model (and Go-like coroutine/channel) programs to the JVM.

...

Several paradigms that try to make writing correct and efficient multithreaded code have emerged, but the actor model offered by Erlang has won some popularity, and for good reason: the actor model is easy to understand and lends itself to building complex fault-tolerant software. My own admiration for Erlang began when I realized that, to paraphrase Philip Greenspun, every mission-critical military software I’d worked on contained an ad-hoc, informally specified implementation of half of Erlang. Other languages — notably Go with channels and goroutines, and Clojure with agents — and frameworks (like Akka), have copied some aspects of the actor model, but they all suffer from deficiencies that make them more complicated, or less powerful than Erlang.

... A notable attempt that also provides lightweight threads is the Kilim library, and Erjang, the JVM Erlang implementation which uses Kilim under the hood. Kilim, however, is a very intrusive and complex infrastructure, while Erjang seems to no longer be actively maintained....Kilim; it fuses all three into a monolithic package....

...

There have been several attempts of porting actors to the JVM. Quasar and Pulsar’s main contribution — from which many of their advantages stem — is true lightweight threads[1]. Lightweight threads provide many of the same benefits “regular”, OS threads do, namely a single, simple control flow and the ability to block and wait for some resource to become available while in the meantime allowing other threads to run on the CPU. Unlike regular threads, lightweight threads are not scheduled by the operating system so their context-switch is often faster, and they require far less system resources. As a result, a single machine can handle millions of them.

Quasar/Pulsar actors have the following benefits:

    Extreme simplicity
    Pattern matching
    Binaries and pattern matching on binaries (just like Erlang!)
    Selective receive
    Super-scalable asynchronous IO accessed with a synchronous, blocking IO API.

Just to show where we’re headed, here’s an almost line-by-line port of the canonical Erlang pingpong example to Pulsar:

(defsusfn ping [n] (if (== n 0) (do (! :pong :finished) (println "ping finished")) (do (! :pong [:ping @self]) (receive :pong (println "Ping received pong")) (recur (dec n)))))

(defsusfn pong [] (receive :finished (println "Pong finished") (do (println "Pong received ping") (! ping :pong) (recur))))

(defn -main [] (register :pong (spawn pong)) (spawn ping 3) :ok) "

" Quasar, the Implementation

An Erlang-like actor system (i.e. with blocking, selective receives), requires three components:

    Coroutines
    A scheduler
    Queues

Coroutines are functions that can be paused and later resumed. They are necessary to build lightweight threads because they provide “user-mode” management of the stack. A good, multithreaded scheduler, is then added to schedule and run the coroutines while making the best use of available cores. This combination gives us lightweight threads. Finally, queues are needed to implement channels or actor mailboxes.

It is important these three components be logically separated. Modularity assist in future maintenance, and allows for a best-of-breed approach[2]. "

" Of the three pieces, coroutines are the most challenging to implement on the JVM, as they require bytecode instrumentation. Essentially, every possibly-pausing method (the Quasar/Pulsar nomenclature is “suspendable”) must be inspected to find all invocations of other suspendable methods. Before the call to a suspendable method, code must be injected to push the caller’s local variables onto a stack object. Also, code must be injected to the beginning of every suspendable method, that upon resuming would jump to the instruction following the pause-point. This has to be done to all suspendable methods on the (OS thread) stack.

Fortunately, I found Matthias Mann’s continuations library that not only does exactly that and only that, but is also small, fast and easy to understand.

Next, came the scheduler. Here the job was easy. Java’s ForkJoinPool? is a masterpiece of multi-threaded task scheduling, made even better for Java 8. In the meantime, we’re using the jsr166e library that contains the fork/join implementation that’s to go into the next major JDK release. We’ve used it in SpaceBase?, and the performance improvement over the Java 7 implementation (which was very impressive to begin with) are significant. With a solid foundation like fork/join you can’t go wrong.

Combining Matthias Mann’s coroutines with fork/join gives us lightweight threads, which in Quasar are called fibers. A fiber can block, or “park”, just like a thread, by pausing the continuation and returning from the fork/join task. To be “unparked”, it is simply scheduled (forked) again.

Finally, I rolled my own queue implementations, partly as an exercise, but also because there’s room for improvement over Java’s concurrent Queues in our case. The reason is that the queues we need aren’t symmetric. Many fibers append messages onto the queues but only one pulls them out, so these are single-consumer multiple-producers queues.

Quasar has three types of lock-free queue implementations: concurrent linked-lists based on John M. Mellor-Crummey’s 1987 paper (for some reason the simpler Craig-Landin-Hagersten queue I’ve tried gave worse performance than Java’s multiple-consumer multiple-producer ConcurrentLinkedQueue?. I must have done something wrong — I’ll look into that); bounded array queues with some added “mechanical sympathy” based on Martin Thompson’s talks; and an unbounded linked-array queue based on Gidenstam, Sundell and Tsigas. Java’s memory model and excellent concurrency primitives allowed building some pretty fast queues (the bounded queue performed best by far, but the others can be improved; see benchmark code here).

Combining the queues with a poor-man’s condition variable that parks the reading fiber while it waits for messages gives us Go-like channels. A fiber attached to a channel, combined with some lifecycle management (Erlang-like “links”) gives us actors.

A special sendSync operation makes a request-reply conversation between two actors almost as efficient as a simple method call, all without changing actor semantics at all. "

" Pulsar

Pulsar wraps Quasar with an API identical to Erlang, pattern-matching and all.

I’ve mentioned before that I see Erlang and Clojure as the two languages[3: Some commenters on HN added Haskell to the list, but I know too little Haskell to form an opinion on the matter.] best suited for programming in our multicore age. They don’t only make concurrency easy (Go tries to do that, too, with a too-early-to-tell measure of success), they also greatly help to ensure that concurrency is correct. In short, they are both languages built for concurrency. They achieve this difficult task because they are opinionated languages with clear philosophies on how modern software should be written.

But Clojure and Erlang have their differences, too. The one most apparent to me is that Erlang was made for programming-in-the-large. Its roots (and motivations) are not in concurrency, but in fault tolerance. Erlang was designed to build large, complex software that needs to withstand failure. Its shortcomings are in supporting low-level constructs. It’s hard or impossible to write a high-performance data structure in Erlang, and it’s ill-suited for computation-heavy work. Erlang delegates these low-level problems to modules written in other languages.

I think Clojure is the opposite. It mostly shines when programming-in-the-small. It’s hard to beat Clojure’s expressivity when it comes to data processing, but it doesn’t offer a clear vision when it comes to fault-tolerance and provides little guidance to high-level organization of complex, large software. But Clojure is a more modern language than Erlang, as a Lisp it’s very malleable, and it runs on a very versatile platform, so it is much easier to bring Erlang to Clojure than the other way around.

Rich Hickey, Clojure’s designer, has made it very clear http://clojure.org/state why he chose not to have Erlang-style actors in Clojure. His makes some good arguments, and far be it from me to disagree with Hickey on Clojure’s philosophy, but I think that that reasoning, too, serves programming-in-the-small rather than in the-large. Actors, in addition to their contribution to simplifying concurrency, help isolate and handle errors. In fact, that is why they’re used in Erlang in the first place. Actors in Clojure may be a tool for programming-in-the-large. "

" Tim Bray also compares Clojure and Erlang http://www.tbray.org/ongoing/When/200x/2009/10/26/Messaging and finds Clojure more elegant, but Erlang’s message-passing model easier to understand. All this goes to say that porting Erlang actors to Clojure might be a worthwhile undertaking (and at the very least — an interesting exercise). "

excerpt from http://clojure.org/state : " Message Passing and Actors

There are other ways to model identity and state, one of the more popular of which is the message-passing actor model, best exemplified by the quite impressive Erlang. In an actor model, state is encapsulated in an actor (identity) and can only be affected/seen via the passing of messages (values). In an asynchronous system like Erlang's, reading some aspect of an actor's state requires sending a request message, waiting for a response, and the actor sending a response. It is important to understand that the actor model was designed to address the problems of distributed programs. And the problems of distributed programs are much harder - there are multiple worlds (address spaces), direct observation is not possible, interaction occurs over possibly unreliable channels, etc. The actor model supports transparent distribution. If you write all of your code this way, you are not bound to the actual location of the other actors, allowing a system to be spread over multiple processes/machines without changing the code.

I chose not to use the Erlang-style actor model for same-process state management in Clojure for several reasons:

    It is a much more complex programming model, requiring 2-message conversations for the simplest data reads, and forcing the use of blocking message receives, which introduce the potential for deadlock. Programming for the failure modes of distribution means utilizing timeouts etc. It causes a bifurcation of the program protocols, some of which are represented by functions and others by the values of messages.
    It doesn't let you fully leverage the efficiencies of being in the same process. It is quite possible to efficiently directly share a large immutable data structure between threads, but the actor model forces intervening conversations and, potentially, copying. Reads and writes get serialized and block each other, etc.
    It reduces your flexibility in modeling - this is a world in which everyone sits in a windowless room and communicates only by mail. Programs are decomposed as piles of blocking switch statements. You can only handle messages you anticipated receiving. Coordinating activities involving multiple actors is very difficult. You can't observe anything without its cooperation/coordination - making ad-hoc reporting or analysis impossible, instead forcing every actor to participate in each protocol.
    It is often the case that taking something that works well locally and transparently distributing it doesn't work out - the conversation granularity is too chatty or the message payloads are too large or the failure modes change the optimal work partitioning, i.e. transparent distribution isn't transparent and the code has to change anyway.

Clojure may eventually support the actor model for distributed programming, paying the price only when distribution is required, but I think it is quite cumbersome for same-process programming. YMMV of course. "

excerpt from http://www.tbray.org/ongoing/When/200x/2009/10/26/Messaging

"

Let’s sup­pose we’ve got a pro­gram we want to smash across lots of cores, and we want it to count things. The counts aren’t going to be re­ported out till the pro­gram fin­ishes, so there’s no rea­son for the main work­load to wait for the count­ing arith­metic. So the ob­vi­ous thing to do is to run the coun­ters off in their own thread(s), send them please-count mes­sages (no need to wait for the an­swers), and the num­bers will all get added up even­tu­ally.

In Er­lang ·

Here’s how you might do it. Let’s take the sim­plest pos­si­ble ap­proach and have a sep­a­rate process for each item you’re count­ing. ¶

 1 -module(counter).
 2 -export([new/0, incr/1, count/1, counter/0]).
 3
 4 new() ->
 5     Counter = spawn(fun counter:counter/0),
 6     Counter.
 7
 8 incr(Counter) ->
 9     Counter ! { incr }.10 11 count(Counter) -> 12 Counter ! { count, self() }, 13 receive 14 { count, Count } -> 15 Count 16 end. 17 18 %- The counter process ------------------- 19 counter() -> 20 counter_loop(0). 21 22 counter_loop(Count) -> 23 receive 24 { incr } -> 25 counter_loop(Count + 1); 26 { count, Requestor } -> 27 Requestor ! { count, Count }, 28 counter_loop(Count) 29 end.

In Clo­jure

You’d use an agent, which is a ref­er­ence to some data; here’s an ex­am­ple from Stu­art Hal­loway’s Pro­gram­ming Clo­jure book. First, you get a ref­er­ence to the agent: ¶

(def counter (agent 0))

Then, to in­cre­ment it, you send it a func­tion; since Clo­jure has a built-in inc func­tion, this is pretty easy.

(send counter inc)

Now, you could have sent any old func­tion over there that could apply to what­ever counter re­ferred to, in this case an in­te­ger, and have that func­tion ap­plied; you can send ar­gu­ments too. Clo­jure takes care of main­tain­ing a thread pool and se­quenc­ing the mes­sages to the agent. If there’s a chance that the func­tion might block, you can use send-off rather than send; it uses an ex­pand­able thread pool.

Any­how, to find the cur­rent value, you just ask; ei­ther of these will do.

(deref counter) @counter

Dif­fer­ent Strokes · The dif­fer­ences be­tween these two ap­proaches are ex­tremely in­ter­est­ing. I’m won­der­ing if the class of things you can do with one but not the other is very big or in­ter­est­ing; I sus­pect that both can be made iso­mor­phic to the Actor model of com­pu­ta­tion, al­though the map­ping from Er­lang is more di­rect. ¶

[Up­date:] Thanks to Phil Hagel­berg for this com­ment, which links to a de­tailed dis­cus­sion of the Clo­jure/Er­lang trade-offs hosted by Bill Clementson. http://web.archive.org/web/20130510063315/http://bc.tech.coop/blog/081201.html I see lots of stuff to argue about there...

I’ll dig into this some more (I’m doing a Wide Finder in Clo­jure to get some hands-on), but I place a cer­tain value on first im­pres­sions and thought it might be use­ful to cap­ture some.

"

http://web.archive.org/web/20130510063315/http://bc.tech.coop/blog/081201.html :

http://web.archive.org/web/20090721080837/http://eigenclass.org/hiki/wide-finder-conclusions :

" we cannot dismiss constant factors. JoCaml? code written by an individual in an afternoon can perform better than Erlang code refined over the course of one month by Erlang experts*2 on a per-core basis, while scaling just as well.

A few personal reflections:

    JoCaml allows to express and reason about concurrent and distributed systems at very high level, leaving little place for the sort of bugs that infest code written with threads and shared state. (I have written a couple fault-tolerant systems in JoCaml that would have been much harder with threads.)
    Erlang's major strength is fault-tolerance, scalability, etc., not raw performance; promoting Erlang by saying that it will be "faster given enough cores" is disingenuous because it conceals the existence of alternatives like JoCaml that can scale just as well but perform much better per core."

---

http://web.archive.org/web/20130719005643/http://bc.tech.coop/blog/060105.html

" Thursday, January 5, 2006

Concurrent/Parallel programming is hard. Just listen to comments from industry leaders like Tim Bray (he also provides a good summary of some of the issues associated with this type of programming):

    "But the bottom line is, for application programmers, don't get into this space unless you're pretty sure you know what you're doing and you're willing to wrestle with some seriously difficult debugging." 

or Bill de hÓra:

    "Distributed programming is hard (and with it comes the truly hard matters of cache invalidation and naming)."

or Herb Sutter and James Larus - here:

    "Probably the greatest cost of concurrency is that concurrency really is hard: The programming model, meaning the model in the programmer's head that he needs to reason reliably about his program, is much harder than it is for sequential control flow." 

and here:

    "But concurrency is hard. Not only are today's languages and tools inadequate to transform applications into parallel programs, but also it is difficult to find parallelism in mainstream applications, and-worst of all-concurrency requires programmers to think in a way humans find difficult." 

And yet, almost everyone (see here, here, and here for examples) is saying:

    "Concurrency has long been touted as the "next big thing" and "the way of the future," but for the past 30 years, mainstream software development has been able to ignore it. Our parallel future has finally arrived: new machines will be parallel machines, and this will require major changes in the way we develop software." 

So, how are developers going to deliver the next generation of concurrent applications and (most importantly) how will Lisp fit into all of this? First of all, let's start with some basics. In my previous articles on this subject (see here, here, here, here, here), I've gone through some background and some definitions and I won't re-hash any of that again. However, it is probably a good idea to go over the different concurrent programming paradigms (as taken from the book Concepts, Techniques, and Models of Computer Programming):

    Shared-state Concurrency: This approach assumes multiple threads of execution with shared data using locks, monitors and transactions. It is the typical Java approach to concurrency and it is also the most widespread form of concurrency. However, shared-state concurrency is also the most complicated model to program to as it requires a fine level of synchronization and scheduling between threads.
    Message-passing Concurrency: Uses asynchronous messaging to communicate between multiple threads over a specified port. This is the model used (very successfully) by Erlang. This model provides for a coarser level of interaction between threads but is easier to program to.
    Declarative/Dataflow Concurrency: This approach is data-driven and extends sequential declarative-type programming with multiple threads of execution. It is a simpler approach then approaches #1 & #2 and it is very useful for certain types of problems; however, it is not widely used."

then goes on to mention map-reduce, which he placed in the Declarative/dataflow section

"

When compared to Java/C++, Clojure's approach to concurrency is much simpler and easier to "get right" when you're coding concurrent programs. However, Clojure isn't the only language to provide "native" support for concurrency. It's interesting to compare the approach that Clojure has taken with the approach that Erlang (another language that has built-in support for concurrency) has taken. I've posted a bit about Erlang in the past on my blog; so, although I'm not an Erlang expert, I do know something about the language. I'll focus on three main areas as a basis for comparison:

    Autonomous/independent threads of execution: Actors in Erlang, Agents in Clojure. Erlang's actors are light-weight "user-land" processes (e.g. - "green threads") that are autonomous and can scale tremendously. Clojure's agents are native threads that are managed in thread pools for performance. As explained in the documentation, "Clojure's Agents are reactive, not autonomous - there is no imperative message loop and no blocking receive". Rich Hickey has summarized the Actor/Agent differences and his design rationale:
        "There are other ways to model identity and state, one of the more popular of which is the message-passing actor model, best exemplified by the quite impressive Erlang. In an actor model, state is encapsulated in an actor (identity) and can only be affected/seen via the passing of messages (values). In an asynchronous system like Erlang's, reading some aspect of an actor's state requires sending a request message, waiting for a response, and the actor sending a response. It is important to understand that the actor model was designed to address the problems of distributed programs. And the problems of distributed programs are much harder - there are multiple worlds (address spaces), direct observation is not possible, interaction occurs over possibly unreliable channels, etc. The actor model supports transparent distribution. If you write all of your code this way, you are not bound to the actual location of the other actors, allowing a system to be spread over multiple processes/machines without changing the code.
        I chose not to use the Erlang-style actor model for same-process state management in Clojure for several reasons:
            It is a much more complex programming model, requiring 2-message conversations for the simplest data reads, and forcing the use of blocking message receives, which introduce the potential for deadlock. Programming for the failure modes of distribution means utilizing timeouts etc. It causes a bifurcation of the program protocols, some of which are represented by functions and others by the values of messages.
            It doesn't let you fully leverage the efficiencies of being in the same process. It is quite possible to efficiently directly share a large immutable data structure between threads, but the actor model forces intervening conversations and, potentially, copying. Reads and writes get serialized and block each other, etc.
            It reduces your flexibility in modeling - this is a world in which everyone sits in a windowless room and communicates only by mail. Programs are decomposed as piles of blocking switch statements. You can only handle messages you anticipated receiving. Coordinating activities involving multiple actors is very difficult. You can't observe anything without its cooperation/coordination - making ad-hoc reporting or analysis impossible, instead forcing every actor to participate in each protocol.
            It is often the case that taking something that works well locally and transparently distributing it doesn't work out - the conversation granularity is too chatty or the message payloads are too large or the failure modes change the optimal work partitioning, i.e. transparent distribution isn't transparent and the code has to change anyway.
        Clojure may eventually support the actor model for distributed programming, paying the price only when distribution is required, but I think it is quite cumbersome for same-process programming. YMMV of course."
    Safe use/update of shared data: Mnesia in Erlang, STM in Clojure. In Erlang, control of updates to thread-shared data is handled by Mnesia, a distributed DBMS. This provides full ACID data integrity; however, Erlang is designed to be a distributed concurrency language and Clojure is not, so the cost/benefit of using a DBMS for non-distributed data integrity has to be taken into consideration. Clojure's STM-based approach is an effective, well-performing solution that doesn't require any out-of-process storage. Although STM has recently had some vocal detractors, most of the criticisms of STM would appear to apply more in an environment where mutable variables/structures are the norm. There is a thread on the Clojure mailing list that details a number of reasons why Clojure's use of STM avoids many of the things that STM has been criticized for. As concurrency-guru Cliff Click says in the thread:
        "I think Clojure is on to something good here - Immutability IS the Big Hammer For Concurrency here, and the STM is just one of several flavors of 'Big Hammer isn't working so I need another approach'. Given the Immutability-By-Default, STM has a much better chance of being performant than in other languages so it makes sense to give it a go."
    Distributed concurrency: This one really subsumes the other two (as well as some others that I won't be discussing here). Providing native support for distributed concurrency (also called parallel computing) affects quite a few design choices when you're building a language. Erlang was designed from the outset to support both distributed and non-distributed (same-address-space) concurrency. Therefore, Erlang uses the same actor-based messaging model in both distributed and non-distributed processing environments and Mnesia is designed to provide a shared data store for Erlang processes that are not necessarily running on the same machine. These choices make sense for Erlang because Erlang has a unified approach to both distributed and non-distributed concurrency. Clojure deliberately does not have any unified, "built-in" support for distributed concurrency. This decision means that distributed concurrency can be more difficult to program in Clojure than in Erlang; however, it also means that non-distributed concurrency (which, for many applications, will be the more common concurrency scenario) can be catered for in an easier manner. In a comparison of the two languages, Rich Hickey explained his reasons for not designing Clojure with support for a distributed concurrency model:
        "I have some reservations about unifying the distributed and non-distributed models (see e.g. http://research.sun.com/techrep/1994/smli_tr-94-29.pdf), and have decided not to do so in Clojure, but I think Erlang, in doing so, does the right thing in forcing programmers to work as if the processes are distributed even when they are not, in order to allow the possibility of transparent distribution later, e.g. in the failure modes, the messaging system etc. However, issues related to latency, bandwidth, timeouts, chattiness, and costs of certain data structures etc remain. My experiences with transparent distribution were with COM and DCOM, and, quite frankly, not happy ones. I think Erlang has a much better story there, but the fact is that distributed programming is more complex, and I personally wouldn't want to incur that complexity all the time. I think it is ok for a system designer to decide one part of a system will be distributed and another not, and incur some rewrite if they are wrong. If I wrote phone switches I might think otherwise :) "

" Rich recently outlined his thoughts on what is available (for distributed concurrency support) as well as what he might do in the future:

        "There's JMS:
        http://en.wikipedia.org/wiki/Java_Message_Service
        https://mq.dev.java.net/
        http://joram.objectweb.org/
        http://activemq.apache.org/
        XMPP:
        http://en.wikipedia.org/wiki/Xmpp
        http://www.igniterealtime.org/projects/index.jsp
        JXTA/Shoal:
        http://en.wikipedia.org/wiki/Jxta
        https://shoal.dev.java.net/
        JINI:
        http://en.wikipedia.org/wiki/Jini
        http://incubator.apache.org/projects/river.html
        DHTs like Pastry:
        http://freepastry.org/
        JGroups:
        http://www.jgroups.org/javagroupsnew/docs/index.html
        Terracotta:
        http://www.terracotta.org
        Jinterface:
        http://www.erlang.org/doc/apps/jinterface/
        NetKernel:
        http://www.1060.org/
        and more. All useful from Clojure. Given the diversity, sophistication, maturity, interoperability, robustness etc of these options, it's unlikely I'm going to fiddle around with some language-specific solution. That said, I have been thinking about putting a simple wrapper API around queues that would work both locally and over something like JMS."
    So, for the time being, it's possible to "roll your own" distributed Clojure programs; but, in the future, there may be some extra or built-in support for distributed concurrency in Clojure."

" Selective Receive

Before discussing performance and the roadmap, I’d like to briefly pause on one feature that makes Erlang actors quite unique: selective receive. This feature allows an actor to process messages not in the order of their arrival.

While selective receive might cause deadlocks, it makes managing complex state-transitions much easier and less error-prone than without selective receive, as demonstrated in this talk by Ulf Wiger.

Anyway, I found this question on stackoverflow.com that gives an example of selective receive in Erlang, and translated it to Clojure using Pulsar (this isn’t idiomatic Clojure at all, but a demonstration of a direct translation from Erlang):

(ns co.paralleluniverse.pulsar-test.examples.priority (:use co.paralleluniverse.pulsar))

(declare normal)

(defsusfn important [] (receive [(priority :guard #(> % 10)) msg] (cons msg (important)) :after 0 (normal)))

(defsusfn normal [] (receive [_ msg] (cons msg (normal)) :after 0 ()))

(defn -main [] (join (spawn (fn [] (! @self [15 :high]) (! @self [7 :low]) (! @self [1 :low]) (! @self [17 :high]) (important)))))

=> (
high :high :low :low)

This example is found in Pulsar’s examples directory. "

---

Before C11, compilers would be allow to introduce new stores to shared memeory, potentially screwing up lockfree algorithms!

http://preshing.com/20120625/memory-ordering-at-compile-time/

" int A, B;

void foo() { if (A) B++; }

Though it’s rather unlikely in practice, nothing prevents a compiler from promoting B to a register before checking A, resulting in machine code equivalent to the following:

void foo() { register int r = B; Promote B to a register before checking A. if (A) r++; B = r; Surprise! A new memory store where there previously was none. }

... In any case, the new C++11 standard explictly prohibits such behavior from the compiler in cases where it would introduce a data race. The wording can be found in and around section 1.10.22 of the most recent C++11 working draft:

    Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard."

---

looked at libatomic_ops.

re: memory_order: seems to me that this is a lot of complexity for Jasper. Let's just forbid reordering in Jasper and memory-fence all atomic ops (the default in libatomic_ops anyways). If Jasper succeeds, we can add optional memory_order annotations later.

otoh i feel like the brain may not work that way... it's probably possible for a module A and B on one side of the brain to send messages a and b in that order, and for module C to receive them in that order, while module D on the other side of the brain receives them in order b,a due to routing delays.

also if you standardize on something relaxed then you can always use synch primitives

see http://en.cppreference.com/w/c/atomic/memory_order ; perhaps all we want is for everything to be like C 'volatile', which prevents intra-thread reordering but not interthread.

otoh i guess what i really want is:

if thread A emits a,b, then any other thread C will observe a,b. But if thread A emits a then thread B emits b, another thread C might observe a then b, whereas D might observe b then a.

i think that might be 'Release-Acquire ordering'. In addition, the synchronization in 'Release-Acquire ordering' is transitive; so assuming Release-Acquire ordering is used for everything, if thread B looked at a before emitting b, then a,b are ordered for any other threads.

see http://stackoverflow.com/questions/14861822/acquire-release-versus-sequentially-consistent-memory-order

hmm.. but looking at jasperBrain, it seems that a case might be able to be made for relaxed ordering. the idea of working memory being the only seq. consist. memory gives me an idea: instead of specifying memory order with per-operation granularity, specify it with per-channel or per-variable granularity.

e.g. there are three types of shared memory: normal shared memory (sequentially consistent/total ordering), acquire-release memory (acquire-release ordering; the relationship between any two threads is totally ordered, but different threads might see different orderings), and relaxed memory (there is no ordering across different variables, only within one variable).

and in addition you can open a channel to another process, but it might have arbitrary and variable delays (e.g. if you send token A and the other side sends X in response, and then you receive X and then send B in response, if you then receive Y, you can't assume that the other side has seen B yet; maybe they sent Y after you sent B but before they received it. If a third process is also listening, they might see Y before they see B, or they might see B before they see Y. in other words, on such a channel, there is a total ordering over what each party sent and received, and a total ordering between dependent/caused multiparty interactions, but that's it) (i guess the 'dependent' thing doesn't make intuitive sense in this case; so just this: if B receives A and then sends X in response, and C is observing, there's no way that C can see X before it sees A. But since whether or not X was sent 'in response', that just means that anything that B sends after it receives A only appears to C after C has received A; this sounds a lot like release-acquire ordering to me but i'm not sure)

and in addition possibly you and some friend processes can construct a group shared memory, exclusive to the group, of any of the above types. in fact perhaps the global shared memory is just a generalization of this.

--

https://en.wikipedia.org/wiki/Relativity_of_simultaneity says

"... such as a car crash in London and another in New York. The question of whether the events are simultaneous is relative: in some reference frames the two accidents may happen at the same time, in other frames (in a different state of motion relative to the events) the crash in London may occur first, and in still other frames the New York crash may occur first. However, if the two events are causally connected ("event A causes event B"), the causal order is preserved (i.e., "event A precedes event B") in all frames of reference."

so that sounds like 'Release-Acquire ordering' if we make the same assumption as above, namely that 'causal connection' is really the relevant concept here, and 'if the two events are causally connected', then that should be replaced by 'if any observer observes A before causing B'

because looking at the example in http://en.cppreference.com/w/c/atomic/memory_order for 'relaxed ordering',

" For example, with x and y initially zero,

Thread 1: r1 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(x, r1, memory_order_relaxed); Thread 2: r2 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(y, 42, memory_order_relaxed);

is allowed to produce r1 == r2 == 42. "

we have the physical equivalent of me receiving a letter from a friend that says '42', and the cause of it saying 42 was that, the day after i received it, i sent a letter to a friend saying '42'. Regardless of whether my decision to send 42 the next day WAS ACTUALLY caused by my receiving 42 the previous day, it's certainly suspicous, and if i have free choice, i COULD HAVE chosen to alter my decision of what to send the second day based on what i received the previous day.

if we really did only care if event B was 'caused' by event A, then it would be like release-consume ordering...hmmm though maybe quantum physics is like this? as long as i don't open the letter until after i send my letter.. (went off and wrote [1])

release-consume ordering is what http://preshing.com/20120930/weak-vs-strong-memory-models/ calls "Weak With Data Dependency Ordering". E points out that ARM (very popular) and PowerPC? (used by Xbox 360) support this model. http://en.cppreference.com/w/c/atomic/memory_order claims that x86, SPARC, and IBM mainframes all have release-acquire, the next stronger step.

do we need to support release-acquire or release-consume for jasper? we don't need for for 'local' efficiency; if it's just a matter of letting the compiler or processor optimize things a little bit because of pipelining or whatever, jasper would rather pass up that increase in speed in exchange for being simpler to understand.

but we might need it for concurrency. recall that a main goal of jasper is to support massively concurrent systems, to allow programmers like me to gain intuition about how the brain might work. we know that it's hard to scale up current processor architectures to, say, a million CPUs and the bottleneck is cache coherency. so we need to get rid of shared memory. but we also know that the human brain can have, at least in working memory, a small global 'blackboard' which supports a totally ordered narrative.

now we could say, 'well if you want jasper to model how the brain works then stick to message passing because that's all that individual neurons can do'. but in that case we may as well have said 'stick to simulating diff eqs like hodgkin-huxley'; we want to allow researchers to start in the middle and consider various questions, to say things like 'okay assume there's some sort of working memory, now here's my theory of knowledge representation'. and there's the not-so-small issue that we want jasper to also be for other programming tasks besides cognitive studies simulations. so we want to provide shared memory, but we want it to seem like a special, expensive facility, not the simplest primitive. a standard library sort of thing, except that perhaps it should be in the language proper because it utilizes support from the underlying platform (not sure if that's a good criterion, i mean, do we put GPGPU stuff in the language proper too? well, bad example, because that has to do with massive concurrency, so maybe. what about a library to use a webcam? that's probably not in core).

anyhow.. so it seems like sequential consistency is too expensive to do all the time in a massively concurrent system, not just that it prohibits optimizations that save some proportion of time, but that it has a superlinear cost in the number of concurrent CPUs, e.g. it's not scalable. So the question is, is release-acquire ordering scalable, or are we forced to release-consume ordering?

--

now, are things automatically 'C volatile'?

i think that whether something is 'shared' or not should be the same as thing, to reduce cognitive load. that is, things that aren't shared aren't 'volatile', and things that are shared are.

that is, the example of https://en.wikipedia.org/wiki/Volatile_variable#Example_of_memory-mapped_I.2FO_in_C would be solved because the meaning of 'not shared' is 'jasper may assume that this thread is the only one that can ever see the value of this variable, and can optimize accordingly'

--

now, are things automatically 'atomic?'

yes, i think that any single assignment operation to a shared variable, or to its parts, is atomic

but this is way expensive; the normal thing to do is to have most shared memory accesses be non-atomic, so that you can copy stuff in or whatever, and then to use synchronization when you're 'done'

but ppl sometimes recommend that you don't bother with that and just use locks anyways

according to the criteria i laid out above, it sounds like this is only a proportional optimization, not a change from unscalable to unscalable; because you have to do the lock at some point anyways

but i'm not sure that's true.

wait-freedom is stronger than lock-freedom, and:

" Wait-freedom is the strongest non-blocking guarantee of progress, combining guaranteed system-wide throughput with starvation-freedom. An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes...It was shown in the 1980s[4] that all algorithms can be implemented wait-free, and many transformations from serial code, called universal constructions, have been demonstrated. However, the resulting performance does not in general match even naïve blocking designs. Several papers have since improved the performance of universal constructions, but still, their performance is far below blocking designs.

Several papers have investigated the hardness of creating wait-free algorithms. For example, it has been shown[5] that the widely available atomic conditional primitives, CAS and LL/SC, cannot provide starvation-free implementations of many common data structures without memory costs growing linearly in the number of threads. But in practice these lower bounds do not present a real barrier as spending a word per thread in the shared memory is not considered too costly for practical systems.

Until 2011, wait-free algorithms were rare, both in research and in practice. However, in 2011 Kogan and Petrank[6] presented a wait-free queue building on the CAS primitive, generally available on common hardware. Their construction expands the lock-free queue of Michael and Scott,[7] which is an efficient queue often used in practice. A follow-up paper by Kogan and Petrank[8] provided a methodology for making wait-free algorithms fast and used this methodology to make the wait-free queue practically as fast as its lock-free counterpart. " -- https://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom

so it sounds like a scalability difference can occur by using the atomics in a propery sparing way.

--

"Scalability Analysis of Release and Sequential Consistency Models in NoC? based Multicore Systems"

http://web.it.kth.se/~axel/papers/2012/SoCTampere-AbdulNaeem.pdf

looking at the graphs, it doesn't seem to me that the release (acquire-release) consistency was in a different complexity class w/r/t number of processors than the sequential consistency

---

looked up ""distributed shared memory" consistency thousands of processors scalability", but didn't find much of use. Most people are working on like 64-1000 processors. BlueGene? has tens of thousands of nodes with hundreds of thousands of processor cores but not many people have a BlueGene? and those who do are willing to program very carefully in each application rather than have an awesome oversimplifying inefficient framework.

---

i guess i think that 'relaxed consistency' as C defines it is too relaxed in one way but too strict in another:

it's too relaxed: the causality violation in the example

" For example, with x and y initially zero,

Thread 1: r1 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(x, r1, memory_order_relaxed); Thread 2: r2 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(y, 42, memory_order_relaxed);

is allowed to produce r1 == r2 == 42. "

is really annoying because a single process is observing something that they caused before they caused it.

however in terms of inter-process communication i don't care as much about such things; e.g. if a third process observed x being 42 before y was 42, even though y's 42 caused x's 42, then i don't care as much. imagine a routing fabric where the third process is very distant, and x and y are located in different distributed memory nodes, and message routing may dynamically change based on traffic, etc. it's conceivable that the transient timing delays between the node where y is stored and the third node will be so much greater than between the node where x is stored and the third node that the third nodes sees 42 in x before it sees 42 in y.

i think a key distinction is that a single process can only observe something they caused before they caused it if reordering is applied to that process; if each operation in that process's program, in the order given in the source code completes before the next operation starts, then this can't happen.

heck, in terms of what a third process observes, i don't know if i even need coherence, e.g. maybe it will see a different sequence of change-x events than other nodes, e.g. like vector clocks in eventually consistent, partitionable, available databases where there was a partition for awhile and both sides accepted writes to a single value.

so what i think i want is just to prevent the re-ordering within each process, but to allow it between processes.

in other words, what i want is pretty simple: if a process issues read R1 and then write W1, the observed value of R1 is not caused by W1, either directly or indirectly; and you read your own writes (in fact all the 'client centric' criteria; monotonic reads, monotonic writes, writes follows reads.

so i would like jasper to support two modes for shared memory. that, and sequential consistency.

---

since it seems like acquire-release consistency only marginally helps (~2x speedup or something like that), no need to provide it in the language, although in theory i like it.

still todo look at location consistency


" DeNovo? has used Deterministic Parallel Java (DPJ) as an ex- ample disciplined programming model [11] to drive its design. DPJ provides the programmer with a novel region-based type and effects system to convey the read and write side-effects on shared-memory for every method. A type-checked DPJ program is guaranteed deterministic-by-default semantics. That is, unless non-determinism is explicitly requested, DPJ programs appear de- terministic and with sequential semantics (the programmer can debug and test such a program as if it were sequential). Even when non-determinism is explicitly requested, DPJ provides strong safety guarantees; e.g., data-race-freedom, strong isolation, and sequential composition of deterministic code sections [12]. The DPJ compiler enforces these guarantees by checking that conflicting accesses from two concurrent tasks – the root cause of non-determinism – are always identified (as atomic) and always occur within explicitly marked atomic sections. "

" DPJ is an extension to Java that enforces deterministic-by-default semantics via compile-time type checking [11, 12]. We first dis- cuss DPJ without non-deterministic constructs [11]. DPJ provides parallel constructs of foreach and cobegin to express parallelism in a structured way as in many current languages (we refer to an iteration of a foreach loop or a parallel statement of a cobegin as a task). DPJ provides a new type and effect system for expressing common patterns of imperative, object-oriented programs. The DPJ programmer assigns every object field or array element to a named “region” and annotates every method with read and write “effects” summarizing the regions read and written by that method (a re- gion can be non-contiguous in memory). The compiler uses this information to (i) type-check program operations in the region type system and (ii) ensure that no two parallel tasks interfere (conflict). DPJ also provides parallel constructs that are potentially non- deterministic; i.e., foreach nd and cobegin nd [12]. These con- structs allow conflicting accesses between their tasks, but require that such accesses be enclosed within atomic sections, their read and write effect declarations also include the atomic keyword, and their region types be declared as atomic. Note that there continue to be no conflicts allowed between a task from a deterministic parallel construct and any other concurrent (non-deterministic or determin- istic) task. The compiler checks that all of the above constraints are satisfied by any type-checked program, again using a simple, modular type checking algorithm. With the above constraints, DPJ is able to provide the follow- ing guarantees: (1) Data-race freedom. (2) Strong isolation of ac- cesses in atomic section constructs and all deterministic parallel constructs; i.e., these constructs appear to execute atomically. (3) Sequential composition for deterministic constructs; i.e., tasks of a deterministic construct appear to occur in the sequential order implied by the program (even if they contain or are contained within non-deterministic constructs). (4) Determinism-by-default; i.e., any parallel construct that does not contain an explicit non- deterministic construct provides deterministic heap output for a given heap input. "

---

" A single programming language or library typ- ically supports only one or two models of parallelism. For example, CUDA, OpenCL? and AMP naturally sup- port fine-grain data parallelism, with a macro function replicated across a large number of threads, but other parallelism models (like more flexible dataflow) are not specifically addressed. Similarly, BlueSpec? [30] and Lime [5], which have proved successful for FPGAs, are both dataflow models but it is not clear whether these can be mapped effectively to GPUs. T "

pvc? what was it called? pvm?

" CUDA, OpenCL?, AMP and OpenACC? are primarily focused on GPU computing

The Liquid Metal project [24] aims to program hy- brid CPU/accelerator code, using a single object-oriented programming language called Lime. Their efforts to date have focused on CPU/FPGA systems. The FPGA part is compiled first to a language based on a dataflow graph of filters connected by input/output queues, which in turn is compiled to Verilog.

"

" At the object code level, the NVIDIA’s PTX and Microsoft’s DirectCompute? are virtual instruction sets for GPU computing. Both these systems aim to sup- port a stable programming model and instruction set (for CUDA and AMP programs respectively), while also providing efficient code" "

---

clearly, if there is an atomic variable 'a' and the user says 'a = a + 1'... that entire thing should be atomic! this should not be:

b = a (atomically) b = a + 1 a = b (atomically)

(that's what ada, at least, does)

---

similarly, if we say that a variable holding a number is atomic, then even if that variable is a zillion bytes long and we can only atomically modify 1 byte using hardware primitives, it is jasper's job to make the whole thing atomic (perhaps using a transaction?)

--

http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf notes that java had to define what the language could not do in cases of data races, rather than just leaving them undefined like in C++, b/c java wanted to deal with untrusted code and prohibit, say, reading a password that you aren't supposed to have access to from another part of memory

--

http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf likes the data-race-free model:

"Data-race-free provides a simple and consistent model for t hreads and shared variables. We are convinced that it is the best mod el to- day to target during initial software development. "

--

http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf seems to think that maybe the non-sequential-consistency memory model options in C/C++ atomics are only really needed b/c hardware doesn't provide primitives that quite work; e.g. on x86 there is no version of a memory fence that only fences vs. synch variable, not ordinary data variables:

" Throughout this process, it repeatedly became clear that cu rrent hardware models and supporting fence instructions are ofte n at best a marginal match for programming language memory models, pa r- ticularly in the presence of Java volatile fields or C++ atomic objects. It is always possible to implement such synchroniz ation variables by mapping each one to a lock, and acquiring and re- leasing the corresponding lock around all accesses. Howeve r, this typically adds an overhead of hundreds of cycles to each acce ss, particularly since the lock accesses are likely to result in coherence cache misses, even when only read accesses are involved. Volatile and atomic variables are typically used to avoid locks for exactly these reasons. A typical use is a flag that in dicates a read-only data structure has been lazily initialized. Sin ce the ini- tialization has to happen only once, nearly all accesses sim ply read the atomic / volatile flag and avoid lock acquisitions. Acquiring a lock to access the flag defeats the purpose. On hardware that relaxes write atomicity (see Figure 3), how - ever, it is often unclear that more efficient mappings (than t he use of locks) are possible; even the fully fenced implementatio n may not be sequentially consistent. Even on other hardware, the re are apparent mismatches, most probably caused by the lack of a we ll- understood programming language model when the hardware wa s designed. On x86, it is almost sufficient to map synchronization loads and stores directly to ordinary load and store instruc tions. The hardware provides sufficient guarantees to ensure that o rdinary memory operations are not visibly reordered with synchronization operations. However it fails to prevent reordering of a sync hroniza- tion store followed by a synchronization load; thus this imp lemen- tation does not prevent the incorrect outcome for Figure 1. This may be addressed by translating a synchronization stor e to an ordinary store instruction followed by an expensive fenc e. The sole purpose of this fence is to prevent reordering of the syn chro- nization store with a subsequent synchronization load. In practice, such a synchronization load is unlikely to follow closely en ough (Dekker’s algorithm is not commonly used) to really constra in the hardware. But the only available fence instruction constra ins all memory reordering around it, including that involving ordi nary data accesses, and thus overly constrains the hardware. A be tter solution would involve distinguishing between two flavors o f loads and stores (ordinary and synchronization), roughly along t he lines of Itanium’s ld.acq and st.rel [21]. This, however, requires a change to the instruction set architecture, usually a diffic ult propo- sition. We suspect the current situation makes the fence instruc- tion more expensive than necessary, in turn motivating addi - tional language-level complexity such as C++ low-level ato mics or lazySet() in Java "

--

they suggest:

" A be tter solution would involve distinguishing between two flavors o f loads and stores (ordinary and synchronization), roughly along t he lines of Itanium’s ld.acq and st.rel [21]. This, however, requires a change to the instruction set architecture, usually a diffic ult propo- sition. "

i looked that up ( http://www.intel.com/content/dam/www/public/us/en/documents/manuals/itanium-architecture-software-developer-rev-2-3-vol-1-manual.pdf , section 4.4.7 page 1:73 PDF page 85). it looks like what it is is that Itanium provides normal ld and st commands, but it also provides variant versions, ld.acq and ld.rel. the difference appears to be that an ld.acquire will not be reordered after following loads and stores (whether the others are ordinary ld, st, or ld.acq, ld.rel), and an st.rel will not be reordered before previous loads and stores; whereas normal ('unordered') ld and st might be reordered with other unordered ld,st in either direction. A fence is like both an acquire and a release at once.

the manual says say:

" Memory data access ordering must satisfy read-after-write (RAW), write-after-write (WAW), and write-after-read (WAR) data dependencies to the same memory location. In addition, memory writes and flushes must observe control dependencies. Except for these restrictions, reads, writes, and flushes may occur in an order different from the specified program order.

Note that no ordering exists between instruction accesses and data accesses or between any two instruct ion accesses. .... Memory accesses follow one of four memory ordering semantics: unordered, release, acquire or fence. Unordered data accesses may become visible in any order.

Release data accesses guarantee that all previous data accesses are made visible prior to being made visible themselves. Acquire data accesses guarantee that they are made visible prior to all subsequent data accesses.

Fence operations combine the release and acquire semantics into a bi-directional fence, i.e., they guarantee that all previous data accesses are made visible prior to any subsequent data accesses being made visible. "

so http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf seems to be suggesting that it would be best if the data-race-free model would be the main or the only thing in languages, but that current hardware does not provide the ability to fence only vs. synchronization variables, and the Itanium does. i don't think the itanium provides that. Table 4-20 clearly states that an ld.acq is ordered before a subsequent ld or st, and an st.rel is ordered after a previous ld or st. Maybe they are just saying that, similar but not identical to Itanium, they want an instruction set that allows the programmer to mark sync variables. And, unlike Itanium, they want to be able to say, order only w/r/t sync variables.

--

"A valuable discipline, therefore, is to provide a guarantee of determinism by default ....In par tic- ular, DPJ proposes a region-based type and effect system for deterministic-by-default semantics – “regions” name disj oint par- titions of the heap and per-method effect annotations summarize which regions are read and written by each method. Coupled wi th a disciplined parallel control structure, the compiler can easily use the effect summaries to ensure that there are no unordered co n- flicting accesses and the program is deterministic. Recent r esults show that DPJ is applicable to a range of applications and com plex data structures and provides performance comparable to thr eads code [8] "

--

"ensuring that data races are never actually executed at run- time, thus both avoiding the need to specify their behavior and gre atly simplifying or eliminating the debugging issues associate d with data races."

--

here's how they want to go beyond data-race-free:

" Further, data-race-free by itself is, arguably, insuf ficient as a discipline for writing correct, easily debuggable, and mai ntainable shared-memory code; e.g., it does not completely eliminate atom- icity violations or non-deterministic behavior. Moving forward, we believe a critical research agenda to ena ble “parallelism for the masses” is to develop and promote disciplined shared-memory models that: • are simple enough to be easily teachable to undergraduates; i.e., minimally provide sequential consistency to programs that obey the required discipline; • enable the enforcement of the discipline; i.e., violations of the discipline should not have undefined or horrendously comple x semantics, but should be caught and returned back to the pro- grammer as illegal; • are general-purpose enough to express important parallel algo- rithms and patterns; and • enable high and scalable performance "

--

so i guess it would make them happy if Core Jasper had a way to indicate which variables (and which variable loads and stores) were 'synchronization variables' and which were 'data variables'

--

i heard in "On the tamability of the Location Consistency memory..." that release consistency -> sequential consistency if you are free of data races (apparently the proof is in the appendix of http://eecs.vanderbilt.edu/courses/eece343/papers/p15-gharachorloo.pdf )? but this seems to disagree:

http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/seq_con.html

at the end he says:

"

None of the above applies if we allow load_acquire and store_release operations, since the synchronization operations themselves may not behave in a sequentially consistent manner. In particular, consider the following example from the strawman proposal:

Thread1 Thread2 store_release(&y, 1); r1 = load_acquire(&x); store_release(&x, 1); r2 = load_acquire(&y);

This allows r1 = r2 = 0, where sequential consistency (and Java) do not.

"

what gives?

afaict the theorem in http://eecs.vanderbilt.edu/courses/eece343/papers/p15-gharachorloo.pdf only assumes 'processor consistency', not sequential consistency, for 'special accesses' which i assume means synchronization operations

---

anyhow it sounds like, especially if well-behaved programs show sequential consistency anyhow, that everyone is recommending that, for most programs, shared memory exhibits sequential consistency. so maybe that's what jasper should do? but what about eventually consistent cassandra style stuff?

and we can still have those channels in which different ppl see things in different orders, and we can still have message passing

if we go this route we're essentially saying that the construct of shared memory is only useful (or at least simple enough to be worth including in jasper) if its sequentially consistent

but i'd still like some eventual consistency

--

http://denovo.cs.illinois.edu/Pubs/10-cpc-dpj.pdf claims that language guaranteed determinism for a data parallel language(s) is described in:

H.-W. Loidl et al. Comparing parallel functional langua ges: Programming and performance. Higher Order Symbol. Comput. , 2003.

and for a pure functional language(s):

13. M. Metcalf and J. Reid. Fortran 90 Explained . Oxford University Press, 1992

" However, while guaranteed determinism is available for some restrict ed styles of parallel programming (e.g., data parallel [13], or pure functional [1 1]), it re- mains a challenging research problem to guarantee determinism for im perative, object-oriented languages such as Java, C++, and C#. In such lan guages, object references, aliasing, and updates to mutable state obscure the d ata dependences between parts of a program, making it hard to prove that those de pendences are respected by the program’s synchronization "

--

" All these choices will build on some variant of the sequential LLVM instruction set. Our tentative strategy is to use a combination of two parallelism abstractions: vector parallelism and general- ized macro dataflow graphs. A virtual vector instruction set, such as Vector LLVA [8] or Vapor SIMD [31], is an obvious candidate because vectorizable code is likely to capture a large category of codes (though not all) that run successfully on GPUs, FPGAs, and vector hardware. Moreover, when applicable, vector instructions are sim- ple, have intuitive deterministic semantics, and can be made highly portable through careful design of the vector lengths, control flow, memory alignment, and operation abstractions [8, 31]. For computations not cleanly expressible as vector code, our second choice is a general dataflow graph, which may include fine-grain dataflow such as Static Sin- gle Assignment (SSA) form within procedures (which al- ready exists in LLVM) and “macro dataflow” representa- tions for coarse-grain data flow. This dataflow graph will not be “pure dataflow” in the sense that individual nodes may include side-effects, i.e,. reads and writes, to shared memory, because explicitly managing memory accesses may be important for performance tuning and front-end compiler optimization. Although these side effects may cause concurrency errors like data races or unintentional non-determinism, it is important to note that the virtual ISA is not a source-level programming language but an object code language: guaranteeing correctness proper- ties like the absence of concurrency errors is not the re- sponsibility of the virtual ISA and is instead left to source level languages.

We believe that such a (less restrictive) dataflow graph language will be able to capture all the forms of data parallelism. For example, the implicit “grid of ker- nel functions” used in PTX [32], Renderscript [3], We- bGL [37], and DirectCompute? [1] are naturally express- ible as dataflow graphs. While dataflow graphs can also capture vectorizable code, they may be less efficient and less compact than vector instructions, which is why we expect to use vector instructions as well. Moreover, having one monolithic dataflow graph for an entire application would severely restrict the modu- larity and flexibility of the code. We therefore propose to make the dataflow graphs hierarchical by simply al- lowing the individual nodes in the overall graph to be either a vector subprogram, or another dataflow graph of its own. Individual nodes in the graph would repre- sent computation within each element, either as simple vector code or rich fine-grain dataflow graphs. Different nodes can use different choices for different kernels. To- gether this combination would give an elegant, yet pow- erful, mechanism for expressing parallelism across an entire heterogeneous system since a wide range of high- level data-parallel programming models can be compiled efficiently down to either a dataflow model or a vector model, including OpenCL?, Array Building Blocks [21], CUDA, Renderscript, OpenMP?, Fortran 9x, Hierarchi- cal Tiled Arrays (HTAs) [7, 22], Concurrent Collec- tions [10], Lime, StreamIt? [38], and others ... abstraction of the overall memory system we present to programmers for coordinating parallelism across compute elements ... for instance, in Listing 1 we abstract away the data move- ment details in the FPGA and GPU (e.g. CUDA mem- cpy operations) through edges which move data from the producer node to the consumer node. This dataflow across elements can then be efficiently mapped to either uncached local memories or to cache-coherent shared memory. Although explicit loads and stores of the GPU and FPGA implementations have been abstracted away in this example, we do expect to need explicit loads and stores, which is not uncommon in macro dataflow lan- guages. For example, an alternative approach to express- ing the FFT algorithm would be to use loads and stores inside the Stage node for transferring data between suc- cessive stages, instead of using explicit dataflow edges as shown. (We omit that pseudocode for lack of space. " -- http://denovo.cs.illinois.edu/Pubs/12-HotPar-VAdve.pdf

---

reading Memory Consistency Models. Mosberger 1991, it seems like what the http://denovo.cs.illinois.edu/Pubs/10-cacm-memory-models.pdf guys like is 'weak consistency', because i think they want synch ops to be sequentially consistent; mosberger says that release consistency only requires PCD consistency (Stanford DASH's variant definition of processor consistency), and he suggests that entry consistency is similar or weaker.

---

i guess 'processor consistency' sounds good. that means (a) coherency (everyone sees the same sequence of values in any single memory location), and (b) everyone sees the same sequence coming from each processor.

can we still have this under processor consistency?

" For example, with x and y initially zero,

Thread 1: r1 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(x, r1, memory_order_relaxed); Thread 2: r2 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(y, 42, memory_order_relaxed);

is allowed to produce r1 == r2 == 42. "

1: R(y)42 W(x)42 2: R(x)42 W(y)42

coherency: yes, x: 0 42, y: 0 42 same sequence of writes, for each processor, seen by all processors: yes, 1: x, 2: y

but do reads matter?

same sequences of reads, for each processorseen by processors: yes, 1: y, 2: x

todo

--

wow, even Milewski can't stomach relaxed memory orders!

http://stackoverflow.com/questions/9553591/c-stdatomic-what-is-stdmemory-order-and-how-to-use-them

"

    I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm. Imagine doing the same for a lock-free queue based on weak atomics!

Hans, who was instrumental in creating the C++ memory model, hopes that weak atomics will be a stopgap measure until processor designers find more efficient ways of implementing sequential consistency. For now, it is strongly recommended to leave weak atomics to top experts whose numbers can be counted on the fingers of one hand. " -- http://bartoszmilewski.com/2008/12/23/the-inscrutable-c-memory-model/

well, i guess that just about decides it for me, then. jasper's shared memory abstraction should be sequentially consistent.

well wait a minute. i think his example mixes relaxed consistency for data with acquire-release consistency for synchronization. the other guys say, use sequential consistency for synchronization and acquire-release for everything else. also, isn't acquire-release consistency weaker than processor consistency? so could processor consistency still be ok?

note to self: see projects/eventReordering !

--

also, just the idea that you read from and write to channels (which are 'volatile' in that they can change when you aren't looking because someone else changed them) just like variables

--

do we need something like 'volatile' in the old C sense just so an external extension module has a way to tell Jasper that something is volatile? not sue


i guess that, given the presence of different memory consistency models, and given that many of these models make various guarantees between memory locations (e.g. FIFO consistency and above), we must be able to create 'shared memory objects' of various types, and access variables within them. perhaps memory is threadlocal by default?

---

--

the language Jade has "with" which specifies a section of code to be done under a lock; withonly, which specifies which shared objects the with applies to; without, which can be used within a with to temporarily release the lock; and withth, which i don't understand

--

maybe have all vars be threadlocal by default, must mark 'shared' for them to be shared and use milewski's type system (was it, threadlocal or immutable or unique or auto-locked by an owning monitor?)

mb everything within one line of code (e.g. until the semicolon or EOL) is atomic, e.g. x = x + 1 is atomic, and even x = bigLongFunction(x) is atomic? note that the latter requires locking and so could call another function that calls another function that also tries to lock x, which could cause a problem... so maybe this is only allowed if x is local or something and not passed-by-reference to bigLongFunction?

---

amorphous medium computing

--

map reduce over multiple processors special case: consensus

--

C11 threads: "C11 threads are similar to but simpler than Pthreads" -- http://fileadmin.cs.lth.se/cs/Education/EDAN25/F06.pdf (slide #62/70)

--

" You can create a goroutine with any function, even a closure. But be careful: a questionable design decision was to make closures capture variables by reference instead of by value. To use an example from Go's FAQ, this innocent looking code actually contains a serious race:

    values := []string{"a", "b", "c"}
    for _, v := range values {
        go func() {
            fmt.Println(v)
            done <- true
        }()
    }
  

The for loop and goroutines share memory for the variable v, so the loop's modifications to the variable are seen within the closure. For a language that exhorts us to "do not communicate by sharing memory," it sure makes it easy to accidentally share memory! (This is one reason why the default behavior of Apple's blocks extension is to capture by value.) " -- http://ridiculousfish.com/blog/posts/go_bloviations.html#go_concurrency

--

parallel by default, unless compiler determines is constant time or serial dependency (e.g. in a + b, a and b may be evaluated in parallel)

--

throwaway41597 8 days ago

link

> We could achieve similar things in Node with generators, but in my opinion generators will only ever get us half way there

I wish more people realized this. Generators seem to be billed (ironically by TJ amongst others with his Koa project) as the solution to callback hell but, having tried Go, I got the same felling that actors are much easier to reason about and debug (although concurrency is still hard).

On the client side, web workers allow parallelism but the API is inconvenient to use: you need to have a file for each web worker whereas Go's `go` routines are akin to a function call. In addition to this standardized conundrum, you have browsers discrepancies with the APIs available inside a worker varying between vendors [1].

On node, you have the cluster API, which allows managing several processes so it's even further from light threads. On the bright side, most (all?) APIs have both a sync and async version.

As a result, there's nothing you can use for multithreading in a module without coordinating with the whole project. I think JavaScript? needs language support for light threads, not APIs for multiprocessing.

[1]: https://developer.mozilla.org/en-US/docs/Web/API/Worker/Func...

reply

iambase 1 day ago

link

Rather than cluster API, I guess you meant child process API but otherwise I agree 100%.

reply

---

http://www.quora.com/Computer-Programming/Why-are-Haskell-green-threads-more-efficient-performant-than-native-threads

http://stackoverflow.com/questions/2708033/technically-why-are-processes-in-erlang-more-efficient-than-os-threads

summary: (a) the minimal (default?) stack space per green thread is less than per OS thread (on the order of 4k vs. 32k) (b) no full context switch needed (don't have to necessarily save every register, etc) b/c the Haskell compiler knows exactly what does and doesn't have to be switched

 note: one page noted that since the language wants to handle scheduling anyhow, maybe it would be best for it to create the same number of OS threads as (virtual) CPUs (e.g. if the CPU has hyperthreading then you create the same number of OS threads as hyperthreads, PLUS a new OS thread each time you execute a potentially blocking OS call (e.g. if you execute 1000 printf calls in parallel, you must create 1000 new OS threads, so that you can still act while you are waiting for that call to unblock))
 note: how much memory per greenthread? Go used to allocate 4k per goroutine, now it allocates 8k; Erlang is on the order of 309 words ( http://www.erlang.org/doc/efficiency_guide/processes.html ); 1k in Hasell http://stackoverflow.com/questions/1900165/how-long-does-it-take-to-create-1-million-threads-in-haskell; 1MB in C++ http://stackoverflow.com/questions/1900165/how-long-does-it-take-to-create-1-million-threads-in-haskell ; 128 bytes in C protothreads library http://stackoverflow.com/questions/1900165/how-long-does-it-take-to-create-1-million-threads-in-haskell
   i'm guessing we want to be more like Erlang, since we are focused on massive concurrency. I wonder what the Erlang would be with SMP and HiPE support?

i guess Ubuntu might have an erlang with SMP support?

$ erl Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [kernel-poll:false]

Eshell V5.8.5 (abort with ^G) 1> Fun = fun() -> receive after infinity -> ok end end.

  1. Fun<erl_eval.20.21881191> 2> {_,Bytes} = process_info(spawn(Fun), memory). {memory,2560} 3> Bytes div erlang:system_info(wordsize). 320

so i guess that's >2k and 320 words? i guess SMP is included?

Let's try HiPE?, which seems to be an Ubuntu package too:

$ sudo aptitude install erlang-base-hipe $ erl Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.5 (abort with ^G) 1> Fun = fun() -> receive after infinity -> ok end end.

  1. Fun<erl_eval.20.21881191> 2> {_,Bytes} = process_info(spawn(Fun), memory). {memory,2656} 3> Bytes div erlang:system_info(wordsize). 332

So, again, order of 309 words, on a 64-bit system, >2k.

So i guess 2k minimum per greenthread (on a 64-bit system, this means 256 words) is a good goal for Jasper.

as a maximum, we may not want the amount per greenthread to be greater than typical L1 data caches. On my system, there is a 32K instruction cache and a 32K data cache per processor, and two hyperthreads per processor. So 16k per hyperthread.

So between 2k and 16k. So, between 256 and 2048 words.

Go used to be at 512 words and is now at 1024 words. Haskell is at 128 words (shall we try and match that? is this an apples-to-apples comparison, e.g. the erlang figure appears to include overhead, does the haskell figure include overhead?).

So, i guess we should shoot for between 128 and 512 words.

perhaps if we have a fixed-size 'jasper assembly' dialect we should also consider having the number of words here being the max size accessible by some fixed address operand field length. 128 corresponds to a 7-bit addressing field, 256 to an 8-bit addressing field, and 512 to a 9-bit addressing field. Which gives us another reason to go with 256 (or 128).. Is 256 easily addressable words crazy small? the PIC MCP family has "A small amount of addressable data space (32, 128, or 256 bytes, depending on the family), extended through banking" (http://en.wikipedia.org/wiki/PIC_microcontroller). About half of the tinyAVR MCP family listed on http://en.wikipedia.org/wiki/Atmel_AVR_ATtiny_comparison_chart has at least 256 bytes of memory (and very slightly less than half has 512). According to http://en.wikipedia.org/wiki/TI_MSP430, many of the MSP430 MCPs have 128 (word?) or 256 (word?) RAM options (see also the 'Memory configurations' table on that page, which has 2 entries each for 128 and 256 (word?) RAM options). So, mb it's not crazy small. However the Arduino Uno (using an ATmega328 AVR) has 2kb of SRAM, and according to http://www.ti.com/lit/ug/slas191a/slas191a.pdf , the MSP430 starter kit has 406 (words?) of RAM available to the user application program (including stack space) even when the monitor program is running. So, maybe those 128/256 word options are in fact crazy small and really there should be at least 512 words? also, each of the 8 cogs in the https://en.wikipedia.org/wiki/Parallax_Propeller has 512 words. otoh Haskell's 1k default is only 128 words. So heck. OK, so all this does is confirm that a reasonable choice may be between 128 and 512 words, and that 256 may or may not be a sweet spot. Let's stick with a 256 word target for now, just b/c it's 8 bits; although if we need to shave a bit, feel free to go down to 128.

---

now, if you merely read some data from an object which is proxied over the network, you might expect this to be timeoutless if you didn't know it would be proxied, and be surprised by the timeout exception. We don't want to make everyone writing a quick script prepare for network timeout exceptions. Which is why the above is BY DEFAULT; you can add a simple sigil to override these defaults, in essence saying "do this in a serial, blocking, timeoutless manner". In addition, b/c we don't have checked exceptions, you can simply not bother to write code to deal with the timeout exception where you know it won't occur.

note: in FFI, how do we deal with foreign constant-time functions which are nominally pure, non-blocking functions, but which might write to STDERR in case of an exception? If we treat them as potentially blocking, we may lose a lot of efficiency if they are called in an inner loop, because Jasper will feel the need to fork a new process for each calls and wrap it with a timeout and a promise. But if we treat them as guaranteed nonblocking, then in particularly unfortunate error cases in which somehow STDERR gets blocked (perhaps because there are a zillion of these guys running at once and all writing to STDERR in an infinite loop) the programmer may be surprised when their entire Jasper program freezes up.

Well, after reading what i wrote there, i guess the answer is clear. The most important thing is for Jasper to minimize the leakiness of its concurrency abstractions, e.g. the programmer should be able to depend on the idea that Jasper greenthreads by default won't block each other. So the convention must be to be pedantic about whether FFI is 'potentially blocking', and only mark it as 'guaranteed non-blocking' if it really is. If the programmer is calling FFI in an inner loop and needs the performance, they can add an annotation (or sigil) that tells Jasper to allow it to block.

---

if we really want Jasper to be good for programming a massively concurrent machine with 64k slow CPUs, must adopt a dataflow or active data approach, because you can't have each CPU waiting to be told which step of the program we're on/what to do, as you would if you simply tried to 'autoparallelize' loops within the context of a single global thread of execution.

---

http://lambda-the-ultimate.org/node/4910

---

PJ Eby's Trellis seems good

https://pypi.python.org/pypi/Trellis/0.7a2

here's a fork of it:

https://pypi.python.org/pypi/Reaction/0.2.2

(note: remember, the language Subtext also tries to handle spreadsheet-like dependencies; latest seems to be a talk at the upcoming Future Programming Workshop: http://alarmingdevelopment.org/?p=880 . Glitch (managed time) is also mentioned there in the comments; the commentor implies that Glitch is 'more permissive than FRP')

---

https://gist.github.com/staltz/868e7e9bc2a7b8c1f754 's way of clearing the suggestings upon refresh ("section Modelling the 3 suggestions with streams") suggests that something more like Glitch, with defined Epochs, could be helpful; because the suggestion stream shows that the three suggestsions go Null at the same time and then non-null at the same time, but this might be hard to recognize using FRP alone; also, it would be good to be able to be able to, instead of having a stream with all three suggestions at once, to somehow relate the 3 streams (i wonder if Glitch can essentially do that? they say Glitch was borne out of a frustration that FRP was too restrictive)

---

hah, the principal behind Glitch epochs is the same as some of the mechanics governing time travel in the fiction/game http://www.aleph.se/Nada/Game/Fukuyama/timetravel.html

---

hendzen 2 days ago

link

As someone who has used TSX to optimize synchronization primitives, you can expect to see a ~15-20% performance increase, if (big if) your program is heavy on disjoint data access, i.e. a lock is needed for correctness, but conflicts are rare in practice. If you have a lot of threads frequently writing the same cache lines, you are probably going to see worse performance with TSX as opposed to traditional locking. It helps to think about TSX as transparently performing optimistic concurrency control, which is actually pretty much how it is implemented under the hood.

While the API of TSX (_xbegin(), _xabort(), _xend()) gives the appearance of implementing transactional memory, it is really an optimized fast path - the abort rate will determine performance. The technical term for what TSX actually implements is 'Hardware Lock Elision'.

If you are going to use TSX, don't use it directly unless you are writing your own primitives (i.e. transactional condition variables [0]), prefer using Andi Kleene's TSX enabled glibc [1] or the speculative_spin_mutex [2] in TBB.

[0] - http://transact09.cs.washington.edu/9_paper.pdf

[1] - https://github.com/andikleen/glibc

[2] - http://www.threadingbuildingblocks.org/docs/doxygen/a00248.html#ga69722571e9e4406693c4a1379f0b47eb

---

beagle3 1 day ago

link

Do you have any insight into how it compares to lock-free techniques?

A full-blown (say) lock-free hash table is almost insurmountable - I recall seeing an efficient reusable implementation a couple of years ago, but most before that had serious performance or correctness issues. However, if you take this into account early enough in the design of the system, you can usually do with much simpler lock free data structures (seqlock and friends).

transactional memory is, of course, a useful abstraction when you can't (or didn't) take this into account upfront.

reply

colanderman 1 day ago

link

Lock-freedom is all about progress, not performance. i.e. generally the only reason to use a lock-free algorithm is if you have issues with ill-behaved code stalling other processes. (I can think of several simple ways to make a fast concurrent hash table, none of which provide lock-freedom, yet which avoid taking locks when necessary.)

Transactional memory is an access abstraction that presents memory as something over which transactions can be made. It says nothing about whether the implementation is lock-free (though the better implementations approach, or are lock-free). There exist software TM algorithms which lock during the transaction, which lock only briefly at the end of the transaction, and which are lock-free, and some which are a combination of these.

TSX (specifically HLE) only takes a lock if the transaction fails without taking a lock, in order to guarantee forward progress in the absence of an ill-behaved transaction (assuming fair locks). Were the software fallback implemented using lock-free techniques (possible with the more general RTM portion of TSX), this guarantee would extend to ill-behaved transactions as well.

reply

beagle3 1 day ago

link

> Lock-freedom is all about progress, not performance. i.e. generally the only reason to use a lock-free algorithm is if you have issues with ill-behaved code stalling other processes.

Lock freedom is about progress, but in many practical cases (short time between load-linked/store-conditional pair, for example) it beats locks and everything else in performance -- basically, in the best case you do the same work as locks (i.e. synchronize caches among CPUs), and in the worst case, you spin instead of suspending the process.

Of course, if you do a lot of work between the load and store (or whatever pair of operations need to surround your concurrent operation), and there's a good chance of contention, then ... yes, it will not perform well. But that's not how you should use it (or how it is used most of the time).

And .. transactional memory is an abstraction the leaks differently than locks, but you must still be aware of the failure modes. It's higher level, but does not magically resolve contention.

reply

lelf 2 days ago

link

And it was the only widespread TM implementation in metal.

So N years more till we see software transactional memory with dedicated hardware support.

reply

wahern 1 day ago

link

Strong LL/SC provides the ability to do real software transactional memory. I believe ARM has the strongest LL/SC implementation, with the architecture supporting transaction blocks up to 2KB in size! Although ARM has specified LL/SC such that vendors can choose to only support blocks much smaller than 2KB.

Note that people have totally confused the terms. Originally "software transactional memory" meant using primitives like [strong] LL/SC to get lock-free, wait-free algorithms. Strong LL/SC has been proven to be a universal primitive which can be used to implement most known lock-free, wait-free algorithms.

"Hardware transactional memory" gave you access to lock-free, wait-free algorithms in a much more convenient manner. But really it's a difference of degree because strong LL/SC requires _significant_ hardware support.

These days people (like the PyPy? project and most STM libraries) use "software transactional memory" for implementations which just use a mutex under the hood. It's emulated STM; they're just being fanciful.

I believe TSX is more like strong LL/SC. Maybe stronger. OTOH I'm not sure if it's universal enough to implement wait-free algorithms.

reply

---

reitzensteinm 1 day ago

link

I also did this, missing out on 100mhz of base clock (3.4ghz vs 3.5ghz for the k).

TSX seemed like a once in a decade step forward, though as I understand it the restrictions with the cache size (and thus the amount of memory you can write to before the transaction gets too big) meant it wasn't very practical for much beyond optimistic lock acquisition.

For example, PyPy? isn't planning on doing a TSX port even with their enthusiasm for transactional memory.

---

ak217 1 day ago

link

> For example, PyPy? isn't planning on doing a TSX port even with their enthusiasm for transactional memory.

Do you have more information about their reasoning behind this? From my point of view this is the highest profile software project to potentially make use of HTM, and I recall reading that the plan was to eventually introduce hardware acceleration.

reply

reitzensteinm 1 day ago

link

Here are a few links:

http://pypy.org/tmdonate.html (Search for "haswell")

http://grokbase.com/t/python/pypy-dev/13bvt3kg70/pluggable-h...

It seems to boil down to:

Interestingly, this does not bode well for HTM on a platform with many smaller cores, say a hypothetical 64 core ARM. Each core will have a tiny amount of L1 cache, severely limiting transaction size.

And many smaller cores is exactly where you'd want the benefits of HTM, since the overhead of synchronization is higher in proportion to the work each core can do.

reply

sounds 1 day ago

link

In reference to the sibling post that gives more details about pypy, I'd like to call to your mind the history of the vector extensions for x86.

First revision: MMX. It reused the same registers as the older x87 floating point coprocessor (even though the x87 transistors lived on the same die). As a result, legacy x87 code and MMX code had to transition using an expensive EMMS instruction.

Second revision: (well, ignoring some small changes to MMX) ... SSE. Finally got its own registers, but lacked a lot of real-world capability.

Third revision: SSE2, finally got to a level of parity with competing vector extensions (see, for example, PowerPC?'s Altivec).

And so forth.

I guess the take-home lesson for me is that these new TSX instructions are indeed fascinating to play around with, but I wouldn't expect it to blow the doors off. Intel will incrementally refine it.

(The incremental approach also gives Intel a chance to study how it's being used and keeps AMD playing catch-up.)

reply

MBCook 1 day ago

link

The other big problem with MMX was that it was integer only. While that might have been ok for some application 3D games and other software that could really use the boost needed floating point and not only couldn't benefit (since it was integer only) it actually interfered (since, as you said, it reused the registers).

AMD's 3DNow had single precision floating point support, so it was actually somewhat useful. SSE followed 3DNow and added single precision support (as well as fixing the register stuff). SSE2 added double precision support.

reply

sounds 12 hours ago

link

Right, thanks for those additional details.

Today, no one would use MMX instructions (since SSE is vastly superior). I expect Intel will continue to add TSX capabilities which will eventually produce some nice results for parallel code.

reply

---

---

" You mention "a good async story" -- that has already been addressed with asyncio. A better multi-processing API has also been addressed in the concurrent.futures module. It's true that "multi-core" is still unsolved in Python, but most practitioners work around this problem by just using process-oriented parallelism instead.

Nonetheless, the frequent complaints about multi-core haven't fallen on deaf ears; they will probably come next! It's pretty much the primary focus of the pypy project and recent development work has begun on a Software Transactional Memory implementation that would remove the GIL.

GvR? at PyCon? 2014 said that he thinks pypy is the future of core Python development -- and that at some point in the future, pypy might become the reference implementation rather than CPython. " -- https://news.ycombinator.com/item?id=8187180

---

to deal with distributed systems:

maybe have two basic error conditions, for three result conditions in total:

(1) success (2) clean failure (3) success or failure unknown; also, partial success possible

so, if a side-effectful RPC was sent but there was a network failure that prevents us from knowing if the RPC request was received or not, (3) would be returned.