Bayle Shanks's website: proj-oot-ootConcurrencyNotes3

Coordinated concurrent programming in Syndicate

http://lambda-the-ultimate.org/node/5301

---

https://downloads.haskell.org/~ghc/7.0.3/docs/html/users_guide/lang-parallel.html

https://wiki.haskell.org/GHC/Data_Parallel_Haskell

---

https://rcrowley.org/2010/01/06/things-unix-can-do-atomically.html

rwmj 15 hours ago

I always thought it would be a good idea for system calls to support transactions. Probably in a limited way because implementing general transactions would require massive changes to the kernel. But it would be nice to be able to do [error checking omitted]:

    begin ();
    fp = fopen ("file", "w");
    fputs (content, fp);
    fclose (fp);
    commit ();

It could solve the whole thing with ending up with zero-length files because you didn't use the right incantation to update a file atomically on ext4 (https://thunk.org/tytso/blog/2009/03/12/delayed-allocation-a...).

In Unix v7 mkdir was not a system call. It was a setuid program implemented using mknod + link. That was racy so the mkdir(2) system call was added. But it could have been solved more generally (and more elegantly) by adding transactions.

tobias3 14 hours ago

Windows has it, but not many people seem to use it: https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

krylon 14 hours ago

Beware, it appears to be deprecated:

"Microsoft strongly recommends developers utilize alternative means to achieve your application’s needs. Many scenarios that TxF? was developed for can be achieved through simpler and more readily available techniques. Furthermore, TxF? may not be available in future versions of Microsoft Windows."

netdog 22 hours ago

The GCC Atomic Builtins mentioned in the article are not specific to Unix. They are compiler constructs, and depend on specific architecture hardware support. All x86 CPUs have such support for some years now. So these atomic operations can also be used in non-Unix software running on x86 CPUs.

The GCC documentation lists other non-intel architectures which also have the features required to support the atomic built-ins.

comex 21 hours ago

Also, if you can depend on recent compilers you should probably be using the standard C <stdatomic.h> or C++ <atomic> instead.

---

" ReactiveCocoa? is inspired by functional reactive programming. Rather than using mutable variables which are replaced and modified in-place, RAC offers “event streams,” represented by the Signal and SignalProducer? types, that send values over time.

Event streams unify all of Cocoa’s common patterns for asynchrony and event handling, including:

    Delegate methods
    Callback blocks
    NSNotifications
    Control actions and responder chain events
    Futures and promises
    Key-value observing (KVO)

Because all of these different mechanisms can be represented in the same way, it’s easy to declaratively chain and combine them together, with less spaghetti code and state to bridge the gap.

For more information about the concepts in ReactiveCocoa?, see the Framework Overview. "

---

notation for multiple parallel input 'sources' to a computation:

eg "I also need to stress that this ((async/await)) syntax doesn't make Promise go away from your codebase. In fact, you must have a thorough understanding of them, which you'll frequently need.

A common example where Promise makes an appearence is code that requires multiple values as part of a loop, which are requested concurrently:

const ids = [1, 2, 3]; const values = await Promise.all(ids.map((id) => { return db.query('SELECT * from products WHERE id = ?', id); })); " [1]

eg forking

eg [2] says "A dojo command is a classic filter pipeline, with sources, filters, and sinks.". In a Unix command line, you have one source, then filters, then sinks. But what else could you have in the middle besides filters? You could have 'maps' (transformations on each item coming through). But you also might want to operate across items, eg reduce. Or reduce's dual, something that takes one item and produces many. But this is still linear; you have multiple items in a stream but only one stream. But what about 'multiplexing' multiple streams? Well i guess Unix does have this; you have STDOUT and STDERR 'multiplexed'. But you also want operators to do the multi-stream stuff in https://gist.github.com/staltz/868e7e9bc2a7b8c1f754 , like merge which merges two streams into one, and combineLatest.

so in general what notation do we have for situations with threads or streams where we would draw diagrams like

  -->--
 /     \

--->---* \-->--/

(which is supposed to indicate a directed graph with a source node, three edges from the source node, and a sink node; more generally you could have intermediate notes in any of the three streams, or even more branching)?

nodrygo

so Erlang remain the king here

according the doc:

    preemptive (based on number of reductions)
    309 words of memory in the non-SMP (little more for SMP)
    hundreds of thousands or even millions of processes
    messages processing
    manager for tree of process

hope that in far future Nim become a competitor here

interesting blog on Erlang internal scheduling http://jlouisramblings.blogspot.fr/2013/01/how-erlang-does-scheduling.html

---

Jehan

vbtt: Firstly, I'm not particularly keen on being able to move running routines, but I just want to point out the (stated) benefits.

I think we all (or at least most of us) know the benefits. The problem is that the benefits aren't free; for example, Go sacrifices a fair amount of memory safety to accomplish its goals and its scheduler can have issues dealing with foreign language libraries that are unaware of it. There are ways to reify a process algebra in a programming language (which is was Go does) that does not have this defect (but of course has different types of tradeoffs).

--- " For example, Go's runtime and API abstractions actually mask a lot of asynchronous behaviour. My understanding is that when you make a blocking call to some io.Reader.Read() or io.Writer.Write() (representing, say, a network socket), the runtime or the library actually tends to translate that into an asynchronous call (e.g. with epoll). This is how Go achieves its magic of seemingly blocking calls only blocking a single goroutine, rather than the whole process.

Java does have plenty of libraries providing asynchronous behaviour, but this comes "out of the box" with Go. I don't know of any Java library that offers lightweight concurrency with blocking semantics for ease of use that sits atop an asynchronous implementation for performance and concurrency. That's not a dig at Java - I just think it means you can't make a meaningful comparison.

I find it much easier to simply compare the performance of equivalent concurrent / parallel applications designed idiomatically - and Go has repeatedly come out on top here in my own experience. " --- " what you are requesting is something like a fiber library - the forthcoming library boost.fiber contains cooperatively scheduled fibers, mutexes/condition-vairables/barriers/... - the interface is similar to boost.thread. " ---

http://ferd.ca/beating-the-cap-theorem-checklist.html

" Beating the CAP Theorem Checklist

Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment advocates a way to beat the CAP theorem. Your idea will not work. Here is why it won't work:

( ) you are assuming that software/network/hardware failures will not happen ( ) you pushed the actual problem to another layer of the system ( ) your solution is equivalent to an existing one that doesn't beat CAP ( ) you're actually building an AP system ( ) you're actually building a CP system ( ) you are not, in fact, designing a distributed system

Specifically, your plan fails to account for:

( ) latency is a thing that exists ( ) high latency is indistinguishable from splits or unavailability ( ) network topology changes over time ( ) there might be more than 1 partition at the same time ( ) split nodes can vanish forever ( ) a split node cannot be differentiated from a crashed one by its peers ( ) clients are also part of the distributed system ( ) stable storage may become corrupt ( ) network failures will actually happen ( ) hardware failures will actually happen ( ) operator errors will actually happen ( ) deleted items will come back after synchronization with other nodes ( ) clocks drift across multiple parts of the system, forward and backwards in time ( ) things can happen at the same time on different machines ( ) side effects cannot be rolled back the way transactions can ( ) failures can occur while in a critical part of your algorithm

And the following technical objections may apply:

( ) your solution requires a central authority that cannot be unavailable ( ) read-only mode is still unavailability for writes ( ) your quorum size cannot be changed over time ( ) your cluster size cannot be changed over time ( ) using 'infinite timeouts' is not an acceptable solution to lost messages ( ) your system accumulates data forever and assumes infinite storage ( ) re-synchronizing data will require more bandwidth than everything else put together ( ) acknowledging reception is not the same as confirming consumption of messages ( ) you don't even wait for messages to be written to disk ( ) you assume short periods of unavailability are insignificant ( ) you are basing yourself on a paper or theory that has not yet been proven

... "

---

praise for Go's Goroutines: "how the language and the runtime transparently let Go programmers write highly scalable network servers, without having to worry about thread management or blocking I/O."

---

some things erlang and go have in common with their greenthreads:

they can resize their stacks by copying them -- because stack size can grow, they can start with a super small stack
being able to copy the stack appears to necessitate knowing which items in memory are pointers (so that they can fix the pointers when they move stuff, assuming that you can point into the stack). [3] [4]. Note that Haskell also makes a big deal out of having a mechanism to know where the pointers are in stuff [5] (and mb see also [6] for more about info tables being all over the place).
because scheduling is done in user space, no need for a full OS-driven context switch to switch between threads
preemption
channels that can easily be configured to send either async (buffered) or sync messages [7] [8]
hide callback-ish or event queue-based async i/o behind a blocking api, and then switch blocked greenthreads

---

on greenthreads:

"In the case of GHC Haskell, a context switch occurs at the first allocation after a configurable timeout" [9]

"Most Smalltalk virtual machines do not count evaluation steps; however, the VM can still preempt the executing thread on external signals (such as expiring timers, or I/O becoming available). Usually round-robin scheduling is used so that a high-priority process that wakes up regularly will effectively implement time-sharing preemption:"

"Other implementations, e.g. QKS Smalltalk, are always time-sharing. Unlike most green thread implementations, QKS Smalltalk also has support for preventing priority inversion."

---

Python Copperhead looks awesome. I already copied a list of its primitives into OA Ops Notes1, and copied my summary of its type system into ootTypeNotes5.

Summary (copied from Self:proj-plbook-plPartConcurrencyTodos ; more details are in that file, just above the summary)

Python Copperhead compiles a pure subset of Python (to C++, where it uses the Thrust library to run it on CUDA, TBB, or OpenMP?). Functions to be compiled to Copperhead code are marked with '@cu'. The primitive functions are given above; in addition, list comprehensions can be used as syntactic sugar. Copperhead is statically typed (and type inferred). The types are (mostly):

scalars: float(32/64), int(32/64), bool
tuples of scalars (note: you must access tuples by deconstructing bind, not with the [] operator, because [] is polymorphic across different tuple shapes)
homogeneous flat lists of scalars and tuples of scalars
tuples of all of the above

Python Copperhead executes async and returns futures (but "successive calls to Copperhead functions always wait for previous calls to finish before beginning").

Can use 'with places.gpu0' 'with places.openmp', etc, to control what is run on the GPU.

Example:

from copperhead import *
import numpy as np

@cu
def saxpy(a, x, y):
  return [a * xi + yi for xi, yi in zip(x, y)]

x = np.arange(2**20, dtype=np.float32)
y = np.arange(2**20, dtype=np.float32)

with places.gpu0:
  gpu_result = saxpy(2.0, x, y)

with places.openmp:
  cpu_result = saxpy(2.0, x, y)

Links:

---

details on Python Copperhead, already read, no need to re-read, just here for reference:

from copperhead import *
import numpy as np

@cu
def saxpy(a, x, y):
  return [a * xi + yi for xi, yi in zip(x, y)]

x = np.arange(2**20, dtype=np.float32)
y = np.arange(2**20, dtype=np.float32)

with places.gpu0:
  gpu_result = saxpy(2.0, x, y)

with places.openmp:
  cpu_result = saxpy(2.0, x, y)

"Copperhead defines a small functional, data parallel subset of Python which it dynamically compiles and executes on parallel platforms, such as NVIDIA GPUs and multicore CPUs through OpenMP? and Threading Building Blocks (TBB). " [10]

https://devblogs.nvidia.com/parallelforall/copperhead-data-parallel-python/

"Copperhead programs are embedded in Python programs using a clearly delineated subset of the Python language. ... We use a Python function decorator @cu, to mark the functions which are written in the subset of Python supported by Copperhead. Python programs execute normally, until the first call from the Python interpreter to a Copperhead decorated function. From that point, the program executes via the Copperhead runtime until it returns from the original entry-point function back to the Python interpreter. Data is garbage collected using Python’s standard garbage collector. ... Looking at the body of axpy, we see an element-wise operation being performed over the two input arrays, applied via a Python list comprehension. This list comprehension will be executed in parallel. Equivalently, the user could have written axpy using map and a lambda anonymous function.

@cu def axpy(a, x, y): return map(lambda xi, yi: a * xi + yi, x, y)

Or via a named, nested function.

@cu def axpy(a, x, y): def el(xi, yi): return a * xi + yi return map(el, x, y)

All three forms perform identically, programmers are free to use whichever form they find most convenient.

Python’s map operates element-wise over sequences, passing elements from the input sequences to the function it invokes. What about auxiliary data that is not operated on element-wise, but needs to be used in an element-wise computation? Copperhead programs use closures to broadcast this data; in this example, you can see that the element-wise function uses the scalar a from the enclosing scope. ...

it interoperates with Python’s widely-used numpy library to provide homogeneous, well-typed arrays ... note that this function contains no loops. In general, Copperhead does not support side effects, where the value of a variable is changed. ... For example, ... This style of programming is not supported in Copperhead...:

def side_effect_axpy(a, x, y): for i in range(len(y)): y[i] = a * x[i] + y[i]

... As a consequence of this decision, all branches in Copperhead must terminate with a return statement. ...

Copperhead Primitives ...

map(f, ...) Applies the function f element-wise across a number of sequences. The function f should have arity equal to the number of arguments. Each sequence must have the same length. For practical reasons, this arity is limited to 10. For programmer convenience, list comprehensions are treated equivalently to map, so the following two lines of code are equivalent.

z = [fn(xi, yi) for xi in zip(x, y)] z = map(fn, x, y)

reduce(fn, x, init) Applies binary function fn cumulatively to the items of x so as to reduce them to a single value. The given function fn is required to be both associative and commutative, and unlike Python’s built-in reduce, parallel semantics mean that the elements are not guaranteed to be reduced from left to right.

filter(fn, x) Returns a sequence containing those items of x for which fn(x) returns True. The order of items in sequence x is preserved.

gather(x, indices) Returns the sequence [x[i] for i in indices]

scatter(src, indices, dst) Creates a copy of dst and updates it by scattering each src[i] to location indices[i] of the copy. If any indices are duplicated, one of the corresponding values from src will be chosen arbitrarily and placed in the result. The updated copy is returned.

scan(fn, x) Returns the inclusive scan (also known as prefix sum) of fn over sequence x. Also, rscan, exclusive_scan, and exclusive_rscan.

zip(...) Returns a sequence of tuples, given several sequences. All input sequences must have the same length.

unzip(x) Returns a tuple of sequences, given a sequence of tuples.

sort(fn, x) Returns a sorted copy of x, given binary comparator fn(a,b) that returns True if a < b.

indices(x) Returns a sequence containing all the indices for elements in x.

replicate(x, n) Returns a sequence containing n copies of x, where x is a scalar.

range(n) Returns a sequence containing [0, n).

bounded_range(a, b) Returns a sequence containing [a, b)

Copperhead Type System

Copperhead has a simple type system that operates on only a few basic types: scalars, tuples, and sequences. ...

    np.float32 : 32-bit floating point number
    np.float64 : 64-bit floating point number
    np.int32 : 32-bit integer number
    np.int64 : 64-bit integer number
    np.bool : Boolean

...

In addition to the five basic scalar types, Copperhead programs can use tuples of these types, and these tuples can be nested. ...

You access elements of a tuple by unpacking it into multiple elements, as commonly done in Python code. For example, k0, v0 = kv0 unpacks kv0 into two elements. This unpacking can be done in a bind statement, as in...k0, v0 = kv0...or in the parameters to a function. ...

Unlike standard Python, Copperhead does not support dynamic, random access to tuples via the [] operator. Since tuples are heterogeneously typed, random access to tuples would require dynamic typing, which Copperhead does not support.

Along with scalars and tuples, Copperhead programs also use sequences. Sequences are similar to Python lists, with the restriction that, like numpy arrays, they must be homogeneously typed. Sequences can be indexed using the [] operator, but only when reading elements. They cannot be assigned to using the [] operator in Copperhead code, since Copperhead does not support side effects.

When calling a Copperhead function, the inputs must be types that the Copperhead runtime understands. These types are:

the 5 scalar types from numpy, along with the 3 scalar types from Python that we outlined earlier ((these are treated as syntactic sugar for the 5 numpy types));
homogeneously typed Python lists, where each element is one of the scalar types mentioned earlier, or a tuple of those scalar types;
one-dimensional numpy arrays, where each element is a supported scalar type;
tuples of the five scalar types and of sequence types;
copperhead.cuarray. This sequence type is returned by Copperhead programs. It manages memory automatically across heterogeneous memory spaces, conserving bandwidth by lazily transferring data.

The runtime converts all sequence types to copperhead.cuarray types before calling a Copperhead function. ...

Functions can be used as values in Copperhead programs, but at present lambdas cannot escape the scope at which they are defined. In practice, this means that Copperhead programs make use of nested functions that close over values as arguments to primitives like map, but a program cannot accept a function as an argument when called from Python, and it cannot produce a function and return it to Python. This restriction could be lifted in the future.

...

The programmer controls the execution “place” (e.g. CPU or GPU) using Python with statements.

...

def on_gpu(): with places.gpu0: foo()

def on_cpu(): with places.openmp: foo()

...

The Copperhead runtime lazily moves data between memory spaces as necessary....Returned values from Copperhead procedures are futures...successive calls to Copperhead functions always wait for previous calls to finish before beginning

...

The Copperhead runtime automatically invokes compilers to create binaries for Copperhead functions. These binaries are cached persistently in a __pycache__ directory ...

The Copperhead compiler aggressively fuses primitive operations together, to optimize memory bandwidth usage. It generates Thrust code to implement computations, and the output of the compiler can be examined in the __pycache__ directory after a program has been executed. ...

At present, Copperhead programs can only be invoked on flat, one-dimensional sequences, although we envision removing these restrictions with time.

https://github.com/bryancatanzaro/copperhead

---

Rust futures and streams look great:

http://aturon.github.io/blog/2016/08/11/futures/

---

pypy-stm appears to be one of the latest attempts to remove the GIL from Python (Stackless didn't remove the GIL, it was about microtasks; PyPy? incorporated much of Stackless but also didn't remove the GIL)

http://doc.pypy.org/en/latest/stm.html

might be useful to take a look at their implementation:

"The core of the implementation is in a separate C library called stmgc, in the c7 subdirectory (current version of pypy-stm) and in the c8 subdirectory (bleeding edge version). Please see the README.txt for more information. In particular, the notion of segment is discussed there.

PyPy? itself adds on top of it the automatic placement of read and write barriers and of “becomes-inevitable-now” barriers, the logic to start/stop transactions as an RPython transformation and as supporting C code, and the support in the JIT (mostly as a transformation step on the trace and generation of custom assembler in assembler.py).

...

The core of STM works as a library written in C (see reference to implementation details below). It means that it can be used on other interpreters than the ones produced by RPython. Duhton is an early example of that. At this point, you might think about adapting this library for CPython. You’re warned, though: as far as I can tell, it is a doomed idea. I had a hard time debugging Duhton, and that’s infinitely simpler than CPython. Even ignoring that, you can see in the C sources of Duhton that many core design decisions are different than in CPython: no refcounting; limited support for prebuilt “static” objects; stm_read() and stm_write() macro calls everywhere (and getting very rare and very obscure bugs if you forget one); and so on. You could imagine some custom special-purpose extension of the C language, which you would preprocess to regular C. In my opinion that’s starting to look a lot like RPython itself, but maybe you’d prefer this approach. Of course you still have to worry about each and every C extension module you need, but maybe you’d have a way forward. "

-- http://doc.pypy.org/en/latest/stm.html#reference-to-implementation-details and http://doc.pypy.org/en/latest/stm.html#python-3-cpython-and-others

---

pypy-stm's TransactionQueue? construct looks good:

http://doc.pypy.org/en/latest/stm.html#transaction-transactionqueue

that document also explains a more blunt, lower-level construct called atomic sections: "Atomic sections are similar to re-entrant locks (they can be nested), but additionally they protect against the concurrent execution of any code instead of just code that happens to be protected by the same lock in other threads." -- [11]

---

" Most main stream languages today use a shared state model - all threads share the same memory space, and programmers must be diligent about locking memory in order to prevent race conditions. If locks are used incorrectly, then deadlocks can occur. This is a fairly low level approach to managing concurrency. It also makes it quite difficult to create composable software components.

Fantom tackles concurrency using a couple techniques:

    Immutability is built into the language (thread safe classes)
    Static fields must be immutable (no shared mutable state)
    Actors model for message passing (Erlang style concurrency)

---

one guy tried Haxe partially b/c

" Re-using browser's JS engine is nearly impossible for such complex things because of async nature of Unity<->Browser interop which conflicted with our synchronous game logic architecture. Also we certainly didn't want to deal with differences between various browsers and plugins installed by thousands of users all around the world. "

---

interesting only insomuch as it's a list of functions that a message-passing system should/could have:

"Barrelfish provides a uniform interface for passing messages between domains, which handles message formating and marshalling, name lookup, and end-point binding."

---

The part about Rust having no shared memory is just wrong. Rust has a very rich set of concurrency primitives at this point: you can use immutable shared memory (Arc), mutexes (MutexArc?), reader-writer locks (RWarc), and atomic variables (AtomicInt? and friends). And if you're willing to drop down to unsafe code, you get the full set of LLVM concurrency primitives. "

Python's FFI locks them into having to do this:

 @coroutine
     def getresp(): 
        s = socket()
         yield from loop.sock_connect(s, host, port)
         yield from loop.sock_sendall(s, b'xyzzy')
         data = yield from loop.sock_recv(s, 100)
        # ...

instead of this:

    def getresp():
        s = socket()
        s.connect((host,port))
        s.sendall(s,b'xyzzy')
        data = s.recv(s,100)

[12]

b/c

"Because the CPython API relies so heavily on the C stack, either some platform-specific assembly is required to slice up the C stack to implement green threads, or the entire CPython API would have to be redesigned to not keep the Python stack state on the C stack.

Way back in the day [1] the proposal for merging Stackless into mainline Python involved removing Python's stack state from the C stack. However there are complications with calling from C extensions back into Python that ultimately killed this approach. ... using the Stackless strategy in mainline python would have either required breaking a bunch of existing C extensions and placing limitations on how C extensions could call back into Python, or custom low level stack slicing assembly that has to be maintained for each processor architecture. CPython does not contain any assembly, only portable C, so using greenlet in core would mean that CPython itself would become less portable.

Generators, on the other hand, get around the issue of CPython's dependence on the C stack by unwinding both the C and Python stack on yield. The C and Python stack state is lost, but a program counter state is kept so that the next time the generator is called, execution resumes in the middle of the function instead of the beginning.

There are problems with this approach; the previous stack state is lost, so stack traces have less information in them; the entire call stack must be unwound back up to the main loop instead of a deeply nested call being able to switch without the callers being aware that the switch is happening; and special syntax (yield or yield from) must be explicitly used to call out a switch.

But at least generators don't require breaking changes to the CPython API or non-portable stack slicing assembly. So maybe now you can see why Guido prefers it. "

[13]

---

keywords: channel file stream socket BSD berkeley tcp tcp/ip network api server connect bind listen

i guess a difference between a 'channel' API and a 'socket' API, such as is often used for TCP/IP (https://en.wikipedia.org/wiki/Berkeley_sockets#Socket_API_functions according to https://www.joyent.com/blog/tcp-puzzlers ), is that the 'socket' API lets you 'listen' and then 'accept' on a port (server) or 'connect' (client) and then setup and take down a persistent 'connection'.

---

how would the brain handle race conditions? I just thought of a brainlike approach to race conditions:

just rerun the same code 3 or more times in parallel in different processes, and then have those processes vote on the result. Assuming the race condition occurs less than 50% of the time, this gives you a better chance of getting the right answer as you add more parallel redundancy.

---

" Of all the ways of doing concurrency, callbacks are by far the worst, Twisted was plagued by them and is the main reason why it failed, and that was with a much more sane and reasonable language like Python (stackless Python was a much better alternative and used a model similar to Go’s CSP).

And the sad thing is that there are much better alternatives around with much more sound models and environments, Erlang and Go are the two obvious examples, and that is for the highly specialized situations where you have great concurrency needs, for any other problem anything else will be much better than Node.js, even PHP.

– uriel, in response to Why Node.JS is absolutely terrible, by Hasen el Judy "

---

MostAwesomeDude? 1793 days ago [-]

I wonder if perhaps he doesn't realize that Ted Dziuba is not a fan of Twisted either. He's generally recognized as a very belligerent, assertive personality, in the same vein as Zed Shaw, and you have to have a certain amount of thick skin when reading his commentary.

That said, the fact that Node doesn't provide the tools necessary to defer blocking JS code to a thread does pose a problem for these sorts of situations. Apparently (and correct me if I'm wrong; I'm not a Node expert!) Node won't let you run JS in any thread which is not the main thread. Twisted does let you run Python in non-main threads with the deferToThread()/callFromThread()[1] functionality.

I also agree with him about JS being a poor language for server-side work, but that's because I don't think JS's object model is well-suited to large, interface-driven/service-driven applications, and that isn't really a gripe with Node.

http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.IReactorThreads.html http://twistedmatrix.com/documents/current/api/twisted.internet.threads.html

document threading in Twisted.

---

on the greenArray (GA144) Forth chips:

" avodonosov 503 days ago [-]

Another Forth CPU by the Forth creator Chuck Moore and colleagues: http://www.greenarraychips.com/. It's has 144 cores on a square centimeter chip.

Each core is equipped with its own little data and control Forth stacks, making it a fully fledged independent computer (that's why the more precise term is "multi-computer chips" rather than "multi-core").

The cores talk to each other via communication ports. Writing to a port suspends the core until the peer reads the value. And vice-versa. (similar to channels in Go language).

Some other interesting properties (quoting the docs):

A computer can read from multiple ports [corresponds to Go's select] and can execute instructions directly from those ports.

FINE GRAINED ENERGY CONTROL: ... The read or write instruction is automatically suspended in mid-operation if the address [one or more of communication ports and I/O pin] is inactive, consuming energy only due to transistor leakage currents, resuming when the address becomes active.

NO CLOCKS: Most computing devices have one or more clocks that synchronize all operations. When a conventional computer is powered up and waiting to respond quickly to stimuli, clock generation and distribution are consuming energy at a huge rate by our standards, yet accomplishing nothing. This is why “starting” and “stopping” the clock is a big deal and takes much time and energy for other architectures. Our architecture explicitly omits a clock, saving energy and time among other benefits.

http://www.greenarraychips.com/home/documents/greg/PB002-100822-GA-Arch.pdf "

misc: http://excamera.com/sphinx/article-ga144-ram.html

---

http://excamera.com/sphinx/article-ga144-ram.html is so cool that i'm going to copy it here so that i see it later:

" GA144 note: one node RAM

A node X can use all 64 words of a neighbor node R0 as RAM.

Assuming X's b register points to R0, and R0 is executing from port X, this is how node X reads word at addr from node R0:

: ramread ( addr -- v ) @p !b !b @p a! @ !p @b ;

it works by sending a tiny four-opcode program @p a! @ !p to node R0. R0 executes this program:

        @p asks X for another word
        a! stores the word in R0's a register
        @ fetches from R0's RAM at address a
        !p writes the result back to the caller, in this case node X.

After node X has sent this program to R0, it reads the result back from R0.

Writing to R0 can be done in a couple of ways. To write a value v to the same location as was just read, X can do:

: ramwrite ( v -- ) @p !b !b ; @p ! . .

The trick here is that R0's a already points to the correct word, because of the preceding ramread.

The above are good factors for a read/modify/write, for example to negate a word in RAM:

call ramread - call ramwrite "

---

	A 32nm 1000-Processor Array (

willvarfar 70 days ago [-]

So how do you program such a beast? What progress is being made on that front?

Cache coherency seems really hard to give up on, and even CPU-GPU cache coherency is becoming the expected norm, with even ARM delivering it.

RachelF? 70 days ago [-]

It's very hard.

In the mid 1980s there was a CPU called a "Transputer" [1] made some of the people who moved to ARM. These CPUs could be connected together in huge networks and directly talk to each other.

The network of CPU's could auto-discover its topology, but coding for so many CPU's was difficult. Some specific algorithms scaled well with the number of CPUs, but most did not.

[1] https://en.wikipedia.org/wiki/Transputer

jacquesm 70 days ago [-]

Occam did it quite nicely.

atrn 70 days ago [-]

transputers had four links and comms beyond that required routing in s/w. I wrote such a thing for a transputer machine.

Also, the auto-discovery wasn't really auto... That was the boot code probing for CPUs on other ends of links and propagating itself to connected CPUs, building a map of the network in the process (which is kind of cool).

sedachv 70 days ago [-]

The Connection Machine was a SIMD design and the languages available for it (StarLisp?: https://omohundro.files.wordpress.com/2009/03/omohundro86_the_essential_starlisp_manual.pdf and C*: http://people.csail.mit.edu/bradley/cm5docs/CStarProgrammingGuide.pdf were actually pretty good compared to OpenCL?.

The Connection Machine Lisp programming language described in Daniel Hillis' PhD? dissertation was essentially going to be Lisp with parallel map/reduce but AFAIK was never done being implemented.

I think the big problem was that most of the SIMD algorithms were yet to be discovered at the time. For example this paper by Hillis and Steele was a very big deal but looks kind of basic today: http://uenics.evansville.edu/~mr56/ece757/DataParallelAlgorithms.pdf

Guy Blelloch did a lot of work on Connection Machines and basically wrote the book on SIMD programming: http://www.cs.cmu.edu/~blelloch/papers/Ble90.pdf He also made a very nice programming language for parallel computing (NESL), as did Gary Sabot who worked at Connection Machines (Paralations). When you compare those to Hadoop or OpenCL? it really is a wonder where we went wrong and what the designers of the latter were thinking (or not).

---

endergen 70 days ago [-]

I'm an ex game engine developer and I bristle anytime anyone thinks any existing functional language is better for multicore. Specifically garbage collection alone will make any language an order of magnitude slower generally per a single core. Also the C/C++ game development community at least has great approaches to multicore which makes C/C++ linearly scale with scores to boot, see for example: http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine

((someone else said: "Fibers" like you linked in the presentation (m:n green thread scheduling) have been in use for decades. Many, many languages other than C++ have had them for over a decade. Go is built on them."))

((and also: "Functional programming languages do not _require_ a GC. They just largely have it."))

I love functional languages, more for thinking in them, prototyping ideas, especially compilers/visualizers, and etc. But for any language that adds garbage collection, immutable data structures (way more operations per write and crazy memory thrashing/alignment issues), unless used sparsely or in a mixed paradigm (ugh, except maybe scala/clojure) are going to pay a magnitude of performance loss.

Mind you there are tricks around using more system languages (C/C++/Rust/D etc) for a lot of the heavy lifting with the application core being functional that gets you closer to the best of both worlds.

---

summary of https://glyph.twistedmatrix.com/2014/02/unyielding.html (copied from plPartConcurrency):

threads with shared state are dangerous, because threads can interleave in any way (or even execute in parallel). This makes it hard to reason locally about code.
concurrency paradigms such as explicit, cooperative coroutines (non-preemptive multitasking), where each thread executes in isolation until it yields, and any function that might yield must be syntactically marked, is better, because you only have to worry about other threads executing at the yield.
greenthreads are presented as one solution to the problems of threading. But greenthreads with shared state have the same disadvantage noted above as ordinary threading; it's hard to reason locally about code because, even if individual instructions cannot execute in parallel, it is still true that different threads may interleave arbitrarily
note that this is also an argument for why systems like async/await should indeed mark things as async, rebutting http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/
some alterative concurrent systems include:
- threading with shared state
- callbacks, eg Twisted’s Deferred, JavaScript’s? Promises/A[+], E’s Promises
- futures/promises (managed callbacks), eg Twisted’s Deferred, JavaScript’s? Promises/A[+], E’s Promises
- explicit coroutines (cooperative multitasking with explicit yield, where any function that might call a yield is syntactically marked), eg Java’s “green threads”, Twisted’s Corotwine, eventlet, gevent
- implicit coroutines (eg Java’s “green threads”, Twisted’s Corotwine, eventlet, gevent)

another post making similar points is https://glyph.twistedmatrix.com/2012/01/concurrency-spectrum-from-callbacks-to.html . That one adds a few details:

"at the simplest end of the spectrum, you have callback-based concurrency...This is very explicit and reasonably straightforward to debug and test, but it can be tedious and overly verbose, especially in Python where you have to think up a new function name and argument list for every step. The extra lines for the function definition and return statement..Twisted's Deferreds make this a bit easier than raw callback-passing without fundamentally changing the execution dynamic ... Then you have explicit concurrency, where every possible switch-point has to be labeled...This is more compact than using callbacks, but also more limiting. For example, you can only resume a generator once, whereas you can run a callback multiple times. However, for a logical flow of sequential concurrent steps, it reads very naturally, and is shorter, as it collapses out the 'def' and 'return' lines, and you have to think of at least two fewer names per step. ... a cooperatively multithreading program with implicit context switches makes every line with any function call on it (or any line which might be a function call, like any operator which can be overridden by a special method) a possible, but not likely culprit. Now when you have a concurrency bug you have to audit absolutely every line of code you've got, ... All the way at the end of the spectrum of course you have preemptive multithreading, where every line of code is a mind-destroying death-trap hiding every possible concurrency peril you could imagine, and anything could happen at any time ... Personally I like Twisted's style best; the thing that you yield is itself an object whose state can be inspected, and you can write callback-based or yield-based code as each specific context merits. My opinion on this has shifted over time, but currently I find that it's best to have a core which is written in the super-explicit callback-based approach with no coroutines at all, and then high-level application logic which wraps that core using yield-based coroutines (@inlineCallbacks, for Twisted fans). "

---

    In the past, oh, 20 years since they invented threads, lots of new, safer models have arrived on the scene. Since 98% of programmers consider safety to be unmanly, the alternative models (e.g. CSP, fork/join tasks and lightweight threads, coroutines, Erlang-style message-passing, and other event-based programming models) have largely been ignored by the masses, including me.

Shared memory concurrency is still where it’s at for really high performance programs, but Go has popularized CSP; actors and futures are both “popular” on the JVM; etc. "

---

smsm42 4 days ago [-]

> the mental model works better for asynchrony; instead of describing a series of steps to follow, and treating the interrupt as the exception, you describe the processes that should be undertaken in certain circumstances

I could never understand how this works better as a mental model. Say you ask somebody to buy you a gadget in a store they don't know. What do you do tell them:

a) "drive in your car on this street, turn left on Prune street, turn right on Elm street, the store will be after the second light. Go there, find "Gadgets" isle, on the second shelf in the middle there would be a green gadget saying "Magnificent Gadget", buy it and bring it home"

or:

b) when you find yourself at home, go to car. When you find yourself in the car, if you have a gadget, drive home, otherwise if you're on Elm street, drive in direction of Prune Street. If you're in the crossing of Elm street and Prune street, turn to Prune street if you have a gadget but to Elm street if you don't. When you are on Prune street, count the lights. When the light count reaches two, if you're on Prune street, then stop and exit the vehicle. If you're outside the vehicle and on Prune street and have no gadget, locate store and enter it, otherwise enter the vehicle. If you're in the store and have no gadget then start counting shelves, otherwise proceed to checkout. Etc. etc. - I can't even finish it!

I don't see how "steps to follow" is not the most natural mental model for humans to achieve things - we're using it every day! We sometimes do go event-driven - like, if you're driving and somebody calls, you may perform event-driven routine "answer the phone and talk to your wife" or "ignore the call and remember to call back when you arrive", etc. But again, most of these routines will be series of steps, only triggered by an event.

---

" Communicating sequential processes (Go, PHP+MySql?) makes IO have a modestly simpler synchronous syntax at the cost of communicating between I/O operations much more complex (sending a message to a port or performing some sort of transaction instead of just assigning to a value). It's a tradeoff. " -- [14]

---

i guess the picture that is coming together is:

you have shared-nothing processes with channels:

Go-style preemptive greenthreads with channels and no shared state, and/or Erlang/Elixir-style processes, and/or CSP, and/or quasar/pulsar, and/or clojure.async (and/or occam?) (and/or jcsp (java csp)) (and/or actors, although http://clojure.com/blog/2013/06/28/clojure-core-async-channels.html contrasts actors and CSP)
- regarding CSP/actors/pi-calculus; i think what we want is the most general/featureful combination of these things, so we have channels, and the channels can be variable-buffered ('infinitely'-buffered, until global memory runs out; like Erlang, i think) (async) or have a fixed buffer (like Golang, async) or can have a 0 buffer (sync) (like Golang, which lets channels be a fixed buffer or unbuffered) and channels are first-class but anonymous, and processes can be anonymous or can have IDs (but the application can choose not to send these IDs down channels if it wants to keep processes 'anonymous'). You can 'poll' across multiple channels at once, and the poll can pattern-match on messages so that only messages of a certain type (or even a certain 'shape', structural typing) are selected. You can introspect on the buffer of incoming messages (its size, whether it is fixed or variable, if fixed, what the fixed size is, and its contents). Processes can dynamically spawn new processes and channels. There are primitive 'timer processes' provided. A single channel can have multiple 'read' ports and/or multiple 'write' ports (like Golang; and like Golang, "sequentially ((interleave)) messages in an arbitrary way"). (what happens when one side or the other closes and then the other tries to write? i recall something about infinite blocking in Golang which was bad, see:
  - recalling these from elsewhere, consider this stuff:
  - Using most things that are nil causes a panic. But sending to or receiving from a nil channel blocks forever.
  - * "The author points out that channel teardown is hard. He's right. Figuring out how to shut down your Go program cleanly can be difficult, especially since calling "close" on a closed channel causes a panic. You have to send an EOF on each channel so the receiver knows to stop. When you have a pair of channels going in opposite directions between two goroutines, and either end can potentially initiate shutdown, it gets messy." [15]
  - "For concreteness, at least from what I've experienced, the "messiness" is that if you close one of these channels, you may have to "drain" the other channel lest you let the other side block. If the other side is only using the channel in a "select" block with other options you may not need to but if it ever does a "bare" send you need to wait for the other end to send its close." [16]
  - https://github.com/golang/go/issues/14601 and https://github.com/golang/go/issues/11344 and "ok := n -> ch" syntax
  - https://news.ycombinator.com/item?id=11211091 (i already read that but you may want to take another look at some point)

would be nice if frameworks/programs could guarantee, via the type system, that they aren't going us the full gamut of features of our message passing (eg that they are only using synchrononus channels, or that they are only using infinite-buffer channels, etc).

and you also have have STM (software transactional memory)

and then within each 'process' you can have 'threads' with shared state. The following is for handling threading with shared state:

(somewhat) primitively, callbacks (and event-driven programming)
built on top of that, promises (actually, we want their generalization into streams)
built on top of that, async/await

note that both 'process' and 'thread' here can be implemented as 'greenthreads' within a single Oot process (or, better, multiplexed onto n Oot processes where n = # of CPUs, or mb by default # of CPUs minus 1 (to preserve UI latency for the user's other applications, since we are by default targeting desktops over servers))

to extend this to a distributed systems context we have to add:

automagically marshall stuff sent over channels, and convert intra-process channels to IPC channels, and IPC channels to over-the-network channels
the possiblity of trying to send or receive a message and then getting an error (do we even need this? see next bullet point)
the possiblity of trying to send or receive a message and it seeming to complete but actually failing
binding channels to 'ports'; 'listen'ing and then 'accept'ing on a port (server) or attempt 'connect'ing (client), and setting up and taking down a persistent 'connection'
'addressses' (names) of nodes (and nodes own ports)
should we add retry/'guaranteed delivery' (at least once/at most once? is exactly once possible? can we sequence streams of packets, like TCP? vector clocks?) on top of this?
does zeromq (zmq) have anything else we should learn here?
someone once said "Well, BSD sockets don't use file abstraction. Plan 9 recovered "Unix philosophy" in that regard. In my opinion.". Do we have anything to learn here from socket vs. file, stream vs. packet
what else?

also need the things that Milewski's system doesn't have to lock:

immutable things (can be sent across channels)
threadlocal varibles
unique (un-aliased) pointers (can be sent across channels)

also need:

atoms (atomically-accessed single memory locations with CAS, which (should? must?) hold immutable data)

clojure's 4 concurrency things (plus core.async, which is covered above):

agents (TODO) note "Agents are integrated with the STM - any dispatches made in a transaction are held until it commits, and are discarded if it is retried or aborted."
covered above: STM, threadlocals, atoms

todo:

join calculus's "multi-way join patterns, the ability to match against messages from multiple channels simultaneously", eg you can have a function that is 'called' from multiple other processes, each of them passing some arguments, and which doesn't start executing until it has all of its arguments; and it can also 'return' different results to various of these processes.
threads, atomics, critical sections, barriers, mutexes, semaphors, conditions vars, fork, https://en.wikipedia.org/wiki/Rendezvous_(Plan_9) ("The rendezvous call takes a tag and a value as its arguments. The tag is typically an address in memory shared by both processes. Calling rendezvous causes a process to sleep until a second rendezvous call with a matching tag occurs. Then, the values are exchanged and both processes are awakened.") ( https://rosettacode.org/wiki/Rendezvous ), etc
map reduce
https://en.wikipedia.org/wiki/Parallel_programming_model
monitors
wait-free algorithms

---

the following are somewhat related: channel file stream socket port

---

" Node came of age about a decade after epoll was introduced, when not having access to nonblocking IO was considered a big liability for a couple of dominant web programming languages, and they built their concurrency model around the semantics of epoll.

However, there are languages like Haskell, Erlang, and Go that IMO did the right thing by building a synchronous programming model for concurrency and offering preemptable lightweight processes to avoid the overhead associated with OS thread per connection concurrency models. These languages offer concurrent semantics to programmers, yet still are able to use nonblocking IO underneath the covers by parking processes waiting on I/O events. It's not the right tradeoff for every language, particularly I think lower level languages like Rust are better off not inheriting all the extra baggage of a runtime like this, but for higher level languages I think its probably the most convenient model to programmers.

---

int_19h 7 days ago [-]

Apparently, I wasn't paying as much attention to the state of async in Python 3.5 as I should have. Looks like they actually have the most featureful implementation of that right now, complete with async for all constructs where it can be sensibly implemented and provide some benefits (like for-loops and with-blocks).

---

" But there is one great reason why node's async first mentality is superior- when you want to do async in node, you don't have to worry that some library you are using is going to lock up your thread.

In any other language, you have to painstaikingly make sure everything you use isn't doing sync, or try and monkeypatch all io operations (python has something like that). " [17]

icebraining 8 days ago [-]

In any other language

No; for example, in Go, every time the code does IO, the scheduler will re-assign another goroutine, and they asynchronously get back to the other when the IO finishes. It doesn't need any special support by the library.

---

ga144's capability of executing code streamed to it from a port is really cool

---

http://www.kamaelia.org/Home.html

---

" I've heard the anecdote that Ruby threading was so slow because Ruby used "green threads" instead of native threads ... From what I've read, Ruby does a lot more work than a native context-switch, mainly by saving and restoring the entire stack by copying memory to and from the heap. The reason that native threads are not used (if this is still true) is probably for simplicity rather than performance. "

---

"Code that's protected...from indeterminacy in a multithreading context - is called thread-safe." -- [18]

"A program or method is thread-safe if it has no indeterminacy in the face of any multithreading scenario." -- [19]

"General-purpose types are rarely thread-safe in their entirety, for the following reasons:

    The development burden in full thread safety can be significant, particularly if a type has many fields (each field is a potential for interaction in an arbitrarily multithreaded context).
    Thread safety can entail a performance cost (payable, in part, whether or not the type is actually used by multiple threads).
    A thread-safe type does not necessarily make the program using it thread-safe, and often the work involved in the latter makes the former redundant.

Thread safety is hence usually implemented just where it needs to be, in order to handle a specific multithreading scenario." 00 [20]

---

" Primitive types aside, few .NET Framework types, when instantiated, are thread-safe for anything more than concurrent read-only access. The onus is on the developer to superimpose thread safety, typically with exclusive locks. (The collections in System.Collections.Concurrent are an exception.) "

---

each primitive needs to be labeled as to whether the scheduler should put the executing greenthread to sleep when it occurs (eg blocking I/O, yes) although i guess this isn't very effective unless the scheduler knows how to check if the thing is done

---

" Careful implementations of GreenThreads?? can be very lightweight. A good example is OzLanguage?'s built-in thread support, which effortlessly tracks thousands of threads to support an interesting style of constraint-solving programming. (Most of these threads are usually blocked, but the efficiency with which these numbers are managed is still impressive.) " -- http://c2.com/cgi/wiki?GreenVsNativeThreads

---

"since you know when a greenlet will context switch, you may be able to get away with not creating locks for shared data-structures." -- [21]

---

" package main

import "fmt"

var x = 1

func inc_x() { test for { x += 1 } }

func main() { go inc_x() for { fmt.Println(x) } }

I recognize that I should be using channels to prevent race conditions with x, but that's not the point here. The program prints 1 and then seems to loop forever (without printing anything more). ... because the main function never yields back to the thread and is instead involved in a busy loop ...because the main function never yields back to the thread and is instead involved in a busy loop " -- [22]

(but now that they are preemptive, i guess this is fixed?)

(no! it's not! http://www.sarathlakshman.com/2016/06/15/pitfall-of-golang-scheduler )

---

"When a goroutine executes a blocking system call, no other goroutine is blocked."

and from elsewhere:

"If a goroutine is blocking, the runtime will start a new OS thread to handle the other goroutines until the blocking one stops blocking." -- [23]

---

so for Oot, some things that the reference implementation will have that not every implementation may support:

non-blocking I/O
migrating greenthreads, like Goroutines
if something blocking is called, then either a new thread is spawned for it, or other greenthreads in the same OS thread are migrated to other OS threads, so that in any case other greenthreads aren't blocked
preemptively scheduled greenthreads (so even in a tight loop/busywait, other threads get some CPU time)
stacks can start small and grow rather than be preallocated

---

in golang, "The starting stack size has changed over time; it started at 4 KiB? (one page), then in 1.2 was increased to 8 KiB? (2 pages), then in 1.4 was decreased to 2 KiB? (half a page). These changes were due to segmented stacks causing performance problems when rapidly switching back and forth between segments ("hot stack split"), so increased to mitigate (1.2), then decreased when segmented stacks were replaced with contiguous stacks (1.4):" [24]

---

"Where fork() only gives the child process a copy of the state of the parent process, clone() can be used to create a new process that shares or copies resources with an existing process. You can share or copy the memory map, file systems, file descriptors, and signal handlers of the existing process. The fork() system call is essentially a special case of clone() where none of the resources are shared." -- http://www.linux-mag.com/id/792/

---

ruby threads example from http://schmurfy.github.io/2011/09/25/on_fibers_and_threads.html :

require 'thread'

MUTEX = Mutex.new

def msg(str) MUTEX.synchronize { puts str } end

th1 = Thread.new do 100.times {

end

n	msg "[Thread 1] Tick #{n}" }

th2 = Thread.new do 100.times {

end

n	msg "[Thread 2] Tick #{n}" }

th1.join th2.join

---

http://schmurfy.github.io/2011/09/25/on_fibers_and_threads.html shows how using cooperative multitasking (fibers (coroutines)), you can transform callback style code into synchronous style code (just set the callback be to fiber.resume() and then do yield())

---

" Quasar abstracts both thread implementations — Java’s Thread and Quasar’s Fiber — into Strand .. Quasar also includes a fiber-compatible, strand-based port of java.util.concurrent as well as Go-like channels, Erlang-like actors and Dataflow. "

---

" So aren’t fibers generators or async/awaits?

No, as we have seen fibers are real threads: namely a continuation plus a scheduler. Generators and async/awaits are implemented with continuations (often a more limited form of continuation called stackless, which can only capture a single stack frame), but those continuations don’t have a scheduler, and are therefore not threads. "

---

sounds like Quasar, like Go, doesn't pre-empt greenthreads during tight loops:

" Because those instructions add some overhead (normally under 3%), Quasar does not instrument all methods. Instrumentation is only required to facilitate the suspension and resumption of fibers, so it is only required for methods that might potentially run in a fiber and block (methods that never block, i.e. they only perform a computation, need never be instrumented, even if they’re called within a fiber). Currently, the programmer must mark those blocking methods manually (either by declaring them to throw SuspendExecution? or by annotating them with @Suspendable). Obviously, any function calling a blocking function is itself blocking. "

note however that the overhead is only 3% (well, i guess that means 3% if function calls are instrumented, which means more if we do it every few instructions, but that's fine). So we should do it.

-- http://blog.paralleluniverse.co/2014/02/06/fibers-threads-strands/

---

" Two well-known JVM languages have also recently introduced constructs akin to lightweight threads: Scala with Async, and Clojure with the excellent core.async. Those employ a similar instrumentation scheme, but perform it at the language level, using macros, rather than at the JVM bytecode level like Quasar. While this allows Clojure to support core.async on non-JVM platforms (like JavaScript?), the approach has two limitations: 1) it is only supported by a single language and can’t interoperate with other JVM languages, and 2) because it uses macros, the suspendable constructs are limited to the scope of a single code block, i.e. a function running in a suspendable block cannot call another blocking function; all blocking must be performed at the topmost function. It’s because of the second limitation that these constructs aren’t true lightweight threads, as threads must be able to block at a any call-stack depth (Pulsar, Quasar’s Clojure API, contains an implementation of core.async that makes use of Quasar fibers). "

-- http://blog.paralleluniverse.co/2014/02/06/fibers-threads-strands/

---

" Strands

While Quasar’s Fiber class has an API very similar to that of Thread, we would still like to abstract the two into a single interface. The Strand class is that abstraction. A strand in quasar is either a (plain Java, OS) thread or a fiber.

The Fiber class directly extends Strand, but Threads have to be wrapped (automatically, as we’ll see).

The static method Strand.currentStrand(), returns the current strand (the one that called the method). It will return a strand representing a fiber or a thread depending on whether it was called from a fiber or a thread.

The Strand class abstract pretty much all of the operations you can do with threads or fibers. Strand.sleep(int millis) suspends the current strand (again, thread or fiber) for the given duration. Strand.getState() and Strand.getStackTrace() returns a strand’s execution state and stack trace, again, regardless of whether the strand is a fiber or a thread.

But perhaps most importantly, Strand exposes park and unpark methods, that generalize these operations for both threads and fibers. "

---

i still feel like there could be some graph abstraction for threads of control. Perhaps this has already been done in process calculi or temporal logic?

what about representing eg multiple threads in a process (can simultaneously execute) vs multiple fibers in a thread (can be interleaved but not simultaneously execute)?

---

" In fact, the synchronization classes in java.util.concurrent (locks, semaphores, countdown-latches and more) do not rely on OS-provided constructs, but are implemented directly in Java using CPU concurrency primitives like CAS combined with the ability to park and unpark fibers. Therefore, all we had to do to adapt these classes to work for fibers as well as threads was to replace all references to Thread with Strand, and every call to LockSupport?.park/unpark with calls to Strand.park/unpark. Quasar contains a few such ports from java.util.concurrent (the package’s source code is in the public domain) that work on strands and maintain the exact same API, so no code changes are required by clients of these classes. " -- http://blog.paralleluniverse.co/2014/02/06/fibers-threads-strands/

---

" This is why fibers and threads can run side by side, and you can easily choose how you want your code to run. Code that runs in short bursts and blocks often (e.g. code that receives and responds to messages) is better suited to run in a fiber, while code performs a lot of computations and blocks infrequently better belongs in a plain thread. The two can still exchange data and synchronize with one another thanks to the strand abstraction. " -- http://blog.paralleluniverse.co/2014/02/06/fibers-threads-strands/

---

"complicated control flow patterns are rare in practice. We examined the code structure of the Flash web server and of several applications in Ninja, SEDA, and TinyOS? [8,12,16,17]. In all cases, the control flow patterns used by these applications fell into three simple categories: call/return, parallel calls, and pipelines. All of these patterns can be expressed more naturally with threads. ... The only patterns we considered that are less graceful with threads are dynamic fan-in and fan-out; such patterns might occur with multicast or publish/subscribe applications. In these cases, events are probably more natural. However, none of the high-concurrency servers that we studied used these patterns. "

-- https://www.usenix.org/legacy/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html

---

"Adding .parallelStream() to your methods can cause bottlenecks and slowdowns (some 15% slower on this benchmark we ran), " -- [25]

---

" I've implemented coroutines in assembler, and measured performance.

Switching between coroutines, a.k.a. Erlang processes, takes about 16 instructions and 20 nanoseconds on a modern processor. Also, you often know the process you are switching to (example: a process receiving a message in its queue can be implemented as straight hand-off from the calling process to the receiving process) so the scheduler doesn't come into play, making it an O(1) operation.

To switch OS threads, it takes about 500-1000 nanoseconds, because you're calling down to the kernel. The OS thread scheduler might run in O(log(n)) or O(log(log(n))) time, which will start to be noticeable if you have tens of thousands, or even millions of threads.

Therefore, Erlang processes are faster and scale better because both the fundamental operation of switching is faster, and the scheduler runs less often. shareeditflag " -- http://stackoverflow.com/questions/2708033/technically-why-are-processes-in-erlang-more-efficient-than-os-threads

---

random distributed systems class. I liked their lecture slides on consensus numbers, so i'm putting it here in case i want to read any more of it later: http://www.cs.yale.edu/homes/aspnes/pinewiki/TitleIndex.html

---

http://www.1024cores.net/home/technologies

" The first thing to say is that there are two main categories of libraries and technologies. The first category is targeted at parallel computations (HPC), that is, it will help you to parallelize computations (like matrix multiplication, sorting, etc). OpenMP?, Cilk++, TBB, QuickThread?, Ct, PPL, .NET TPL, Java Fork/Join fall into this category.

The second category is targeted at general concurrency, that is, it will help to implement things like multi-threaded servers, rich client application, games, etc. TBB, QuickThread?, just::thread, PPL, AAL, AppCore? fall into this category. .NET and Java standard libraries also include some things that may be of help (like concurrent containers, synchronization primitives, thread pools, etc).

As you may notice, some libraries are mentioned in both categories. For example, Intel TBB contains some primitives that help with parallel computations (parallel algorithms and task scheduler), and some things that help with general concurrency (threads, synchronization primitives, concurrent containers, atomics).

Efficient and scalable parallel computations is a rather difficult field, it's easy to create a parallel program that executes slower than original single-threaded version (and the more cores you add the slower it executes). So, if you are not doing something extraordinary, you may consider using internally parallelized libraries like Intel MKL, Intel IPP, AMD ACML. They are not only internally parallelized, they are also highly optimized and will be updates for future architectures. So you only need to call a function for matrix multiplication, FFT, video stream decoding, image processing or whatever you need.

Subpages (6): Concurrency Kit (C) FastFlow? (C++) Intel TBB Just::Thread (C++) OpenMP? QuickThread? (C++/Fortran) "

---

http://www.1024cores.net/home/technologies/just-thread

std::thread class for launching and managing threads.
std::async function for starting asynchronous tasks.
Mutex classes (std::mutex, std::timed_mutex, etc.) for protecting shared data.
Condition variables (std::condition_variable and std::condition_variable_any) for synchronizing operations.
Atomic types (std::atomic_int, std::atomic_long, etc.) for low level atomic access.
Futures and promises (std::future, std::promise, etc.) for communicating data between threads.
Compatible with Microsoft Visual Studio 2005, 2008, and 2010 for both 32-bit and 64-bit Windows targets.
Compatible with g++ 4.3, 4.4 and 4.5 for 32-bit and 64-bit Ubuntu linux (x86/x86_64) targets, making full use of the C++0x support from g++ including rvalue references and variadic templates.
Full documentation available online.
Special debug mode for identifying the call chain leading to a deadlock.

---

(replying to "Generally I see the idea that "technology X makes the problems of concurrency go away" is a bad smell.")

 nickpsecurity 199 days ago [-]

The only ones I've seen that delivered [with caveats] were Concurrent Pascal, Ada's Ravenscar, Eiffel's SCOOP, and recently Rust. SCOOP got the most CompSci? work with a Java port, one proving no deadlocks, one proving no livelocks, a formal verification in Maude that caught problems, and one modification to almost entirely eliminate performance penalty.

So, it can happen but does rarely enough to arouse strong skepticism. I usually don't buy it.

---

_halgari 199 days ago

parent [-]

on: How core.async and CSP help you prove your async c...

Big hairy core.async go blocks like shown in these articles makes me a tad uneasy (see the sync source in the second part to this article). Because they seem to be programmed in a mostly imperative style. Almost every time I see a go block loop, I see code that could be better written with core.async/pipeline. `If` statements can be come filters, transformations can become map. And with the addition of transducers all this becomes much cleaner.

So in that sense, I think many FRP-esque libraries have something going for them, they force their users into a model that is more declarative than imperative. That sort of programming can and should be done with core.async. Go blocks are a primitive, build abstractions on top of that and keep your app code declarative.

hoodunit 199 days ago [-]

The go blocks in here are really ugly, but to be honest I just found it difficult to refactor in this particular case. The issues were a) I wanted all the events in a single event loop to make timing issues clear and to fit the model semantics better b) core.async go blocks, because they are macros, can't be refactored like functions (>! and such need to be in a go block) and c) the sync logic otherwise was easier to reason about with all of the logic in one place. One better way to factor this would be to have a declarative format for what each action performs, but it ended up being more difficult to follow the logic when I refactored it this way. Types would help, and I think in general there is unexploited potential for types and CSP and FRP approaches. But if you have suggestions for how to make this better I'm all ears.

---

" However, the biggest reason for CORBA’s failure is that it tried to make remote calls look the same as local calls. Cap’n Proto does NOT do this – remote calls have a different kind of API involving promises, and accounts for the presence of a network introducing latency and unreliability.

As shown above, promise pipelining is absolutely critical to making object-oriented interfaces work in the presence of latency. If remote calls look the same as local calls, there is no opportunity to introduce promise pipelining, and latency is inevitable. Any distributed object protocol which does not support promise pipelining cannot – and should not – succeed. Thus the failure of CORBA (and DCOM, etc.) was inevitable, but Cap’n Proto is different. "

---

https://en.m.wikipedia.org/wiki/Monitor_(synchronization) has an example implementation of semaphores and condition variables from test-and-set

---

" Removing the GIL in CPython has two problems:

    how do we guard access to mutable data structures with locks and
    what to do with reference counting that needs to be guarded.

PyPy? only has the former problem; the latter doesn't exist, due to a different garbage collector approach. "

---

"...a massive use case ((that Python currently doesnt handle well)): shared memory...shared memory is available in ((Python library named)) multi-processing, but it doesn't necessarily interact well with existing codes. ((my multi-language program)) wants to manage shared memory so that multiple cores don't necessarily need multiple copies of the data, when the executing tasks don't conflict (all are read-only, or access disjoint data)."

" ^This. It is a very common usecase for applications I work with to create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. If I use async workers, the dataframe operations are bound by GIL restraints. If I use sync workers, each process needs a copy of the pd dataframe which the server cannot handle (I have never seen pre-fork shared memory work for this problem). I don't want to introduce another technology to solve this problem. "

---