Bayle Shanks's website: proj-oot-ootConcurrencyNotes6

[1]:

" IPC can be synchronous or asynchronous. Asynchronous IPC is analogous to network communication: the sender dispatches a message and continues executing. The receiver checks (polls) for the availability of the message by attempting a receive, or is alerted to it via some notification mechanism. Asynchronous IPC requires that the kernel maintains buffers and queues for messages, and deals with buffer overflows; it also requires double copying of messages (sender to kernel and kernel to receiver). In synchronous IPC, the first party (sender or receiver) blocks until the other party is ready to perform the IPC. It does not require buffering or multiple copies, but the implicit rendezvous can make programming tricky. Most programmers prefer asynchronous send and synchronous receive.

First-generation microkernels typically supported synchronous as well as asynchronous IPC, and suffered from poor IPC performance. Jochen Liedtke assumed the design and implementation of the IPC mechanisms to be the underlying reason for this poor performance. In his L4 microkernel he pioneered methods that lowered IPC costs by an order of magnitude.[13] These include an IPC system call that supports a send as well as a receive operation, making all IPC synchronous, and passing as much data as possible in registers. Furthermore, Liedtke introduced the concept of the direct process switch, where during an IPC execution an (incomplete) context switch is performed from the sender directly to the receiver. If, as in L4, part or all of the message is passed in registers, this transfers the in-register part of the message without any copying at all. Furthermore, the overhead of invoking the scheduler is avoided; this is especially beneficial in the common case where IPC is used in an RPC-type fashion by a client invoking a server. Another optimization, called lazy scheduling, avoids traversing scheduling queues during IPC by leaving threads that block during IPC in the ready queue. Once the scheduler is invoked, it moves such threads to the appropriate waiting queue. As in many cases a thread gets unblocked before the next scheduler invocation, this approach saves significant work. Similar approaches have since been adopted by QNX and MINIX 3.[citation needed]

In a series of experiments, Chen and Bershad compared memory cycles per instruction (MCPI) of monolithic Ultrix with those of microkernel Mach combined with a 4.3BSD Unix server running in user space. Their results explained Mach's poorer performance by higher MCPI and demonstrated that IPC alone is not responsible for much of the system overhead, suggesting that optimizations focused exclusively on IPC will have limited impact.[14] Liedtke later refined Chen and Bershad's results by making an observation that the bulk of the difference between Ultrix and Mach MCPI was caused by capacity cache-misses and concluding that drastically reducing the cache working set of a microkernel will solve the problem.[15]

In a client-server system, most communication is essentially synchronous, even if using asynchronous primitives, as the typical operation is a client invoking a server and then waiting for a reply. As it also lends itself to more efficient implementation, most microkernels generally followed L4's lead and only provided a synchronous IPC primitive. Asynchronous IPC could be implemented on top by using helper threads. However, experience has shown that the utility of synchronous IPC is dubious: synchronous IPC forces a multi-threaded design onto otherwise simple systems, with the resulting synchronization complexities. Moreover, an RPC-like server invocation sequentializes client and server, which should be avoided if they are running on separate cores. Versions of L4 deployed in commercial products have therefore found it necessary to add an asynchronous notification mechanism to better support asynchronous communication. This signal-like mechanism does not carry data and therefore does not require buffering by the kernel. By having two forms of IPC, they have nonetheless violated the principle of minimality. Other versions of L4 have switched to asynchronous IPC completely.[16]

As synchronous IPC blocks the first party until the other is ready, unrestricted use could easily lead to deadlocks. Furthermore, a client could easily mount a denial-of-service attack on a server by sending a request and never attempting to receive the reply. Therefore, synchronous IPC must provide a means to prevent indefinite blocking. Many microkernels provide timeouts on IPC calls, which limit the blocking time. In practice, choosing sensible timeout values is difficult, and systems almost inevitably use infinite timeouts for clients and zero timeouts for servers. As a consequence, the trend is towards not providing arbitrary timeouts, but only a flag which indicates that the IPC should fail immediately if the partner is not ready. This approach effectively provides a choice of the two timeout values of zero and infinity. Recent versions of L4 and MINIX have gone down this path (older versions of L4 used timeouts, as does QNX). "

---

if ur using synchronous message passing (by which i mean that the sender blocks until receiver receives the message) on the same system, then you can possibly do zero-copy; the receiver can read and manipulate the sent data in-place while the sender is blocked. If giving the receiver access to that memory segment would be too kludgy, then the IPC system can provide function calls to read subsections of the message; this allows the receiver to process a large message incrementally, rather than allocating a buffer big enough for the whole message and then copying it in before it starts processing it.

---

synchronous IPC copying can be further reduced by allowing a message sender to send a list of message segments to be concatenated (eg a header and a payload); in the common case in which the payload is a data structure already in use in the sender before message construction, this prevents the sender from having to allocate a new buffer for the entire message and copying the payload into it before sending.

---

" Designing generalized lock-free algorithms is hard

Design lock-free data structures instead
- Buffer, list, stack, queue, map, deque, snapshot
Often implemented in terms of simpler primitives
- e.g. ‘Multi-word Compare and Set’ (MCAS, CAS2, CASN)
- Cannot implement lock-free algorithms in terms of lock-based data structures " [2]

---

" We could still scale to a very impressive amount of threads before we had stack chunks. The core reasons we can actually scale with so many threads are different: safe-points and the scheduler tie into allocations, allocations are fast and common, the scheduler is fast, and the overhead of an individual green thread is incredibly small (less than 20 machine words.) A real OS thread is vastly more expensive to manage, stack chunks or not. " -- [3]

---

" Day 18: concurrency I can finally understand and write correctly

(problem statement / my solution)

This day was an exciting one for me. Discussing this problem with others, it seems many people had much difficulty solving this problem in other programming languages. Most people seemed to have to emulate their own concurrency with no help from their programming language of choice. In contrast, D absolutely shined here, because it is based on the actor concurrency model (message passing), which precisely fits the problem statement. (There are other concurrency primitives in case the actor concurrency model isn’t sufficient, but it was sufficient for me.)

The basic idea of concurrency in D is that each thread of execution localises all of its state. By default, threads share no data. In order to communicate with each other, threads pass messages. A thread can indicate at any time when it’s ready to send or receive messages. Messages can be any type, and each thread says what type it’s expecting to receive. If a thread receives a type it’s not prepared to handle, it will throw an exception.

There are more details, such as what happens if a thread receives too many messages but doesn’t respond to any of them. Let’s not go into that now. Basic idea: threads get spawned, threads send and receive messages.

Let’s spend a little bit of time looking at the relevant functions and types I used, all defined in std.concurrency

spawn Starts a thread of execution. The first argument is a reference to the function that this thread will execute, along with any arguments that function may take. Returns a Tid type, a thread ID. This is used as the address to send messages to. A thread can refer to its parent thread via the special variable ownerTid.

    Unless explicitly declared shared, the arguments of a threaded function must be immutable. This is how the compiler guarantees no race conditions when manipulating those variables. Of course, with shared variables, the programmer is signalling that they are taking over synchronisation of that data, which may require using low-level mutexes. send Send a message to a particular thread. The first argument is the thread id. The other arguments can be anything. It’s up to the receiving thread to handle the arguments it receives. receiveOnly Indicate that this thread is ready to receive a single type, and returns the value of that type. The type must of course be specified as a compile-time argument. receive Indicates what to do with any of several possible types. The arguments to this function are a collection of functions, whose parameters types will be dynamically type-matched with received types. I didn’t need this function, but I wanted to mention it exists. receiveTimeout The problem statement is designed to deadlock. Although there probably is a more elegant solution, timing out on deadlock was the solution I wrote. This function does just that: listens for a message for a set of amount of time. If a message is received in the designated time, its handler function is executed and receiveTimeout returns true. If the timeout happens, it returns false instead.

Armed with these tools, the solution was a breeze to write. I first spawn two threads and save their thread ids, auto tid1 = spawn(&runProgram, opcodes, 0); auto tid2 = spawn(&runProgram, opcodes, 1);

Each of the two threads defined by runProgram then immediately stops, waiting for a thread id, to know whom to talk to, void runProgram(immutable string[] opcodes, ulong pid) { auto otherProg = receiveOnly!Tid(); ... }

The parent thread then connects the two worker threads to each other, send(tid1, tid2); send(tid2, tid1);

And off they go, the two threads run through the opcodes in the problem statement, and eventually they deadlock, which I decided to handle with a timeout like so, case "rcv": if (!receiveTimeout(100.msecs, (long val) {regs[reg] = val;})) { goto done; } no timeout, handle next opcode goto default;

After the thread has timed out, it signals to the parent thread that it’s done, done: send(ownerTid, thisTid, sent);

The parent in turn receives two tuples with thread id and computed value from each thread, and based on that decides what to output, after figuring out which thread is which, Wait for both children to let us know they're done. auto done1 = receiveOnly!(Tid, long); auto done2 = receiveOnly!(Tid, long); if (done1[0] == tid2) { writeln(done1[1]); } else { writeln(done2[1]); }

And voilà, concurrency easy-peasy. "

[4]

---

moomin 9 hours ago [-]

People always used to ask me for advice on multithreading, because I had a reputation for multithreaded code that worked. I recommended they use Retlang (there's a Java version called Jetlang), which works on a shared-nothing message passing basis (and is blindingly fast). They tended to say they didn't want a library and just wanted to use locks and threads. I said that not doing that was how come I had a reputation for correctness.

jcrites 6 hours ago [-]

I agree that you shouldn't be touching locks and threads, typically.

However, I think you can get pretty far these days with just the built-in Java class library, using features like Futures (CompletableFuture?), ExecutorService?, ForkJoinPool?, and BlockingQueue? and such. Before CompletableFuture? there was Guava's ListenableFuture?. Synchronized methods are also useful in certain situations. I've built several fairly sophisticated concurrent applications using these primitives and felt they worked out reasonably well. I find that BlockingQueues? with thread-pool-backed ExecutorServices? provide a model that's fairly robust while being simple to reason about and monitor.

---

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent Feng Niu, Benjamin Recht, Christopher Re, Stephen J. Wright (Submitted on 28 Jun 2011 (v1), last revised 11 Nov 2011 (this version, v2))

    Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

---

alfanerd 9 hours ago [-]

In my 30+ years experience of concurrency, the shared-nothing message model makes it relative easy to reason about concurrency (since every actor is single threaded).

I have worked with Java, C# and Python, and for each I have developed a concurrent object framework, making it quite easy to work with multible channels of incoming events, for example for process control systems with barcode-scanners, scales, I/O-busses etc. connected to the system

I think ponylang does a pretty good implementation of the Actor model.

Feel free to have a look at the Python impl. of the framework (www.github.com/pylots/pyworks)

pjmlp 5 hours ago [-]

In a way it is ironic that we went full speed into threads, shared memory and plugins, only to go back into processes, messages and sandboxing as a means to tame multi-core programming and safety.

zokier 4 hours ago [-]

I thought threads and shared memory were introduced more as performance hacks than superior architecture. It is not surprising that as the performance profile has changed (due both increased core counts and improved software primitives), we are getting back those things that were deemed too expensive previously

sametmax 1 hour ago [-]

But don't we have thread pools and futures for that ? You can't get a simpler API.

KingMob? 11 hours ago [-]

I'm surprised nobody's mentioned Clojure in the comments. Its concurrency abilities are top-notch, straightforward to use, and immutable data makes it dead-simple to avoid issues.

---

https://hackernoon.com/synchronization-primitives-in-python-564f89fee732

---

https://github.com/WebAssembly/threads/blob/master/proposals/threads/Overview.md

---

" C/POSIX type threads have no language support for indicating what data is shared and which locks protect which data. That's a common cause of trouble. The big question in shared memory concurrency is "who locks what". Most of the bugs in concurrent programs come from ambiguities over that question.

Early attempts to deal with this at the language level included Modula's "monitors", the "rendezvous" in Ada, and Java "synchronized" classes. These all bound the data and its lock together. Rust's locking system does this, and is probably the most successful one so far. (Yes, the functional crowd has their own approaches.)

...

The separation of data into immutable, unshared, or synchronized is mainstream now...Immutable objects could be shared. Mutable objects had to either be unshared, or a subclass of an object that enforced synchronization. "

[5]

---

" The (simplified, and as I understand it) gist of concurrency in CSP is that the program is expressed as a series of parallel (PAR) and sequential (SEQ) operations.

Everything in a PAR block will run in parallel and all their outputs will be collected and fed as the input to the next SEQ. Everything in a SEQ will run sequentially as a pipeline until the next PAR. Every PAR must follow a SEQ and vice versa, as two PARS or SEQS next to each other will simply coalesce.

eg.

    PAR
      longCall1
      longCall2
      longCall3
    SEQ
      reduceAllThreeResults
      doSomethingWithTheReducedResult
    PAR
      nextParallelOp1
      nextParallelOp2

etc.

...

'select' is ALT

...

(The "basic" processes are just the send/receive of a value over a channel. By the way, Go channels are a direct lift from CSP!)

hevrard 10 hours ago [-]

Speaking of CSP, more broadly process algebras (a.k.a. process calculus) generally have such a "parallel composition" operator.

Also, CSP genuine inter-process communication primitive is a _multi-way_ rendezvous which can synchronise an arbitrary number of process, possibly more than two. This is called "interactions" in chapter 2 of Tony Hoare CSP book [1], and channels are built on top of these interactions in chapter 4.

This multi-way synchronization is also present e.g. in the LOTOS specification language, which is an ISO standard. The CADP [2] verification toolbox offers various tools like model-checker to verify LOTOS programs, and also to generate executables. For those who know how special/weird the LOTOS syntax is, the CADP folks also develops the LNT language which looks much more like Ada/Pascal.

[1] http://usingcsp.com/cspbook.pdf [2] http://cadp.inria.fr/

Go doesn't implement the process operators parallel, seq and interleaved.

[6]

---

nursery concurrency

[7]

e notes that GOTO was considered harmful because it prevents you from considering functions as a black box, because a GOTO within a function could escape the caller.

e then suggests that concurrent 'spawn' as a primitive has a similar problem. The problem is that if you allow a 'spawn' or 'fork' (or 'go') primitive, then a subroutine could spawn something unbeknowst to the caller, and the caller has no guarantee that these unknown spawned processes are not still running upon function return.

this means that you can't treat calling a function as a black box from a control flow point of view; you can't assume that the only thing the function did was computing the return value plus whatever side-effects it had, because in addition it may have left something still running when it returns to you

this means that you can't do stuff like (a) have an airtight RAII/python-WITH thingee for resource cleanup, because if you eg open a file and then close it when exiting the WITH block, well the spawned function may still hold a reference to that file, and (b) have airtight exceptions, because if the spawned function throws an exception, well by now in the main thread the block of code that spawned it may be long gone, so who should handle the exception?

when he recommends instead is, just as GOTO was replace by if+while+function_calling, all of which have the property that control is ultimately returned to the statement following the construct, we replace SPAWN with the following proposal called 'nursery concurrency' (this example is in a Python library called 'trio' that implements eis proposal):

" async with trio.open_nursery() as nursery: nursery.start_soon(myfunc) nursery.start_soon(anotherfunc)

...

[4] If you call start_soon after the nursery block has exited, then start_soon raises an error, and conversely, if it doesn't raise an error, then the nursery block is guaranteed to remain open until the task finishes. If you're implementing your own nursery system then you'll want to handle synchronization carefully here.

There's no rule that only the code directly inside the async with open_nursery() block can call nursery.start_soon – so long as the nursery block remains open [4], then anyone who acquires a reference to the nursery object gets the capability of spawning tasks into that nursery. You can pass it in as a function argument, send it through a queue, whatever.

...

the nursery block doesn't exit until all the tasks inside it have exited – if the parent task reaches the end of the block before all the children are finished, then it pauses there and waits for them. ... Since nursery objects have to be passed around explicitly, you can immediately identify which functions violate normal flow control by looking at their call sites, so local reasoning is still possible.

Any tasks the function spawns are still bound by the lifetime of the nursery that was passed in.

...

You can define new types that quack like a nursery.

The standard nursery semantics provide a solid foundation, but sometimes you want something different. Perhaps you're envious of Erlang and its supervisors, and want to define a nursery-like class that handles exceptions by restarting the child task. That's totally possible, and to your users, it'll look just like a regular nursery:

async with my_supervisor_library.open_supervisor() as nursery_alike: nursery_alike.start_soon(...)

If you have a function that takes a nursery as an argument, then you can pass it one of these instead to control the error-handling policy for the tasks it spawns. Pretty nifty. But there is one subtlety here that pushes Trio towards different conventions than asyncio or some other libraries: it means that start_soon has to take a function, not a coroutine object or a Future. (You can call a function multiple times, but there's no way to restart a coroutine object or a Future

...

In Trio, it's possible for code to receive a cancellation request at any time. After a cancellation is requested, then the next time the code executes a "checkpoint" operation (details), a Cancelled exception is raised. This means that there's a gap between when a cancellation is requested and when it actually happens – it might be a while before the task executes a checkpoint, and then after that the exception has to unwind the stack, run cleanup handlers, etc. When this happens, the nursery always waits for the full cleanup to happen. We never terminate a task without giving it a chance to run cleanup handlers, and we never leave a task to run unsupervised outside of the nursery, even if it's in the process of being cancelled.

...

Automatic resource cleanup works.

Because nurseries follow the black box rule, they make with blocks work again. There's no chance that, say, closing a file at the end of a with block will accidentally break a background task that's still using that file. Automated error propagation works.

As noted above, in most concurrency systems, unhandled errors in background tasks are simply discarded. There's literally nothing else to do with them.

In Trio, since every task lives inside a nursery, and every nursery is part of a parent task, and parent tasks are required to wait for the tasks inside the nursery... we do have something we can do with unhandled errors. If a background task terminates with an exception, we can rethrow it in the parent task.

...

when an unhandled exception occurs in a child, Trio immediately cancels all the other tasks in the same nursery, and then waits for them to finish before re-raising the exception. The intuition here is that exceptions cause the stack to unwind, and if we want to unwind past a branch point in our stack tree, we need to unwind the other branches, by cancelling them.

    Trio's cancellation system is easier to use and more reliable than competitors, because it can assume that tasks are nested in a regular tree structure; see Timeouts and cancellation for humans for a full discussion.
    Trio is the only Python concurrency library where control-C works the way Python developers expect (details). This would be impossible without nurseries providing a reliable mechanism for propagating exceptions.

---

(responses to the nursery concept discussed in the previous section)

[–]Eirenarch 6 points 10 hours ago

So as a C# dev I read this as "make the C# compiler raise errors on any Task object that is not awaited". Am I right?

    permalinkembedsavereportgive goldreply

But what if instead we:

    Force tasks to produce a return value (even if only a dummy unit type).
    Do not fork execution when a task is created, but only when a consumer synchronizes with it to consume its result value.

Now all paths go "forward" to a synchronization point where their results are consumed.

    permalinkembedsavereportgive goldreply

my note: hmm i guess maybe we should present that as an alternative to nurserys? "You can either have nursery concurrency or a variant of async/await in which the task isn't started until the await is reached"

---

[8]

Coroutine

Stackfull semi and full coroutines allow retaining data and execution state between calls.

include <fstream>
include <coroutine>

coroutine RunTotal? { input numbers and return running total int input, total; communication }; void ?{}( RunTotal? & rntl ) { rntl.total = 0; } void update( RunTotal? & rntl, int input ) with( rntl ) { helper total += input; remember between activations suspend(); inactivate on stack } void main( RunTotal? & rntl ) with( rntl ) { for ( ;; ) { update( rntl, input ); } } int add( RunTotal? & rntl, int input ) { rntl.input = input; pass input to coroutine resume( rntl ); return rntl.total; return total from coroutine } int main() { RunTotal? rntl; for ( int i = 0; i < 10; i += 1 ) { sout

add( rntl, i )

endl;

}}

Monitor

A monitor type defines data protected with mutual exclusion, and the mutex qualifier acquires mutual exclusion. Bulk acquisition of multiple monitors is supported.

include <thread> monitor Bank { shared resource int money; }; acquire mutual exclusion of bank on entry void deposit( Bank & mutex bank, int deposit ) { bank.money += deposit; } acquire mutual exlcusion of both banks on entry void transfer( Bank & mutex mybank, Bank & mutex yourbank, int me2you ) { deposit( mybank, -me2you ); deposit( yourbank, me2you ); }

Thread

A thread type, T, instance creates a user-level thread, which starts running in routine void main( T & ), and joins on deallocation.

include <fstream>
include <thread> thread T { int id; }; void ?{}( T & t ) { t.id = 0; } void ?{}( T & t, int id ) { t.id = id; } void main( T & t ) with( t ) { thread starts here sout

} int main() { enum { NumThreads? = 5 }; T t[ NumThreads? ]; create/start threads T * tp[ NumThreads? ]; for ( int i = 0; i < NumThreads?; i += 1 ) { tp[i] = new( i + 1 ); create/start threads } for ( int i = 0; i < NumThreads?; i += 1 ) { delete( tp[i] ); wait for thread to terminate } } wait for threads to terminate

endl;

---

https://github.com/fluture-js/Fluture

---

chubot 21 days ago

parent [-]

on: The Problem with Threads (2006) [pdf]

This talk was about Foundation DB was brought up recently, and it's pretty amazing. I recommend watching the whole thing, but to be brief they are taming the "infinite interleavings" problem through determinism.

"Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson

https://www.youtube.com/watch?v=4fFDFbi3toc

They wrote an interesting Actor DSL that compiles to C++ and is completely deterministic, and they torture this deterministic engine with generated test cases on a cluster every night.

I guess you could say that the whole cluster is necessarily non-deterministic, but an individual node is deterministic, given an ordering of the messages it receives.

---

" That said, here is a good paper from 2004 from the authors of Lua: Revisiting Coroutines. Among other things, it points out that "coroutine" doesn't mean one thing. "

https://scholar.google.com/scholar?cluster=14280883920152261782&hl=en&as_sdt=0,5&sciodt=0,5

---

[9]

" Thou shalt not block

I/O can be performed in a blocking fashion or in an asynchronous fashion. And there’s more than one way to do asynchronous I/O: there’s the callback model (“run this function when data is ready”), the futures model (“a future represents an I/O action that will be completed some time in the future; poll to see if the I/O has completed or not”) and the generators model (“yield control, instead of blocking, when no more progress can be made”).

We can implement a blocking API on top of the svd2rust API; we can implement a futures based version of the API on top of the svd2rust API; and we can implement a generator based version of the API on top of the svd2rust API. The implementations of those three flavors of the same API will look very similar. The question is: can we implement some intermediate API to avoid writing register manipulation code in those three implementations? The answer is the nb crate.

I pulled a solution out of the Tokio book, or maybe it’s actually from the *nix book. The solution I chose was to implement the intermediate API as a non blocking API that returns a WouldBlock? error variant to signal that the operation can not be completed right now. This non fatal error variant has a different meaning depending on the (a)synchronous model is used in:

    In a blocking API, WouldBlock means try again i.e. busy wait (or maybe do something more elaborated if you have a threading system).
    In a futures based API, WouldBlock means Async::NotReady
    In a generator based API, WouldBlock means yield

Let’s look at an example. Below is shown the nb API for reading a single byte from a serial interface. Serial::read will return Ok(byte) if new data is available to be read; it will return Err(nb::Error::WouldBlock?) if no new data is available at the moment; and it will return e.g. Err(nb::Error::Other(Error::Overrun)) if some fatal error occurred.

...

block! is used to implement blocking APIs by busy waiting.

fn blocking_read(serial: &mut Serial) -> Result<u8, Error> { block!(serial.read()) }

try_nb! is used to implement futures based APIs.

fn futures_read(serial: Serial) -> impl Future<(Serial, u8), Error> { let mut serial = Some(serial); futures::poll_fn(move

{

        let byte = try_nb!(serial.as_mut().unwrap().read());
        Ok(Async::Ready((serial.take().unwrap(), byte)))
    })}

await! is used to implement generator based APIs.

fn generator_read( mut serial: Serial, ) -> impl Generator<Return = Result<(Serial, u8), Error>, Yield = ()> { let byte = await!(serial.read())?; Ok((serial, byte)) }

None of these higher level APIs required writing any register manipulation code and the functions could even have been made generic if Serial::read was a trait method instead of an inherent method. Which brings me to the next topic."

---

" In Ada, mutual exclusion is provided by the protected objects.

At run-time, the protected objects provide the following properties:

    There can be only one task executing a protected operation at a given time (mutual exclusion)
    There can be no deadlock

In the Ravenscar profile, this is achieved with Priority Ceiling Protocol.

A priority is assigned to each protected object, any tasks calling a protected sub-program must have a priority below or equal to the priority of the protected object.

When a task calls a protected sub-program, its priority will be temporarily raised to the priority of the protected object. As a result, this task cannot be preempted by any of the other tasks that potentially use this protected object, and therefore the mutual exclusion is ensured.

The Priority Ceiling Protocol also provides a solution to the classic scheduling problem of priority inversion. " [10]

---

[11]

" Go hits the concurrency nail right on the head ...

Unfortunately, most languages didn't yet move past the threads vs. asynchronous dichotomy. You either use threads, or a single-threaded event loop decorated with a bunch of bells and whistles to make code more palatable. Mixing threads with event loops is possible, but so complicated that few programmers can afford the mental burden for their applications.

...

Threads aren't a bad thing in languages that have good library support for them, and their scalability is much better than it used to be a decade ago, but for very high levels of concurrency (~100,000 threads and above) they are still inadequate. On the other hand, event-driven programming models are usually single-threaded and don't make good use of the underlying HW. More offensively, they significantly complicate the programming model. I've enjoyed Bob Nystrom's What Color is Your Function for explaining how annoying the model of "non-blocking only here, please" is. The core idea is that in the asynchronous model we have to mentally note the blocking nature of every function, and this affects where we can call it from.

Python took the plunge with asyncio, which is so complicated that many Python luminaries admit they don't understand it, and of course it suffers from the "function color" problem as well, where any blocking call can ruin your day.

...

Go is the mainstream language that gets this really right. And it does so by relying on two key principles in its core design:

    Seamless light-weight preemptive concurrency across cores
    CSP and sharing by communicating

...The main reason for this is that they are both implemented in the language runtime, rather than being delegated to libraries.

...

You can think of goroutines as threads, it's a fairly good mental model. They are truly cheap threads - because the Go runtime implements launching them and switching between them without deferring to the OS kernel. In a recent post I've measured goroutine switching time to be ~170 ns on my machine, 10x faster than thread switching time.

But it's not just the switching time; goroutines also have small stacks that can grow at run-time (something thread stacks cannot do), which is also carefully tuned to be able to run millions of goroutines simultaneously.

There's no magic here; consider this claim - if threads in C++ or JS or Python were extremely lightweight and fast, we wouldn't need async models. Well, this is the case with Go. As Bob Nystrom says in his post - Go has eliminated the distinction between synchronous and asynchronous code.

That's not all, however. The second principle is critical too. The main objections to threads are not just about performance, there's also correctness issues and complexity. Programming with threads is hard - it's hard to synchronize access to data structures without causing deadlocks; it's hard to reason about multiple threads accessing the same data, it's hard to choose the right locking granularity, etc.

And this is where Go's sharing by communicating principle comes in. In idiomatic Go programs you won't see a lot of mutexes, condition variables and critical areas protecting shared data. In fact, you probably won't see much locking at all. This is because Go encourages programmers to use channels instead, and channels are built into the language, with awesome features like select, and so on. Proper use of channels removes the need for more explicit locking, is easier to write correctly, tune for performance, and debug.

Moreover, building these capabilities into the runtime, Go can implement great tools like the race detector, which make concurrency bugs easier to smoke out

...

Obviously many challenges of concurrent programming remain in Go - these are the essential complexities of the problem that no language is likely to remove; Go does a great job at removing the incidental complexities, though.

...

For these reasons, I believe Ryan Dahl - the creator of Node.js - is absolutely right when he says:

    [...] if you’re building a server, I can’t imagine using anything other than Go. [...] I think Node is not the best system to build a massive server web. I would use Go for that. And honestly, that’s the reason why I left Node. It was the realization that: oh, actually, this is not the best server-side system ever.

Different languages are good for different things, which is why programmers should have several sufficiently different languages in their arsenal. If concurrency is central to your application, Go is the language to use.

" The vanity style can be floor standing type, it could have vanity legs, or it could be mounted to the wall. The shape and size of your bathroom should be factored into this decision.

You can also decide on a single vanity or a double vanity, depending on the space that you have and how your bathroom is organized. It will also depend on how much you can afford. Of course a double vanity will usually cost more than a single unit of similar quality. "

---

from discussion on [12]

" 1. "I've measured goroutine switching time to be ~170 ns on my machine, 10x faster than thread switching time." This is because of the lack of switchto support in the Linux kernel, not because of any fundamental difference between threads and goroutines. A Google engineer had a patch [1] that unfortunately never landed to add this support in 2013. Windows already has this functionality, via UMS. "

1: https://blog.linuxplumbersconf.org/2013/ocw/system/presentations/1653/original/LPC%20-%20User%20Threading.pdf

Syscall API pid_t switchto_wait(timespec *timeout) ● Enter an 'unscheduled state', until our control is re-initiated by another thread or external event (signal). void switchto_resume(pid_t tid) ● Resume regular execution of tid pid_t switchto_switch(pid_t tid) ● Synchronously transfer control to target sibling thread, leaving the current thread unscheduled. ● Analogous to: ○ Atomically { Resume(t1); Wait(NULL); }

API choices/Considerations ● Operations must be commutative (reversible). {T1:Wait, T2:Switch(T1)} should behave the same as {T2:Switch (T1), T1:Wait} ● Requiring a re-entrant (asynchronous) user-scheduler entry classically hard; prefer a synchronous programming model. ● User scheduling id compatible with kernel scheduling; the kernel scheduler grants us quanta, we schedule within that quanta. ● Load-balancing is best left to the load-balancer

" Go doesn't do anything to prevent data races, which are the biggest problem facing concurrent code. It actually makes data races easier than in languages like C++, because it has no concept of const, even as a lint. Race detectors have long existed in C++ as well, at least as far back as Helgrind. "

" In other languages, asynchronous APIs return futures or have explicit callbacks, whereas synchronous APIs return the type directly, thus creating a type level distinction between asynchronous and synchronous code. Go's model eliminates that distinction even though it calls asynchronous OS APIs under the hood. "

" Go isn't the king of cheap threads, though. Some cilk style task parallelism implementations have figured out clever tricks to make threads and synchronisation even cheaper. They're so cheap in fact, that a recursive fibonacci function that spawns threads for its recursive calls is almost as fast as one that doesn't. This is achieved by spawning those threads lazily: if a core is idle it looks at the call stacks of other cores and retroactively spawns threads for some of the remaining work on those stacks. It steals that work by mutating that call stack in such a way that if that other core returns to the stack frame it will not perform that work itself, but it will use the result computed by the core that stole the work. This not only avoids thread spawning cost unless the parallelism is actually used, but it also avoids synchronisation cost, because synchronisation is only necessary if work was actually stolen. Cilk style task parallelism traditionally focuses on parallelism rather than concurrency, but there's no reason why the same implementation strategies couldn't work for concurrency. OS threads have no hope of beating this. "

nelsonje 16 hours ago [-]

This is the canonical Cilk paper: http://supertech.csail.mit.edu/papers/cilk5.pdf

jules 16 hours ago [-]

Here is the paper describing the lazy thread spawning implementation: https://ece.umd.edu/~barua/ppopp164.pdf

-- [13]

---

Groxx 21 hours ago [-]

>Goroutines are threads

This is a thing I have to continually remind newer engs that are using Go. Goroutines have exactly the same memory observability / race conditions / need for locking/etc as any threaded language. Every race bug that exists in e.g. Java also exists in Go. The only real difference for most code is that it pushes you very hard to use channels directly, as they're the only type-safe option.

There are some neat tricks lower down, e.g. where you don't have to explicitly reserve a thread for most syscalls (but there are caveats there too, which is why runtime.LockOSThread?() exists), which are legitimately nice conveniences. But not a whole lot else.

-- [14]

---

Matthias247 19 hours ago [-]

> The only real difference for most code is that it pushes you very hard to use channels directly, as they're the only type-safe option.

It is, but it is also a super important difference: Go provides inbuilt libraries to make inter-thread synchronization more bearable for the average user, e.g. channels and waitgroups. These are lot harder to misuse than bare mutexes and condition variables. Since those are not available in other languages and are too complex for most users to implement them on their own (especially when select{} is required), the solutions there often end up more error-prone.

Groxx 19 hours ago [-]

Practically every language includes stuff higher level than bare mutexes and atomics. That people use them doesn't mean other options aren't available. And yeah, totally 100% agreed, most people should never implement them themselves.

But "futures", "blocking queues", "synchronized maps", and "locked objects" (e.g. a synchronized wrapper around a java object) are extremely common and often higher level than channels and selects and waitgroups[1].

[1]: Waitgroups in particular are covered by normal counting mutexes, marking them as low-level constructs like they are, and are also available nearly everywhere.

Matthias247 18 hours ago [-]

True, there are plenty of primitives available. However I found many of them are mainly for synchronizing low-level access to shared-data (e.g. the concurrent data structures, synchronized objects, mutexes), etc.

Primitives for synchronizing concurrent control flow (like Go's select()) seem less common. E.g. in Java I would have some executor services, and could post all tasks to the same single threaded executor to avoid concurrency issues. But it's clearly a different way of thinking than with channels/select.

You are also fully right about waitgroups. They are really nothing special. But they might see more use in Go since some structuring patterns are more often used there: E.g. starting multiple parallel workflows with "go", and then waiting for them to complete with a waitgroup before going on.

Groxx 17 hours ago [-]

Select is a bit less common (at least in use) from what I've seen, yea... until you start looking at Promises and Rx. Then it's absolutely everywhere, e.g. `Promise.race(futures...)` (which trivially handles a flexible length of futures, unlike select) or any merging operator in Rx. More often though I see code waiting on all rather than any, and Go doesn't seem to change that.

Channels though are everywhere, and sample code tends to look almost exactly like introductory Go stuff[1] - Java has multiple BlockingQueues? which serve the same purpose as a buffered channel, and SynchronousQueue? is a non-buffered one. Though they don't generally include the concept of "close"...

But streams do have the "close" concept in nearly all cases, and streams are all over the place in many many forms, and have been for a long time. They generally replace both select and channels, but are usually much easier to use IMO (e.g. `stream.map(i -> i.string())` vs two channels + for/range/don't forget to safely close it or you leak a goroutine). Some of that is due to generics though.

[1]: https://docs.oracle.com/javase/7/docs/api/java/util/concurre...

some alternate stream concepts are super interesting too, e.g. .NET's new pipelines: https://blog.marcgravell.com/2018/07/pipe-dreams-part-1.html

pcwalton 19 hours ago [-]

I think more languages should have typed channels, but untyped channels are readily available via OS primitives. Unix has had pipe() since 1973.

Matthias247 17 hours ago [-]

True! Haven't thought about it, but it's actually an OS-backed untyped channel, which even supports select().

Guess the differences are: It operates on bytes and not on objects, so a custom protocol is needed on top of it. And it can't provide the same guarantees as an unbounded channel, where there are certain guarantees that sending an item actually reached the other site, and which makes guaranteed resource handoff through channels a bit easier.

" Multithreaded Python is a complete minefield for race conditions. Sure, you might not segfault, but you have to think if you use fork() or spawn() depending on the OS, you have to install libraries to detect races, you have to write special tests. With Go all of that comes out of the box and it makes it _easy_. "

 roganartu 21 hours ago [-]

> Finally, Go doesn't do anything to prevent data races, which are the biggest problem facing concurrent code. It actually makes data races easier than in languages like C++, because it has no concept of const, even as a lint. Race detectors have long existed in C++ as well, at least as far back as Helgrind.

While it can't guarantee correctness outside of runtime (and even then, obviously only if whatever you are running actually triggers the race), a race detector has been part of the Go core since 2012[1].

[1]: https://blog.golang.org/race-detector

anonacct37 19 hours ago [-]

Also there's a "best effort" concurrent map access detector that runs even when you don't compile with race detector support enabled.

-- [15]

---

jadbox 21 hours ago [-]

So what's preventing Linux from adopting switchto support? I'd love to see the discussion around it.

pcwalton 21 hours ago [-]

Beats me! The author no longer works at Google from what I can tell—emails to him bounced.

I'd love to see 1:1 threading become competitive with M:N for the heaviest workloads. It just plays so much nicer with the outside world than M:N does.

Matthias247 18 hours ago [-]

> I'd love to see 1:1 threading become competitive with M:N for the heaviest workloads. It just plays so much nicer with the outside world than M:N does.

After working lots of years with all of the available async paradigms (eventloops, promises, async-await, observables, etc) I would tend to agree. Even though many of those things (including Rusts futures implementation) are very well-engineered, the integration problems (as e.g. outlined in the what color is your function article) are very real. And the extra amount of understanding that is required to use and implement those technologies might often not justify the gains. One basically needs to understand normal threaded-synchronization as well as async-synchronization (e.g. as with .NET Tasks or Rusts futures) to build things on top of it. Same goes for the extra level of type indirection (Task<T> vs T), or the distinction between "hot" and "cold" tasks/promises.

If it would be possible to get 1:1 threading into the same performance region for most normal applications (e.g. everything but a 1 million connection servers) it seems to very favorable.

-- [16]

---

weberc2 19 hours ago [-]

> Goroutines are threads. So the idea the "Go has eliminated the distinction between synchronous and asynchronous code" is only vacuously true, because in Go, everything is synchronous.

Abstractly, I agree, but practically the distinction is important. In Go you don't have to use a frustrating async interface to get decent performance. You don't have to manage a threadpool or other tricks. You pretty much get the threadlike interface you want to use without the difficulty.

 pcwalton 19 hours ago [-]

Let's be clear: the one-OS-thread-per-connection model does yield decent performance. We used to call async I/O "solving the C10K problem"—i.e. serving 10,000 clients simultaneously.

I'm speaking from experience here, having tried to implement M:N and abandoning it in favor of 1:1, which yielded better performance. Can M:N yield better performance than 1:1? Sure, in some circumstances. But I think that, ideally, we should be striving for 1:1 everywhere.

dboreham 19 hours ago

unvote [-]

All true. But I have been waiting nearly 30 years for a 1:1 implementation that scaled to the #threads I want to have. Still waiting...

The point of M:N schemes isn't to achieve better performance but rather to achieve higher scaling.

dboreham 18 hours ago [-]

I believe there are boxes in production with 1E6 live connections.

-- [17]

---

thinkpad20 23 hours ago [-]

I think the author might be overstating how unique Go’s position is in terms of making concurrency easier. Haskell has the best concurrency story of any language I’ve used. Super lightweight green threads, simple concurrency primitives like MVar and STM, good performance (possibly requiring tweaks, but not bad out of the box). Referential transparency (by which I basically mean immutable data) makes sharing data across threads much easier and in some cases more performant by allowing you to avoid copying. Plus, you have a type system which makes it much easier to reason about your code.

All that being said, I haven’t written Go and can’t compare the two. Also, Haskell doesn’t support the actor/message passing model out of the box (although libraries exist for it) or prevent deadlocks (although the immutability helps a great deal here). BEAM languages, clojure, rust, pony and others all have their strengths here — again, this doesn’t discredit the article at all but the idea that Go is the clear winner is debatable.

jerf 21 hours ago [-]

In terms of raw capability, Haskell is the concurrency winner. Not only does it have basically every paradigm, they even all work together reasonably well. (Not perfectly, but reasonably well.) But I don't think you can argue Haskell is mainstream. It continues to be on a fairly slow growth trajectory from what I see, but it's not in any danger of cracking into the top tier language set anytime soon. Go just might in another 5 years.

-- [18]

---

kasey_junk 22 hours ago [-]

Its fairly interesting that this article doesn't mention actors, futures/promises and async/await style co-routines which are all extremely available in all of the major languages available today and broadly used (with the possible exception of golang).

Frankly, I think the concurrency story is one of the weaknesses of golang. Contrary to what this article says you cannot stop thinking about concurrency in your code in golang like you can in some of the more avante garde concurrency models (STM) and it doesn't provide nearly the sophistication in type or post build checking of other languages.

wahern 20 hours ago [-]

All of those interfaces are trivial to implement in Go precisely because Go implements a much stronger abstraction. By contrast, if you only have those other abstractions you're quite limited in how you can structure your code. (You might not realize how limited you are, however, if those are your only options.)

As the article says, most languages settle for those interfaces because they can mostly be implemented as libraries with minimal refactoring of the existing implementations and their execution model.

They have their uses but they're not universal enough. If you think actors, futures, and async/await are so great, imagine having to write every single function invocation in that manner. It would be unimaginable outside of some special-purpose declarative DSL. By contrast, the concept of a thread--linear flow of logical execution of statements and expressions--is fundamental to programming. Even in functional languages with lazy evaluation or automagic parallelism. It's a basic building block similar to functions. And much like functions, it's incredibly useful to be able to easily compose those blocks, which is where channels and message passing come into the equation. One way to compose them is with actor and future interfaces, but that's not the only way and not always the most appropriate way.

Threads ended up with a bad name because of (1) performance and (2) races, but that conflates the abstract concept of a thread--a construct that encapsulates the state of recursive function calls--with particular implementations. Goroutines are threads, pure and simple, but [mostly] without all the baggage of traditional implementations. (That they can be scheduled to execute in parallel complicates the construct but also makes them useful for the same reasons traditional OS threading constructs were predominately used, except with much less of the baggage.)

kasey_junk 17 hours ago [-]

The biggest problem with go is that it’s not easy to implement those as libraries. This is a combination of the golang story around generics & their opinionated strategy on concurrency.

Conversely most other modern, mainstream languages can mimic golang concurrency as a library.

GordonS? 19 hours ago [-]

> By contrast, if you only have those other abstractions you're quite limited in how you can structure your code.

OK... but what languages only have those abstractions?

wahern 16 hours ago [-]

I think it's fair to say that neither JavaScript?, Perl, nor Python, notwithstanding Web Workers or rough POSIX threading support; but Lua and Scheme do even though their threading construct is not based on the system threading model (though strictly speaking it's likewise available to those languages).

What I really meant to get at was that Go provides a flavor of threading that is lightweight, simple, and ergonomic. The threading construct is a first-class citizen and fundamental to both the language design and its implementation tradeoffs. Threading, lexical closures, and GC were designed and implemented holistically. Go takes a performance hit for its seamless support of closures (lots of incidental heap allocation), for example, but they did it because notwithstanding the Go authors' utilitarian goal it was important that the core abstractions remained relatively unadulterated and natural. If this wasn't the case (if it was just about avoiding manual memory management), Go could have required explicit capturing of lexicals, like C++ and Python do. People complain that Go doesn't have a lot of sophisticated features, but that's because everybody is focused on typing and generics. But modeling execution flow is at least as interesting academically and important commercially. While Go seems simple, supporting these constructs the way Go does is actually really difficult. Which is why it's so rare to find.

I intentionally didn't mention channels because while syntactically nice the real internal complexity (and payoff) comes from the closures. Channels are something of an implementation detail which you can easily hide inside closures in a functional-style of programming; threading + closures allow you to implement coroutine semantics very cleanly (i.e. no async/await necessary) and, if you so desire, transparently. (And it just occurred to me that async/await is such a misnomer. In a coroutine caller and callee are logically synchronous. The fact that languages like Python, C#, and C++ use async/await terminology shows how they put the cart before the horse--these constructs were designed and implemented for the very specific use case of writing highly concurrent async-I/O services, and they stopped at making that use case moderately convenient. They're a very leaky abstraction. See function color problem mentioned in the article.)

-- [19]

jy3 20 hours ago [-]

Describing its ((Go's)) concurrency model as one of its weakness is a bold statement, because it's probably one of the biggest reason it became widely adopted.

Everyone is different, but Go 's concurrency model has been the easiest for me to reason about. And I've dealt with my fair share of "async/await".

kasey_junk 16 hours ago [-]

Maybe. But I’ve been developing golang full time in high scale concurrency environments for 4 years and working with a team of similar people. It’s an opinion that is near universally shared on that team.

At high concurrency levels almost everything abandons standard golang concurrency patterns and tools.

tandr 13 hours ago [-]

Would you be so kind and explain what term "high concurrency levels" implies?

thank you in advance.

kasey_junk 12 hours ago [-]

Consumer facing systems. System wide throughput between 6-12 million QPS (daily low/high) (query body average size is 1.5KB). Each server tops at ~130K QPS. On a system with 2 1 GB NICs we pop the NIC. On a 10G we pop the CPU.

Current bottleneck is the golang http/net libs. Would likely need to rewrite it from the NIC up to do better.

-- [20]

---

jerf 21 hours ago [-]

"We've had ways of doing multithreaded code that are easier to reason about for decades. They really do work quite well. Why people doggedly insist on pretending they don't exist is a perennial mystery to me."

The real advantage to Go in a lot of ways was just starting over again with a couple decades more experience with multithreaded coding, and making the community default to a set of those more sane concurrency primitives. Nothing nominally stops you from doing the same thing in a number of other older threaded languages, but you're trying to bootstrap a new set of libraries from scratch, and that's not just a neutral operation, you are actively fought by the existing bulk of libraries for your existing language.

It's sort of weird that sometimes it's literally easier to start an entirely new language than fix an existing ecosystem and I can't say I've necessarily gotten my head wrapped around it, but observationally the evidence seems quite strong.

spinningslate 22 hours ago [-]

Don't disagree with the challenges about other languages being equally applicable. My first thought was also "Erlang does it at least as well as Go". Reasonable challenges on whether Erlang (or Elixir/Pony/...) are mainstream though.

For me the more valuable point is not that Go specifically gets it right: it's that async - as implemented in javascript/python/java/c# and so on - is fundamentally wrong. These two quotes get to the heart of it:

>The core idea is that in the asynchronous model we have to mentally note the blocking nature of every function, and this affects where we can call it from.

>The fundamental issue here is that both Python and C++ try to solve this problem on a library level, when it really needs a language runtime solution.

I've said for a while that async as implemented in javascript et al is the "GOTO" of concurrency - and should be considered equally as harmful as Dijkstra's observation on GOTO, for many of the same reasons [0].

[0] https://en.wikipedia.org/wiki/Considered_harmful

-- [21]

notriddle 21 hours ago [-]

Shared-memory threads are worse than async/await. I'd rather have GOTO than data races, since, while it might be hard to model in your head, at least it's deterministic and using it wrong doesn't cause UB.

p.s. https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

spinningslate 21 hours ago [-]

Erlang doesn't do shared memory. It's a message passing, shared-nothing model. I believe GO is the same.

tomjakubowski 20 hours ago [-]

Go is not shared-nothing: if you write a slice to a channel between two goroutines, for example, the memory that the slice references is now shared between the goroutines

-- [22]

dreamcompiler 23 hours ago [-]

Or you could just use the Actor model, which I like much better than CSP. (Or at least I think I'll like it better when I finally understand it.)

sagichmal 22 hours ago [-]

The Actor model is trivially implementable with CSP.

zzzcpan 19 hours ago [-]

CSP can be implemented with Actor model and rather trivially. You can think of an actor as a programmable channel. But not the other way around. Because you can't implement unbounded nondeterminism with bounded.

notriddle 21 hours ago [-]

And CSP is trivially implementable using the Actor model. What's your point?

Matthias247 19 hours ago [-]

Depends. I would argue that e.g. Go's CSP approach is not implementable on top of Akka, since actors there are not allowed to block on reception of single message. They will always get messages pushed, so the blocking Go concurrency constructs can not be built on top of it. It's obviously slightly different with Erlang, where Actors can block.

I think the basic Actor definition does only say that Actors need to send messages to others, but not whether they can block on reception of selected responses. So it might be a bit undefined.

zzzcpan 19 hours ago [-]

Lack of blocking support in runtime cannot prevent you from implementing waiting. Either through higher-order event driven code or just plain busy-waiting style message exchanging until you receive your message.

Matthias247 18 hours ago [-]

Busy waiting won't work, since there will potentially never run another thread that changes the thing you are waiting for (that's why e.g. in JS everything needs to be async). There might be ways to model things a bit different (like using Promises and other monadic abstractions), but things will never look the same as in a blocking environment.

zzzcpan 18 hours ago [-]

It's a model, it doesn't matter how it looks.

sagichmal 18 hours ago [-]

My point is that you can program with the actor model in Go, if you want to.

karols 17 hours ago [-]

They are computationally equivalent. Except that you can't typecheck actor calculus (since you can send anything to an actor) but you CAN typecheck pi calculus (channels).

[23]

---

favorited 22 hours ago [-]

> Mixing threads with event loops is possible, but so complicated that few programmers can afford the mental burden for their applications.

Correct me if I'm wrong, but doesn't basically every UI framework from the last 20 years do exactly this?

Matthias247 19 hours ago [-]

They do, but I think also that possibly every user of such a framework has at least once created a concurrency issue once in their life. All of these the issues below are things I have seen (and created) over and over in these environments:

Trying to access UI framework elements from the wrong thread
not understanding that a callback inside one of the included libraries is not running on the UI thread but from somewhere else
callbacks capturing references to objects which already had been destroyed (but the callback had been queued and can't be cancelled), and accessing those objects later on
deadlocks, due to mutex locking in both callbacks as well as the "normal" code.

There are ways to minimize the amount of issues, e.g. always deferring asynchronous callbacks to the next eventloop iteration, making sure that callbacks are queued on the main thread instead of an arbitrary one, deferring object deletion behind all other callbacks, etc. But these are on the harder to learn, teach and enforce side.

Therefore I would agree with the author of that article that the mixture of eventloop-based programming and multiple threads is the hardest combination.

zzzcpan 18 hours ago [-]

> Correct me if I'm wrong, but doesn't basically every UI framework from the last 20 years do exactly this?

Yeah, multithreading with event loops was never a big deal. Simple API to run something in a thread pool out of event loop is definitely much easier, than using threads. Apple did Grand Central Dispatch and I believe it's universally recognized to be easier than using threads.

tybit 18 hours ago [-]

C# does this server side too, async isn’t particular fun but threading doesn’t make it any worse when the runtime handles it imo.

iainmerrick 21 hours ago [-]

Mixing threads with event loops is possible, but so complicated that few programmers can afford the mental burden for their applications.

This is just Apple's Grand Central Dispatch model, or the event loops used internally in Chromium. It's not complicated at all, it's a very practical and productive approach.

-- [24]

---

erik_seaberg 22 hours ago [-]

> Proper use of channels removes the need for more explicit locking

If you're lucky. Sharing mutable state is unsafe by default (map writes can crash!) yet very common and the language doesn't help you avoid it. A good language for concurrency would also make it easy to switch between sync and async method calls; the trouble with channels is they don't support passing errors and panics without rewriting everything to wrap them in a pseudo-generic struct.

weberc2 19 hours ago [-]

If you're using channels "properly", then you only have one thread writing to the map at a time. In this case, "proper" use of channels is easy enough, but there are plenty of cases where "proper" use of channels is (in my opinion) a fair bit harder than proper use of mutexes, which is one of the problems Go's concurrency scheme was meant to solve.

I will say that even in spite of the above criticism, I've found that writing safe, concurrent Go is quite a lot easier (channels are still useful if not the panacea they were made out to be), but it still takes a bit of planning to minimize the risk of creating race conditions (minimizing the interaction points between goroutines, not throwing goroutines at every problem, etc). This takes experience which requires social controls instead of technical controls, which is a bummer, but on balance still better (IMHO) than the problems brought by other languages with technical controls for this particular problem.

camgunz 22 hours ago [-]

Go's answer to this is tooling, -race in this case. I think Go's general strategy of moving complexity/functionality to external tools is a little underexamined -- it's definitely interesting, especially as it's kind of a middle path between "language with lots of helper stuff" and "IDE with lots of helper stuff".

erik_seaberg 21 hours ago [-]

It's better than nothing, but it's so expensive they don't really expect anyone to use it at prod scale. That's a recipe for letting catastrophic bugs do damage for hours or days.

-- [25]

---

 siscia 22 hours ago [-]

Leave aside the extremely interesting engineer behind th go scheduler and goroutine.

The abstraction that goroutine provides is a simple, indipendent, isolated unit of execution.

You start it, and that is all you can do with it.

No way to set priorities, decided when to stop it or inspect it.

After the goroutine start the only interface that you get is a channel where to push and pop stuff from. Which is just too limited consider the complete lack of genetics.

It is really the best we can come up with?

-- [26]

---

kahlonel 20 hours ago [-]

Apparently there is a C library for green (Go-like) threads called "libdill" or "libmill".

-- [27]

---

rustcharm 22 hours ago [-]

I disagree. You really need "shared nothing" and messages. The only language that really hits the concurrency nail on the head is Erlang, which we use extensively in our business.

truth_seeker 21 hours ago [-]

Erlang is latency sensitive and Go is throughput sensitive

-- [28]

---

vvanders 23 hours ago [-]

> Programming with threads is hard - it's hard to synchronize access to data structures without causing deadlocks; it's hard to reason about multiple threads accessing the same data, it's hard to choose the right locking granularity, etc.

It's almost as if we need a language that has a focus on data races, concurrency and fearless threading.

rubiquity 23 hours ago [-]

Hah! Good one. To anyone else not in on the joke, parent is referring to the language that is often discussed on HN and known for having a sometimes overly enthusiastic community, Pony: https://www.ponylang.io/

markdoubleyou 16 hours ago [-]

Hah! Good one. Parent is clearly referring to Ada and the oft-quoted chapter 9 of NASA's Ada Style Guide (c. 1987).

https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/198700...

The discussion of pitfalls surrounding Ada's task facility is eerily familiar to what you're reading in this thread 31 years later.

eatbitseveryday 23 hours ago [-]

Hah! You must be in on another joke, because OP was referring to Rust ;)

> Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.

www.rust-lang.org

edit: /s

-- [29]

---

kazinator 23 hours ago [-]

Actually it's very easy to avoid data-structure related deadlocks:

avoid holding two locks simultaneously: that guarantees no deadlocks.
if you have to hold several locks, always acquire them in the same order in all scenarios.

The "avoid holding multiple locks" rule also harmonizes very well with "minimize the durations/sizes of critical regions". That is, hold a lock over the minimum number of machine instructions necessary. That reduces lock contention, promoting better concurrency.

Of course, we don't get anything free and easy. Not holding multiple locks means that you can't lock in some consistent condition across multiple structures to do an atomic update. Any time you let go of a lock to go do something elsewhere and re-acquire the lock, the "world has changed". If the code hangs on to any previous assumptions about the state (e.g. cached info in local variables), there will be a bug.

LukeShu? 23 hours ago [-]

"avoid holding two locks simultaneously" only guarantees no deadlocks if locks are your only form of inter-thread-blocking. If you have synchronization points or channels/pipes or anything else, that needs to be expanded to include "avoid using a channel while you hold a lock" and similar for every blocking primitive.

kazinator 22 hours ago [-]

That's right; avoid doing anything that could block while holding a lock: acquiring another lock, waiting on a semaphore, and so on.

Note how condition variables have an aspect of this built-in: the wait operation gives up the mutex.

dreamcompiler 23 hours ago [-]

Or if you have to hold several locks, grab them all atomically and if even one grab fails, release them all and try again later.
If you really need all the locks right now, steal the ones you don't own from another thread, and make sure all threads can deal with lock theft.

It's not always easy to sort out the best way to deal with deadlocks. Each approach has significant tradeoffs.

kazinator 22 hours ago [-]

Multiple locks are not always in the same scope. Method Foo::Update acquires a lock, then without releasing the lock calls into another class/object where Bar::Commit also acquires its own lock.

dreamcompiler 20 hours ago [-]

Yep. That's certainly one of the tradeoffs.

pkaye 20 hours ago [-]

How would you do this in Go? If not Go, what language has the capability?

gatestone 6 hours ago [-]

"...shared values are passed around on channels and, in fact, never actively shared by separate threads of execution. Only one goroutine has access to the value at any given time. Data races cannot occur, by design." [...if you stick to this style of programming...]

-- [30]

---

reacharavindh 22 hours ago [-]

I thought this was one of the "not strong" points of Rust? It had Tokyo, futures, async/await but I vaguely remember many libraries waiting for the async story to settle down? Is there a one standard way to do async now in Rust stable?

steveklabnik 22 hours ago [-]

You can use tokio with stable Rust. Async/await is in nightly, but won’t be stable until early next year. As part of that work, some of the APIs are changing, but there is/will be a compatibility layer to make upgrading from the stuff today to tomorrow’s stuff easier.

That said, to address the question directly, rust-the-language provides a stronger guarantee than Go: rust code is free of data races at compile time. However, the APIs are much harder. Async/await will make them much easier, but still not as easy as “everything is always async.” The tradeoff is speed and safety; rust will be faster and have more guarantees, but be slightly harder to use (though some people do think the explicitness of async/await is easier, but that’s personal opinion).

-- [31]

---

[32]

" Why you can have millions of Goroutines but only thousands of Java Threads ... The term “thread” can mean a lot of different things. In this post, I’m going to use it to refer to a logical thread. That is: a series of operations with are run in a linear order; a logical path of execution. ... To support being paused and resumed, a thread minimally needs two things:

    An instruction pointer of some kind. AKA: What line of code was I executing when I was paused?
    A stack. AKA: What is my current state? The stack contains local variables as well as pointers to heap allocated variables. All threads within the same process share the same heap.

...

The JVM uses operating system threads

...

Using operating system threads incurs a constant, large, memory cost per thread

The second major problem with operating system threads comes because each OS thread has its own fixed-size stack. Though the size is configurable, in a 64-bit environment, the JVM defaults to a 1MB stack per thread. You can make the default stack size smaller, but you tradeoff memory usage with increased risk of stack overflow. The more recursion in your code, the more likely you are to hit stack overflow. If you keep the default value, 1k threads will use almost 1GB of RAM! RAM is cheap these days, but almost no one has the terabyte of ram you’d need to run a million threads with this machinery.

...

How Go does it differently: Dynamically Sized Stacks

Golang prevents large (mostly unused) stacks running the system out of memory with a clever trick: Go’s stacks are dynamically sized, growing and shrinking with the amount of data stored. This isn’t a trivial thing to do, and the design has gone through a couple of iterations (4). While I’m not going to get into the internal details here (they’re more than enough for their own posts and others have written about it at length), the upshot is that a new goroutine will have a stack of only about 4KB. With 4KB per stack, you can put 2.5 million goroutines in a gigabyte of RAM – a huge improvement over Java’s 1MB per thread. ...

4: Golang first started with a segmented-stacks model where the stack would actually expand into a separate area of memory, using some clever bookkeeping to keep track. A later implementation improved performance in specific cases by instead using a contiguous stack where instead of splitting the stack, much like resizing a hashtable, a new large stack is allocated and through some very tricky pointer manipulation, all the contents are carefully copied into the new, larger, stack. [return]

...

In the JVM: Context Switching Delay

Using operating system threads caps you in the double digit thousands, simply from context switching delay.

Because the JVM uses operating system threads, it relies on the operating system kernel to schedule them. The operating system has a list of all the running processes and threads, and attempts to give them each a “fair” share of time running on the CPU.5 When the kernel switches from one thread to another, it has a significant amount of work do. The new thread or process running must be started with a view of world that abstracts away the fact that other threads are running on the same CPU. I won’t the get into the nitty gritty here, but you can read more if you’re curious. The critical takeaway is that switching contexts will take on the order of 1-100µ seconds. This may not seem like much, but at a fairly realistic 10µ seconds per switch, if you want to schedule each thread at least once per second, you’ll only be able to run about 100k threads on 1 core. And this doesn’t actually give the threads time to do any useful work.

...

How Go does it differently: Run multiple Goroutines on a single OS thread

...

Even if JVM brought threads to user space, it still wouldn’t be able to support millions of threads. Suppose for a minute that in your new system, switching between new threads takes only 100 nanoseconds. Even if all you did was context switch, you could only run about a million threads if you wanted to schedule each thread ten times per second. More importantly, you’d be maxing out your CPU to do so. Supporting truly massive concurrency requires another optimization: Only schedule a thread when you know it can do useful work! If you’re running that many threads, only a handful can be be doing useful work anyway. Go facilitates this by integrating channels and the scheduler. If a goroutine is waiting on a empty channel, the scheduler can see that and it won’t run the Goroutine. Go goes one step further and actually sticks the mostly-idle goroutines on their own operating system thread. This way the (hopefully much smaller) number of active goroutines can be scheduled by one thread while the millions of mostly-sleeping goroutines can be tended to separately. This helps keep latency down.

Unless Java added language features that the scheduler could observe, supporting intelligent scheduling would be impossible. However, you can build runtime schedulers in “user space” that are aware of when a thread can do work. This forms the basis for frameworks like Akka that can support millions of actors(6)

6: Actors serve the same purpose as Goroutines for Scala/Java by enabling massive concurrency. Just like Goroutines, the actor scheduler can see which actors have messages in their mailbox and only run actors that are ready to do useful work. You can actually have even more actors than you can have goroutines because actors don’t need a stack. However, this means that if an actor does not quickly process a message, the scheduler will be blocked (since it can’t pause in the middle of the message since the Actor doesn’t have it’s own stack). A blocked scheduler means no messages are processed and things will quickly grind to a halt. Trade offs! [return]

...

In Apache, each request is handled by 1 OS thread, effectively capping Apache to thousands of concurrent connections. Nginx opts for a model where 1 OS thread tends to hundreds or even thousands of concurrent connections, allowing a much higher degree of concurrency. Erlang uses a similar model that allows millions of actors to execute concurrently. Gevent brings greenlets (user space threads) to Python, enabling a much higher degree of concurrency than would be supported otherwise (Python threads are OS threads). [return]

...

((a goroutine-relevant part of Go's runtime:)) https://github.com/golang/go/blob/d9b006a7057d4666cb4fa9c421f2360ef3994b0f/src/runtime/proc.go

---

 cryptonector 1 day ago [-]

It's all about fixed resource allocations. The default stack size for user-land threads tends to be 1MB or 2MB, and there's also (smaller) kernel stacks.

In implementations of languages like Scheme or Haskell where no stack may even be allocated, or where stacks might be allocated in chunks, the cost of a co-routine can be very small -- as small as the cost of a closure. If you take that to the limit and allocate every call frame on the heap, and if you make heavy use of closures/continuations, then you end up with more GC pressure because instead of freeing every frame on return you have to free many of them via the GC.

In terms of ease of programming to deal with async events, the spectrum runs from threads on the one hand to callback hell on the other. Callback hell can be ameliorated by allowing lambdas and closures, but you still end up with indentation and/or paren/brace hell. Co-routines are somewhere in the middle of the spectrum, but closer to threads than to callback hell. Await is something closer to callback hell with nice syntax to make things easier on the programmer.

Ultimately it's all about how to represent state.

In threaded programming program state is largely implicit in the call stack (including all the local variables). In callback hell program state is largely explicit in the form of the context argument that gets passed to the callback functions (or which they close over), with a little bit of state implicit in the extant event registrations.

Threaded programming is easier because the programmer doesn't have to think about how to compactly represent program state, but the cost is higher resource consumption, especially memory. More memory consumption == more cache pressure == more cache misses == slower performance.

Callback hell is much harder on the programmer because it forces the programmer to be much more explicit about program state, but this also allows the programmer to better compress program state, thus using fewer resources, thus allowing the program to deal with many more clients at once and also be faster than threaded programming.

Everything in computer science in the async I/O space in the last 30 years has been about striking the right balance between minimizing program state on the one hand and minimizing programmer pain on the other. Continuations, delineated continuations, await, and so on -- all are about finding that sweet spot.

TimJYoung? 8 hours ago [-]

Very good summary. I would only add that it pays to mention that thread/fiber stacks typically only reserve 1-2MB of VM, and only a small amount is committed initially (1 to 2 pages). With 32-bit address spaces, this is a minor distinction because the size of the address space is what matters the most, but with 64-bit address spaces, it can mean that you can aren't putting as much pressure on the VMM as one might think. If all of the threads in a process only use the default two pages, and the system page size is 4K, then 10,000 threads are only going to result in a commit size of 81MB of VM for the stacks. Of course, that's a lot of "ifs", and it completely ignores the effects of that many threads on the per-process VM mappings/CPU caches.

cryptonector 5 hours ago [-]

I keep getting comments about this, but it's not that simple. First of all, unless you're overcommitting, those stacks consume more resources than one might think at first blush. Second, the apps I am familiar with (on the Java side) get pretty deep stacks (you can tell from the stack traces in exceptions!). Third, if you want millions of stacks on a 64-bit address-space, I'm thinking you'll want larger pages if much more than a handful of 4KB pages per-stack get used (as otherwise the page tables get large), and then you're consuming large amounts of RAM anyways. I may be wrong as to the third item, but I suspect the actual depth of the stacks as-used makes a big difference.

The point is that thread-per-client is just never going to scale better than C10K no matter what, and that the work we see in this space is all about finding the right balance between the ease-of-programming of thread-per-client and the efficiency of callback hell (typical C10K).

For any size computer (larger than tiny), C10K designs will wring out orders of magnitude better scaling than typical thread-per-client apps no matter how much you optimize the hardware for the latter.

Thread-per-client can only possibly compare in the same ballpark as C10K when the actual stack space used is tiny and comparable to the amount of explicit state kept in the C10K alternative, and even then, the additional kernel resources consumed by those per-client threads will dwarf those needed for the C10K case (thread per-CPU), thus adding cache pressure. In practice, thread-per-client is never that efficient because the whole point of it is that it makes it trivial to employ layered libraries of synchronous APIs.

Now, I'm not saying that the alternatives to thread-per-client are easy, but we should do a lot better at building new libraries so that async I/O can get easier.

---