proj-oot-ootNotes20

Notes on an article "Why is Python slow?" and on a related presentation and on the HN discussion about it:

http://blog.kevmod.com/2016/07/why-is-python-slow/ https://docs.google.com/presentation/d/10j9T4odf67mBSIJ7IDFAtBMKP-DpxPCKWppzoIfmbfQ/edit#slide=id.g142ccf5e49_0_432 https://lwn.net/Articles/691243/ https://news.ycombinator.com/item?id=12025309

the article is about why is Python slower than JS and Lua.

" Python spends almost all of its time in the C runtime ... Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. ... Pyston's performance improvements come from speeding up the C code, not the Python code. ... why is the Python C runtime slow? ... This is a for loop written in Python, but that doesn't execute any Python bytecodes:

import itertools sum(itertools.repeat(1.0, 100000000))

The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython. In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython's sum() has an optimization to avoid boxing when the summands are all floats (or ints). So it's not boxing overhead, and it's not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.

My current best explanation is that it's not so much that the C runtime is slow at any given thing it does, but it just has to do a lot. In this itertools example, about 50% of the time is dedicated to catching floating point exceptions. The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not. All of these checks are fast and well optimized, but they are done every loop iteration so they add up. A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8's 5. ... If JS/Lua can be fast why don't the Python folks get their act together and be fast?

Python is a much, much more dynamic language that even JS. Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS.

Why don't we rewrite the C runtime in Python and then JIT it? ... If you're going to rewrite the runtime into another language, I don't think Python would be a very good choice. There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don't think you'd end up ahead. ... "

my note: the article thinks that b/c most of the time is spent executing the OP codes rather than in interpreter overhead (b/c the opcodes are complex rather than simple), there will be little gains to stuff like JITing, and most gains will come from optimizing the opcode implementations, but some HN comments say that JITs etc can do optimizations that span multiple opcodes so this isn't necessarily true. What i'm more interested in though is how design choices in the Python language affect this stuff.

" Lots of expensive features Dynamic overhead of the runtime

What is it then?

((examples of 'features')) Support not calling PyType?_Ready Strong signal handling guarantees Tracebacks (including frame info) for every exception sys.exc_info() / 0-arg raise isinstance() override checking Predictable deallocation (prevents fast refcounting schemes) "

" Raymond Hettinger wondered about the tradeoffs involved. There is a lot of checking for things like signals and recursion depth in the C runtime that have a fairly high cost. Would it make sense to allow those to be relaxed or eliminated under some circumstances? "

 gopalv 3 days ago
 ...Python is extremely dynamic and this makes things hard for someone who wants to build a JIT.

The powerful bits of python metaprogramming makes it really impossible for a JIT to say with some certainty across all running threads that what it is doing is right.

Inlining a simple call like a.x() is rather hard when everything underneath can move around - I am not saying that it always does, but implementing a python variant which is nearly the same isn't very useful.

Compare this to PHP, which has fixed method calls (unless you use runkit, which you shouldn't) - a->x() will always be the same method as long as there was an a->x which was valid.

The method will never change once it is has been validated.

However unlike Java, both languages end up not knowing exactly what type "a" will be when the method is being called.

Java also doesn't quite know, but only when the invoke is via an interface. But the engine at least knows exactly how many impls of that interface has been loaded so far (and the bi-morphic case is commonly 1 real impl and 1 mock impl).

But both in case of PHP and Python, the whole idea of "which object do I have look up ::x() for?" is an unknown. In PHP's case, you have to look it up once per class encountered and in Python's case, you have to verify someone hasn't replaced it at runtime.

There are very nice functional ways around this problem at the bottom end of numeric loops for Python, which makes it great for numeric processing interleaved with generic control flow.

...

jcranmer 3 days ago

JavaScript? has the same problems, and that hasn't stopped every major JS engine from building a JIT.

reply

madgar 3 days ago

JS is single-threaded which makes an enormous difference to actually squeezing performance out of your JIT.

Just building a JIT for a language generally isn't the hard part. Building a JIT that is substantially faster than a bytecode-compiled implementation of the language is what's hard, and how hard that is depends intimately on the semantics of the source language. When I say intimately, I mean every single detail of the language's semantics matter.

reply

"

" ... python opcodes are inefficient because they must support a significant amount of dynamicity, even as opposed to some other language peers like JS, PHP, Lua, etc.

...

> Does type hinting in Python 3.x (PEP 484) change this

No. The type hints are only meant for tooling (type checking, refactoring, documentation, etc.).

reply

"It's because of advanced JITs that Java got a lot faster."

" When you have fat opcodes you're SOL.

Consider this: an instruction like STORE_ATTR (implements obj.name = value) has to check whether the top of the stack refers to an object, then whether writing to the object attributes is allowed. Then "name" has to be checked if it's a string. Perhaps additional checking is needed to normalize the string or other unicode testing. Then the assignment has to happen which is a dictionary update (internally a hash map insertion which may trigger a resize). This is just the tip of the iceberg. A lot more stuff is happening and the correct exceptions are thrown when the instruction is used incorrectly (which leads to code bloat which hurts the instruction cache).

A thin bytecode instruction for STORE_ATTR could actually reduce the store to a single LEA machine code instruction (Load Effective Address).

The downside of a language with a thin instruction set is that the individual instructions can't validate their input. They have to trust that the compiler did its job correctly, a segfault or memory corruption will happen otherwise. One of Guido's goals when creating Python was that the runtime should never ever crash, even on nonsensical input. This pretty much rules out a thin instruction set right from the start. Simple concurrency is also a lot easier with a fat instruction set, because the interpreter can just yield in between instructions (Global interpreter lock). With a thin instruction set there is no clear delineation between instructions where all memory is guaranteed to be in a stable state. So a different locking model is needed for multi-threading, which adds even more complexity to the compiler and runtime. "

chrisseaton 3 days ago

All the problems you're describing are solved with a powerful JIT. And the core team do seem to be opposed to doing the work needed for that.

reply

gizmo 3 days ago

Python's philosophy chooses simplicity over everything else. Simple grammar. Simple bytecode instruction set. Simple CPython implementation. Simple threading model (GIL). Simple data structures. Adding a highly sophisticated and complicated JIT on top of that makes little sense.

It's not so difficult to create a high performance language that's much like Python. It's just not possible to make Python fast without abandoning some of its founding principles.

reply

"the Hotspot JVM has added intrinsics for memset, autovectorization, etc."

pjmlp 3 days ago

Lisp and Smalltalk like languages run circles around Python and Perl performance.

Enjoy the same powerful features, have JIT and AOT compilers to native code.

It all boils down to how much the language designers care about performance.

reply

dkersten 2 days ago

Lua is almost as dynamic and flexible as Python and is very easy for the programmer. LuaJIT?'s performance is close to that of C.

reply

nameless912 15 hours ago

I think part of the reason for this, though, is that Lua is a very "thin" language. It purposefully doesn't have a lot of fancy features (I mean, come on, it literally has one composite datatype, and you're supposed to use them [tables] for everything from arrays to maps to full objects). By removing a lot of assumptions that make Python "easy", Lua has made it much easier to write optimizing JITers and compilers since runtime behavior is (perhaps a little paradoxically) easier to predict. And it doesn't make Lua any less "easy" a language, it's just a different set of rules to follow. Guido's hunt for the simplest programming language possible is noble, but sometimes it causes poor decisions to be made imho.

reply

eva1984 3 days ago ((this comment modded down))

Well, just want to say for one example here, why Python is hard to speedup than JS.

In Python, essentially every object is a dictionary. And behavior like, "x.y = 'z'" is legal, meaning pretty much everything could be put into object x without declaration ahead of time, at all. While, not very sure, JS doesn't allow such behavior, you can still do assignment, but assessing will yield undefined.

The above behavior come as default(you can use __slot__ for sure, but compiler can't assume that) is pretty problematic for optimization, compiler simply won't know where to locate a specific variable ahead-of-time, so it cannot substitute the variable access using a pointer offset, it will have to go through a hash lookup, which comparing to the offset alternative, is magnitude slower.

reply

trishume 3 days ago

Except that LuaJIT? does something close to just a pointer offset. The trick it uses is to use a hash function that is known by the compiler so when you have a constant key ("y" in this case) the hash table set is compiled down to 3 instructions (imagine hash("y") = 0xa7):

1. Load whatever necessary to check if x[0xa7] is occupied 2. Branch to slow path if occupied 3. Store x[0xa7] = 'z'

And a modern super-scalar CPU will even do some of this in parallel.

reply

" Yes, Python semantics allow for lots of introspection, and that's expensive — but so did Smalltalk to a large extent. Yet people managed, for instance, to not allocate stack frames on the heap unless needed (I'm pointing vaguely in the direction of JIT compilers for Smalltalk and Self, though by now I forgot those details). "

---

great slides on JVM:

http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2011_WhatDoesJVMDo.pdf

---

Python vs Java vs PHP memory:

" Python runs in very low memory...Not because of JIT, but because of refcounting GC....Python is not really memory prudent either. PHP is much better in that regard (very fast startup time, fast page render times, but a rather different approach). "

---

" Fast VMs for dynamically typed languages (Chakra, V8) design the object model and the runtime around being able to spend as much time in JITted code as possible - the "fast path". "

---

chrisseaton 3 days ago

I think it's true that language implementations such as Ruby and Python spend most of their time running the C parts of the code. I did a talk saying the same thing about Ruby a couple of weeks ago, but referring to the Java code in JRuby, https://ia601503.us.archive.org/32/items/vmss16/seaton.pdf.

But this doesn't mean that a JIT is not going to help you. It means that you need a more powerful JIT which can optimise through this C code. That may mean that you need you to rewrite the C in a managed language such as Java or RPython which you can optimise through (which we know works), or maybe we could include the LLVM IR of the C runtime and make that accessible to the JIT at runtime (which is a good idea, but we don't know if it's practical).

...

---

brianwawok 3 days ago

Pypy makes your app take many times the memory for like 20% perf. Which is good but seems maybe often not worth the effort.

reply

---

beagle3 3 days ago

(based on discussion, can't get to website)

Any discussion that does not compare to LuaJIT?2 is suspect in its conclusions. On the surface, Lua is almost as dynamic as Python, and LuaJIT?2 is able to do wonders with it.

Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
       y = os.path.join(PATH, x)
           process_file(y)

There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration. I'm not sure why it isn't anymore - it would also likely help PyPy? and Pyston produce better code.

This is by no means the only thing that makes Python slower - my point is that if one looks for inherently slow issues that cannot be solved by an implementation, comparison with Lua/LuaJIT?2 is a good way to go.

reply

dagw 3 days ago

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration.

I'll admit to being skeptical to how big an effect this would have, but it was bigger than I thought:

  %%timeit
  s=0
  for i in itertools.repeat(1.0, 1000000):
      s+=math.sin(i)
  10 loops, best of 3: 124 ms per loop

vs

  %%timeit
  s=0
  math_sin=math.sin
  for i in itertools.repeat(1.0, 1000000):
      s+=math_sin(i)
  10 loops, best of 3: 89.1 ms per loop

reply

fovc 3 days ago

Not sure how this timeit is running, but there are two things going on: Local name lookups are done as array indexing; whereas global lookups are done as cascading dictionary lookups. And the attribute lookup also is a dictionary search. On instances, looking up a method defined in a super class involves failed lookups in the instance and all the base classes in a definition. In a hot inner loop, definitely worth cutting out

reply

---

trishume 3 days ago

Except that Lua has the almost the same problem, and Mike Pall solved it in LuaJIT?. So did the Truffle/Graal team with JRuby+Truffle.

Lua has the same property that global methods are hash table lookups. What LuaJIT? does is that it specializes hash table lookups to the key so that looking up an item in a hash table is two instructions which can be executed at the same time on a super-scalar CPU.

LuaJIT? also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state. JRuby+Truffle solves this with multithreading by assuming that it doesn't change and deoptimizing into a slow path if it detects a change.

reply

trishume 3 days ago

Idea: I wonder if it would be profitable to have a mode for Pyston/PyPy?/ZiPy?/Whatever that disables threads, or perhaps an annotation that stops other threads from being scheduled inside certain loops.

This would allow so many checks to be hoisted out of the loop like LuaJIT? does, that otherwise can't be hoisted because in theory some thread could re-assign os.path in the middle of it. If this could provide a 2-5x speedup, in most scenarios it would outweigh the performance benefit of parallelism, as well as being easier than paralellizing.

reply

xamuel 3 days ago

There isn't even any guarantee that "os.path" is the vanilla iterative lookup it's presented as. For all we know, it could be a @property-decorated method thousands of lines long.

reply

jblow 3 days ago

Due to aliasing problems, you have exactly this issue in C and C++ as well, yet those languages are both massively faster than the languages in question.

reply

kayamon 3 days ago

While true, an aliasing 'miss' in C just causes a pointer reload, while the same miss in Python causes a hash table lookup.

reply

beagle3 3 days ago

it causes much more than a hash table lookup: it causes an attribute lookup, which is ...

     1. an instance __dict__ lookup, 
     2. a __getattr__ call (itself involving another hash lookup)
     3. a class property lookup
     4. superclass property lookups (if there are any superclasses)

If it was just a hash table lookup, it would be LuaJIT?2 fast. But it's more like 10 hash table lookups and indirections, and that's assuming you don't actually try to hook any of this machinery to do something.

reply

pcwalton 3 days ago

A C compiler doesn't have to assume that the definitions of functions called directly can change. LLVM couldn't optimize calls to memcpy if it couldn't assume its semantics, just to name one example.

reply

---

Animats 3 days ago

Poor article. This subject has been covered many times, and others have put in the key references.

Python's dynamism is part of the problem, of course. Too much time is spent looking up objects in dictionaries. That's well known, and there are optimizations for that.

Python's concurrency approach is worse. Any thread can modify any object in any other thread at any time. That prevents a wide range of optimizations. This is a design flaw of Python. Few programs actually go mucking with stuff in other threads; in programs that work, inter-thread communication is quite limited. But the language neither knows that nor provides tools to help with it.

This is a legacy from the old C approach to concurrency - concurrency is an OS issue, not a language issue. C++ finally dealt with that a little, but it took decades. Go deals with it a little more, but didn't quite get it right; Go still has race conditions. Rust takes it seriously and finally seems to be getting it right. Python is still at the C level of concurrency understanding.

Except, of course, that Python has the Global Interpreter Lock. Attempts to eliminate it result in lots of smaller locks, but not much performance increase. Since the language doesn't know what's shared, lots of locking is needed to prevent the primitives from breaking.

It's the combination of those two problems that's the killer. Javascript has almost as much dynamism, but without extreme concurrency, the compiler can look at a block of code and often decide "the type of this can never change".

reply

EE84M3i 3 days ago

Note that threading is not the only place this can happen: this can happen as the result of signal handlers as well. cpython has can trigger signal handlers between any python opcodes.

reply

---

"

Why is Python so slow?

We alluded to this yesterday, but languages tend to have a compromise between convenience and performance.

    C, Fortran, etc.: static typing and compiled code leads to fast execution
        But: lots of development overhead in declaring variables, no interactive prompt, etc.
    Python, R, Matlab, IDL, etc.: dynamic typing and interpreted excecution leads to fast development
        But: lots of execution overhead in dynamic type-checking, etc.

We like Python because our development time is generally more valuable than execution time. But sometimes speed can be an issue. Strategies for making Python fast

    Use Numpy ufuncs to your advantage
    Use Numpy aggregates to your advantage
    Use Numpy broadcasting to your advantage
    Use Numpy slicing and masking to your advantage
    Use a tool like SWIG, cython or f2py to interface to compiled code.

Here we'll cover the first four, and leave the fifth strategy for a later session.

" [1]

---

"Why Python is Slow...it boils down to Python being a dynamically typed, interpreted language, where values are stored not in dense buffers but in scattered objects."

[2]

---

makecheck 3 days ago

Python has had some surprising costs but some well-known cases have straightforward work-arounds.

For example, at one point, the way you called a multi-dot function actually mattered (not sure if this persists in the latest Python releases). If your function is like "os.path.join", the interpreter would seem to incur a cost for each dot: once to look up "os.path" and once to look up "path.join". This meant that there was a noticeable difference between calling it directly several times, and calling it only through a fully resolved value (e.g. say "path_join = os.path.join" and call it only as "path_join" each time).

Another good option is to use the standard library’s pure-C alternatives for common purposes (containers, etc.) if they fit your needs.

reply

---

saynsedit 3 days ago

What's expensive about the runtime is the redundant type/method-dispatch not the opcode dispatch. The runtime is constantly checking types and looking up methods in hash tables.

Gains can be made by "inter-bytecode" optimization in the same vein as inter-procedural optimization.

If you can prove more assumptions about types between the execution of sequential opcodes, you can remove more type checks and reduce the amount of code that must be run and make it faster.

E.g.:

    01: x = a + b
    02: y = x + c

If we have already computed that a and b are Python ints, then we can assume that x is a Python int. To execute line 2, we just then need to compute the type of c, thus saving time.

The larger the sequence of opcodes you are optimizing over, the better gains you can make.

Right now I think Pyston uses a traditional inline-cache method. This only works for repeated executions. The code must eagerly fill the inline cache of each opcode to get the speed I'm talking about.

Another reason Python's runtime is slow is because there is no such thing as undefined behavior and programming errors are always checked against. E.g. The sqrt() function always checks if its argument is >= 0, even though it's programming error to use it incorrectly and should never happen. This can't be fixed by a compiler project, it's a problem at the language level.

Being LLVM based, I think Pyston has its greatest potential for success as an ahead-of-time mypy compiler (with type information statically available). IMO leave the JITing to PyPy?.

reply

---

rurban 3 days ago

The arguments presented here are extremely dumb. Of course everyone knows already that the ops itself are slow. What not many people know is that all this could be easily optimized. javascript had the very same problems, but had good engineers to overcome the dynamic overhead. php7 just restructured their data, lua and luajit are extreme examples of small data and ops winning the cache race. look at v8, which was based on strongtalk. look at guile. look at any lisp. All these problems were already solved in the 80ies, and then again in the 90ies, and then again in the 2000ies.

python is similar to perl and ruby plagued with dumb engineering. python is arguably the worst. They should just look at lua or v8. How to slim down the ops to one word, how to unbox primitives (int, bool), how to represent numbers and strings. How to speed up method dispatch, how to inline functions, how to optimize internals, how to call functions and represent lexicals. How to win with optional types. Basically almost everything done in python, perl and ruby is extremely dumb. And it will not change. I'm still wondering how php7 could overcame the very same problems, and I believe it was the external threat from hhvm and some kind of inner revolution, which could convince them to rewrite it. I blame google who had the chance to improve it, after they hired Guido, and went nowhere. You still have the old guys around resisting any improvements. They didn't have an idea how to write a fast vm's in the old days, and they still don't know.

reply

sitkack 3 days ago

The semantics of the language are extremely flexible, that flexibility add a ton of "yeah but" when trying to do obvious things. "yeah but what if they monkey patched boolean?" ... If I were starting a Python VM from scratch, objects would be closed and only deopt if they were mucked with. Most Python code in the wild is written in a dynamically typed version of Java. Yes most Python is Java w/o the types. Typed inference, escape analysis and closed classes will allow for the same speed as the JVM.

reply

rurban 3 days ago

Yes, but then it's not python anymore. You will get a similar speedup for a fully dynamic python with easier optimizations, as done with v8 or lua. Even without types. types and attributes (like closed classes) do help, but a proper vm design is more important.

But not within this community. It needs to be a completely separate project, as cython, mypy or pypy. pypy is pretty good already, but a bit overblown.

...python needs to be rewritten completely. cython, pypy or mypy are good starts, but their op and data layout is still suboptimal and not best practice.

And I have no interest in python as language, only in fast dynamic VMs. e.g. potion could just add a python parser and get a 200x function call speedup, just as done for ruby or perl. The problem is the library.

reply

sitkack 2 days ago

Agree, but Pythonic-Python, isn't Python the language, it is dynamically typed Java. It isn't considered good form to use the flexibility that is there.

reply

rurban 3 days ago

They might be good hackers, but have no idea how a VM should be implemented. That's why you got this mess.

Just look how they designed the ops, the primitives, the calling convention, the stack, the hash tables. This is not engineering, this was amateurish compared to existing practice.

reply

pkd 2 days ago

Can you give some examples of more professional VMs which support dynamic languages without an extra compilation step apart from the JVM?

reply

rurban 2 days ago

I gave already and I didn't call them more professional. I called them engineered, in opposite to those simple adhoc implementations. The only part why worse is better in this regard was because of getting business attraction.

lua, v8, guile (but just last year), smalltalk and family (esp. strongtalk and the other pre-v8 projects), ml and family (MLton and ocaml), haskell (after many years of being too slow), any better scheme (>20), any better lisp (>5), php7 (just last year), and then the tiny lua based ones, like potion, wren, tinyrb, tinypy or tvmjit (with simple method ast->jit), or partially io, lily, or the other post-smalltalk languages, like maru (simpliest ast->jit) or the soda family around alan key and ian piumarta.

when you design a 20-100x slower calling convention without proper support for lexical variables, slow data, no unboxing, no tagging, slow ops, no inlining, no fast memory reclamation you are in the pre-SICP days and should be treated like that.

not even talking about the modern static languages and type systems which are now catching up, via llvm. not go or rust, but stuff like pony or julia.

reply

---

awesome example of literate programming:

https://github.com/jameysharp/corrode/blob/master/src/Language/Rust/Corrode/C.md

---

on writing java without allocation:

chollida1 1 day ago ... EDIT: someone emailed asking for how HFT firms write their java code. I haven't written java in 3 years so I'm probably not the best person to author a list but this is what I'd include:

In order of importance:

This time increasing is considered a bug in the same way that an app crashing due to user input is considered a bug, which is to say that you just don't ship with this kind of bug.

Watch this video: https://www.youtube.com/watch?v=iINk7x44MmM

vvanders 1 day ago

That's true, I'm very familiar with tuning Java to be cache aware. I've played around with FlatBuffers?[1] in my spare time to good effect which seems to be the only sane way of doing cache aware layout in Java.

That said the amount of restrictions you have makes it really painful(needing to use pre-allocated wrappers if you want to read more than intrinsic types, not being able to grow/shrink anything). There's sun.misc.Unsafe, but that isn't available on every JVM implementation.

A GC is going to by nature introduce pauses if since at some point it's going to need to do work on the memory you hold. There's a set of tunable tradeoffs but it will impact latency.

I guess my meta point is if throughput and latency are paramount to you then you should really consider a language suited for it. For me today that would be Rust or C++ depending on how risk-adverse the organization you're working with is.

[1] https://google.github.io/flatbuffers/

reply

vitalyd 1 day ago

The fantastic standard library mostly goes away because it allocates. It's possible to write Java code that doesn't allocate in steady state, but the coding style becomes terrible (e.g. overuse of primitives, mutable wrappers, manually flattened data structures, etc).

There's also the issue that even without GC running you pay the cost of card marking (GC store barriers) on every reference write. There's unpredictability due to deoptimizations occurring due to type/branch profile changes, safepoints occurring due to housekeeping, etc.

It's unclear whether that style of Java coding is actually a net win over using languages with better performance model.

reply

---

" Also until pretty recently, C and C++ did not have a sane memory model so what happened when threads was involved was dependent on what CPU you are on, the sign of the moon, etc.

	Negitivefrags 1 day ago> Also until pretty recently, C and C++ did not have a sane memory model

Technically true, but in practice it was rarely relevant.

johncolanduoni 1 day ago

Lockless structures is one of those rare cases where it is relevant. Not that I think Java is great for these (Java's atomics support is not great compared to C/C++ or even C#), but at least it's not the free-for-all that C/C++ used to enjoy.

reply

"

---

yummyfajitas 1 day ago

Garbage collection.

A common pattern in Java based systems is fan-out based processing. One immutable message is created, sent to multiple processors, who then combine signals to make a decision. Tracking the memory here would slow you down. But usually the JVM handles this just fine, and kills those objects in first generation.

---

Animats 1 day ago [-]

"However BTreeMap? is implemented using a modest spoonful of Unsafe Rust (most collections are)."

Unsafe pure Rust code reflects a lack of expressive power. How raw memory becomes a typed object is a touchy area for arrays. To do collections safely, you have to be able to talk about an array which is partially valid. This is quite possible, but you need formal verification like constructs to do it. You need to be able to express that a slice of an array is valid, even though the whole array isn't.

If you can talk about a slice being valid, and an element being valid, you're almost there. It's a theorem that if an element is adjacent to the end of a slice, the element has some property, and all elements in the slice have a property, then a slice containing the original slice and the element has that property. Using that, you can prove by induction that as you grow a collection, the elements in use remain valid. If the access primitives don't let you access outside the valid range, they're safe.

With some extra predicates and a modest theorem prover, much unsafe code in the collection area could probably be proven safe.

Rather than describing unsafe code as a "dark art", it would be more useful to try to formalize safety in this way. The theory is well understood, and there are many proof of correctness systems around for other languages. It might require a lot of annotation in unsafe code, to help the prover along, but that's not a bad thing.

reply

---

"

Python is a well-designed language with a reasonably clean syntax, a comprehensive standard library, excellent included and third party documentation, widespread deployment, and the immediacy of a "scripting" style language (ie. no explicit compile step).

" -- http://programmers.stackexchange.com/a/5429/167249


more motivations for modular/extensible type systems:

---

some common tasks on mac CLI:

https://github.com/rgcr/m-cli https://github.com/guarinogabriel/mac-cli

---

(in a discussion about Jupyter)

drauh 12 hours ago

unvote [-]

Beaker Notebook (posted several times, but without much attention[0]) does something similar: http://beakernotebook.com/features

It supports these languages: Python, Python3, R, JavaScript?, SQL, C++, Scala/Spark, Lua/Torch, Java, Julia, Groovy, Node, Ruby, HTML, and Clojure.

It has an experimental native version: https://github.com/twosigma/beaker-notebook/wiki/Electron-Be...

Talk at SciPy? 2015: https://www.youtube.com/watch?v=iMPfLz6kKv8

[0] https://news.ycombinator.com/item?id=8364237

reply

denfromufa 12 hours ago [-]

Maybe because Jupyter has already more than 70 kernels and got adopted before other notebooks appeared (Beaker, Spark, Zeppelin)?

reply

spot 2 hours ago [-]

true but the languages are siloed, each notebook runs just one language. with beaker the languages can communicate with each other. there's no easier way to combine python and javascript for d3, for example: https://pub.beakernotebook.com/publications/7fdcaaa6-fb83-11e5-9212-0797fdc961bc?fullscreen=true

there are lot more differences in the UI as well.

reply

nl 1 hour ago [-]

You can actually use R and Python in the same notebook. See https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/

The %Rpush and %Rpull magics are what you need.

Also, if you are using Spark, then Apache Toree[1] lets you use Python, R, Scala and SQL in the same notebook against Spark[2].

[1] https://toree.incubator.apache.org/documentation/user/how-it-works.html

[2] https://github.com/ibm-et/spark-kernel/wiki/Language-Support-on-the-Spark-Kernel

reply

denfromufa 1 hour ago [-]

Not everyone needs more than one language in the same notebook communicating between each other. But if required, then the cell magic system looks superior to me: have a look at %%fortran, %%cython, %%javascript, %%html, %%bash options. Also it is possible to switch kernels in the same notebook, but serialization of state between kernels is handled by user.

reply

---

the SEAM meta-VM uses a graph store as a generic memory. The graph store also has 'references' (aliases), which are used for completed futures (a special case of their "transients") ("Furthermore, a reserved label, internal to the store, is used to represent references. A replaced transient updates its label in place to become a reference, and stores an edge to the node the transient was replaced with. "), so that is very interesting for us.

---

technosmurf 1 day ago [-]

WordStar? is popular with science fiction writers for more fundamental reasons. (Not because it doesn't have spell check or isn't connected to the internet.)

Wordstar treats the text like a long-hand manuscript. You can add notes to yourself like "fix this!" and deal with the edits later. Or, you can block copy large sections of text and have the computer keep a reference to it without having to "copy and paste" it immediately. I've seen George RR Martin write that he likes how easy it is to move large sections of text around in WordStar?.

Here's how the SF writer Robert J. Sawyer describes it:

"... as a creative writer, I am convinced that the long-hand page is the better metaphor.

Consider: On a long-hand page, you can jump back and forth in your document with ease. You can put in bookmarks, either actual paper ones, or just fingers slipped into the middle of the manuscript stack. You can annotate the manuscript for yourself with comments like "Fix this!" or "Don't forget to check these facts" without there being any possibility of you missing them when you next work on the document. And you can mark a block, either by circling it with your pen, or by physically cutting it out, without necessarily having to do anything with it right away. The entire document is your workspace...

WordStar?'s ^Q (Quick cursor movement) and ^K (block) commands give me more of what I used to have when I wrote in longhand than any other product does. WordStar?'s powerful suite of cursor commands lets me fly all over my manuscript, without ever getting lost. That's because WordStar? is constantly keeping track of where I've been and where I'm likely to want to go. ^QB will take me to the beginning of the marked block; ^QK will take me to the end; ^QV will take me to where the marked block was moved from; ^QP will take me to my previous cursor position. And, just as I used to juggle up to ten fingers inserted into various places in my paper manuscript, WordStar? provides me with ten bookmarks, set with ^K0 through ^K9, and ten commands to jump to them, ^Q0 to ^Q9...

WordStar?, with its long-hand-page metaphor, says, hey, do whatever you want whenever you want to. This is a good spot to mark the beginning of a block? Fine. What would you like to do next? Deal with the block? Continue writing? Use the thesaurus?

After another half hour of writing, I can say, ah hah!, this is where I want to end that block. And two hours later I can say, and this is where that block should go. I'm in control, not the program. That's clearly more powerful, more intuitive, and more flexible than any other method of text manipulation I've yet seen implemented in a word processor. That WordStar? lets me have separate marked blocks in each of its editing windows multiplies that power substantially: imagine doing a cut and paste job between two versions of a paper document, but being told that you could only have one piece cut out at a time. Madness! Yet that's what WordPerfect?, Microsoft Word, and others would force you to do. (In WordStar? 7.0, you can even, in essence, have two marked blocks per window, toggling between them with the "mark previous block" command, ^KU.)"

http://sfwriter.com/wordstar.htm

reply

blacksmith_tb 1 day ago [-]

There are more modern word processors that implement many of these features, for example Scrivener[1] and Ulysses[2]. Clipboard managers also help quite a bit, I can't do anything without one these days (but I also have a lot of commonly-used commands and blocks of code stored as snippets in mine, so it isn't just for juggling).

1: http://www.literatureandlatte.com/

2: http://ulyssesapp.com/

reply

lobster_johnson 23 hours ago [-]

Borland's IDEs (Turbo Pascal, etc.) also implemented the WordStar? keyboard shortcuts and block behaviour, and it was amazingly productive.

You could mark a block (^KB to start, ^KK to end), and you could move around, find a place to insert it, then hit ^KV to move it there. Borland's IDEs also had selecting (using shift+arrows like many UIs today), and didn't affect blocks, which meant you could do some really fast editing by combining them.

The WordStar? keystrokes currently survive in JOE [1], which I use as my $EDITOR. There's also a tiny Atom plugin [2] which gives you those block commands.

[1] https://en.wikipedia.org/wiki/Joe%27s_Own_Editor

[2] https://github.com/iarna/atom-joe

reply

dalke 22 hours ago [-]

The phrase I remember was that the WordStar? key commands were no one's favorite commands, but everyone's second favorite. That is, everyone in microcomputing in the 1980s knew and could use the ^KB, etc. set, but preferred some other editor's way of doing things.

Looking around now, this may have come from Phillipe Khan and Sidekick. Quoting Jim Mischel at http://www.mischel.com/diary/2005/08/22.htm :

> While I'm on the subject of WordStar?, I've heard a story that I haven't been able to verify. When Phillipe Kahn was asked why he chose to use the WordStar? command set in the first version of Sidekick, he said that he asked a lot of people for their editor preferences. Almost everybody had a different first preference (back then it could have been Emacs, vi, Word Perfect, WordStar?, Brief, Leading Edge Word Processor, or who knows what else). But almost everybody he asked knew WordStar? and named it as their second preference. I don't know if it's true, but it smacks of truth. Certainly every microcomputer programmer I knew back in the late 80s was proficient with WordStar?.

reply

---

nostrademons 11 hours ago [-]

There's actually a fairly long history of cross-language VMs, with various degrees of success. What usually happens is that they work fine for languages that look, semantically, basically like the native language on the VM. So LLVM works well as long as your language is mostly like C (C, C++, Objective-C, Rust, Swift). Parrot works if you language is mostly like Perl 6 (Perl, Python, PHP, Ruby). .NET works if your language is mostly like C# (C#, F#, Boo, IronPython?). The JVM works if your language is mostly like Java (Java, Clojure, Scala, Kotlin). Even PyPy? has frontends for Ruby, PHP, Smalltalk, and Prolog.

---

" Lua is unparalleled in its support of coroutines (...((including)) asymmetric coroutines that can yield across multiple, unadorned function invocation...), and metatables are what make optimizing Lua _difficult_. ... Lua's lexical closures or tail call optimization, both critical to Lua's ability to support functional programming patterns. ... the Lua C API, which is part-and-parcel of what makes Lua preferable as an extension and glue language "

[3]

" ...coroutines...

That's a pretty critical feature of Lua that somewhat defines it(along with it's low overhead + embeddability " [4]

"we use to run VM + tuneable data + game logic code in a 400kb block allocated to Lua on the PSP....Lua has a really rich history of being embedding in some pretty small targets." [5]

---

article on Truffle:

https://medium.com/@octskyward/graal-truffle-134d8f28fb69#.xjjpiifrz

wcrichton 10 hours ago [-]

Before anyone gets too excited, make sure you look at what a compiler using Truffle actually looks like, and remember that Java is far from an ideal language for writing a compiler. I'll use their SimpleLanguage? (their primary tutorial lang) as an example.

Parser [1] shows how lack of sum types makes code messy. Lots of "factory.createStringLiteral" kind of calls. At least Java has a mature parser generator.

Implementing interpreter nodes [2] requires a pretty large amount of boilerplate, each individual node is relegated to its own file, some classes are littered with probably autogenerated getters/setters. While functional languages can be too terse, Java shows its exceptional verbosity here.

[1] https://github.com/graalvm/truffle/blob/master/truffle/com.o...

[2] https://github.com/graalvm/truffle/tree/master/truffle/com.o...

reply

---

kodablah 11 hours ago [-]

The Oracle flavors of the JVM/JDK probably scare too many away with regards to redistribution of their product. That coupled with the fact that things like the AOT engine is closed source and the weight of the JVM for anything besides daemons (both mentioned towards the end of the article) probably keep language designers away these days.

I would love a lightweight toolkit that made for easy language development. VMKit[0] is dead, MicroVM?[1] is a nice idea but not full fledged. Many like myself looking to implement a toy language would love to be fast out of the gate and would rather not mess with the OS-specific constructs and specifics of LLVM IR. So we end up cross-compiling to existing languages or marrying ourselves to a certain runtime (e.g. the CLR). Any other lightweight VMs or AOT compilers that are cross-platform that language designers can use these days?

0 - http://vmkit.llvm.org/ 1 - http://microvm.github.io/

reply

frankpf 10 hours ago [-]

QBE[1] seems to be what you're looking for. It "aims to be a pure C embeddable backend that provides 70% of the performance of advanced compilers in 10% of the code".

Previous discussion on HN: https://news.ycombinator.com/item?id=11555527

[1]: http://c9x.me/compile/

reply

kodablah 10 hours ago [-]

Neat, I missed this when originally posted. Sadly, it falls apart on the "cross-platform" requirement (does not appear to support Windows).

reply

the8472 11 hours ago [-]

Graal/Truffle is not a proprietary part of the JDK, i.e. it's also part of OpenJDK?. The article mentions this.

Some tangential things mentioned are proprietary though.

reply

geodel 11 hours ago [-]

Also sometime back there use to be OpenJDK? maxine VM project which is gone and now an option of closed source SubstrateVM? appeared.

It may be unlikely but Oracle can simply remove highly paid staff from above mentioned projects and put in some closed source alternative/analogous projects. Open source project may then be left to rot. As it is almost every single developer on Graal etc is Oracle labs or Oracle funded research group at university.

reply

mahmoudimus 10 hours ago [-]

I think we should start to see post-Java9 a less resource intensive JVM, even more so with Java10. The problem boils down to if there exists an alternative ecosystem that is going to be as featureful and as powerful in the meantime. I am bullish on Oracle here, as much as I hate to say it

reply

merb 8 hours ago [-]

yeah java9/10/11 stuff is probably a little bit late.

reply

pjmlp 7 hours ago [-]

Yeah, the problem is that it will only appear around 2020 and meanwhile the competion doesn't stand still.

Also we still have deployments using EOL Java versions that don't get upgraded, because IT doesn't want to change those servers.

reply

Scarbutt 11 hours ago [-]

check out RPython.

reply

kodablah 10 hours ago [-]

I should have mentioned one of my wants was a pluggable type system. Maybe not a full type inferencing component (that would be nice!) but at least static typing at the IR level.

reply

---

more on truffle:

qwertyuiop924 8 hours ago [-]

Okay, very cute, but what if you want your language to have significant semantic differences from C and the like? What if you want to write a Scheme interpreter, for instance? Can I have Continuations? Can I have Tail Call Optimization? I very much doubt it.

If you want to write a language that is semantically like C for the most part, than go ahead and use Truffle, but that's not where interesting language design is happening. Show me that Truffle's Ruby implementation hasn't removed callcc (yes, Ruby has callcc), and maybe I'll reconsider.

reply

Veedrac 3 hours ago [-]

You can do TCO relatively simply. You throw a special exception class for nonlocal control flow (eg. tail calls, breaks from loops, continues, etc.) and Graal knows how to optimize the cost away to a normal jump.

http://cesquivias.github.io/blog/2015/01/15/writing-a-langua...

ZipPy? (an incomplete Python implementation) supports Python's coroutines, which are faster than any other implementation. In theory these might be directly generalizable to get continuations.

http://thezhangwei.com/documents/oopsla113-zhang.pdf

reply

qwertyuiop924 2 hours ago [-]

The TCO sounds like an ugly hack to me, but I'll take it. How would this interact with the magical AST merging? I would assume that languages with different calling conventions can't just merge like that.

reply

Veedrac 2 hours ago [-]

Imagine you have an AST. You execute it directly, by running a node. This node will run its sub-nodes, etc. The implemented language has its own stack frames stored in specific objects. Thus the stack is actually split between those variables in the Java call stack and those in the explicit object.

Let's say you're in a particular function at a tail call node. Since the tail call node is semantically (from the point of view of the Java program) many function calls deep into the implemented language's function and we want to get back to the start, we want some way of exiting a whole stack of function calls in one jump.

Exceptions offer a simple and neat way to do this. You throw from the call site and catch back at the start of the function. Graal will inline all of the internal function calls up to this point, since Graal uses inlining extremely heavily. This means that Graal will see the throw site and the call site next to each other in one block of code, and consider it to just be a jump.

Because Graal is just inlining Java functions, and the actual target language's call stack is stored separately as a particular object, the target language's calling convention for a large part doesn't matter. The jump to the first node doesn't invalidate any of the target language's stack - that has to be done by the language implementor manually at that point.

reply

grashalm 2 hours ago [-]

If you do it in a language specific way, you would probably need to catch those TC exceptions when calling into functions of your language. That is very easy to do, as every Truffle language controls how its functions get called in a foreign languages.

reply

qwertyuiop924 1 hour ago [-]

Ah.

reply

---

more on truffle:

qwertyuiop924 1 hour ago [-]

Wait, nevermind. See above.

reply

dkarapetyan 10 hours ago [-]

Doesn't RPython do all the same things? Also there are language workbenches (http://www.languageworkbenches.net/) that allow one to get a bunch of things like IDE support and even basic compilers by properly expressing the language syntax and semantics.

I agree there is a lot of interesting stuff going on here but lets not forget the prior art. Even OMeta I think already did a lot of these things way back when. Going further back there's META II (https://en.wikipedia.org/wiki/META_II).

I think building self-specializing interpreters is the main trick. If that becomes easy then a whole bunch of other magic is also possible. But I'm just an amateur so all of this is still very impressive.

reply

jakub_h 5 hours ago [-]

Since you mention OMeta, let me remark here that for some reason, this whole Graal+Truffle thing looks to me like an "enterprisey" version of OMeta+COLA.

reply

Veedrac 2 hours ago [dupe] [-]

RPython is very similar to Graal+Truffle, except that RPython is based on tracing and Graal+Truffle is based on partial evaluation.

There's a nice paper comparing the two approaches and their relative merits:

http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracin...

reply

http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs-partial-evaluation/

---

" compiling Ruby into fast code is largely about assuming people will behave sanely (e.g. people won't usually give Fixnum#+ side-effects, or make it return weird stuff), but add fallbacks for when they do.

This is the big challenge with Ruby: The subset of Ruby that people actually tend to work in is largely very predictable and possible to compile efficiently, but there's a lot of stuff around the fringes that people expect to work on the very rare occasions that we use it. "

---

wellpast 11 hours ago [-]

I think another key part of the dream is IDE support. So maybe the dream is complete with: Graal + Truffle + Nitra[1].

[1] Nitra - https://github.com/JetBrains/Nitra

reply

zokier 11 hours ago [-]

How does Nitra and MPS fit together? They both come from JetBrains? and both describe themselves as "language workbenches"

reply

---

truffle vs racket:

TheMagicHorsey? 10 hours ago [-]

Interesting. But I feel like Racket is a better framework for quickly prototyping new languages.

reply

trishume 7 hours ago [-]

Totally, Truffle is not nearly as easy as a simpler interpreter in a terser language, but unlike a Racket interpreter your code can actually run faster than raw Racket, and have incredible tooling and integration support.

reply

---

more on Truffle:

zzzcpan 9 hours ago [-]

So, judging by the slides for that "one vm to rule them all" talk [1], it's Java all the way. One has to generate AST in a form of Java code and deal with Java ecosystem. And it feels all very very complicated for all the wrong reasons.

I guess some people from the Java world could find something useful there, but it's very unlikely to attract anyone else.

[1] https://lafo.ssw.uni-linz.ac.at/pub/papers/2016_PLDI_Truffle...

reply

---

lkrubner 1 day ago [-]

Since I've been interested in how the Functional paradigm might help us developers deal with concurrency, there are 2 items here that really strike me:

1.) Clojure fell back

2.) Elixir, though small, is still moving forward

I've been a fan of Clojure since 2009. It grew rapidly for a few years, from 2009 till at least 2014. It has stalled out. This makes me sad because I've loved working with it and I think it a great community, in most ways. But it is true that the community has been unable to answer some of the criticisms leveled against it. The reliance on Emacs has meant it is not easy for beginners. On the other side, the elite (the purists?) have wanted to see Clojure be more like Scheme, or more like Racket, or more like Haskell. Clojure has offered up many interesting ideas, but perhaps hasn't quite built a complete eco-system that makes it broadly appealing.

Elixir, though very small, still seems to be moving forward, and perhaps has a chance to answer some of the demands that people made of Clojure (more pure? easier syntax?). Maybe Elixir is the language that make the Functional style the dominant style in computer programming?

reply

s_kilk 1 day ago [-]

A few (rambling) thoughts, as a long-time Clojure dev who's recently been focusing on Elixir.

The thing I like most about Elixir is the low friction between it and it's host language, Erlang. Erlang is a functional language right from the start, and the BEAM is designed to run a functional language, which means you never hit the weird FP-OO barrier that comes up so commonly in Clojure.

I really think the impedence mismatch between Clojure and the underlying Java implementation is a big, painful problem which is going mostly unacknowledged by the inner-core of the Clojure community. And that makes sense, because they're all (or mostly) "Java Guys" who are already sold on the Java ecosystem and are fine with there being a layer of spindly OO horror in their projects. They don't mind the Java layer poking through in unpleasant ways.

When I bring this up in a congregation of Clojurists I just get "shrug, welp, Clojure is a hosted language", not even an acknowledgement that there may be a problem worth discussing.

For the rest of us, the FP enthusiasts who don't have an emotional or professional attachment to Java, this sucks, because you can't do much in Clojure without also touching Java. As I said at the start of this ramble, Elixir/Erlang don't suffer from this problem, it's FP all the way down. Erlang idioms fit in with Elixir, and vice versa. No friction, no bullshit.

One final point: I think the Elixir team visibly care about ergonomics and good design, much more than the Clojure team seem to. Look at `mix` and compare it to `leiningen`, honestly. Look at Plug and Phoenix and compare to something like Luminus in Clojure-land. The difference is night-and-day. Look at how fast the `iex` repl boots up, or how fast the stop-compile-start workflow is in a Phoenix app. Even look at the content of the languages respective homepages (http://elixir-lang.org and http://clojure.org)

reply

---

" Terra: at Stanford, Zach DeVito? along with Pat Hanrahan created the Terra language, which is a low-level language that is interoperable with Lua and can also be metaprogrammed in Lua. Their primary goal for Terra is using it to construct domain-specific languages like Ebb and Opt. While Wyvern compiles down its languages all into the same base language and type system (SML), Terra uses Lua to define both the compiler and the top-level interface for each new language, but then compiles them down into Terra (as opposed to Lua) for performance. Figure 4: JIT compiling Terra code in Lua. Here, the fundamental concept is that exposing the compiler's intermediate representations enables other languages to precisely control their compiled code (as oppose to just compiling to the host language and hoping for the best) while still maintaining a level of interoperability. "

[6]

---

" Lia: in my free time, I started working on a language called Lia which is a Javascript-esque syntax that compiles into pure Rust code using the aforementioned procedural macros—imagine a Javascript runtime where you could easily call any Rust function. Figure 5: Using Rust code in Lia to multiply matrices. In my mind, Rust is the ideal compile target for an ecosystem of interoperable languages, as it has the least number of opinions on how to run your program while still ensuring safety. Specifically, it manages memory at compile time and also provides a static type system. In my mind, there are three main tiers of general purpose languages: low level systems languages like Rust, C++, and Terra that provide static types and compile-time or manually managed memory comprise the first tier. In the next tier are languages like OCaml, Haskell, Scala, and Java that provide static types but impose a garbage collector in exchange for usability. The last tier consists of dynamically typed and garbage collected languages, usually called scripting languages, like Javascript, Python, and Ruby. Traditionally, all of these kinds of languages are developed in relative isolation, but there is no real reason for that to be the case. I see a future where Rust, OCaml, and Javascript (or some selection of languages like that) all seamlessly interoperate in the same ecosystem. Lia is a first step towards that ideal by exploring the issues and uses cases for a third tier scripting language embedded inside of a first tier systems language. "

[7]

---

JS 'Get'

8.12.3 [[Get?]] (P) # Ⓣ

When the [[Get?]] internal method of O is called with property name P, the following steps are taken:

    Let desc be the result of calling the [[GetProperty]] internal method of O with property name P.
    If desc is undefined, return undefined.
    If IsDataDescriptor(desc) is true, return desc.[[Value]].
    Otherwise, IsAccessorDescriptor(desc) must be true so, let getter be desc.[[Get]].
    If getter is undefined, return undefined.
    Return the result calling the [[Call]] internal method of getter providing O as the this value and providing no arguments.

---

great article on select vs poll vs epoll vs kqueue (listed from worst to best):

http://people.eecs.berkeley.edu/~sangjin/2012/12/21/epoll-vs-kqueue.html

some notes: select and poll are similar but select has some suboptimal data structures that require constantly rewriting the same thing and iterating densely over an array which is probably spare.

epoll and kqueue are superior in that the 'interest set' data structure is stateful, ie persistent between calls to epoll and kqueue, and maintained by the kernel, instead of passed in each call. All you do is send CHANGES to the interest set to the kernel. This allows the kernel not to waste time on each call re-scanning the old, almost always unchanged, interest set.

kqueue is superior in that:

" The last issue is that epoll does not even support all kinds of file descriptors; select()/poll()/epoll do not work with regular (disk) files. This is because epoll has a strong assumption of the readiness model; you monitor the readiness of sockets, so that subsequent IO calls on the sockets do not block. However, disk files do not fit this model, since simply they are always ready.

Disk I/O blocks when the data is not cached in memory, not because the client did not send a message. For disk files, the completion notification model fits. In this model, you simply issue I/O operations on the disk files, and get notified when they are done. kqueue supports this approach with the EVFILT_AIO filter type, in conjunction with POSIX AIO functions, such as aio_read(). In Linux, you should simply pray that disk access would not block with high cache hit rate (surprisingly common in many network servers), or have separate threads so that disk I/O blocking does not affect network socket processing (e.g., the FLASH architecture).

In our previous paper, MegaPipe?, we proposed a new programming interface, which is entirely based on the completion notification model, for both disk and non-disk files. "

---

" While these approaches have the benefit of maintain- ing backward compatibility for existing applications, the need to maintain the generality of the existing API – e.g., its reliance on file descriptors, support for block- ing and nonblocking communication, asynchronous I/O, event polling, and so forth – limits the extent to which it can be optimized for performance. In contrast, a clean- slate redesign offers the opportunity to present an API that is specialized for high performance network I/O. "

" The current best practice for event-driven server pro- gramming is based on the readiness model. Applica- tions poll the readiness of interested sockets with select/poll/epoll and issue non-blocking I/O commands on the those sockets. The alternative is the completion no- tification model. In this model, applications issue asyn- chronous I/O commands, and the kernel notifies the appli- cations when the commands are complete. This model has rarely been used for network servers in practice, though, mainly because of the lack of socket-specific opera- tions such as accept/connect/shutdown (e.g., POSIX AIO [6]) or poor mechanisms for notification delivery (e.g., SIGIO signals).

MegaPipe? adopts the completion notification model over the readiness model for three reasons. First, it allows transparent batching of I/O commands and their notifi- cations. Batching of non-blocking I/O commands in the readiness model is very difficult without the explicit as- sistance from applications. Second, it is compatible with not only sockets but also disk files, allowing a unified in- terface for any type of I/O. Lastly, it greatly simplifies the complexity of I/O multiplexing. Since the kernel controls the rate of I/O with completion events, applications can blindly issue I/O operations without tracking the readiness of sockets "

" MegaPipe? API functions. ... The application associates a handle (either a regular file descriptor or a lwsocket) with the specified channel with mp_register() . All further I/O commands and com- pletion notifications for the registered handle are done through only the associated channel. A cookie, an opaque pointer for developer use, is also passed to the kernel with handle registration. This cookie is attached in the comple- tion events for the handle, so the application can easily identify which handle fired each event. The application calls mp_unregister() to end the membership. Once unregistered, the application can continue to use the reg- ular FD with general system calls. In contrast, lwsockets are immediately deallocated from the kernel memory. "

" MegaPipe? API functions (not exhaustive)

Function Parameters Description

mp_create() Create a new MegaPipe? channel instance.

mp_register() channel,fd, cookie,cpu_mask Create a MegaPipe? handle for the specified file descriptor (either regular or lightweight) in the given channel. If a given file descriptor is a listening socket, an optional CPU mask parameter can be used to designate the set of CPU cores which will respond to incoming connections for that handle.

mp_unregister() handle Remove the target handle from the channel. All pending completion notifications for the handle are canceled.

mp_accept() handle,count,is_lwsocket Accept one or more new connections from a given listening handle asynchronously. The application specifies whether to accept a connection as a regular socket or a lwsocket. The completion event will report a new FD/lwsocket and the number of pending connections in the accept queue.

mp_read(), mp_write() handle, buf,size Issue an asynchronous I/O request. The completion event will report the number of bytes actually read/written....The application should not use the provided buffer for any other purpose until the completion event, as the ownership of the buffer has been delegated to the ker- nel

mp_disconnect() handle Close a connection in a similar way with shutdown() . It does not deallocate or unregister the handle.

mp_dispatch() channel,timeout Retrieve a single completion notification for the given channel. If there is no pending notification event, the call blocks until the specified timer expires.... ((the notification) contains: i) a completed command type (e.g., read/write/accept/etc.), ii) a cookie, iii) a result field that indicates success or failure (such as broken pipe or connection reset) with the corresponding errno value, and iv) a union of command-specific return values.

"

" There is an interesting spectrum between pure event- driven servers and pure thread-based servers. Some frameworks expose thread-like environments to user ap- plications to retain the advantages of thread-based archi- tectures, while looking like event-driven servers to the kernel to avoid the overhead of threading. Such function- ality is implemented in various ways: lightweight user- level threading [23, 39], closures or coroutines [4, 18, 28], and language runtime [14]. Those frameworks intercept I/O calls issued by user threads to keep the kernel thread from blocking, and manage the outstanding I/O requests with polling mechanisms, such as epoll . These frame- works can leverage MegaPipe? for higher network I/O per- formance without requiring modifications to applications themselves. We leave the evaluation of effectiveness of MegaPipe? for these frameworks as future work. '

[8]

---

" Yes I would say kqueue, the interface, is superior to epoll. Kqueue allows one to batch modify watcher states and to retrieve watcher states in a single system call. With epoll, you have to call a system call for every modification. Kqueue also allows one to watch for things like filesystem changes and process state changes, epoll is limited to socket/pipe I/O only "

---

so ppl say IOCP > kqueue > epoll > poll > select.

They say IOCP and kqueue have a completion model whereas the others have a readiness model (the application subscribes to completion events vs the application asks who is ready to receive I/O?). They say poll and select have stateless interest sets whereas the others have stateful (the application sends a list of file descriptions to wait upon in each poll, forcing the OS to rescan this list each time, vs the kernel maintans the list and the application just sends mutations to the list).

one interesting detail from http://people.eecs.berkeley.edu/~sangjin/static/pub/osdi2012_megapipe.pdf is that it lets the application send a 'cookie' when a file descriptor is associated with a channel; this cookie is returned with each completion events. Presumably this is like how i propose that an API that lets clients create entities should allow the clients to associate an opaque ID with each created entity (and should also let the client alias an externally created entity so that their alias has their opaque ID).


the Erlang guy gives a good example of why the concurrency model of Erlang is easier to reason about than in JS:

" Here's a example of a green callback in Erlang: loop(F) -> receive {new_callback, F1} -> loop(F1); Msg -> F(Msg), loop(F) end.

When the processes running this code receives a message Msg it evaluates the function F(Msg). There is no uncertainty about this, we know _exactly_ when the callback is triggered. It is triggered immediately after receiving the message Msg.

This wee code fragment is doubly beautiful, if you send the process a message {new_callback, F1} then it will change its behavior using the new callback on the next invocation.

I don't know how you would write this in Javascript. I've written quite a lot of jQuery and know how to setup and remove callbacks. But what happens if an event is triggered in the time interval between removing an event handler and adding a new one. I have no idea, and life is too short to find out. "

[9]

note, however, that in the comments to that same blog post, a commentor says that actually in JS no event can be triggered in between removing the event handler, and adding the new one, because Javascript's event model is that incoming events don't (logically) occur in the middle of (in the middle of a task? a microtask? what were those things called again).

the point seems to be, however, that the Erlang implementation handles the hard word of scheduling multiple lightweight threads, and that all the threads have to do is check their mailbox from time to time, rather than scheduling callbacks. This is probably the sort of thing that the MegaPipe? paper was talking about when they said, above, "frameworks expose thread-like environments to user applications to retain the advantages of thread-based architectures, while looking like event-driven servers to the kernel to avoid the overhead of threading".

---

" By always writing total functions, we can be confident that our services and jobs don't fail in uncontrolled ways. "Proving of theorems" is what you end up doing as you write total functions. NonEmptyList?, ADTs, Options, Coproducts, These, singleton types, sized collections. All these things can provide powerful proof that is used to write total functions. The proof that various types provide makes it easier to write total functions as we can use Scala's powerful type system to greatly restrict which programs we even have to think about. " ---

fbonetti 22 hours ago [-]

> For instance I have yet to see an easy and simple to use (and as such maintainable) functional widget and gui library.

I will continue to shill for Elm[1], like I always do.

[1] http://elm-lang.org/

reply

---