notes-computer-programming-programmingLanguageDesign-gaynorWhyPythonRubyAndJavascriptAreSlow

Why python ruby and javascript are slow

https://speakerdeck.com/alex/why-python-ruby-and-javascript-are-slow

hn discussion: https://news.ycombinator.com/item?id=5304873

reddit discussion: http://www.reddit.com/r/programming/comments/19gv4c/why_python_ruby_and_js_are_slow/c8nyejd

Following is a summary/excerpts from Alex Gaynor's talk from Waza, 2013 in San Francisco. Some stuff is left out or rearranged and some typos are corrected.


Lame Excuses about why they're slow

In truth, there are two things that can cause code to be slow: data structures and algorithms...let's start with a core point: I can run a line of Python or Ruby as fast as you can run a line of C or Java. It's all about what that line does.


Hash tables

Example task: a data structure for a 3-d point

C

struct Point { 
  double x; 
  double y; 
  double z; 
};

Python

class Point(object): 
  def __init__(self, x, y, z): 
    self.x = x
    self.y = y
    self.z = z

Some obvious differences: no type declarations, you can add any other field you want to point instances, point instances are always allocated by the GC on the heap. NONE OF THAT MATTERS, I WRITE COMPILERS I'LL DEAL WITH THAT.

Python

Here's what I see people do a lot. This is a dictionary. Dictionaries are not objects, dictionaries are hash tables:

data = { "x": x, 
         "y": y, 
         "z": z, 
}

Dictionary vs. Object: A dictionary is a mapping of arbitrary keys (usually of homogenous type) to arbirary values. An object is thing with a fixed set of properties and behaviors. You should never confuse them even if the performance difference didn't exist, these are different things.

C

std::hash_set<std::string, double> point; 
point["x"] = x; 
point["y"] = y; 
point["z"] = z; 

if you put this in a code review, first your coworkers would laugh a lot, and then they'd make mean jokes about you for the next year. It looks ridiculous, and it is ridiculous. And it would be slow. If you mixed up hash tables and objects in C or the like, it would be idiotically slow.

Hash tables: Discussion

Why don't people care? On naive implementations of these languages, that is CPython, MRI, and most JS implementations up until 5 years ago. There's no performance difference. When you have a real JIT though, the difference is huge. They train themselves to think of a dictionary as a lightweight object as a weird dogma. And in Javascript it's all one type so that's a mess anyways.


Strings

Example task: Given a string matching:"\w+-\d+" return the integral part

Python

int(s.split("-", 1)[1])

This is probably how like 99% of python people would solve it. Split the string on the dash, get the second part, convert to an int. Let's talk about efficiency. split is going to allocate a list of 2 elements, allocate 2 strings, do 2 copies, and then convert one to an int, throwing the rest away.

C

atoi(strchr(s, '-') + 1) 

Finds the first instance of a -, and converts the remainder of a string to an int. 0 allocations, 0 copies. Doing this with 0 copies is pretty much impossible in Python, and probably in ruby and Javascript too.

Discussion: Things that take time

The JIT will take care of the accidental ones, but the ones that are fundamental part of your algorithm, you need to deal with those.

The C way: BYOB

Bring Your Own Buffer. In, C no stdlib function allocates a string basically, you're always responsible for bringing a buffer to allocate data into. In practice this means C programmers tends to work on a single allocated block of memory.

Example task: Read data from a file descriptor, and then print each chunk with the leading spaces removed

C

char *data = malloc(1024); 
while (true) { 
  read(fd, data, 1024); 
  char *start = data; 
  while (start < data + 1024) { 
    if (isspace(*start)) { break; } 
    start++; 
  } 
  printf("%s\n", start); 
}

does one allocation the whole time.

Python

while True: 
  data = os.read(fd, 1024) 
  print data.lstrip()

much prettier, much shorter, and much slower. It does multiple allocations per iteration of the loop. We can easily imagine if we wanted to add a .lower() to print the in lowercase, now there's an extra allocation and an extra copy in the loop, whereas the C one could easily add that with no additional copying.

Example task: return an array of squares

C

long *squares(long n) { 
  long *sq = malloc(sizeof(long) * n); 
  for (long i = 0; i < n; i++) { 
    sq[i] = i * i; 
  } 
  return sq; 
}

One allocation, no copying.

Python

def squares(n): 
  sq = [] 
  for i in xrange(n): 
    sq.append(i * i) 
  return sq

No list pre-allocation, so every iteration through the list we have the potential to need to resize the list and copy all the data. That's inefficient.


Pre-allocate APIs; one part of the solution

Don't make us (compiler authors) add heuristics!

Python, Ruby, Javascript, it's not that there's any necessary reason for these inefficiency. It's that we're missing APIs, and we could have GREAT APIs for this. PyPy? has some secret ones:

newlist_hint returns you a list that looks totally normal, but internally it's been preallocated. For building large lists this is way faster, with 0 loss of power, 0 loss of flexibility.

from __pypy__ import newlist_hint
def squares(n): 
  sq = newlist_hint(n) 
  for i in xrange(n): sq.append(i * i) 
  return sq 

Recap



Discussion on reddit/hn: some excerpts

On the bigness of Python

"Lua is much much simpler than Javascript, which is again simpler than Python (Python is really vast)." -- https://news.ycombinator.com/item?id=5310186

"I've been working on a range of JIT compilers, but the PyPy? is the biggest one. The problem is ... that interaction between (semantics) is a headache... descriptors, metaclassses, new/old style classes, tons of builtin types, tons of builtin modules, all of it the user will expect to seamlessly integrate with the JIT compiler." -- https://news.ycombinator.com/item?id=5331454

People who think that Gaynor is overly optimistic about JIT optimizations

DannyBee? 552 days ago

link

Speaking as a compiler guy, and having a hand in a few successful commercial JITs: The only reason he thinks they aren't slow is because they haven't yet reached the limits of making the JIT faster vs the program faster. Yes, it's true that the languages are not slow in the sense of being able to take care of most situations through better optimization strategies. As a compiler author, one can do things like profile types/trace/whatever, and deoptimize if you get it wrong. You can do a lot. You can recognize idioms, use different representations behind people's back, etc.

But all those things take time that is not spent running your program. On average, you can do pretty well. But it's still overhead. As you get farther along in your JIT, optimization algorithms get trickier and trickier, your heuristics, more complex. You will eventually hit the wall, and need to spend more time doing JIT'ing than doing real work to make optimizations to some code. This happens to every single JIT, of course. This is why they try to figure out which code to optimize. But even then, you may find there is too much of it.

Because of this, the languages are slower, it's just the overhead of better JIT algorithms, not slower code.

evincarofautumn 551 days ago

link

I develop a compiler for ActionScript? 3. Though it’s not a great language, it does have a few distinct advantages over its cousin JavaScript?. First, type annotations. Obviously the more knowledge you have about the structure of the program, the better you can help it run well. Having a class model instead of a prototype model also helps—much as I like working with prototypes—because you can easily eliminate late binding (“hash lookups”) and allocations, two of the performance killers mentioned in the slides. The operative word there is easily. You can analyse and JIT all you want, but you cannot feasibly solve at runtime the fundamental constraints imposed by the language.

Say a compiler author needs twice as much effort and cleverness to make programs in language X run fast than for language Y. That means that—all other things being equal—implementations of X will be twice as slow as Y for the same basic quality of implementation.

People who think that Gaynor is overly pessimistic about JIT optimizations

" While I agree with the first part ("excuses"), the "hard" things mentioned in the second part are a) not that hard and b) solved issues (just not in PyPy?).

Hash tables: Both v8 and LuaJIT? manage to specialize hash table lookups and bring them to similar performance as C structs (*). Interestingly, with very different approaches. So there's little reason NOT to use objects, dictionaries, tables, maps or whatever it's called in your favorite language.

(*) If you really, really care about the last 10% or direct interoperability with C, LuaJIT? offers native C structs via its FFI. And PyPy? has inherited the FFI design, so they should be able to get the same performance someday. I'm sure v8 has something to offer for that, too.

Allocations: LuaJIT? has allocation sinking, which is able to eliminate the mentioned temporary allocations. Incidentally, the link shows how that's done for a x,y,z point class! And it works the same for ALL cases: arrays {1,2,3} (on top of a generic table), hash tables {x=1,y=2,z=3} or FFI C structs.

String handling: Same as above -- a buffer is just a temporary allocation and can be sunk, too. Provided the stores (copies) are eliminated first. The extracted parts can be forwarded to the integer conversion from the original string. Then all copies and references are dead and the allocation itself can be eliminated. LuaJIT? will get all of that string handling extravaganza with the v2.1 branch -- parts of the new buffer handling are already in the git repo. I'm sure the v8 guys have something up their sleeves, too.

I/O read buffer: Same reasoning. The read creates a temporary buffer which is lazily interned to a string, ditto for the lstrip. The interning is sunk, the copies are sunk, the buffer is sunk (the innermost buffer is reused). This turns it into something very similar to the C code.

Pre-sizing aggregates: The size info can be backpropagated to the aggreagate creation from scalar evolution analysis. SCEV is already in LuaJIT? (for ABC elimination). I ditched the experimental backprop algorithm for 2.0, since I had to get the release out. Will be resurrected in 2.1.

Missing APIs: All of the above examples show you don't really need to define new APIs to get the desired performance. Yes, there's a case for when you need low-level data structures -- and that's why higher-level languages should have a good FFI. I don't think you need to burden the language itself with these issues.

Heuristics: Well, that's what those compiler textbooks don't tell you: VMs and compilers are 90% heuristics. Better deal with it rather than fight it.

tl;dr: The reason why X is slow, is because X's implementation is slow, unoptimized or untuned. Language design just influences how hard it is to make up for it. There are no excuses. "

-- mikemike, LuaJIT? author, http://www.reddit.com/r/programming/comments/19gv4c/why_python_ruby_and_js_are_slow/c8nyejd

[–]julesjacobs 11 points 1 year ago

Very informative post. What in your opinion keeps LuaJITs? performance away from C/C++, i.e. what other "hard" things need to be solved? ...

[–]mikemike 22 points 1 year ago

Well, nothing really. It's just a matter of engineering, i.e. man-years put into the VM and the compiler.

The amount of work and money poured into making C compilers fast plus dealing with the abominations of C++ has been massive over the past decades. The combined efforts of tuning every single dynamic language VM in existence is miniscule in comparison. And it was a mostly dormant field until a few years ago. ...

replies to the previous that say it is overly optimistic

[–]gsg_ 8 points 1 year ago

You're playing down the role that the design of the language plays in reducing the effort needed to produce workable compilers. A C compiler doesn't need to do escape analysis to allocate a struct on the stack, doesn't need lifetime analysis to safely 'inline' a member into its containing type, doesn't need to eliminate bounds checking, doesn't need strong closure conversion, representation analysis, a clever GC, etc. ...

[–]oridb 9 points 1 year ago

You still have to pay for things like PIC indirections and type checks, and unfortunately, without whole program analysis (ie, no loaded modules) there's no good way to eliminate any of those. Dynamic language features cost you, unless the compiler can do an analysis to demonstrate that things aren't actually dynamic. ...

[–]fijal 3 points 1 year ago

Hi Mike.

PyPy? also has allocation sinking (or escape analysis or virtuals how we call them). As for the rest of the stuff - yes, we're also working on this. Yes, you can go a long way without extra APIs etc, but sometimes there is info that's available, that's simply not there, because it's only in a programmers head. ...

another example of optimization in Python

moreati 552 days ago

link

Great presentation, thank you for making me aware of an aspect of Python performance. One slide struck me as odd - the "basically pythonic" squares() function. I understand it's a chosen example to illustrate a point, I just hope people aren't writing loops like that. You inspired me to measure it

    $ cat squares.py
    def squares_append(n):
        sq = []
        for i in xrange(n):
            sq.append(i*i)
        return sq
    def squares_comprehension(n):
        return [i*i for i in xrange(n)]
    $ PYTHONPATH=. python -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 148 usec per loop
    $ PYTHONPATH=. python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    10000 loops, best of 3: 74.1 usec per loop
    $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 46.9 usec per loop
    $ PYTHONPATH=. pypy -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    100000 loops, best of 3: 8.67 usec per loop

I'm curious to know how many allocations/copies a list comprehension saves in CPython/PyPy?. However I wouldn't begin to know how to measure it.

ericmoritz 551 days ago

link

If you really want power, use NumPy?:

    from numpy import arange
    def squares_numpy(n):
        a = arange(n)
        return a * a
    $ python -m timeit -s "from squares import squares_append" "squares_append(1000)"
    10000 loops, best of 3: 130 usec per loop
    $ python -m timeit -s "from squares import squares_comprehension" "squares_comprehension(1000)"
    10000 loops, best of 3: 95.4 usec per loop
    $ python -m timeit -s "from squares import squares_numpy" "squares_numpy(1000)"
    100000 loops, best of 3: 5.31 usec per loop

More features similar to preallocation of lists

Zak 552 days ago

link

The creators of Common Lisp knew what Alex is talking about. Lisp is, of course just as dynamic as Ruby, Python or Javascript, but it exposes lower-level details about data structures and memory allocation iff the programmer wants them.

Features that come to mind include preallocated vectors (fixed-size or growable), non-consing versions of the standard list functions and the ability to bang on most any piece of data in place. There are fairly few situations in which a CL program can't come within a factor of 2 or 3 of the performance of C.

riobard 552 days ago

link

Completely agree. APIs are so important for many optimizations to pull off.

I'd really like to use a lot more buffer()/memoryview() objects in Python. Unfortunately many APIs (e.g. sockets) won't work well with them (at least in Python 2.x. Not sure about 3.x).

So we ended up with tons of unnecessary allocation and copying all over the place. So sad.

https://docs.python.org/2/c-api/buffer.html https://docs.python.org/2/library/stdtypes.html#memoryview