Revision 2 not available (showing current revision instead)

proj-oot-ootNotes20

Notes on an article "Why is Python slow?" and on a related presentation and on the HN discussion about it:

http://blog.kevmod.com/2016/07/why-is-python-slow/ https://docs.google.com/presentation/d/10j9T4odf67mBSIJ7IDFAtBMKP-DpxPCKWppzoIfmbfQ/edit#slide=id.g142ccf5e49_0_432 https://lwn.net/Articles/691243/ https://news.ycombinator.com/item?id=12025309

the article is about why is Python slower than JS and Lua.

" Python spends almost all of its time in the C runtime ... Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. ... Pyston's performance improvements come from speeding up the C code, not the Python code. ... why is the Python C runtime slow? ... This is a for loop written in Python, but that doesn't execute any Python bytecodes:

import itertools sum(itertools.repeat(1.0, 100000000))

The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython. In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython's sum() has an optimization to avoid boxing when the summands are all floats (or ints). So it's not boxing overhead, and it's not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.

My current best explanation is that it's not so much that the C runtime is slow at any given thing it does, but it just has to do a lot. In this itertools example, about 50% of the time is dedicated to catching floating point exceptions. The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not. All of these checks are fast and well optimized, but they are done every loop iteration so they add up. A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8's 5. ... If JS/Lua can be fast why don't the Python folks get their act together and be fast?

Python is a much, much more dynamic language that even JS. Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS.

Why don't we rewrite the C runtime in Python and then JIT it? ... If you're going to rewrite the runtime into another language, I don't think Python would be a very good choice. There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don't think you'd end up ahead. ... "

my note: the article thinks that b/c most of the time is spent executing the OP codes rather than in interpreter overhead (b/c the opcodes are complex rather than simple), there will be little gains to stuff like JITing, and most gains will come from optimizing the opcode implementations, but some HN comments say that JITs etc can do optimizations that span multiple opcodes so this isn't necessarily true. What i'm more interested in though is how design choices in the Python language affect this stuff.

" Lots of expensive features Dynamic overhead of the runtime

What is it then?

((examples of 'features')) Support not calling PyType?_Ready Strong signal handling guarantees Tracebacks (including frame info) for every exception sys.exc_info() / 0-arg raise isinstance() override checking Predictable deallocation (prevents fast refcounting schemes) "

" Raymond Hettinger wondered about the tradeoffs involved. There is a lot of checking for things like signals and recursion depth in the C runtime that have a fairly high cost. Would it make sense to allow those to be relaxed or eliminated under some circumstances? "

 gopalv 3 days ago
 ...Python is extremely dynamic and this makes things hard for someone who wants to build a JIT.

The powerful bits of python metaprogramming makes it really impossible for a JIT to say with some certainty across all running threads that what it is doing is right.

Inlining a simple call like a.x() is rather hard when everything underneath can move around - I am not saying that it always does, but implementing a python variant which is nearly the same isn't very useful.

Compare this to PHP, which has fixed method calls (unless you use runkit, which you shouldn't) - a->x() will always be the same method as long as there was an a->x which was valid.

The method will never change once it is has been validated.

However unlike Java, both languages end up not knowing exactly what type "a" will be when the method is being called.

Java also doesn't quite know, but only when the invoke is via an interface. But the engine at least knows exactly how many impls of that interface has been loaded so far (and the bi-morphic case is commonly 1 real impl and 1 mock impl).

But both in case of PHP and Python, the whole idea of "which object do I have look up ::x() for?" is an unknown. In PHP's case, you have to look it up once per class encountered and in Python's case, you have to verify someone hasn't replaced it at runtime.

There are very nice functional ways around this problem at the bottom end of numeric loops for Python, which makes it great for numeric processing interleaved with generic control flow.

...

jcranmer 3 days ago

JavaScript? has the same problems, and that hasn't stopped every major JS engine from building a JIT.

reply

madgar 3 days ago

JS is single-threaded which makes an enormous difference to actually squeezing performance out of your JIT.

Just building a JIT for a language generally isn't the hard part. Building a JIT that is substantially faster than a bytecode-compiled implementation of the language is what's hard, and how hard that is depends intimately on the semantics of the source language. When I say intimately, I mean every single detail of the language's semantics matter.

reply

"

" ... python opcodes are inefficient because they must support a significant amount of dynamicity, even as opposed to some other language peers like JS, PHP, Lua, etc.

...

> Does type hinting in Python 3.x (PEP 484) change this

No. The type hints are only meant for tooling (type checking, refactoring, documentation, etc.).

reply

"It's because of advanced JITs that Java got a lot faster."

" When you have fat opcodes you're SOL.

Consider this: an instruction like STORE_ATTR (implements obj.name = value) has to check whether the top of the stack refers to an object, then whether writing to the object attributes is allowed. Then "name" has to be checked if it's a string. Perhaps additional checking is needed to normalize the string or other unicode testing. Then the assignment has to happen which is a dictionary update (internally a hash map insertion which may trigger a resize). This is just the tip of the iceberg. A lot more stuff is happening and the correct exceptions are thrown when the instruction is used incorrectly (which leads to code bloat which hurts the instruction cache).

A thin bytecode instruction for STORE_ATTR could actually reduce the store to a single LEA machine code instruction (Load Effective Address).

The downside of a language with a thin instruction set is that the individual instructions can't validate their input. They have to trust that the compiler did its job correctly, a segfault or memory corruption will happen otherwise. One of Guido's goals when creating Python was that the runtime should never ever crash, even on nonsensical input. This pretty much rules out a thin instruction set right from the start. Simple concurrency is also a lot easier with a fat instruction set, because the interpreter can just yield in between instructions (Global interpreter lock). With a thin instruction set there is no clear delineation between instructions where all memory is guaranteed to be in a stable state. So a different locking model is needed for multi-threading, which adds even more complexity to the compiler and runtime. "

chrisseaton 3 days ago

All the problems you're describing are solved with a powerful JIT. And the core team do seem to be opposed to doing the work needed for that.

reply

gizmo 3 days ago

Python's philosophy chooses simplicity over everything else. Simple grammar. Simple bytecode instruction set. Simple CPython implementation. Simple threading model (GIL). Simple data structures. Adding a highly sophisticated and complicated JIT on top of that makes little sense.

It's not so difficult to create a high performance language that's much like Python. It's just not possible to make Python fast without abandoning some of its founding principles.

reply

"the Hotspot JVM has added intrinsics for memset, autovectorization, etc."

pjmlp 3 days ago

Lisp and Smalltalk like languages run circles around Python and Perl performance.

Enjoy the same powerful features, have JIT and AOT compilers to native code.

It all boils down to how much the language designers care about performance.

reply

dkersten 2 days ago

Lua is almost as dynamic and flexible as Python and is very easy for the programmer. LuaJIT?'s performance is close to that of C.

reply

nameless912 15 hours ago

I think part of the reason for this, though, is that Lua is a very "thin" language. It purposefully doesn't have a lot of fancy features (I mean, come on, it literally has one composite datatype, and you're supposed to use them [tables] for everything from arrays to maps to full objects). By removing a lot of assumptions that make Python "easy", Lua has made it much easier to write optimizing JITers and compilers since runtime behavior is (perhaps a little paradoxically) easier to predict. And it doesn't make Lua any less "easy" a language, it's just a different set of rules to follow. Guido's hunt for the simplest programming language possible is noble, but sometimes it causes poor decisions to be made imho.

reply

eva1984 3 days ago ((this comment modded down))

Well, just want to say for one example here, why Python is hard to speedup than JS.

In Python, essentially every object is a dictionary. And behavior like, "x.y = 'z'" is legal, meaning pretty much everything could be put into object x without declaration ahead of time, at all. While, not very sure, JS doesn't allow such behavior, you can still do assignment, but assessing will yield undefined.

The above behavior come as default(you can use __slot__ for sure, but compiler can't assume that) is pretty problematic for optimization, compiler simply won't know where to locate a specific variable ahead-of-time, so it cannot substitute the variable access using a pointer offset, it will have to go through a hash lookup, which comparing to the offset alternative, is magnitude slower.

reply

trishume 3 days ago

Except that LuaJIT? does something close to just a pointer offset. The trick it uses is to use a hash function that is known by the compiler so when you have a constant key ("y" in this case) the hash table set is compiled down to 3 instructions (imagine hash("y") = 0xa7):

1. Load whatever necessary to check if x[0xa7] is occupied 2. Branch to slow path if occupied 3. Store x[0xa7] = 'z'

And a modern super-scalar CPU will even do some of this in parallel.

reply

" Yes, Python semantics allow for lots of introspection, and that's expensive — but so did Smalltalk to a large extent. Yet people managed, for instance, to not allocate stack frames on the heap unless needed (I'm pointing vaguely in the direction of JIT compilers for Smalltalk and Self, though by now I forgot those details). "

---

great slides on JVM:

http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2011_WhatDoesJVMDo.pdf

---

Python vs Java vs PHP memory:

" Python runs in very low memory...Not because of JIT, but because of refcounting GC....Python is not really memory prudent either. PHP is much better in that regard (very fast startup time, fast page render times, but a rather different approach). "

---

" Fast VMs for dynamically typed languages (Chakra, V8) design the object model and the runtime around being able to spend as much time in JITted code as possible - the "fast path". "

---

chrisseaton 3 days ago

I think it's true that language implementations such as Ruby and Python spend most of their time running the C parts of the code. I did a talk saying the same thing about Ruby a couple of weeks ago, but referring to the Java code in JRuby, https://ia601503.us.archive.org/32/items/vmss16/seaton.pdf.

But this doesn't mean that a JIT is not going to help you. It means that you need a more powerful JIT which can optimise through this C code. That may mean that you need you to rewrite the C in a managed language such as Java or RPython which you can optimise through (which we know works), or maybe we could include the LLVM IR of the C runtime and make that accessible to the JIT at runtime (which is a good idea, but we don't know if it's practical).

...

---

brianwawok 3 days ago

Pypy makes your app take many times the memory for like 20% perf. Which is good but seems maybe often not worth the effort.

reply

---

beagle3 3 days ago

(based on discussion, can't get to website)

Any discussion that does not compare to LuaJIT?2 is suspect in its conclusions. On the surface, Lua is almost as dynamic as Python, and LuaJIT?2 is able to do wonders with it.

Part of the problem with Python is (that Lua doesn't share) is that things you use all the time can potentially shift between two loop iterations, e.g.

    for x in os.listdir(PATH):
       y = os.path.join(PATH, x)
           process_file(y)

There is no guarantee that "os.path" (and thus "os.path.join") called by one iteration is the same one called by the next iteration - process_file() might have modified it.

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration. I'm not sure why it isn't anymore - it would also likely help PyPy? and Pyston produce better code.

This is by no means the only thing that makes Python slower - my point is that if one looks for inherently slow issues that cannot be solved by an implementation, comparison with Lua/LuaJIT?2 is a good way to go.

reply

dagw 3 days ago

It used to be common practice to cache useful routines (e.g. start with "os_path_join = os.path.join" before the loop and call "os_path_join" instead of "os.path.join"), thus avoiding the iterative lookup on each iteration.

I'll admit to being skeptical to how big an effect this would have, but it was bigger than I thought:

  %%timeit
  s=0
  for i in itertools.repeat(1.0, 1000000):
      s+=math.sin(i)
  10 loops, best of 3: 124 ms per loop

vs

  %%timeit
  s=0
  math_sin=math.sin
  for i in itertools.repeat(1.0, 1000000):
      s+=math_sin(i)
  10 loops, best of 3: 89.1 ms per loop

reply

fovc 3 days ago

Not sure how this timeit is running, but there are two things going on: Local name lookups are done as array indexing; whereas global lookups are done as cascading dictionary lookups. And the attribute lookup also is a dictionary search. On instances, looking up a method defined in a super class involves failed lookups in the instance and all the base classes in a definition. In a hot inner loop, definitely worth cutting out

reply

---

trishume 3 days ago

Except that Lua has the almost the same problem, and Mike Pall solved it in LuaJIT?. So did the Truffle/Graal team with JRuby+Truffle.

Lua has the same property that global methods are hash table lookups. What LuaJIT? does is that it specializes hash table lookups to the key so that looking up an item in a hash table is two instructions which can be executed at the same time on a super-scalar CPU.

LuaJIT? also gets serious wins from lifting the lookups out of loops since they can't change. This is only due to the fact that Lua doesn't have threads that can muck with each other's state. JRuby+Truffle solves this with multithreading by assuming that it doesn't change and deoptimizing into a slow path if it detects a change.

reply

trishume 3 days ago

Idea: I wonder if it would be profitable to have a mode for Pyston/PyPy?/ZiPy?/Whatever that disables threads, or perhaps an annotation that stops other threads from being scheduled inside certain loops.

This would allow so many checks to be hoisted out of the loop like LuaJIT? does, that otherwise can't be hoisted because in theory some thread could re-assign os.path in the middle of it. If this could provide a 2-5x speedup, in most scenarios it would outweigh the performance benefit of parallelism, as well as being easier than paralellizing.

reply

xamuel 3 days ago

There isn't even any guarantee that "os.path" is the vanilla iterative lookup it's presented as. For all we know, it could be a @property-decorated method thousands of lines long.

reply

jblow 3 days ago

Due to aliasing problems, you have exactly this issue in C and C++ as well, yet those languages are both massively faster than the languages in question.

reply

kayamon 3 days ago

While true, an aliasing 'miss' in C just causes a pointer reload, while the same miss in Python causes a hash table lookup.

reply

beagle3 3 days ago

it causes much more than a hash table lookup: it causes an attribute lookup, which is ...

     1. an instance __dict__ lookup, 
     2. a __getattr__ call (itself involving another hash lookup)
     3. a class property lookup
     4. superclass property lookups (if there are any superclasses)

If it was just a hash table lookup, it would be LuaJIT?2 fast. But it's more like 10 hash table lookups and indirections, and that's assuming you don't actually try to hook any of this machinery to do something.

reply

pcwalton 3 days ago

A C compiler doesn't have to assume that the definitions of functions called directly can change. LLVM couldn't optimize calls to memcpy if it couldn't assume its semantics, just to name one example.

reply

---

Animats 3 days ago

Poor article. This subject has been covered many times, and others have put in the key references.

Python's dynamism is part of the problem, of course. Too much time is spent looking up objects in dictionaries. That's well known, and there are optimizations for that.

Python's concurrency approach is worse. Any thread can modify any object in any other thread at any time. That prevents a wide range of optimizations. This is a design flaw of Python. Few programs actually go mucking with stuff in other threads; in programs that work, inter-thread communication is quite limited. But the language neither knows that nor provides tools to help with it.

This is a legacy from the old C approach to concurrency - concurrency is an OS issue, not a language issue. C++ finally dealt with that a little, but it took decades. Go deals with it a little more, but didn't quite get it right; Go still has race conditions. Rust takes it seriously and finally seems to be getting it right. Python is still at the C level of concurrency understanding.

Except, of course, that Python has the Global Interpreter Lock. Attempts to eliminate it result in lots of smaller locks, but not much performance increase. Since the language doesn't know what's shared, lots of locking is needed to prevent the primitives from breaking.

It's the combination of those two problems that's the killer. Javascript has almost as much dynamism, but without extreme concurrency, the compiler can look at a block of code and often decide "the type of this can never change".

reply

EE84M3i 3 days ago

Note that threading is not the only place this can happen: this can happen as the result of signal handlers as well. cpython has can trigger signal handlers between any python opcodes.

reply

---

"

Why is Python so slow?

We alluded to this yesterday, but languages tend to have a compromise between convenience and performance.

    C, Fortran, etc.: static typing and compiled code leads to fast execution
        But: lots of development overhead in declaring variables, no interactive prompt, etc.
    Python, R, Matlab, IDL, etc.: dynamic typing and interpreted excecution leads to fast development
        But: lots of execution overhead in dynamic type-checking, etc.

We like Python because our development time is generally more valuable than execution time. But sometimes speed can be an issue. Strategies for making Python fast

    Use Numpy ufuncs to your advantage
    Use Numpy aggregates to your advantage
    Use Numpy broadcasting to your advantage
    Use Numpy slicing and masking to your advantage
    Use a tool like SWIG, cython or f2py to interface to compiled code.

Here we'll cover the first four, and leave the fifth strategy for a later session.

" [1]

---

"Why Python is Slow...it boils down to Python being a dynamically typed, interpreted language, where values are stored not in dense buffers but in scattered objects."

[2]

---

makecheck 3 days ago

Python has had some surprising costs but some well-known cases have straightforward work-arounds.

For example, at one point, the way you called a multi-dot function actually mattered (not sure if this persists in the latest Python releases). If your function is like "os.path.join", the interpreter would seem to incur a cost for each dot: once to look up "os.path" and once to look up "path.join". This meant that there was a noticeable difference between calling it directly several times, and calling it only through a fully resolved value (e.g. say "path_join = os.path.join" and call it only as "path_join" each time).

Another good option is to use the standard library’s pure-C alternatives for common purposes (containers, etc.) if they fit your needs.

reply

---

saynsedit 3 days ago

What's expensive about the runtime is the redundant type/method-dispatch not the opcode dispatch. The runtime is constantly checking types and looking up methods in hash tables.

Gains can be made by "inter-bytecode" optimization in the same vein as inter-procedural optimization.

If you can prove more assumptions about types between the execution of sequential opcodes, you can remove more type checks and reduce the amount of code that must be run and make it faster.

E.g.:

    01: x = a + b
    02: y = x + c

If we have already computed that a and b are Python ints, then we can assume that x is a Python int. To execute line 2, we just then need to compute the type of c, thus saving time.

The larger the sequence of opcodes you are optimizing over, the better gains you can make.

Right now I think Pyston uses a traditional inline-cache method. This only works for repeated executions. The code must eagerly fill the inline cache of each opcode to get the speed I'm talking about.

Another reason Python's runtime is slow is because there is no such thing as undefined behavior and programming errors are always checked against. E.g. The sqrt() function always checks if its argument is >= 0, even though it's programming error to use it incorrectly and should never happen. This can't be fixed by a compiler project, it's a problem at the language level.

Being LLVM based, I think Pyston has its greatest potential for success as an ahead-of-time mypy compiler (with type information statically available). IMO leave the JITing to PyPy?