Notes on an article "Why is Python slow?" and on a related presentation and on the HN discussion about it:
http://blog.kevmod.com/2016/07/why-is-python-slow/ https://docs.google.com/presentation/d/10j9T4odf67mBSIJ7IDFAtBMKP-DpxPCKWppzoIfmbfQ/edit#slide=id.g142ccf5e49_0_432 https://lwn.net/Articles/691243/ https://news.ycombinator.com/item?id=12025309
the article is about why is Python slower than JS and Lua.
" Python spends almost all of its time in the C runtime ... Another way of saying this is that Python opcodes are very complex, and the cost of executing them dwarfs the cost of dispatching them. ... Pyston's performance improvements come from speeding up the C code, not the Python code. ... why is the Python C runtime slow? ... This is a for loop written in Python, but that doesn't execute any Python bytecodes:
import itertools sum(itertools.repeat(1.0, 100000000))
The amazing thing about this is that if you write the equivalent loop in native JS, V8 can run it 6x faster than CPython. In the talk I mistakenly attributed this to boxing overhead, but Raymond Hettinger kindly pointed out that CPython's sum() has an optimization to avoid boxing when the summands are all floats (or ints). So it's not boxing overhead, and it's not dispatching on tp_as_number->tp_add to figure out how to add the arguments together.
My current best explanation is that it's not so much that the C runtime is slow at any given thing it does, but it just has to do a lot. In this itertools example, about 50% of the time is dedicated to catching floating point exceptions. The other 50% is spent figuring out how to iterate the itertools.repeat object, and checking whether the return value is a float or not. All of these checks are fast and well optimized, but they are done every loop iteration so they add up. A back-of-the-envelope calculation says that CPython takes about 30 CPU cycles per iteration of the loop, which is not very many, but is proportionally much more than V8's 5. ... If JS/Lua can be fast why don't the Python folks get their act together and be fast?
Python is a much, much more dynamic language that even JS. Fully talking about that probably would take another blog post, but I would say that the increase in dynamicism from JS->Python is larger than the increase going from Java->JS.
Why don't we rewrite the C runtime in Python and then JIT it? ... If you're going to rewrite the runtime into another language, I don't think Python would be a very good choice. There are just too many warts/features in the language, so even if you could somehow get rid of 100% of the dynamic overhead I don't think you'd end up ahead. ... "
my note: the article thinks that b/c most of the time is spent executing the OP codes rather than in interpreter overhead (b/c the opcodes are complex rather than simple), there will be little gains to stuff like JITing, and most gains will come from optimizing the opcode implementations, but some HN comments say that JITs etc can do optimizations that span multiple opcodes so this isn't necessarily true. What i'm more interested in though is how design choices in the Python language affect this stuff.
" Lots of expensive features Dynamic overhead of the runtime
What is it then?
((examples of 'features')) Support not calling PyType?_Ready Strong signal handling guarantees Tracebacks (including frame info) for every exception sys.exc_info() / 0-arg raise isinstance() override checking Predictable deallocation (prevents fast refcounting schemes) "
" Raymond Hettinger wondered about the tradeoffs involved. There is a lot of checking for things like signals and recursion depth in the C runtime that have a fairly high cost. Would it make sense to allow those to be relaxed or eliminated under some circumstances? "
gopalv 3 days ago ...Python is extremely dynamic and this makes things hard for someone who wants to build a JIT.
The powerful bits of python metaprogramming makes it really impossible for a JIT to say with some certainty across all running threads that what it is doing is right.
Inlining a simple call like a.x() is rather hard when everything underneath can move around - I am not saying that it always does, but implementing a python variant which is nearly the same isn't very useful.
Compare this to PHP, which has fixed method calls (unless you use runkit, which you shouldn't) - a->x() will always be the same method as long as there was an a->x which was valid.
The method will never change once it is has been validated.
However unlike Java, both languages end up not knowing exactly what type "a" will be when the method is being called.
Java also doesn't quite know, but only when the invoke is via an interface. But the engine at least knows exactly how many impls of that interface has been loaded so far (and the bi-morphic case is commonly 1 real impl and 1 mock impl).
But both in case of PHP and Python, the whole idea of "which object do I have look up ::x() for?" is an unknown. In PHP's case, you have to look it up once per class encountered and in Python's case, you have to verify someone hasn't replaced it at runtime.
There are very nice functional ways around this problem at the bottom end of numeric loops for Python, which makes it great for numeric processing interleaved with generic control flow.
...
jcranmer 3 days ago
JavaScript? has the same problems, and that hasn't stopped every major JS engine from building a JIT.
reply
madgar 3 days ago
JS is single-threaded which makes an enormous difference to actually squeezing performance out of your JIT.
Just building a JIT for a language generally isn't the hard part. Building a JIT that is substantially faster than a bytecode-compiled implementation of the language is what's hard, and how hard that is depends intimately on the semantics of the source language. When I say intimately, I mean every single detail of the language's semantics matter.
reply
"
" ... python opcodes are inefficient because they must support a significant amount of dynamicity, even as opposed to some other language peers like JS, PHP, Lua, etc.
...
> Does type hinting in Python 3.x (PEP 484) change this
No. The type hints are only meant for tooling (type checking, refactoring, documentation, etc.).
reply
"It's because of advanced JITs that Java got a lot faster."
" When you have fat opcodes you're SOL.
Consider this: an instruction like STORE_ATTR (implements obj.name = value) has to check whether the top of the stack refers to an object, then whether writing to the object attributes is allowed. Then "name" has to be checked if it's a string. Perhaps additional checking is needed to normalize the string or other unicode testing. Then the assignment has to happen which is a dictionary update (internally a hash map insertion which may trigger a resize). This is just the tip of the iceberg. A lot more stuff is happening and the correct exceptions are thrown when the instruction is used incorrectly (which leads to code bloat which hurts the instruction cache).
A thin bytecode instruction for STORE_ATTR could actually reduce the store to a single LEA machine code instruction (Load Effective Address).
The downside of a language with a thin instruction set is that the individual instructions can't validate their input. They have to trust that the compiler did its job correctly, a segfault or memory corruption will happen otherwise. One of Guido's goals when creating Python was that the runtime should never ever crash, even on nonsensical input. This pretty much rules out a thin instruction set right from the start. Simple concurrency is also a lot easier with a fat instruction set, because the interpreter can just yield in between instructions (Global interpreter lock). With a thin instruction set there is no clear delineation between instructions where all memory is guaranteed to be in a stable state. So a different locking model is needed for multi-threading, which adds even more complexity to the compiler and runtime. "
chrisseaton 3 days ago
All the problems you're describing are solved with a powerful JIT. And the core team do seem to be opposed to doing the work needed for that.
reply
gizmo 3 days ago
Python's philosophy chooses simplicity over everything else. Simple grammar. Simple bytecode instruction set. Simple CPython implementation. Simple threading model (GIL). Simple data structures. Adding a highly sophisticated and complicated JIT on top of that makes little sense.
It's not so difficult to create a high performance language that's much like Python. It's just not possible to make Python fast without abandoning some of its founding principles.
reply
"the Hotspot JVM has added intrinsics for memset, autovectorization, etc."
pjmlp 3 days ago
Lisp and Smalltalk like languages run circles around Python and Perl performance.
Enjoy the same powerful features, have JIT and AOT compilers to native code.
It all boils down to how much the language designers care about performance.
reply
dkersten 2 days ago
Lua is almost as dynamic and flexible as Python and is very easy for the programmer. LuaJIT?'s performance is close to that of C.
reply
nameless912 15 hours ago
I think part of the reason for this, though, is that Lua is a very "thin" language. It purposefully doesn't have a lot of fancy features (I mean, come on, it literally has one composite datatype, and you're supposed to use them [tables] for everything from arrays to maps to full objects). By removing a lot of assumptions that make Python "easy", Lua has made it much easier to write optimizing JITers and compilers since runtime behavior is (perhaps a little paradoxically) easier to predict. And it doesn't make Lua any less "easy" a language, it's just a different set of rules to follow. Guido's hunt for the simplest programming language possible is noble, but sometimes it causes poor decisions to be made imho.
reply
eva1984 3 days ago ((this comment modded down))
Well, just want to say for one example here, why Python is hard to speedup than JS.
In Python, essentially every object is a dictionary. And behavior like, "x.y = 'z'" is legal, meaning pretty much everything could be put into object x without declaration ahead of time, at all. While, not very sure, JS doesn't allow such behavior, you can still do assignment, but assessing will yield undefined.
The above behavior come as default(you can use __slot__ for sure, but compiler can't assume that) is pretty problematic for optimization, compiler simply won't know where to locate a specific variable ahead-of-time, so it cannot substitute the variable access using a pointer offset, it will have to go through a hash lookup, which comparing to the offset alternative, is magnitude slower.
reply
trishume 3 days ago
Except that LuaJIT?