proj-plbook-plChTradeoffsImpl

Difference between revision 31 and current revision

No diff available.

Table of Contents for Programming Languages: a survey

Chapter: Trade-offs in language design and implementation

Why high-level languages tend to be slower

Aside: algorithms can matter more than languages

"It has been a long-known lesson in programming that different languages for the same algorithm give you factors of a few in improvement (let's say 5-10x here), but that improved algorithms give you orders of magnitude improvement (10-100x). " -- http://danbricklin.com/log/2013_06_19.htm#jsspeed

however some disagree:

"With an interpreted language, either you are lucky and have an efficient routine that does what you need, or...the two orders of magnitude paid up-front are hard to recover with algorithmic improvements" -- http://web.archive.org/web/20090721080837/http://eigenclass.org/hiki/wide-finder-conclusions

layers of abstraction

In general, whenever one writes a program on a higher layer of abstraction which is executed on some intermediate layer platform, there is an efficiency penalty. For example, a program running on a VM will almost never be faster, and will often be slower, than the same program running on the underlying hardware, for the same reason that running an algorithm on top of an emulator can be slower than running the same algorithm directly on the underlying hardware that the emulator is running on.

Many high level languages use VMs.

late binding

If late binding is used, then a method call must be resolved at runtime to the actual code that will be executed. This resolution is an extra step.

function calling

(antidote: inlining)

overuse of hashes and string allocations

see [1]

todo

Things that take time

computation < threadlocal memory access < shared memory access < memory allocations ===

In general,

Of course, very long computations could take longer than memory access (todo: for example, would doing a lookup on a cached hash table with string keys take longer than an uncached access to main memory?)

Some things that tend to take a lot of time ([2]):

error handling

GC

or reference counting

bounds checking

runtime type checking

runtime evaluation strategy

e.g. lazy

e.g. prolog

runtime multitasking

e.g. greenthreads

GILs

not making use of platform primitives

???

" Managed languages made deliberate design tradeoffs to optimize for programmer productivity even when that was fundamentally in tension with, and at the expense of, performance efficiency… In particular, managed languages chose to incur costs even for programs that don’t need or use a given feature; the major examples are assumption/reliance on always-on or default-on garbage collection, a virtual machine runtime, and metadata. But there are other examples; for instance, managed apps are built around virtual functions as the default, whereas C++ apps are built around inlined functions as the default, and an ounce of inlining prevention is worth a pound of devirtualization optimization cure. " -- http://herbsutter.com/2012/04/02/reader-qa-when-will-better-jits-save-managed-code/

metadata

see metadata subsection in next section on memory; following pointers in these larger memory structures takes time, too

memory overhead of indirection

" The Go authors had a pretty good article on what's wrong with Java's performance : pointers everywhere. Every last little thing that isn't a primitive type is a pointer. Everywhere, in every bit of code.

That means a "new Object()" takes up 16 bytes (8 bytes for the object, 8 for the pointer to it). That means you fill a cache line by allocating 4 objects, or 2 objects containing a single reference, or ...

So in java you should never program a line drawing loop by using 2 vectors, because 2 vectors, each with 2 32-bit ints take up 82 (2 pointers to the objects you're using) + 82 (overhead for the objects) + 4*2 (the actual data) 40 bytes of data. No way you can fit that in registers and still use registers to actually calculate things. So instead you should use 4 ints and just forget about the objects, and even that will only work if you never call any functions.

Same loop in C/C++/Pascal/Go/... using structs takes 8 bytes (they don't keep structs on the heap), which, if necessary, fits in 1 register (granted, in practice we're talking 2 registers, but still).

People might reply to this with benchmarks, but if you actually analyse the java code where java beats or is comparable with C/C++ you're going to see zero object allocations. You're not even going to see them using bool in the extreme cases, rather they'll bitshift into ints to effectively generate packed bools (certainly in SAT benchmarks). This is not realistic java code, which would have been way slower.

Java's memory model is the main culprit at this point in time. Java can do incredible tricks with programs, and actually exposes them, enabling lots of language creativity on the JVM. But there's a pretty sizeable cost in speed and memory usage. " -- iofj

Chapter : Why high-level languages tend to use more memory

GC

metadata attached to values

e.g. type information, thunks

boxing, e.g. linked list vs. array

prevention of inlining object members into the object struct

prevention of e.g. intrusive linked lists

Safety

Chapter : Why some languages take a long time to compile

todo

Case studies

scala

Chapter : Why some languages take a long time to startup

See my StackOverflow? question: http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/

Here is my summary answer to that question: http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/28379292#28379292

Chapter : choices you can make to make your language simpler to implement

Choose an easy syntax

A good choice would be LL(1) syntax, or at least LL(k), with no symbol table required at parse time. This enables you to write a simple recursive descent parser or to use a wide variety of existing tools.

todo: list of languages with LL(k) grammars: Python, what else?

Restrict the scope of runtime metaprogramming

If code anywhere in the program can, at any time, redefine the addition operator to do something else (this occurs, for example, with some forms of 'open classes'), and if this redefinition changes the meaning of the addition operator throughout the whole program, then the compiler can't optimize very much because it must not merge addition with anything else and forget where the original additions occurred, and it can't assume that addition will necessarily have the usual properties (no side-effects, deterministic, no chance of throwing an exception, associative, commutative, 0 is an identity, etc); in addition, either the runtime must be able to at least partially recompile the rest of the program or every addition must be an indirect function call rather than a machine instruction.

Even if the program does not redefine addition, if the programming language allows the possibility of doing so, then the compiler must statically/at compile time prove to itself that this particular program will not do it before it can optimize addition or use machine instructions for it. This could be tricky if the program contains any instances of 'eval' applied to a dynamically-generated string.

The less than the language is capable of runtime metaprogramming, the less that this sort of thing will come up. If metaprogramming is possible, but it is all at compile time (eg compile-time macros), this solves this problem.

If runtime metaprogramming is permitted by the language, then consider restricting its lexical scope, so that the compiler can easily be sure that at least most of the code (that outside the lexical scope of the runtime metaprogramming) is not affected. For example, a form of 'eval' might be provided that can take input and return output, and use metaprogramming within the 'eval' itself, but does not have the power to affect the interpretation of expressions outside of the 'eval'

Another way out (or at least to 'contain the damage') is to allow metaprogramming to define new terms but never to redefine the meaning of terms already defined at compile-time.

One construct that is not often thought of as 'metaprogramming' but that can cause the same sort of problem are 'import' statements, especially if they operate at runtime. If an import statement can redefine existing terms at runtime, then the compiler cannot finish dealing with expressions containing that term until all possible imports that might affect those expressions have been resolved. Dynamic interpreted languages such as Ruby sometimes have import constructs of this sort [4].

Note that even restricted scope, non-redefining metaprogramming (for example, an 'eval' that cannot affect anything besides returning a result) could force the language implementation to include an interpreter in the runtime (if not in every program, at least in those programs making use of constructs similar to 'eval').

Examples:

Don't use multiple inheritance in a statically typed, compiled language

" Single inheritance is convenient, because it simplifies many aspects of the implementation. Objects can be extended just by appending fields; a cast to the supertype just involves ignoring the end, and a cast to a subtype just involves a check—the pointer values remain the same. Downcasting in C++ requires a complex search of the inheritance graph in the run-time type information via a runtime library function. " -- The Challenge of Cross-language Interoperability by David Chisnall

Chapter : Constraints that performance considerations of the language and the toolchain impose on language design

Whole program analysis

other stuff that Go does (later: what?!?)

a sequence point is "a point in the program's execution sequence where all previous side effects SHALL have taken place and all subsequent side-effects SHALL NOT have taken place" -- http://www.slideshare.net/olvemaudal/deep-c

if each computation step is a sequence point, then the programmer doesn't have to think too much about what is going on. but if there are few sequence points, there is more room for optimization. Similary, if the order of evaluation of expressions is unspecified, then the compiler can optimize more.

similarly, if expression evaluation order is undefined, more room for optimization

something about memory model and reordering of statements e.g. java volatile

if you pad data structures to fall on word boundaries, you get better performance

what else?

---

" qznc 1 day ago [-]

While I generally agree with your point, D/Walter Bright slightly disagrees with the sentiment. One constraint for D is "must not require data flow analysis" because that incurs a cost on compilation time. Afaik this was mostly inspired by Javas definite assignment rule [0]. If fast compilation is a goal, then a language is constrained by implementation and IR aspects.

[0] https://docs.oracle.com/javase/specs/jls/se6/html/defAssign.html

reply " -- [6]

---

https://wren.io/performance.html

---

Generics

Reflection

" Traits do not make Rust more powerful than your average scripting language. They are much less powerful. The biggest hole here is the complete absence of any dynamicism. Rust types do not have any real runtime presence. Concretely, there is no way to dynamically check whether this value implements some Trait. This is used in Golang (Copy -> WriterTo?) but is impossible in Rust without manual implementation. " -- [7]

---

" > I think having something built in is wrong, if it can be a library

I'm split on this one. This sort of "axiomatic" philosophy of programming is fairly common, and it's not hard to see why; it just feels right, intuitively, for everything to be built out of a tiny set of simple constructs. It's really empowering to know that, if you understand those constructs, you can (in principle) understand any program. Lisp and Urbit are extreme examples of this.

But to everything, there is a price, and here the price is paid in performance, tooling, and readability. Reified builtins can offend our sensibilities, but they are easier to optimize, permit deeper tooling, and require less cognitive overhead. ... " -- [8]

---

" Another problem is that tracing GC algorithms are selfish; they only work well when a lot of information about the potential "root set" is available, this makes interoperability between different garbage collectors (and thus between different programming language implementations) quite challenging. " -- [9]

---

Julia: dynamism and performance reconciled by design describes design decisions made in Julia to promote performance while keeping implementation relatively simple

some things that are mentioned:

---

Modules

" In this post, we’ll learn how to make a snappy IDE...

Specifically, we’ll look at the backbone infrastructure of an IDE which serves two goals:

    Quickly accepting new edits to source files.
    Providing type information about currently opened files for highlighting, completion, etc.

...

architecture is reminiscent of the map-reduce paradigm. The idea is to split analysis into relatively simple indexing phase, and a separate full analysis phase.

The core constraint of indexing is that it runs on a per-file basis. The indexer takes the text of a single file, parses it, and spits out some data about the file. The indexer can’t touch other files.

Full analysis can read other files, and it leverages information from the index to save work.

This all sounds way too abstract, so let’s look at a specific example — Java. In Java, each file starts with a package declaration. The indexer concatenates the name of the package with a class name to get a fully-qualified name (FQN). It also collects the set of methods declared in the class, the list of superclasses and interfaces, etc.

Per-file data is merged into an index which maps FQNs to classes. Note that constructing this mapping is an embarrassingly parallel task — all files are parsed independently. Moreover, this map is cheap to update. When a file change arrives, this file’s contribution from the index is removed, the text of the file is changed and the indexer runs on the new text and adds the new contributions. The amount of work to do is proportional to the number of changed files, and is independent from the total number of files.

...

One problem with the approach described thus far is that resolving types from the index requires a non-trivial amount of work. This work might be duplicated if, for example, Foo.f is called several times. The fix is to add a cache. Name resolution results are memoized, so that the cost is paid only once. The cache is blown away completely on any change — with an index, reconstructing the cache is not that costly.

To sum up, the first approach works like this:

    Each file is being indexed, independently and in parallel, producing a "stub" — a set of visible top-level declarations, with unresolved types.
    All stubs are merged into a single index data structure.
    Name resolution and type inference work primarily off the stubs.
    Name resolution is lazy (we only resolve a type from the stub when we need it) and memoized (each type is resolved only once).
    The caches are completely invalidated on every change
    The index is updated incrementally:
        if the edit doesn’t change the file’s stub, no change to the index is required.
        otherwise, old keys are removed and new keys are added

...

This approach combines simplicity and stellar performance. The bulk of work is the indexing phase, and you can parallelize and even distribute it across several machine. Two examples of this architecture are IntelliJ? and Sorbet.

The main drawback of this approach is that it works only when it works — not every language has a well-defined FQN concept. I think overall it’s a good idea to design name resolution and module systems (mostly boring parts of a language) such that they work well with the map-reduce paradigm.

The last point is worth elaborating. Let’s look at the following Rust example:

File: ./foo.rs trait T { fn f(&self) {} } File: ./bar.rs struct S;

File: ./somewhere/else.rs impl T for S {}

File: ./main.s use foo::T; use bar::S

fn main() { let s = S; s.f(); }

Here, we can easily find the S struct and the T trait (as they are imported directly). However, to make sure that s.f indeed refers to f from T, we also need to find the corresponding impl, and that can be roughly anywhere! " -- Three Architectures for a Responsive IDE

note that in the elided parts of the above blog post, it is explained that the indexer stores references as text; it doesn't actually compute what they refer to ('resolve' them) because that computation requries looking outside the file being indexed, and the indexer is supposed to work on a purely per-file basis. The resolution happens later, lazily (but cached) in the analysis phase.

---

https://instagram-engineering.com/python-at-scale-strict-modules-c0bb9245c834?source=social.tw&gi=c39932b5c7cd

---

"

abernard1 5 hours ago [–]

> I wish all the "big data" tooling was written in Ruby.

On a practical note, Ruby's heavy use of reflection, inheritance and method_missing style dispatch currently makes performance for big data tasks less than ideal. Python is less expressive but its "one true way" makes things like vectorization and type specialization easier for data tasks.

Sometimes you really just need a better Fortran.

reply

lloeki 5 hours ago [–]

I believe you’re confusing Ruby with Rails.

While these can eventually come in handy sometimes, they are by and large seldom used, but Rails heavily leverages such patterns for flashy yet ultimately questionable magic.

As a long time Rubyist, I believe Rails, for all its innovation, has been warping the view of what Ruby is, and when and how to use its features responsibly.

reply

chrisseaton 4 hours ago [–]

> While these can eventually come in handy sometimes, they are by and large seldom used

I don't think they are seldom used in practice - I think large parts of the Ruby ecosystem use them pervasively, even if you disregard Rails.

But even if we don't agree on that - it's missing the point. The point is you pay for much of Ruby's metaprogramming, even if you aren't using it. So not using it doesn't help you.

A concrete example is that you pay for `Kernel#binding` every single time you finish using a local variable but Ruby retains it for the binding, even if you never ever call `Kernel#binding`. Even JITs cannot remove this cost.

" -- [10]

---

late binding vs run-time perf

---

single pass compiler (fast compile) vs intermediate language (better optimization)

---

"

jakobnissen 18 days ago

...
unvote parent favorite on: What's bad about Julia?

The problem with simply compiling dependencies to static binaries is that all Julia code is allowed to redefine other Julia code. So package X can load package Y, then define a method that causes package Y to change behaviour, and thus needs to be recompiled.

This is not unintentional, by the way. Having packages able to use each other's code is critical for having the ecosystem be "composable", that is, being able to make two different packages work effectively together. For example, I might take a type from package X and throw it into a function from package Y, and it works.

lasagnaphil 18 days ago [–]

I don't think this is an inherent problem of having a compiled dynamic (in other words, JITted) language.

Things like Javascript (V8) and Lua (LuaJIT?) manages to have fast startup times while having exceptional performance in hot paths - this is because they have a fast bytecode interpreter that executes the script first while the actual compilation is taking place. Unfortunately Julia in its current state, doesn't have a fallback interpreter to execute its bytecode/IR (which is similar to LLVM IR). And LLVM IR isn't really suited for fast execution in interpreters - it's moreso designed as an intermediate language for a heavyweight compiler than a bytecode for a dynamic language.

Maybe some heroic figure would come out of the fog and suddenly brings a whole new bytecode representation and a fast interpreter for Julia, but that would be quite a project..

fiddlerwoaroof 18 days ago [–]

> It's a fundamental problem of having a dynamic, compiled language. It's not possible to actually solve.

There are plenty of Common Lisp implementations that allow you to redefine functions or add methods to multi-dispatch functions without requiring you to recompile every use-site. There are other trade-offs here, but it’s not “impossible” to precompile and optimize code in a dynamically-typed language.

StefanKarpinski? 17 days ago [–]

It’s not really a problem with dynamism, it’s more about the fact that Julia’s compilation model is similar to C++ with templates but at runtime (and with first class parametric types instead of textual substitution, but the compilation and specialization story is similar). Just as separate compilation isn’t possible for C++ template code, it’s a challenge for Julia as well.

" [11]

---

In this study we attempt to quantify the costs of language features such as dynamic typing, reference counting for memory management, boxing of numbers, and late binding of function calls

...

We find that a boxed representation of numbers as heap objects is the single most costly language feature on numeric codes, accounting for up to 43 % of total execution time in our benchmark set. On symbolic object-oriented code, late binding of function and method calls costs up to 30 %. Redundant reference counting, dynamic type checks, and Python’s elaborate function calling convention have comparatively smaller costs.

...

The optimizations performed by pylibjit on request aim directly at eliminating the sources of overheads discussed above.

...

Unboxing of numbers uses type annotations to directly use machine operations on integers and floating-point numbers that fit into machine registers. This avoids storing every number in a heap object and managing that memory. As a special case, this also turns loops of the very common form for i in range(n) , which would normally use an iterator to enumerate boxed numbers from 0 to n , into unboxed counting loops if i is an unboxed machine integer.

...

Early binding of function calls resolves the addresses of compiled or built-in functions or methods at compile time (guided by type annotations for methods’ receiver objects) and generates direct CPU call instructions. " -- Python Interpreter Performance Deconstructed by Gergö Barany

---

"

4 icefox 1 month ago

link

The whole example of juggling Pin in the futures executor section is a great example of how assumptions in the design of programming languages can have unexpectedly far-reaching effects. Rust’s borrow checker assumes that things in memory generally don’t move: copies are explicit, if an object is moved between variables then it is actually memcpy’d from one struct or variable to another, and it may live on the stack for a particular function or in thread local storage for a thread that might go away or strange stuff like that.. This means that when you create a piece of state, like a future, and hand it off to something that may or may not execute it on a different thread and free it, then the question of where the object is actually supposed to be is nontrivial and you need to be very meticulous about explaining to the language what’s going on.

Contrast this with async in languages that have a garbage collector, like C#. Most objects are boxed on the heap and you seldom actually move them around, you just pass around pointers to them. The things that aren’t on the heap and so might have a weird lifetime usually can’t have pointers to them created, a la C#’s struct’s (if I am remembering correctly). This means that moving objects around in memory is very easy and the garbage collector can do it whenever it feels like it, since it already knows how to fix up all pointers in the program to point to the new object. So, storing the state for an async state machine and handing it around to different threads or executors or whatever needs very little additional work compared to just writing a function, because all the framework for dynamically tracking lifetimes is already there.

There’s a reason that programming language design from 1995 to 2015 tried to make garbage collection good enough to be universal and boxed every object possible: it makes some very nice simplifying assumptions possible.

" -- [12]

---

"My (mostly uninformed) feeling is that the job of the compiler is to get rid of abstractions (e.g. monomorphizing code), and perfect optimization is a global process. So when any piece of code can be redefined at any time, that makes recompilation slow." [13]

---

"...from reading about v8 over the years, the tiers seem to become a huge tax. It’s not just “someone has to do that work”, but “every language change from now on requires more work” (from a limited group of people; it’s not orthogonal)." [14]

---

"When you timed MzScheme?, did you put the code in a module?

If not the JIT-compiler can't inline the operations such as +, - and friends, since the R5RS requires you can assign to builtin operators. " [15]

---

--- from part 2 of the previous series:

since we prefer compile-time speed to runtime speed, i guess we prefer vtables over monomorphization? ([17] says

" In general, for programming languages, there are two ways to translate a generic function:

1) translate the generic function for each set of instantiated type parameters, calling each trait method directly, but duplicating most of the generic function's machine instructions, or

2) translate the generic function just once, calling each trait method through a function pointer (via a "vtable").

The first results in static method dispatch, the second in dynamic (or "virtual") method dispatch. The first is sometimes called "monomorphization", particularly in the context of C++ and Rust, a confusingly complex word for a simple idea. ... These two strategies represent a notoriously difficult tradeoff: the first creates lots of machine instruction duplication, forcing the compiler to spend time generating those instructions, and putting pressure on the instruction cache, but — crucially — dispatching all the trait method calls statically instead of through a function pointer. The second saves lots of machine instructions and takes less work for the compiler to translate to machine code, but every trait method call is an indirect call through a function pointer, which is generally slower because the CPU can't know what instruction it is going jump to until the pointer is loaded. "

---

"Ada, D, Delphi, Eiffel, .NET Native, Ada all compile quite fast and have generics support." [18]

---

[19] recommends this "Responsive compilers" talk at PLISS 2019, saying "In that talk he also provided some examples of how the Rust language was accidentally mis-designed for responsive compilation. It is an entirely watchable talk about compiler engineering and I recommend checking it out." slides: https://nikomatsakis.github.io/pliss-2019/responsive-compilers.html#1

in the talk, Matsakis suggests using the Salsa framework (spreadsheet-y updates; "a Rust framework for writing incremental, on-demand programs -- these are programs that want to adapt to changes in their inputs, continuously producing a new output that is up-to-date").

---

" It seems like there are three basic approaches to generics:

    (The C approach.) Leave them out.
    This slows programmers.
    But it adds no complexity to the language.
    (The C++ approach.) Compile-time specialization or macro expansion.
    This slows compilation.
    It generates a lot of code, much of it redundant, and needs a good linker to eliminate duplicate copies. The individual specializations may be efficient but the program as a whole can suffer due to poor use of the instruction cache. I have heard of simple libraries having text segments shrink from megabytes to tens of kilobytes by revising or eliminating the use of templates.
    (The Java approach.) Box everything implicitly.
    This slows execution.
    Compared to the implementations the C programmer would have written or the C++ compiler would have generated, the Java code is smaller but less efficient in both time and space, because of all the implicit boxing and unboxing. A vector of bytes uses significantly more than one byte per byte. Trying to hide the boxing and unboxing may also complicate the type system. On the other hand, it probably makes better use of the instruction cache, and a vector of bytes can be written separately.

The generic dilemma is this: do you want slow programmers, slow compilers and bloated binaries, or slow execution times? " -- The Generic Dilemma by Russ Cox (in the context of contemplating adding generics to Golang)

---

" ... am implementing an Ownership/Borrowing system for the D programming language, and to make it work requires Data Flow Analysis. DFA is slow. It's normally only used for optimized builds, which is why optimized builds are slow.

But when DFA is a required semantic feature of the language, it has to be done for debug builds, too, and so they're going to be slow. ... It tends to be quadratic, based on the number of variables and the cyclomatic complexity. The DFA equations cannot be solved directly, but only iteratively until a solution is reached.

Debug code generation, on the other hand, involves none of that and the compiler just blasts out code expression by expression. The time goes up linearly with the number of expressions.

DFA (and the O/B DFA) is done at the function level. It can be done globally, but I try to keep the compile times reasonable. ... Many implementors try to cache the results of DFA and patch it incrementally. Anecdotal evidence is that they spend endless hours dealing with weird optimization bugs because they patched it wrong.

I decided to redo the DFA from scratch anytime the AST changed, and have had pretty reliable optimization as a result. " -- Walter Bright

---

" One of the main reason I got into Ocaml (coming from C++) was that it had bytecode compilation (and obviously a bytecode interpreter) next to the native one. This is really fast (5-7 times faster last time I measured) and perfect during development, when you don't really care about performance. " [20]

---

" qaq on Dec 26, 2019 [–]

need to support hot reload is one of the reasons Erlang does not have static typing

thosakwe on Dec 26, 2019 [–]

Dart has static typing, and hot reloading in its VM, so it's definitely possible. " -- [21]

---

https://github.com/athas/raytracers

---

" Why Python is slow ... The most common reason given is that interpreted languages are slow and, since Python is interpreted, it is slow, but that is not what he has found. In his measurements of web servers, the overhead of interpretation is about 10%; that is "significant, and people don't want it, but it doesn't explain why Python can be ten to 100 times slower than C". In order to explain that, you need to look at the dynamic nature of Python.

"Python is a very dynamic language", in lots of different ways, but he wanted to focus on just a few of those in the talk. The first is perhaps the most obvious, the interpreter does not know the types of any of the variables. That means any operation on a variable needs to look up what type it is in order to figure out how to perform the operation on that type.

The second dynamic behavior is in variable lookups, though that is a less obvious one. For example, "print" is not a keyword in Python, so a simple print statement needs to look up the value for the "print" name in order to call it as a function. In particular, it has to determine whether the name has been overridden by anything in the scope and it has to do that every time print() is called. The same goes things like len(), int(), and many more built-in functions; all of them require expensive lookups every time they are used.

The third aspect is Python's dynamic attribute lookup. In general, even in a single class, the interpreter does not know what attributes exist on an object of that class. That means it needs a dynamic representation for the object to track the attributes. Looking up the attributes in Python is pretty fast, but still much slower than it would be if the list of attributes were static. " -- Modern Python performance considerations

---

https://github.com/joncatanio/cannoli

"Cannoli is a compiler for a subset of Python 3.6.5 and is designed to evaluate the language features of Python that negatively impact performance"

---

Self-modifying code

If the language permits self-modifying code, then its executables must contain an implementation of the language, which may be easy enough for interpreter implementations, but adds complexity and makes optimization more difficult for compiler implementations [22].

---

'unnecessary' stack memory move CPU instructions is one reason why some languages are slower than others. Eg here's a web page comparing Rust and C++ in this: https://arewestackefficientyet.com/

---