Bayle Shanks's website: proj-plbook-plChTradeoffsImpl

Table of Contents for Programming Languages: a survey

Chapter: Trade-offs in language design and implementation

Why high-level languages tend to be slower

Aside: algorithms can matter more than languages

"It has been a long-known lesson in programming that different languages for the same algorithm give you factors of a few in improvement (let's say 5-10x here), but that improved algorithms give you orders of magnitude improvement (10-100x). " -- http://danbricklin.com/log/2013_06_19.htm#jsspeed

however some disagree:

"With an interpreted language, either you are lucky and have an efficient routine that does what you need, or...the two orders of magnitude paid up-front are hard to recover with algorithmic improvements" -- http://web.archive.org/web/20090721080837/http://eigenclass.org/hiki/wide-finder-conclusions

layers of abstraction

In general, whenever one writes a program on a higher layer of abstraction which is executed on some intermediate layer platform, there is an efficiency penalty. For example, a program running on a VM will almost never be faster, and will often be slower, than the same program running on the underlying hardware, for the same reason that running an algorithm on top of an emulator can be slower than running the same algorithm directly on the underlying hardware that the emulator is running on.

Many high level languages use VMs.

late binding

If late binding is used, then a method call must be resolved at runtime to the actual code that will be executed. This resolution is an extra step.

function calling

(antidote: inlining)

overuse of hashes and string allocations

see [1]

todo

Things that take time

computation < threadlocal memory access < shared memory access < memory allocations ===

In general,

reading or writing recent-accessed memory values (values in registers or caches) is fast
reading from or writing to main memory that is only used by a single thread is slower
reading from or writing to memory that is shared between multiple threads is even slower (due to the need to use atomic primitives and/or locking and/or synchronization and/or memory barriers to coordinate between multiple CPUs)
dynamically allocating new memory is even slower

Of course, very long computations could take longer than memory access (todo: for example, would doing a lookup on a cached hash table with string keys take longer than an uncached access to main memory?)

Some things that tend to take a lot of time ([2]):

Hash table lookups
Allocations
Copying

error handling

https://news.ycombinator.com/item?id=9546350

GC

or reference counting

bounds checking

runtime type checking

runtime evaluation strategy

e.g. lazy

e.g. prolog

runtime multitasking

e.g. greenthreads

GILs

not making use of platform primitives

???

" Managed languages made deliberate design tradeoffs to optimize for programmer productivity even when that was fundamentally in tension with, and at the expense of, performance efficiency… In particular, managed languages chose to incur costs even for programs that don’t need or use a given feature; the major examples are assumption/reliance on always-on or default-on garbage collection, a virtual machine runtime, and metadata. But there are other examples; for instance, managed apps are built around virtual functions as the default, whereas C++ apps are built around inlined functions as the default, and an ounce of inlining prevention is worth a pound of devirtualization optimization cure. " -- http://herbsutter.com/2012/04/02/reader-qa-when-will-better-jits-save-managed-code/

metadata

see metadata subsection in next section on memory; following pointers in these larger memory structures takes time, too

memory overhead of indirection

" The Go authors had a pretty good article on what's wrong with Java's performance : pointers everywhere. Every last little thing that isn't a primitive type is a pointer. Everywhere, in every bit of code.

That means a "new Object()" takes up 16 bytes (8 bytes for the object, 8 for the pointer to it). That means you fill a cache line by allocating 4 objects, or 2 objects containing a single reference, or ...

So in java you should never program a line drawing loop by using 2 vectors, because 2 vectors, each with 2 32-bit ints take up 82 (2 pointers to the objects you're using) + 82 (overhead for the objects) + 4*2 (the actual data) 40 bytes of data. No way you can fit that in registers and still use registers to actually calculate things. So instead you should use 4 ints and just forget about the objects, and even that will only work if you never call any functions.

Same loop in C/C++/Pascal/Go/... using structs takes 8 bytes (they don't keep structs on the heap), which, if necessary, fits in 1 register (granted, in practice we're talking 2 registers, but still).

People might reply to this with benchmarks, but if you actually analyse the java code where java beats or is comparable with C/C++ you're going to see zero object allocations. You're not even going to see them using bool in the extreme cases, rather they'll bitshift into ints to effectively generate packed bools (certainly in SAT benchmarks). This is not realistic java code, which would have been way slower.

Java's memory model is the main culprit at this point in time. Java can do incredible tricks with programs, and actually exposes them, enabling lots of language creativity on the JVM. But there's a pretty sizeable cost in speed and memory usage. " -- iofj

Chapter : Why high-level languages tend to use more memory

GC

metadata attached to values

e.g. type information, thunks

boxing, e.g. linked list vs. array

prevention of inlining object members into the object struct

prevention of e.g. intrusive linked lists

Safety

Chapter : Why some languages take a long time to compile

todo

type inference, particularly when one must search for possible type coercions
re-reading the same header file many times, e.g. thru #include macros
parsing
optimizations

Case studies

https://github.com/nordlow/compiler-benchmark
- https://lobste.rs/s/9t6fnr/benchmarks_compilation_speeds
https://github.com/smarr/are-we-fast-yet (is this compile-time or just runtime)
https://wiki.alopex.li/LanguageCompilationSpeed
https://www.reddit.com/r/golang/comments/55k7n4/go_compilation_times_compared_to_c_d_rust_pascal/
https://planetscale.com/blog/generics-can-make-your-go-code-slower
- "The tldr seems to be that Go generics treat all pointer typed type parameters as being the same, so fails to monomorphise, resulting in needlessly slow code... Slightly more TL;DR: it monomorphizes by “shape” rather than type, so it has to pass the vtable as a hidden parameter to the monomorphized functions, and the vtable parameter makes inlining hard for the Go compiler and results in extra pointer dereferences...But less code. It’s a tradeoff, not a mistake" [3]

scala

Chapter : Why some languages take a long time to startup

See my StackOverflow? question: http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/

Here is my summary answer to that question: http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/28379292#28379292

Chapter : choices you can make to make your language simpler to implement

Choose an easy syntax

A good choice would be LL(1) syntax, or at least LL(k), with no symbol table required at parse time. This enables you to write a simple recursive descent parser or to use a wide variety of existing tools.

todo: list of languages with LL(k) grammars: Python, what else?

Restrict the scope of runtime metaprogramming

If code anywhere in the program can, at any time, redefine the addition operator to do something else (this occurs, for example, with some forms of 'open classes'), and if this redefinition changes the meaning of the addition operator throughout the whole program, then the compiler can't optimize very much because it must not merge addition with anything else and forget where the original additions occurred, and it can't assume that addition will necessarily have the usual properties (no side-effects, deterministic, no chance of throwing an exception, associative, commutative, 0 is an identity, etc); in addition, either the runtime must be able to at least partially recompile the rest of the program or every addition must be an indirect function call rather than a machine instruction.

Even if the program does not redefine addition, if the programming language allows the possibility of doing so, then the compiler must statically/at compile time prove to itself that this particular program will not do it before it can optimize addition or use machine instructions for it. This could be tricky if the program contains any instances of 'eval' applied to a dynamically-generated string.

The less than the language is capable of runtime metaprogramming, the less that this sort of thing will come up. If metaprogramming is possible, but it is all at compile time (eg compile-time macros), this solves this problem.

If runtime metaprogramming is permitted by the language, then consider restricting its lexical scope, so that the compiler can easily be sure that at least most of the code (that outside the lexical scope of the runtime metaprogramming) is not affected. For example, a form of 'eval' might be provided that can take input and return output, and use metaprogramming within the 'eval' itself, but does not have the power to affect the interpretation of expressions outside of the 'eval'

Another way out (or at least to 'contain the damage') is to allow metaprogramming to define new terms but never to redefine the meaning of terms already defined at compile-time.

One construct that is not often thought of as 'metaprogramming' but that can cause the same sort of problem are 'import' statements, especially if they operate at runtime. If an import statement can redefine existing terms at runtime, then the compiler cannot finish dealing with expressions containing that term until all possible imports that might affect those expressions have been resolved. Dynamic interpreted languages such as Ruby sometimes have import constructs of this sort [4]