proj-plbook-plChTradeoffsImpl

Table of Contents for Programming Languages: a survey

Chapter : Why high-level languages tend to be slower

Aside: algorithms can matter more than languages

"It has been a long-known lesson in programming that different languages for the same algorithm give you factors of a few in improvement (let's say 5-10x here), but that improved algorithms give you orders of magnitude improvement (10-100x). " -- http://danbricklin.com/log/2013_06_19.htm#jsspeed

however some disagree:

"With an interpreted language, either you are lucky and have an efficient routine that does what you need, or...the two orders of magnitude paid up-front are hard to recover with algorithmic improvements" -- http://web.archive.org/web/20090721080837/http://eigenclass.org/hiki/wide-finder-conclusions

layers of abstraction

In general, whenever one writes a program on a higher layer of abstraction which is executed on some intermediate layer platform, there is an efficiency penalty. For example, a program running on a VM will almost never be faster, and will often be slower, than the same program running on the underlying hardware, for the same reason that running an algorithm on top of an emulator can be slower than running the same algorithm directly on the underlying hardware that the emulator is running on.

Many high level languages use VMs.

late binding

If late binding is used, then a method call must be resolved at runtime to the actual code that will be executed. This resolution is an extra step.

function calling

(antidote: inlining)

overuse of hashes and string allocations

see [1]

todo

Things that take time

computation < threadlocal memory access < shared memory access < memory allocations ===

In general,

Of course, very long computations could take longer than memory access (todo: for example, would doing a lookup on a cached hash table with string keys take longer than an uncached access to main memory?)

Some things that tend to take a lot of time ([2]):

error handling

GC

or reference counting

bounds checking

runtime type checking

runtime evaluation strategy

e.g. lazy

e.g. prolog

runtime multitasking

e.g. greenthreads

GILs

not making use of platform primitives

???

" Managed languages made deliberate design tradeoffs to optimize for programmer productivity even when that was fundamentally in tension with, and at the expense of, performance efficiency… In particular, managed languages chose to incur costs even for programs that don’t need or use a given feature; the major examples are assumption/reliance on always-on or default-on garbage collection, a virtual machine runtime, and metadata. But there are other examples; for instance, managed apps are built around virtual functions as the default, whereas C++ apps are built around inlined functions as the default, and an ounce of inlining prevention is worth a pound of devirtualization optimization cure. " -- http://herbsutter.com/2012/04/02/reader-qa-when-will-better-jits-save-managed-code/

metadata

see metadata subsection in next section on memory; following pointers in these larger memory structures takes time, too

memory overhead of indirection

" The Go authors had a pretty good article on what's wrong with Java's performance : pointers everywhere. Every last little thing that isn't a primitive type is a pointer. Everywhere, in every bit of code.

That means a "new Object()" takes up 16 bytes (8 bytes for the object, 8 for the pointer to it). That means you fill a cache line by allocating 4 objects, or 2 objects containing a single reference, or ...

So in java you should never program a line drawing loop by using 2 vectors, because 2 vectors, each with 2 32-bit ints take up 82 (2 pointers to the objects you're using) + 82 (overhead for the objects) + 4*2 (the actual data) 40 bytes of data. No way you can fit that in registers and still use registers to actually calculate things. So instead you should use 4 ints and just forget about the objects, and even that will only work if you never call any functions.

Same loop in C/C++/Pascal/Go/... using structs takes 8 bytes (they don't keep structs on the heap), which, if necessary, fits in 1 register (granted, in practice we're talking 2 registers, but still).

People might reply to this with benchmarks, but if you actually analyse the java code where java beats or is comparable with C/C++ you're going to see zero object allocations. You're not even going to see them using bool in the extreme cases, rather they'll bitshift into ints to effectively generate packed bools (certainly in SAT benchmarks). This is not realistic java code, which would have been way slower.

Java's memory model is the main culprit at this point in time. Java can do incredible tricks with programs, and actually exposes them, enabling lots of language creativity on the JVM. But there's a pretty sizeable cost in speed and memory usage. " -- iofj

Chapter : Why high-level languages tend to use more memory

GC

metadata attached to values

e.g. type information, thunks

boxing, e.g. linked list vs. array

prevention of inlining object members into the object struct

prevention of e.g. intrusive linked lists

Safety

Undefined behavior

http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html goes into detail about how allowing undefined behavior (or at least undefined value return) makes C faster (and conversely, how safe languages must be either more restrictive, or slower):

Use of an uninitialized variable: no need to autoinitialize

Signed integer overflow: can optimize eg "X*2/2" to "X"; " While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion.". Another example: "for (i = 0; i <= N; ++i) { ... }"; can assume that the loop will iterate exactly n+1 times (as opposed to i maybe wrapping if N is large)

Oversized Shift Amounts: different platform do different things in this case, so leaving this an undefined result lets the compiler just let the platform decide, rather than putting in extra instructions

Dereferences of Wild Pointers and Out of Bounds Array Accesses: "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic"

Dereferencing a NULL Pointer: "dereferencing wild pointers and the use of NULL as a sentinel. NULL pointer dereferences being undefined enables a broad range of optimizations: in contrast, Java makes it invalid for the compiler to move a side-effecting operation across any object pointer dereference that cannot be proven by the optimizer to be non-null. This significantly punishes scheduling and other optimizations. In C-based languages, NULL being undefined enables a large number of simple scalar optimizations that are exposed as a result of macro expansion and inlining."

Violating Type Rules: It is undefined behavior to cast an int* to a float* and dereference it (accessing the "int" as if it were a "float"). C requires that these sorts of type conversions happen through memcpy: using pointer casts is not correct and undefined behavior results.... (eg an example is given where a loop that zeros memory is replaced by a call to a platform primitive for this purpose)

https://kukuruku.co/post/undefined-behavior-and-fermats-last-theorem/ gives some examples where C/C++'s undefined behavior rules lead to paradoxical results. For example, in the following code there is an off-by-one mistake in the loop condition; the <= should be <, because table[4] doesn't exist. In the case that the loop reaches the fifth iteration (i == 4), this code would perform an access to table[4], which is undefined behavior. Therefore, in the case that the loop reaches the fifth iteration, the compiler is allowed to do anything (because the compiler is allowed to do anything in the case of undefined behavior:

int table[4]; bool exists_in_table(int v) { for (int i = 0; i <= 4; i++) { if (table[i] == v) return true; } return false; }

So, one thing that the compiler could choose to do in the case that the fifth iteration is accessed is 'return true', in which case it would be returning true in all cases. Therefore, the compiler is allowed to optimize this code to:

int table[4]; bool exists_in_table(int v) { return true; }

" John Regehr provides the following selection of examples when undefined behavior led to significant consequences:

    Using uninitialized data as an extra source of randomness caused a compiler to delete an entire seed computation. [http://kqueue.org/blog/2012/06/25/more-randomness-or-less/]
    A compiler can evaluate (x+1)>x to both 0 and 1 in the same program, compiled at the same optimization level, for the same value of x. [http://www.cs.utah.edu/~regehr/papers/overflow12.pdf]
    An infinite loop such as a counterexample search (where no counterexample exists) can be terminated by the compiler, permitting, for example, Fermat’s Last Theorem to be “disproved.” [http://blog.regehr.org/archives/161]
    An undefined shift operation caused a compiler to delete an important safety check in Google’s Native Client. [http://code.google.com/p/nativeclient/issues/detail?id=245]
    A reference to a possibly-null pointer in the Linux kernel caused the compiler to remove a subsequent null check, creating an exploitable vulnerability. [https://isc.sans.edu/diary.html?storyid=6820]
    A compiler can run the code inside if (p) {… } and also inside if (!p) {… } when p is not initialized. [http://markshroyer.com/2012/06/c-both-true-and-false/]
    A division instruction (that potentially crashes the process) can be moved in front of code that has externally visible side effects. [http://blog.regehr.org/archives/232]

For example, this code:

int a; void bar (void) { setlinebuf(stdout); printf ("hello!\n"); } void foo3 (unsigned y, unsigned z) { bar(); a = y%z; } int main (void) { foo3(1,0); return 0; }

will have time to print the message before SIGFPE, if compiled without optimizations, and it will crash right away in case we enable optimization. The program “knows in advance” that it is destined to crash with SIGFPE, and does not even bother printing the message. Chen provides a similar example, but with SIGSEGV.

In 2012, Regehr announced a contest for the craziest consequence of undefined behavior. One of the winners took advantage of the fact that the usage of a pointer after being passed to realloc(), is undefined behavior. His program prints different values for the same pointer:

  1. include <stdio.h>
  2. include <stdlib.h> int main() { int *p = (int*)malloc(sizeof(int)); int *q = (int*)realloc(p, sizeof(int));

$ clang -O realloc.c; ./a.out

1 2

...

But nothing can compare with Regehr’s example: the disprove of Fermat’s Last Theorem by the compiler... " -- [3]

In many cases (and in even more cases before 2011, when the C 1999 (C99) standard was in effect [4]) the C standard allows the compiler to assume that almost every loop terminates [5]. In 2010, Regehr ran the following program, which terminates if and only if it can find a counterexample to Fermat's last theorem (this theorem was proven true in 1994):

int fermat (void) { const int MAX = 1000; int a=1,b=1,c=1; while (1) { if (((a*a*a) == ((b*b*b)+(c*c*c)))) return 1; a++; if (a>MAX) { a=1; b++; } if (b>MAX) { b=1; c++; } if (c>MAX) { c=1; } } return 0; }

  1. include <stdio.h> int main (void) { if (fermat()) { printf ("Fermat's Last Theorem has been disproved.\n"); } else { printf ("Fermat's Last Theorem has not been disproved.\n"); } return 0; }

regehr@john-home:~$ icc fermat2.c -o fermat2 regehr@john-home:~$ ./fermat2 Fermat's Last Theorem has been disproved. regehr@john-home:~$ suncc -O fermat2.c -o fermat2 "fermat2.c", line 20: warning: statement not reached regehr@john-home:~$ ./fermat2 Fermat's Last Theorem has been disproved.

It appears that the program has found a counterexample to Fermat's last theorem! So, what is the counterexample; "what a,b,c values has it found? Regehr added printing of the found values..."; the revised program hung in an infinite loop.

What happened was that the compiler deduced that the only way out of the while(1) loop in finite time was the 'return 1'. There are no side effects in this loop (before the printing of the found values was added). According to some interpretations of the C 1999 standard ('C99'), the compiler is allowed to assume that this loop terminates. Therefore, the compiler optimized the loop to just "return 1".

When the printing of the found values was added, the loop now had side-effects, and so according to the standard the compiler is no longer allowed to optimize away the loop; so now the program ran as expected (another way to block the compiler from optimizing away the loop would have been to access volatile variables within the loop).

A list of some other undefined behaviors in C++ is found at https://stackoverflow.com/questions/367633/what-are-all-the-common-undefined-behaviours-that-a-c-programmer-should-know-a

Chapter : Why some languages take a long time to compile

todo

Case studies

scala

Chapter : Why some languages take a long time to startup

See my StackOverflow? question: http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/

Here is my summary answer to that question: http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/28379292#28379292

Chapter : choices you can make to make your language simpler to implement

Choose an easy syntax

A good choice would be LL(1) syntax, or at least LL(k), with no symbol table required at parse time. This enables you to write a simple recursive descent parser or to use a wide variety of existing tools.

todo: list of languages with LL(k) grammars: Python, what else?

Restrict the scope of runtime metaprogramming

If code anywhere in the program can, at any time, redefine the addition operator to do something else (this occurs, for example, with some forms of 'open classes'), and if this redefinition changes the meaning of the addition operator throughout the whole program, then the compiler can't optimize very much because it must not merge addition with anything else and forget where the original additions occurred, and it can't assume that addition will necessarily have the usual properties (no side-effects, deterministic, no chance of throwing an exception, associative, commutative, 0 is an identity, etc); in addition, either the runtime must be able to at least partially recompile the rest of the program or every addition must be an indirect function call rather than a machine instruction.

Even if the program does not redefine addition, if the programming language allows the possibility of doing so, then the compiler must statically/at compile time prove to itself that this particular program will not do it before it can optimize addition or use machine instructions for it. This could be tricky if the program contains any instances of 'eval' applied to a dynamically-generated string.

The less than the language is capable of runtime metaprogramming, the less that this sort of thing will come up. If metaprogramming is possible, but it is all at compile time (eg compile-time macros), this solves this problem.

If runtime metaprogramming is permitted by the language, then consider restricting its lexical scope, so that the compiler can easily be sure that at least most of the code (that outside the lexical scope of the runtime metaprogramming) is not affected. For example, a form of 'eval' might be provided that can take input and return output, and use metaprogramming within the 'eval' itself, but does not have the power to affect the interpretation of expressions outside of the 'eval'

Another way out (or at least to 'contain the damage') is to allow metaprogramming to define new terms but never to redefine the meaning of terms already defined at compile-time.

One construct that is not often thought of as 'metaprogramming' but that can cause the same sort of problem are 'import' statements, especially if they operate at runtime. If an import statement can redefine existing terms at runtime, then the compiler cannot finish dealing with expressions containing that term until all possible imports that might affect those expressions have been resolved. Dynamic interpreted languages such as Ruby sometimes have import constructs of this sort [6].

Note that even restricted scope, non-redefining metaprogramming (for example, an 'eval' that cannot affect anything besides returning a result) could force the language implementation to include an interpreter in the runtime (if not in every program, at least in those programs making use of constructs similar to 'eval').

Examples:

Don't use multiple inheritance in a statically typed, compiled language

" Single inheritance is convenient, because it simplifies many aspects of the implementation. Objects can be extended just by appending fields; a cast to the supertype just involves ignoring the end, and a cast to a subtype just involves a check—the pointer values remain the same. Downcasting in C++ requires a complex search of the inheritance graph in the run-time type information via a runtime library function. " -- The Challenge of Cross-language Interoperability by David Chisnall

Chapter : Constraints that performance considerations of the language and the toolchain impose on language design

Whole program analysis

other stuff that Go does (later: what?!?)

a sequence point is "a point in the program's execution sequence where all previous side effects SHALL have taken place and all subsequent side-effects SHALL NOT have taken place" -- http://www.slideshare.net/olvemaudal/deep-c

if each computation step is a sequence point, then the programmer doesn't have to think too much about what is going on. but if there are few sequence points, there is more room for optimization. Similary, if the order of evaluation of expressions is unspecified, then the compiler can optimize more.

similarly, if expression evaluation order is undefined, more room for optimization

something about memory model and reordering of statements e.g. java volatile

if you pad data structures to fall on word boundaries, you get better performance

what else?

---

" qznc 1 day ago [-]

While I generally agree with your point, D/Walter Bright slightly disagrees with the sentiment. One constraint for D is "must not require data flow analysis" because that incurs a cost on compilation time. Afaik this was mostly inspired by Javas definite assignment rule [0]. If fast compilation is a goal, then a language is constrained by implementation and IR aspects.

[0] https://docs.oracle.com/javase/specs/jls/se6/html/defAssign.html

reply " -- [8]

---