proj-plbook-plChControlImpl

todo: a lot of this affects language semantics, not just implementation. Move elsewhere.

Upvariables and upvalues

If inner functions can access local variables from enclosing lexical scopes ('upvariables'), then how are references to these variables implemented?

If local variables are allocated on the stack, then typically they are referenced via a known offset from the beginning of the current stack frame (for example, local variable 'a' might be found at offset 5 from the current stack frame, local variable 'b' might be found at offset 7, etc). In essence, after compilation, the name of the local variable 'a' is 'local variable #5'. However, a local variable from an enclosing scope will be stored before the beginning of the current stack frame.

It is not sufficient to use a constant negative offset, because, due to recursion (eg if the inner function calls itself), there may be a variable number of frames in between the current frame and the referenced variable.

For example, Lua's VM provides bytecode instructions GETUPVAL and SETUPVAL which access upvariables. Confusingly, Lua refers to this concept as 'upvalues' rather than 'upvariables', implying immutability, although actually Lua 'upvalues' are mutable (todo: am i understanding this correctly?).

If upvariable are mutable, another design choice is whether or not the state of these variables are local to each closure (see below), or are shared between all functions with lexical access to the upvariable. For example, Golang's closures each have their own copy of each upvariable.

Another class of solution, which i will call 'upvalues' (although i mean something different from what Lua means here), is to allow references to the values of variables in enclosing scopes, but not to allow them to be mutated. This semantics can allow an implementation to copy each relevant outer local variable into the stack frame of the inner function at the time the inner function is 'instantiated' in the outer scope. The inner function can perceive these local variables as immutable, or they can mutate them, but in the latter case these mutations will not affect the outer function's copy of these variables. The time of 'instantiation' can vary; if a first-class function value is created, the values of the variables could be copied at that time, or alternately they could only be copied at the time that function is actually called.

If mutability is desired, a related 'solution' is global variables; don't allow access to local variables from enclosing scopes, but instead, allow global variables. Then when such access is desired, declare the variable in both the enclosing scope and the inner scope to be global. This solution has the usual disadvantages of global variables: (a) if some other function is called which happens to use the same global variable name for something different, there will be a conflict. One can partially ameliorate this by having 'module-scoped globals', eg each module has its own global variable namespace (we assume that each module is written by one team, and a module's team can take care not to reuse their own names). But this still leaves another disadvantage: (b) you cannot have multiple distinct instances of the same global variable at different points in the call stack; for example, if function 'outer' uses global variable 'g', and then calls function 'inner', and 'inner' then calls function 'outer' again, then this instance of function 'outer' cannot change the value of global variable 'g' without overwriting the old value, which will be seen by the first instance of 'outer'. So global variables really don't substitute for upvariables.

For example, Python's solution is upvalues. At parsing time, Python determines which variable references in an inner scope are upvalues; a variable reference is taken to be an upvalue reference if that same variable is never written to in the inner scope (note that the classification of variables as local mutable vs. upvalues is per-scope; even if a variable is first read from and then later written to, the later write causes this variable to be categorized as a local mutable variable, not an upvalue, even for the earlier read). For first-class functions, upvalues are read at function call time, not function creation time. To see this behavior of Python, consider these examples:

Python example 1:

def outer():
    a = 1;
    def inner():
         print 'inner: %s' % a
         a = 5
         print 'inner after: %s' % a
    f = inner
    print 'outer before: %s' % a
    a = 2
    f()
    print 'outer after: %s' % a

outer()

Result:

...
----> 5          print 'inner: %s' % a
...
UnboundLocalError: local variable 'a' referenced before assignment

Here, the variable 'a' was written to in the inner scope, causing it to be classified as a mutable local variable at parse time, not as an upvalue. It was read to before it was assigned to, causing a UnboundLocalError? at runtime.

Python example 2:

def outer():
    global a
    a = 1;
    def inner():
         global a
         print 'inner: %s' % a
         a = 5
         print 'inner after: %s' % a
    f = inner
    print 'outer before: %s' % a
    a = 2
    f()
    print 'outer after: %s' % a

outer()

Result:

outer before: 1
inner: 2
inner after: 5
outer after: 5

Here the variable is a module-scoped variable (all globals in Python are really module-scoped 'globals'). It appears to be behaving similarly to an upvariable; but see the next example.

Python example 3:

def outer(x):
    global a
    a = 1;
    def inner():
         global a
         print 'inner: %s' % a
         a = 5
         outer(0)
         print 'inner after: %s' % a
    f = inner
    print 'outer %d before: %s' % (x, a)
    a = 2
    if x:
        f()
    print 'outer %d after: %s' % (x, a)
    a = 20

outer(1)

Result:

outer 1 before: 1
inner: 2
outer 0 before: 1
outer 0 after: 2
inner after: 20
outer 1 after: 20

If this were really an upvariable, and not a global variable, then the 'a = 20' in the recursive call to 'outer' would not have any effect on the caller, because upvariables are lexically scoped, and 'inner' is not an enclosing lexical scope to 'outer'. This example shows the difference between upvariables and global variables.

Finally, the following two examples show that upvalues (not upvariables) are supported by Python:

Python example 4:

def outer():
    a = 1;
    def inner():
         print 'inner: %s' % a
    f = inner
    print 'outer before: %s' % a
    a = 2
    f()
    print 'outer after: %s' % a

outer()

Result:

outer before: 1
inner: 2
outer after: 2

Python example 5:

def outer(x):
    a = 1;
    def inner():
         print 'inner: %s' % a
         outer(0)
         print 'inner after: %s' % a
    f = inner
    print 'outer %d before: %s' % (x, a)
    a = 2
    if x:
        f()
    print 'outer %d after: %s' % (x, a)
    a = 20

outer(1)

Result:

outer 1 before: 1
inner: 2
outer 0 before: 1
outer 0 after: 2
inner after: 2
outer 1 after: 2

Here, because the variable is a local, not a global, the assignment 'a = 20' at the end of the recursive call to 'outer' had no effect on the caller.

What about dynamically bound variables? (todo a lot of this should be moved out of the implementation section, to the scoping section) How do they differ from upvariables? They differ by whether or not variables not in the local scope are looked for in dynamic scopes (up the call stack; first in the caller, then in the caller's caller, etc) or by lexical scopes (first in the lexically enclosing function, then in the function lexically enclosing that one, etc). For example:

def outer():
   a = 3
    def inner
        print a
    return inner

def other():
   f = outer()
   a = 5
   f()

other()

would print 5 with dynamic variables but 3 with lexical upvalues or upvariables. The dynamic answer would be 5 because at the time f() is called, 'a' is not found in f's local scope, so we look up the call stack, starting with the caller, which it 'other'. The lexical answer would be 3 because we would look up the chain of lexical scopes, starting with the enclosing lexical scope, which is 'outer'.

Dynamically scoped variables vs. globals: with dynamically scoped variables, if a change is made to a variable's value, that change is seen by functions beneath the changer in the call chain, but not by functions above the changer. For example:

def outer():
   a = 3
    def inner
        print a
    return inner

def other():
   f = outer()
   a = 5
   f()

a = 3 
other()
print a

would print '5' followed by '3' using dynamic variables, but '5' followed by '5' using global variables. The 'a = 5' inside other() would not affect the outer scope.

The upvalue approach is also taken in Java [1].

Links:

Closures (and continuations)

If a closure can be created, then the values of the variables stored in the closure must be able to persist past the time when the function that originally created the closure returns. Note that this is necessary if continuations other than escape continuations are supported.

The typical implementation of local variables inside stack frames does not suffice for this, because those stack frames will be lost when the function that created the closure returns. One solution is to allocate all 'stack frames' on the heap instead of the stack (this is why a more general term for 'stack frame' is 'activation record'). However, this is slower than using the stack.

Optimizations include static analysis to deduce which activation records need to be created on the heap and which can be placed on the stack.

Another solution is similar to 'upvalues', above; at closure creation time a copy is made of the values of the variables in the closure. In this solution, different instances of the same closure cannot share mutable variable state (unlike typical closures, in which any instance of the closure created in the same calling scope can share upvariable state). This works well with languages with immutable data, such as ML [2].

The term "funarg problem" refers to the design choice of whether and how to implement first-class closures (how to pass around first class functions, or function arguments, which refer lexically to variables defined in the enclosing scope). It is divided into the 'downward funarg problem' (how to pass closures down the call stack) and the 'upward funarg problem' (how to pass closures up the call stack).

https://en.wikipedia.org/wiki/Funarg_problem

https://en.wikipedia.org/wiki/Parent_pointer_tree#Use_in_programming_language_runtimes

http://matt.might.net/articles/closure-conversion/

(see also https://en.wikipedia.org/wiki/Parent_pointer_tree#Use_as_call_stacks )

Trampolines

https://en.wikipedia.org/wiki/Trampoline_%28computing%29#High-level_programming

https://en.wikipedia.org/wiki/Tail_call#Through_trampolining

todo

Tail call optimization

http://akkartik.name/post/swamp

Continuation-passing

https://en.wikipedia.org/wiki/Continuation-passing_style

todo

" This style adds an additional item to each function call (you could think of it as a parameter, if you like), that designates the next bit of code to run (the continuation k can be thought of as a function that takes a single parameter). For example you can rewrite your example in CPS like this:

(defun f (x y k) (a x y k))

(defun a (x y k) (+ x y k))

(f 1 2 print)

The implementation of + will compute the sum of x and y, then pass the result to k sort of like (k sum).

Your main interpreter loop then doesn't need to be recursive at all. It will, in a loop, apply each function application one after another, passing the continuation around. " -- http://stackoverflow.com/a/5986168/171761

another example: " Direct style

(define (pyth x y) (sqrt (+ (* x x)(* y y))))

CPS style (k is the continuation)

(define (pyth& x y k) (*& x x (lambda (x2) (*& y y (lambda (y2) (+& x2 y2 (lambda (x2py2) (sqrt& x2py2 k)))))))) (define (*& x y k) (k (* x y))) (define (+& x y k) (k (+ x y))) (define (sqrt& x k) (k (sqrt x))) "

" The main interest of CPS is to make explicit several things that are typically implicit in functional languages: returns, intermediate values (= continuation arguments), order of argument evaluation...

Like for SSA, the main interest is to ease optimizations. Theoretically, SSA and CPS are equivalent: a program in SSA form can be transformed into a CPS program and vice versa.

Previous program can be rewritten as: x2=x+x y2=y*y x2py2=x2+y2 res=sqrt(x2py2) " -- http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/05-intermediatecode-2014-2015.pdf

'Threading' models for interpreters and computed goto

https://en.wikipedia.org/wiki/Threaded_code

see also [3] section 'computed goto',

see also:

Deep dives into implementation/actual performance of dispatch methods

Dynamics jump and static analysis

A few VMs started out with dynamic jumps and then later considered removing them (and possibly replacing with a small set of other less expressive control flow constructs) because they are so expressive that they make static analysis difficult.

See plChJvmLang, section 'issues with the JSR/RET instructions'.

See plChMiscIntermedLangs, section Ethereum Virtual Machine (EVM), subsection 'proposal to remove dynamic jumps to assist static analysis'.

---

"

rurban 353 days ago

parent favorite on: Branch Prediction and the Performance of Interpret...

This research misses a few important points.

First, the fastest interpreter is never switch or "jump threading" (i.e. computed goto) based. It is either based on passing the next pointer from one opcode to the next (e.g. perl5) or using an opcode array (lua). Those two cache much better and don't need the dispatch overhead with a jump table at all. Not talking about asm tuned interpreter dispatch yet, which count as best of the best.

Second, the famous cited Ertl paper mostly suggested to use smaller opcodes to influence the icache. Lua and luajit do this with one-word ops and blow all other interpreters away.

Using the computed goto extension only helps in bypassing the out of bounds check at the start of the switch dispatch, and gains the famous 2-3%, and less on more modern CPU's. When talking about the differences, the differences should be explained.

Summary: This research result is not useful for state of the art interpreters, only for conventional slow interpreters. And for those the result should be ignored at all, as much better interpreter speedup techniques exist, not noted in this paper.

" -- [4]

---

" sklogic 353 days ago

parent favorite on: Branch Prediction and the Performance of Interpret...

The article is missing one important difference between a primitive Python use of jump threading and the OCaml bytecode interpreter which pre-compiles the bytecode into a threaded code. A difference in performance between these two is huge. ...

sklogic 353 days ago [-]

> Wait a minute, how is that even possible? Or rather, what do you mean?

It means that the stream of bytecodes (32-bit numbers in OCaml) is replaced with the corresponding label addresses (for 64-bit hosts - addresses minus a fixed offset) once. Then the execution is trivial. This is called "indirect threaded code" [1] and used a lot in Forth implementations, for example.

See the definition of the "Next" macro in interp.c [2] for details.

[1] https://en.wikipedia.org/wiki/Threaded_code#Indirect_threadi...

[2] https://github.com/ocaml/ocaml/blob/trunk/byterun/interp.c

EDIT: if you wonder where the actual threading is done, see https://github.com/ocaml/ocaml/blob/trunk/byterun/fix_code.c

loup-vaillant 353 days ago [-]

OK, I'm rephrasing to make sure I understand.

In the ordinary, C compliant switch code, you would do this:

  int instructions[] = { /* bytecode */ };
  // main loop
  int* ip = instructions;
  while(1) {
    switch (*ip) {
    case 1: /* instruction 1 */ break;
    case 2: /* instruction 2 */ break;
    case 3: /* instruction 3 */ break;
    /* etc */
    }
    ip++; /* goto next instruction. beware jumps */
  }

With a jump threaded implementation, you would do this instead:

  int instructions[] = { /* bytecode */ };
  void jump_table[] = {
    &&lbl1,
    &&lbl2,
    &&lbl3,
    /* etc */
  };
  // main loop
  int* ip = instructions;
  goto *jump_table[*ip];
  lbl1: /* instruction1 */ ip++; goto *jump_table[*ip];
  lbl2: /* instruction2 */ ip++; goto *jump_table[*ip];
  lbl3: /* instruction3 */ ip++; goto *jump_table[*ip];
  /* etc */

Which means, instead of having the compiler constructing a jump table under the hood, I do this myself, and get the benefit of jumping from several locations instead of just one. But I still look up that table. Now the indirect threading you speak of:

  int instructions[] = { /* bytecode */ };
  void jump_table[] = {
    &&lbl1,
    &&lbl2,
    &&lbl3,
    /* etc */
  };
  // translating bytecode into adresses
  void* labels[] = malloc(sizeof(void*) * nb_instructions);
  for (uint i = 0; i < nb_instructions; i++)
    labels[i] = jump_table[instructions[i]];
  // main loop
  void** lp = labels;  
  goto **lp;
  lbl1: /* instruction1 */ lp++; goto **lp;
  lbl2: /* instruction2 */ lp++; goto **lp;
  lbl3: /* instruction3 */ lp++; goto **lp;
  /* etc */

If I got that correctly, instead of accessing the jump table at some random place, I only access the label table in a much more linear fashion, saving one indirection and some memory access in the process —this should relieve some pressure off the L1 cache.

Did I get that right?

sklogic 353 days ago [-]

Exactly, you got it right. One indirection less, and no jump table in the cache. And CPUs are very well optimised for this kind of indirect jumps, because of vtables.

There is also one step further - emit the actual jump instructions, making a direct threaded code. It is a bit more complicated and not very portable, but if you want to squeeze all the cycles you can it's still an easy and fast option before resorting to a full-scale compiler.

ehaliewicz2 353 days ago [-]

I think direct threading is usually implemented just by reducing the indirection another level,

e.g.

  goto *lp

rather than

  goto **lp

You could implement an interpreter with generated JMP instructions, which might be called direct-threading, but at least in the forth world, where most jumps are to nested user-defined subroutines, subroutine-threading using JSR instructions is typically used. "

-- [5]

Coroutines: Cooperative multitasking and yield

https://en.wikipedia.org/wiki/Coroutine

todo

Stackless

http://stackoverflow.com/questions/5986058/how-does-one-implement-a-stackless-interpreted-language

"...call frames are stored on the heap, rather than in a stack. This makes it simple to implement closures as you can then treat a call frame as a first class object that is available for garbage collection once the closure is unreferenced. I think of the call frames as being implemented as a kind of graph structure as opposed to a flat structure in other languages." -- http://stackoverflow.com/a/1016232/171761

"The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.

The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area. One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.

Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).

There are several reasons for using heap allocation for stack frames:

1) If the program does deep recursion dependent on the specific problem it is solving, it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations. Allocating stack frames means the application never has to say its sorry until there's literally no allocatable memory left.

2) The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".

3) Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding."

-- http://stackoverflow.com/a/1053159/171761 (from http://stackoverflow.com/questions/1016218/how-does-a-stackless-language-work )

" What does it really mean for them to be stackless? Does it mean they don't use a call stack?

Yes, that's about right.

    If they don't use a call stack, what do they use?

The exact implementation will, of course, vary from language to language. In Stackless Python, there's a dispatcher which starts the Python interpreter using the topmost frame and its results. The interpreter processes opcodes as needed one at a time until it reaches a CALL_FUNCTION opcode, the signal that you're about to enter into a function. This causes the dispatcher to build a new frame with the relevant information and return to the dispatcher with the unwind flag. From there, the dispatcher begins anew, pointing the interpreter at the topmost frame.

Stackless languages eschew call stacks for a number of reasons, but in many cases it's used so that certain programming constructs become much easier to implement. The canonical one is continuations. Continuations are very powerful, very simple control structures that can represent any of the usual control structures you're probably already familiar with (while, do, if, switch, et cetera).

If that's confusing, you may want to try wrapping your head around the Wikipedia article, and in particular the cutesy continuation sandwich analogy:

    Say you're in the kitchen in front of the refrigerator, thinking about a sandwich. You take a continuation right there and stick it in your pocket. Then you get some turkey and bread out of the refrigerator and make yourself a sandwich, which is now sitting on the counter. You invoke the continuation in your pocket, and you find yourself standing in front of the refrigerator again, thinking about a sandwich. But fortunately, there's a sandwich on the counter, and all the materials used to make it are gone. So you eat it.

" -- http://stackoverflow.com/a/796233/171761

" They don't use a call stack, because they operate in continuation-passing style. If you're not familiar with tail call optimization, that's probably a good first step to understanding what this means.

To emulate the traditional call/return on this model, instead of pushing a return address and expecting the remainder of the frame to remain untouched the caller closes over the remainder of its code and any variables that are still needed (the rest are freed). It then performs a tail call to the callee, passing this continuation as an argument. When the callee "returns", it does so by calling this continuation, passing the return value as an argument to it.

As far as the above goes, it's merely a complicated way to do function calls. However, it generalizes very nicely to more complex scenarios:

    exception/finally/etc blocks are very easily modeled - if you can pass one "return" continuation as an argument, you can pass 2 (or more) just as easily. lisp-y "condition handler" blocks (which may or may not return control to the caller) are also easy - pass a continuation for the remainder of this function, that might or might not be called.
    Multiple return values are similarly made easy - pass several arguments on to the continuation.
    Return temporaries/copying are no longer different than function argument passing. This often makes it easier to eliminate temporaries.
    Tail recursion optimization is trivial - the caller simply passes on the "return" continuation it received rather than capturing a new one.

" -- http://stackoverflow.com/a/796284/171761

Links:

---

http://matt.might.net/articles/implementing-exceptions/

implements exceptions and loops (while with break and continue) in terms of escape continuations

---

https://wiki.c2.com/?ContinuationImplementation

https://github.com/rain-1/continuations-study-group/wiki/Questions

https://github.com/rain-1/continuations-study-group/wiki/Topics

---

Tail calls

---

General links on control implementation