proj-plbook-plChAdministrationImpl

Table of Contents for Programming Languages: a survey

Administrative Automation

Language-supported management functionality (centralized administration).

When the language implementation takes responsibility for various difficult, tedious, functionality in order to make things easy for the programmer.

Chapter : hard things to make it easy: memory management and garbage collection

There are two main strategies for memory management:

There are also various points in the middle of these extremes, for example, Automatic Reference Counting in which the language handles most memory management but some situations (namely, reference cycles) still require programmer attention. We'll refer to some of these as 'semi-automatic memory management'

Links:

Manual memory management (and general allocation issues)

" In C: malloc and free In object oriented languages: object constructors and destructors " -- http://www.montefiore.ulg.ac.be/~geurts/Cours/compil/2014/06-finalcode-2014-2015.pdf

Typically uses one or more 'free lists'. This is a list of free areas of memory. One trick to reduce fragmentation is to restrict memory allocation sizes to powers of 2, and then to have one free list for each available power of two (can also reallocate a block of the next-biggest size if the free list for the requested size is empty; if you ever split blocks then you'll also want to walk the free list upon free() and merge consecutive free blocks).

Links:

https://news.ycombinator.com/item?id=32288087 claims that "Quick fit" as described at http://www.flounder.com/memory_allocation.htm "is the algorithm almost everyone uses"

Semi-automatic memory management

Automatic Reference Counting

compiler insertion of 'retain' and 'release' messages to objects to alter their reference count

Automatic memory management

"tracing garbage collectors operate on and attempt to find live objects, while reference counters operate on and attempt to find dead ones" -- https://lobste.rs/s/ce1u9d/how_not_learn_rust#c_baqddu, citing https://courses.cs.washington.edu/courses/cse590p/05au/p50-bacon.pdf

tracing garbage collectors

" For tracing, some of the major advances have been iterative copying collection [15], generational collection [41, 1], constant-space tracing [36], barrier optimization techniques [13, 45, 46], soft real-time collection [2, 7, 8, 14, 26, 30, 44], hard real-time collection [5, 16, 23], distributed garbage collection [29], replicating copying collection [34], and multiprocessor concurrent collection [21, 22, 27, 28, 39]. For reference counting, some of the major advances have been incremental freeing [42], deferred reference counting [20], cycle collection [17, 32, 6], compile-time removal of counting opera- tions [9], and multiprocessor concurrent collection [3, 19, 31]. " [2]

Links:

Generational references:

Advanced tracing GC:

Advanced reference counting:

example or pedagogical or toy GCs with code:

production or advanced research GCs/memory management with code:

Discussions:

" Conventual wisdom is the overhead of write barriers is somewhere around 5-10%. There are tricks you can use to reduce it, though; you can coalesce the barriers for objects together into one block of operations if you have a JIT, only do the write barrier if the GC is in a collecting state (and then the overhead is 0% most of the time!), skip the barrier if you know an object is short lived and doesn't escape a scope, etc. " -- chc4

Copying GCs

todo

"The fewer values that survive - the less work to do... So it has a counter-intuitive behavior: the larger percent of your values are garbage - the faster it works. " -- [5]

Links:

Immutable data and GC

"Data immutability forces us to produce a lot of temporary data but it also helps to collect this garbage rapidly. The trick is that immutable data NEVER points to younger values. Indeed, younger values don't yet exist at the time when an old value is created, so it cannot be pointed to from scratch. And since values are never modified, neither can it be pointed to later. This is the key property of immutable data." -- [6]

case study: Erlang

"it is a Generational Copying garbage collection that runs inside each Erlang process private heap independently, and also a Reference Counting garbage collection occurs for global shared heap." -- [7]

"one of the main reasons that makes Erlang a soft realtime platform from its garbage collection point of view...each process has its own private heap and its own GC, so each time GC occurs inside a process it just stops the Erlang process which is being collected, but doesn’t stop other processes, and this is what a soft realtime system needs." -- [8]

Links:

case study: Python

case study: Lua

todo see [[proj-oot-memoryManagementNotes1?]]

case study: Inferno/Limbo

todo see [[proj-oot-memoryManagementNotes1?]]

case study: JVM

case study: Haskell GHC

"Haskell computations produce a lot of memory garbage - much more than conventional imperative languages...because data are immutable...But GHC is able to efficiently manage garbage collection, so it's not uncommon to produce 1gb of data per second (most part of which will be garbage collected immediately)." -- [9]

"GHC uses generational GC. New data are allocated in 512kb "nursery". Once it's exhausted, "minor GC" occurs - it scans the nursery and frees unused values. Or, to be exact, it copies live values to the main memory area." -- [10]

Links:

case study: LuaJIT

http://wiki.luajit.org/New-Garbage-Collector

case study: OCaml

Allocators

Yeah, I wasn’t clear in my terminology, sorry. I was referring to the common design where the consumer can directly allocate from a cache of objects freed by that thread but allocated by other threads. That’s what I think is commonly referred to as a “tcache” in tcmalloc and (later) jemalloc. This of course leads to false sharing and requires complex GC mechanisms to eventually return cached objects to their allocating heap or to a global heap when the cache gets too large. The key difference with modern allocators like snmalloc and mimalloc is that they never allocate from memory originally allocated by another thread, right? 2 david_chisnall 4 days ago (unread)
link flag
link flag
    The key difference with modern allocators like snmalloc and mimalloc is that they never allocate from memory originally allocated by another thread, right?That’s more or less what I said, but I oversimplified very slightly. It’s true for all small and medium allocations but for large allocations we allocate entire pages with mmap or a platform equivalent. When you free one of these, it’s cached locally in the freeing allocator and then returned to a central pool (snmalloc never returns address space, only physical memory), where it can then be used to satisfy later large allocations or can be split up to provide new chunks. Aside from that corner case, I agree that this is the correct characterisation. " -- [18] * https://lobste.rs/s/kfwxvu/microsoft_mimalloc#c_hvxzcs

Discussions about allocators:

Misc / todo

Linux memory management intro: sbrk, mmap, malloc, ELF:

Clang memory management (search):

FreeRTOS? memory management intro:

python's similar sized obj pools

Joel Spolsky's suggestion that memory allocation sizes should be in powers of 2 to minimize fragmentation: http://www.joelonsoftware.com/articles/fog0000000319.html

reference counting misses cycles reference count overflows? one soln is by having the reference counters have at least as large bitwidth as the system addressing (eg Python does this)

tracing garbage collection pauses tend to come from: heap compaction and root scanning, both of which must be atomic one fix for heap compaction: write barriers another fix for heap compaction: max size of memory allocation region

finalizers what to do about finalizers when there is a reference cycle? CPython's solution; don't GC objects in a reference cycle if there are finalizers

haskell's pointer maps precise garbage collectors need to know what is a pointer and what is not

handles (doubly-indirect pointers) let you relocate a block of memory and change only one pointer

https://news.ycombinator.com/item?id=10001519

maximilianburke 1 hour ago

parent flag

One of my work projects involved building a .NET runtime with LLVM. It evolved from conservative garbage collection through to an advanced precise collector supporting a multithreaded runtime.

The conservative collector required no support from the runtime as they operate by treating every word-sized value on the stack and heap as a potential reference. All you need for a conservative collector is to be able to know where your stack starts and ends, where your heap begins and ends, and where your globals are.

We went through a couple designs for precise collectors and they all used the same meta-information. A precise collector needs to know what is a reference and what is not so your compiler needs to emit stack maps (tables of offsets where, given your PC in a function, all the live objects are), global root information, as well as object layout information.

Sometimes this becomes more complex. Object layout information included both location of references within type instances as well as whether or not an object is an array. Arrays of value types require both.

.NET also has a type of reference, a managed pointer, which can be used to refer to the middle of an object, such as a pointer to a value type in the middle of an array, which contributes to the parent object's liveliness information.

To sum up, conservative garbage collection can be easy to add. Precise garbage collection much less so. Language specifics can throw a big wrench into things.

https://news.ycombinator.com/item?id=9999360

cfallin 12 hours ago

If you want an imprecise (conservative) garbage collector, the GC and the compiler are almost disjoint. The GC only needs to know where the heap and stack are, and be able to capture register values at GC time. In other words, it's a runtime-library issue, not a compiler issue. (See the Boehm GC for an example that works with a stock C compiler.)

If you want a precise garbage collector, you need type information to know which heap and stack slots and registers are pointers. This is much harder because your compiler needs to emit metadata for each struct and then keep stack frame info or somesuch at the GC safe points so the runtime knows what the thread's roots are.

Most of the precise GCs I know of are in JIT'd or interpreted languages where you have this type info anyway... AOT-compiled Lisp is the one counterexample I can think of, but usually Lisps solve this in a different way by using tagged pointers (so you know if any heap slot is a pointer by e.g. its LSB).

reply

eslaught 10 hours ago

The issue is not having the type info available. AOT compilers certainly have that (or at least, enough of it to get the job done).

The issue is that most compilers with precise GC---AOT or otherwise---are built around that assumption from the very beginning. Therefore, they choose representations, algorithms, and optimizations that are amenable to GC, and in particular don't destroy the type info necessary to make GC work.

It is possible in theory to do the same with LLVM, but difficult in practice. Most existing LLVM-based GC'd languages compromise on performance by requiring LLVM to "root" pointers, effectively forcing them to stay on the stack (instead of registers) so that the GC can see them when it needs to. LLVM's optimizations (particularly at the machine-specific level) were previously allowed to do arbitrary things with register-based pointers, which is obviously bad for a GC if no copies of those pointers exist on the stack. It just takes a long time to take a code base that large and untangle the code from such basic assumptions.

This appears to be changing, as the release nodes for LLVM 3.6 indicated. But it will take time. In the mean time languages on LLVM are either going with conservative GCs, or with precise GCs with the performance degradation noted above (e.g. OCaml), or with no GC (e.g. Rust for the time being).

Source: My talk on precise GC for Rust (from 2012): https://github.com/elliottslaughter/rust-gc-talk

...

mafribe 10 hours ago

All you need is access to the roots and the ability always to determine which data is another pointer. The main 'enemy' of this is the ability to cast integers to pointers (as you may in C/C++).

...

https://verdagon.dev/blog/surprising-weak-refs

Chapter : hard things to make it easy: coroutines and greenthreads

cooperative vs preemptive

todo preemptive on top of cooperative

(note: 'fibers' are not greenthreads, they are cooperative [19])

scheduling

todo

It is desirable for the scheduler to be aware when a thread is waiting on some condition, so that the scheduler doesn't bother scheduling the thread until the condition is true. Scheduler awareness of things like blocking I/O, locks, condition variables, semaphores, mutexes, futexes, timers/alarms, and OS signals might be useful here.

"A thread may block on IO or on a syncronization construct like a mutex, in which case it waits not for an external event to occur, but for another thread to signal that it has completed some task. " -- http://blog.paralleluniverse.co/2014/02/06/fibers-threads-strands/

Links:

How can greenthreads be more efficient than OS threads?

Links:

case study: erlang

case study: go

todo

case study: clojure core.async

Links:

case study: D

small coroutine implementations

vs async/await / cooperative multitasking / generators

" So aren’t fibers generators or async/awaits?

No, as we have seen fibers are real threads: namely a continuation plus a scheduler. Generators and async/awaits are implemented with continuations (often a more limited form of continuation called stackless, which can only capture a single stack frame), but those continuations don’t have a scheduler, and are therefore not threads. " -- http://zeroturnaround.com/rebellabs/what-are-fibers-and-why-you-should-care/

case study: C++

On M:N greenthreads

https://news.ycombinator.com/item?id=9762489:

pcwalton 4 hours ago

Golang implements a custom scheduler in userspace too, and this is why its FFI is not as fast or as well integrated with the OS as the calling conventions of C or C++ are. Heck, Golang's FFI is even less straightforward than that of Java, since the JVM is typically 1:1, not M:N.

Once you've gone M:N you're effectively in your own little world. There's a reason Golang's runtime is compiled with the Golang toolchan, and why it invokes syscalls directly as opposed to going through libc.

reply

Sorry, what is M:N?

reply

gilgoomesh 2 hours ago

Any situation where you have M userspace jobs running on N system threads, i.e. the number of tasks is different to the number of system threads.

Normally this occurs because you're running a large number of "green" threads on your own scheduler which schedules onto a thread pool underneath. This is good if all your threads are small/tiny since userspace thread creation is cheaper than creating an OS thread but if your jobs are long-lived then your userspace job scheduler is really just adding additional scheduling overhead on top of the overhead that the OS already has for thread scheduling and you would have been better with OS threads. If your M:N threading requires separate stack space for each job, there can be a sizeable overhead (this is why Rust abandoned M:N threading).

reply

anon3_ 2 hours ago

Can you come up with some examples of when this would began to be noticeable to an end-user?

reply

pcwalton 1 hour ago

If you're crossing the FFI boundary a lot, any overhead adds up quick. For example, drawing a bunch of small objects using Skia, performing lots of OpenGL? draw calls, allocating LLVM IR nodes, or calling a C memory allocator…

reply

bkeroack 8 hours ago

One of the nice things about M:N is it decouples concurrency from parallelism. That is, your application can spawn as many inexpensive green threads as the task requires without worrying about optimizing for core count or capping threads to avoid overhead, etc. With Go 1.5, underlying system thread count will default to the same as the system CPU core count.

reply

jrochkind1 9 hours ago

It's noticeable to the end-user only in it's negative performance implications in certain situations, making things slower than they would be otherwise on the same hardware. It's a low-level construct, it is not directly noticeable to the user either way. The negative performance implications are largely under heavy load. The post you replied to gave some more specific situations.

reply

Greenthreads libraries

Memory management links

(todo move to relevant chapter)

Region-based memory management

http://www.elsman.com/mlkit/ offers this; it describes itself as "region-based memory management with automatic inference of regions, aiming to support realtime applications"

" The store consists of a stack of regions , see Figure 1.1. Regions hold values, for example tuples, records, function closures, references, and values of recursive types (such as lists and trees). All values, except those that fit within one machine word (for example integers), are stored in regions. The size of a region is not necessarily known when the region is allocated. Thus a region can grow gradually (and many regions can grow at the same time) so one might think of the region stack as a stack of heaps. However, the region stack really is a stack in the sense that (a) if region r1 is allocated before region r2 then r2 is de-allocated before r1 and (b) when a region is de-allocated, all the memory occupied by that region is reclaimed in one constant time operation. Values that reside in one region are often, but not always, of the same type. A region can contain pointers to values that reside in the same region or in other regions. Both forward pointers (i.e., pointers from a region into a region closer to the stack top) and backwards pointers (i.e., pointers to an older region) occur.

As mentioned in the preface, the present version of the MLKit supports reference-tracing garbage collection in combination with region memory man-agement [Hal99]. While most de-allocations can be efficiently performed by region de-allocation, there are some uses of memory for which it is difficult to predict when memory can be de-allocated. In these cases reference-tracing garbage collection does a good job in combination with region de-allocation. ...

In the ML Kit the vast majority of region management is done automatically by the compiler and the runtime system. Indeed, with one exception, source programs are written in Standard ML, with no added syntax or special directives. The exception has to do with resetting of regions...resetRegions is a safe form of resetting where the compiler only inserts region resetting instructions if it can prove that they are safe, and prints thorough explanations of why it thinks resetting might be unsafe otherwise. "

-- [32] and [33]

Chapter : hard things to make it easy: TCO and strictness analysis

Chapter : hard things to make it easy: binding / dataflow

spreadsheet-like dependencies between variables

Chapter : administration: Bounds checking

" Bounds checks cause programs to execute slower for two reasons. One is the cost of executing the bounds checks themselv es since they can occur quite frequently and involve a memory load of the array length and two compare operations. Even more importantly , the presence of bounds checks greatly limits the application of code optimizations. Precise exception semantics (as in Java) requires that all code transformations preserve the program state at an exception point, and also the order in which exceptions occur. As a consequence, the application of traditional optimizations must be restricted to prevent side-effect-causing instructions from moving across an y exception points. Since the exception points introduced by array bounds checks can be frequent, the scope over which optimizations are applicable can be severely restricted" -- [34]

ABCD bounds checking

Links:

Fat pointers or tagged pointers for bounds checking

"Fat pointer systems work by replacing every pointer value with a structure that describes the valid range for the pointer, and replacing every pointer read/write with code that checks or updates this structure as necessary. The canonical form of this stores a ’base’ and ’extent’ value with each pointer, describing the lower and upper bounds that it is permitted to point into; if it is outside this range when it is dereferenced, an error is reported. " -- http://www.imperial.ac.uk/pls/portallive/docs/1/18619746.PDF section 2.2

There are proposals for hardward-supported bounds checking using this scheme [35]. Some propose that, instead of using more bytes to store fat pointers than ordinary pointers, taking advantage of pointer alignment requirements on popular platforms (which mean that the low-order bits of pointers must always be 0), masking the low-order bits of the pointer, and using these bits to specify imprecise, power-of-two bounds [36].

Note that tagged pointers require a power-of-two memory size limit.

Links:

Misc

"global constants that can be initialized from imported values" is a desirable feature in a target language to allow this sort of thing: https://github.com/WebAssembly/design/issues/333

misc / todo

tagged pointers

Some proposed uses for the bits in tagged pointers:

how many bits are typically available for tagging on popular hardware architectures? https://en.wikipedia.org/wiki/Tagged_pointer says 2 bits on 32-bit and 3 bits on 64-bit.

advantages: " The major advantage of tagged pointers is that they take up less space than a pointer along with a separate tag field. This can be especially important when a pointer is a return value from a function. It can also be important in large tables of pointers.

A more subtle advantage is that by storing a tag in the same place as the pointer, it is often possible to guarantee the atomicity of an operation that updates both the pointer and its tag without external synchronization mechanisms. This can be an extremely large performance gain, especially in operating systems. " [39]

---

write about how low level sandboxable code may require something like The CLI in order to provide a meta data language to allow stack maps and memory maps, to ensure that the sandboxed code can't coerce integers to references. This allows references to serve as capabilities. Refer to https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-335.pdf I.6.1 Relationship to type safety, pdf page 35, page 9. I don't think they mention capabilities, the that might be my own insight, but I'm not sure.

---

data format optimization:

https://dl.acm.org/doi/10.1145/3473596

---

spatial memory safety

temporal memory safety

some types of temporal memory safety bugs:

---

https://dl.acm.org/doi/10.1145/53990.53992 https://dl.acm.org/doi/pdf/10.1145/53990.53992 Appel-Ellis-Li "real time" concurrent collection gc faq describes as: " It isn't really real time (see Jeff Barth's question from the audience at 1988 SIGPLAN PLDI) but that's probably ok, unless you are feeling pedantic. Use page protection to enforce the mutator invariant, and move objects on the entire page when one of them is loaded. (...) "

The Treadmill: Real-Time Garbage Collection Without Motion Sickness Henry G. Baker https://dl.acm.org/doi/pdf/10.1145/130854.130862

https://github.com/robey/mwgc micro-wibble garbage collector "A sesame-seed-sized heap with a tri-color, tracing, conservative, incremental, non-compacting garbage collector, for implementing a tiny language on tiny hardware."

---

One Pass Real-Time Generational Mark-SweepGarbage Collection by Joe Armstrong and Robert Virding

"Traditional mark-sweep garbage collection algorithms do not allow reclamation of data until the mark phase of the algorithm has terminated.For the class of languages in which destructive operations are not allowed we can arrange that all pointers in the heap always point backwards towards older data. In this paper we present a simple scheme for reclaiming data for such language classes with a single pass mark-sweep collector. We also show how the simple scheme can be modifed so that the collection can be done in an incremental manner (making it suitable for real-time collection). Following this we show how the collector can be modifed for generational garbage collection, and finally how the scheme can be used for a language with concurrent processes."

---