proj-oot-ootMemoryManagementNotes1

--

Python has reference counting, which frees most memory immediately as soon as it can be, but which never frees reference cycles. To catch the reference cycles, they have garbage collection on top of this. for some reason that they sort of explain but not really that i dont understand, having to do with their preexisting extension API, they cannot use a traditional garbage collector, which marks objects reachable from the root, but instead they just search for UNreachable objects, i dont understand this part.

Lua has an incremental garbage collector.

Apparently writing external modules in Python requires unwanted interaction with Python's reference counting/garbage collections (http://lua-users.org/lists/lua-l/2007-11/msg00248.html : " Compared to other language's binding mechanisms, Lua's C API effectively provides a shield between the internals of the Lua VM and the outer world. One does not have to care about internal structures of the Lua VM or even the garbage collector (how many refcounting bugs do your Python bindings have?).")

I like the idea of combining reference counting with an incremental garbage collector. Not sure how to do it. Should think about whether and how to gain Lua's lack of need for extensions to interact with refcounting, and whether and how to avoid whatever design decision Python used that prevented them from using traditional GC.

http://arctrix.com/nas/python/gc/

upon suggestion of an incremental GC for Lua, this guy argued for a Python-esque approach: http://lua-users.org/lists/lua-l/2003-07/msg00359.html so did this guy; http://lua-users.org/lists/lua-l/2003-05/msg00444.html

and later provides more arguments (many of which are not applicable to a non-copying incremental GC): http://lua-users.org/lists/lua-l/2003-07/msg00370.html

this post notes that inferno/limbo uses a Python-esque system: http://lua-users.org/lists/lua-l/2003-05/msg00496.html

this guy suggests that if refcounting is used, as an optimization objects on the stack are omitted: http://lua-users.org/lists/lua-l/2003-07/msg00359.html . This guy likes it but notes that it makes finalizer semantics more complex: http://lua-users.org/lists/lua-l/2003-07/msg00370.html . This post suggests that if that optimization is used, there is a language API that allows one to manually trigger garbage collection (and hence finalization) on objects on the stack: http://lua-users.org/lists/lua-l/2003-07/msg00381.html

http://stackoverflow.com/questions/4484167/details-how-python-garbage-collection-works

this post argues against a Python-esque model and for a pure incremental garbage collector, although it also seems to say that something like reference counting should be an optimization detail, and the only important issue is how much overhead the incremental garbage collector has in the write barrier:

http://lua-users.org/lists/lua-l/2003-08/msg00020.html

"I am willing to believe reference counting would relieve pressure on the M&S algorithm, but we are talking about a generational optimisation for Lua so this should become less of a problem. I think the issue here is whether the write barrier implementation for the generational incremental GC is a bigger overhead than just using RC."

this guys also argues against it: http://lua-users.org/lists/lua-l/2003-08/msg00088.html he notes among other things that naive reference counting, like stop-the-world garbage collectors, have unbounded pause times. he notes that incremental garbage collectors can have bounded pause times. he talks about read barrier and write barrier-based GCs.

here's an interesting suggestion: http://lua-users.org/lists/lua-l/2003-08/msg00101.html

i would extend the previous suggestion to say that, like disabling interrupts, there could be a language API to allow Oot code to tell the language to stop automatic garbage collections, and to only run the collector on demand.

excellent glossary:

http://www.iecc.com/gclist/GC-algorithms.html

good intro that also discusses what sort of annoying constraints the various GCs place on the rest of your language implementation:

http://llvm.org/docs/GarbageCollection.html

this blog post likes Lua's GC better than Python's, likes that you can turn automatic GC off, and implicitly suggests that refcounting+GC might make the GC part even faster:

http://www.altdevblogaday.com/2011/06/23/from-python-to-lua/

on the complexity of the don't-reference-count-on-the-stack optimization: http://lua-users.org/lists/lua-l/2003-05/msg00446.html

http://lua-users.org/lists/lua-l/2003-05/msg00447.html

random technical points: http://lua-users.org/lists/lua-l/2003-05/msg00498.html http://lua-users.org/lists/lua-l/2003-05/msg00483.html

some random links in this post: http://lua-users.org/lists/lua-l/2003-05/msg00465.html

here's Go's garbage collector: http://stackoverflow.com/questions/7823725/what-kind-of-garbage-collection-does-go-use

it's stop the world and zero-cost, so i dont think we want that one. it also doesn't support weak refs.

points out that incremental reference counting is also complicated: http://lua-users.org/lists/lua-l/2003-05/msg00458.html --

could allow programmer to assign integer 'finalizer levels' to objects. a cycle of objects with finalizers will be garbage collected from least to most iff their 'finalizer levels' are all distinct.

--

"

upvote

jerf 2 days ago

link

It's easier to start from a GC'ed language, and use escape hatches when you need them, then to start without GC and then realize that your program has become a mess because you don't have it.

Many GC'ed languages have such escape hatches. Those that don't, well, don't start writing performance-critical code in them would be my recommendation "

--

" 2) Pointers everywhere. Pointers, pointers and yet more pointers, and more than that still. So datastructures in java will never match their equivalent in C++ in lookup speeds. Plus, in C++ you can do intrusive datastructures (not pretty, but works), which really wipe the floor with Java's structures. If you intend to store objects with lots of subobjects, this will bit you. As this wasn't bad enough java objects feel the need to store metadata, whereas C++ objects pretty much are what you declared them to be (the overhead comes from malloc, not from the language), unless you declared virtual member functions, in which case there's one pointer in there. In Java, it may (Sadly) be worth it to not have one object contain another, but rather copy all fields from the contained object into the parent object. You lose the benefits of typing (esp. since using an interface for this will eliminate your gains), but it does accelerate things by keeping both things together in memory. "

--

so should be able to directly embed a subobject in a parent object, without an extra pointer. also should be able to do things like intrusive linked lists

--

what is the metadata that the above quote (from https://news.ycombinator.com/item?id=6425412 ) says that Java needs to store with each object? do we need to do that?

--

"

--

interesting that Apple chose automatic reference counting for Objective-C on iOS ( http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/ ) despite Apple's preference for speed in iOS ( https://news.ycombinator.com/item?id=6016628 ), although garbage collection is supposedly better for throughput.

otoh http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/ seems to think that garbage collection is only at a huge disadvantage in very memory-constrained situations, in the future, perhaps CPUs will speed up less than memory will become cheaper, leading to less memory-constrained situations?

--

did i ever note this? garbage collection for young generations and reference counting for old ones:

http://grothoff.org/christian/teaching/2007/4705/urc-oopsla-2003.pdf

--

https://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Reference_counting

notes that often ref counts can be omitted, e.g. when a pointer is moved (e.g. upon a function call), or when one object is part of a larger object (i think, mb i got that wrong)

--

so far i am leaning towards something like Inferno's thing:

"

Then I found this quote from Ken Thompson (http://www.computer.org/computer/thompson.htm), and it crystalized my little rant about GC:


Thompson, talking about the Limbo language in Inferno:

    In addition, the language implementation -- and again I don't want
    to take any credit -- doesn't have big mark-and-sweep type garbage
    collection. It has reference counting: If you open something and
    then return, it's gone by reference count. Thus, you don't have
    high and low watermarks because 99 percent of the garbage goes
    away as soon as it is dereferenced. If you store a null in a
    pointer, it chases the pointer and all that stuff goes.
    If you make cycles, there is a background distributed
    mark-and-sweep algorithm that just colors the next step a little
    bit at a time. It doesn't have to be very aggressive because there
    is not much garbage around in most applications: People really
    don't leave dangling loop structures that require this kind of
    algorithm. So you can devote just one percent of your time in the
    background looking for garbage without these monster
    mark-and-sweep things.
    So, again, it's pragmatic. It's not the theoretical
    top-of-the-line garbage collection paper. It's just a way of doing
    it that seems to be very, very effective.

end Thompson quote " -- http://tulrich.com/geekstuff/ref-counting-is-good.txt

see also

http://lua-users.org/lists/lua-l/2003-05/msg00496.html

also note that i want to defer reference chasing so that if one free causes a cascade, there isn't a long pause

also, probably want a non-moving GC if any so that you can give pointers to external functions via FFI without pinning

--

maybe in addition to a 'turn off background GC' command, have a 'do this part fast' boundary type, e.g. sort of like turning off interrupts

--

another idea with destructor cycles is not to guarantee that the object will be destructed immediately after the destructor call (but only in the case of a destructor cycle)

-- the new Python system for dealing with finalizers and cyclic reference loops looks good:

http://www.python.org/dev/peps/pep-0442/

---

toread

http://www.memorymanagement.org/articles/lang.html


rust is cool:

" The ~ sigil represents a unique handle for a memory allocation on the heap:

{ an integer allocated on the heap let y = ~10; } the destructor frees the heap memory as soon as `y` goes out of scope

Rust includes syntax for heap memory allocation in the language since it's commonly used, but the same semantics can be implemented by a type with a custom destructor. 8 Ownership

Rust formalizes the concept of object ownership to delegate management of an object's lifetime to either a variable or a task-local garbage collector. An object's owner is responsible for managing the lifetime of the object by calling the destructor, and the owner determines whether the object is mutable.

Ownership is recursive, so mutability is inherited recursively and a destructor destroys the contained tree of owned objects. Variables are top-level owners and destroy the contained object when they go out of scope. A box managed by the garbage collector starts a new ownership tree, and the destructor is called when it is collected.

the struct owns the objects contained in the `x` and `y` fields struct Foo { x: int, y: ~int }

{ `a` is the owner of the struct, and thus the owner of the struct's fields let a = Foo { x: 5, y: ~10 }; } when `a` goes out of scope, the destructor for the `~int` in the struct's field is called

`b` is mutable, and the mutability is inherited by the objects it owns let mut b = Foo { x: 5, y: ~10 }; b.x = 10;

If an object doesn't contain garbage-collected boxes, it consists of a single ownership tree and is given the Owned trait which allows it to be sent between tasks. Custom destructors can only be implemented directly on types that are Owned, but garbage-collected boxes can still contain types with custom destructors. 9 Boxes

Many modern languages represent values as pointers to heap memory by default. In contrast, Rust, like C and C++, represents such types directly. Another way to say this is that aggregate data in Rust are unboxed. This means that if you let x = Point { x: 1f, y: 1f };, you are creating a struct on the stack. If you then copy it into a data structure, you copy the entire struct, not just a pointer.

For small structs like Point, this is usually more efficient than allocating memory and indirecting through a pointer. But for big structs, or mutable state, it can be useful to have a single copy on the stack or on the heap, and refer to that through a pointer. 9.1 Owned boxes

An owned box (~) is a uniquely owned allocation on the heap. It inherits the mutability and lifetime of the owner as it would if there was no box:

let x = 5; immutable let mut y = 5; mutable y += 2;

let x = ~5; immutable let mut y = ~5; mutable

The purpose of an owned box is to add a layer of indirection in order to create recursive data structures or cheaply pass around an object larger than a pointer. Since an owned box has a unique owner, it can only be used to represent a tree data structure.

The following struct won't compile, because the lack of indirection would mean it has an infinite size:

struct Foo { child: Option<Foo> }

    Note: The Option type is an enum that represents an optional value. It's comparable to a nullable pointer in many other languages, but stores the contained value unboxed.

Adding indirection with an owned pointer allocates the child outside of the struct on the heap, which makes it a finite size and won't result in a compile-time error:

struct Foo { child: Option<~Foo> }

9.2 Managed boxes

A managed box (@) is a heap allocation with the lifetime managed by a task-local garbage collector. It will be destroyed at some point after there are no references left to the box, no later than the end of the task. Managed boxes lack an owner, so they start a new ownership tree and don't inherit mutability. They do own the contained object, and mutability is defined by the type of the managed box (@ or @mut). An object containing a managed box is not Owned, and can't be sent between tasks.

let a = @5; immutable

let mut b = @5; mutable variable, immutable box b = @10;

let c = @mut 5; immutable variable, mutable box

let mut d = @mut 5; mutable variable, mutable box

A mutable variable and an immutable variable can refer to the same box, given that their types are compatible. Mutability of a box is a property of its type, however, so for example a mutable handle to an immutable box cannot be assigned a reference to a mutable box.

let a = @1; immutable box let b = @mut 2; mutable box

let mut c : @int; declare a variable with type managed immutable int let mut d : @mut int; and one of type managed mutable int

c = a; box type is the same, okay d = b; box type is the same, okay

but b cannot be assigned to c, or a to d c = b; error

10 Move semantics

Rust uses a shallow copy for parameter passing, assignment and returning values from functions. A shallow copy is considered a move of ownership if the ownership tree of the copied value includes an owned box or a type with a custom destructor. After a value has been moved, it can no longer be used from the source location and will not be destroyed there.

let x = ~5; let y = x.clone(); y is a newly allocated box let z = x; no new memory allocated, x can no longer be used

Since in owned boxes mutability is a property of the owner, not the box, mutable boxes may become immutable when they are moved, and vice-versa.

let r = ~13; let mut s = r; box becomes mutable

11 Borrowed pointers

Rust's borrowed pointers are a general purpose reference type. In contrast with owned boxes, where the holder of an owned box is the owner of the pointed-to memory, borrowed pointers never imply ownership. A pointer can be borrowed to any object, and the compiler verifies that it cannot outlive the lifetime of the object.

As an example, consider a simple struct type, Point:

struct Point { x: float, y: float }

We can use this simple definition to allocate points in many different ways. For example, in this code, each of these three local variables contains a point, but allocated in a different location:

let on_the_stack : Point = Point { x: 3.0, y: 4.0 }; let managed_box : @Point = @Point { x: 5.0, y: 1.0 }; let owned_box : ~Point = ~Point { x: 7.0, y: 9.0 };

Suppose we want to write a procedure that computes the distance between any two points, no matter where they are stored. For example, we might like to compute the distance between on_the_stack and managed_box, or between managed_box and owned_box. One option is to define a function that takes two arguments of type point—that is, it takes the points by value. But this will cause the points to be copied when we call the function. For points, this is probably not so bad, but often copies are expensive. So we’d like to define a function that takes the points by pointer. We can use borrowed pointers to do this:

fn compute_distance(p1: &Point, p2: &Point) -> float { let x_d = p1.x - p2.x; let y_d = p1.y - p2.y; sqrt(x_d * x_d + y_d * y_d) }

Now we can call compute_distance() in various ways:

compute_distance(&on_the_stack, managed_box); compute_distance(managed_box, owned_box);

Here the & operator is used to take the address of the variable on_the_stack; this is because on_the_stack has the type Point (that is, a struct value) and we have to take its address to get a value. We also call this borrowing the local variable on_the_stack, because we are creating an alias: that is, another route to the same data.

In the case of the boxes managed_box and owned_box, however, no explicit action is necessary. The compiler will automatically convert a box like @point or ~point to a borrowed pointer like &point. This is another form of borrowing; in this case, the contents of the managed/owned box are being lent out.

Whenever a value is borrowed, there are some limitations on what you can do with the original. For example, if the contents of a variable have been lent out, you cannot send that variable to another task, nor will you be permitted to take actions that might cause the borrowed value to be freed or to change its type. This rule should make intuitive sense: you must wait for a borrowed value to be returned (that is, for the borrowed pointer to go out of scope) before you can make full use of it again.

For a more in-depth explanation of borrowed pointers, read the borrowed pointer tutorial. 11.1 Freezing

Borrowing an immutable pointer to an object freezes it and prevents mutation. Owned objects have freezing enforced statically at compile-time.

let mut x = 5; { let y = &x; x is now frozen, it cannot be modified } x is now unfrozen again

Mutable managed boxes handle freezing dynamically when any of their contents are borrowed, and the task will fail if an attempt to modify them is made while they are frozen:

let x = @mut 5; let y = x; { let z = &*y; the managed box is now frozen modifying it through x or y will cause a task failure } the box is now unfrozen again

"

-- http://static.rust-lang.org/doc/0.8/tutorial.html

---

some discusson on nimrod's GC:

https://news.ycombinator.com/item?id=6820474

--

http://pivotallabs.com/why-not-to-use-arc/

--

https://docs.python.org/3/whatsnew/3.4.html#pep-442-safe-object-finalization

---

" Since only one task can own a boxed array at a time, if instead of distributing our numbers array to a single task we wanted to distribute it to many tasks, we would need to copy the array for each. Let's see an example that uses the clone method to create copies of the data:

fn main() { let numbers = vec![1i, 2i, 3i];

    for num in range(0u, 3) {
        let (tx, rx)  = channel();
        // Use `clone` to send a *copy* of the array
        tx.send(numbers.clone());
        spawn(proc() {
            let numbers = rx.recv();
            println!("{:d}", numbers[num as uint]);
        })
    }}

This is similar to the code we had before, except now we loop three times, making three tasks, and cloning numbers before sending it.

However, if we're making a lot of tasks, or if our data is very large, creating a copy for each task requires a lot of work and a lot of extra memory for little benefit. In practice, we might not want to do this because of the cost. Enter Arc, an atomically reference counted box ("A.R.C." == "atomically reference counted"). Arc is the most common way to share data between tasks. Here's some code:

use std::sync::Arc;

fn main() { let numbers = vec![1i, 2i, 3i]; let numbers = Arc::new(numbers);

    for num in range(0u, 3) {
        let (tx, rx)  = channel();
        tx.send(numbers.clone());
        spawn(proc() {
            let numbers = rx.recv();
            println!("{:d}", (*numbers)[num as uint]);
        })
    }}

This is almost exactly the same, except that this time numbers is first put into an Arc. Arc::new creates the Arc, .clone() makes another Arc that refers to the same contents. So we clone the Arc for each task, send that clone down the channel, and then use it to print out a number. Now instead of copying an entire array to send it to our multiple tasks we are just copying a pointer (the Arc) and sharing the array.

How can this work though? Surely if we're sharing data then can't we cause data races if one task writes to the array while others read?

Well, Rust is super-smart and will only let you put data into an Arc that is provably safe to share. In this case, it's safe to share the array as long as it's immutable, i.e. many tasks may read the data in parallel as long as none can write. So for this type and many others Arc will only give you an immutable view of the data. " -- http://doc.rust-lang.org/master/intro.html

---

Go 1.4+ GC Plan and Roadmap

https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLovrHvvLhK_h0KN8woTO4/edit

https://news.ycombinator.com/item?id=8148666

---

random LtU? discussion on real-time GC:

http://lambda-the-ultimate.org/node/2393

---

how to concisely let user choose arc vs tracing gc?

---

"

Internal Memory Management

To speed-up memory allocation (and reuse) Python uses a number of lists for small objects. Each list will contain objects of similar size: there will be a list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object needs to be created, either we reuse a free block in the list, or we allocate a new one.

There are some internal details on how Python manages those lists into blocks, pools, and “arena”: a number of block forms a pool, pools are gathered into arena, etc., but they’re not very relevant to the point we want to make (if you really want to know, read Evan Jones’ ideas on how to improve Python’s memory allocation). The important point is that those lists never shrink.

Indeed: if an item (of size x) is deallocated (freed by lack of reference) its location is not returned to Python’s global memory pool (and even less to the system), but merely marked as free and added to the free list of items of size x. The dead object’s location will be reused if another object of compatible size is needed. If there are no dead objects available, new ones are created.

If small objects memory is never freed, then the inescapable conclusion is that, like goldfishes, these small object lists only keep growing, never shrinking, and that the memory footprint of your application is dominated by the largest number of small objects allocated at any given point.

Therefore, one should work hard to allocate only the number of small objects necessary for one task, favoring (otherwise unpythonèsque) loops where only a small number of elements are created/processed rather than (more pythonèsque) patterns where lists are created using list generation syntax then processed.

While the second pattern is more à la Python, it is rather the worst case: you end up creating lots of small objects that will come populate the small object lists, and even once the list is dead, the dead objects (now all in the free lists) will still occupy a lot of memory.

The fact that the free lists grow does not seem like much of a problem because the memory it contains is still accessible to the Python program. But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase. " -- http://deeplearning.net/software/theano/tutorial/python-memory-management.html

---

" Running an application with Metronome

Metronome is designed to provide RT behavior to existing applications. No user code modification should be required. Desired heap size and target utilization must be tuned to the application so target utilization maintains the desired application throughput while letting the GC keep up with allocation. Users should run their applications at the heaviest load they want to sustain to ensure RT characteristics are preserved and application throughput is sufficient. This article's Tuning Metronome section explains what you can do if throughput or utilization is insufficient. In certain situations, Metronome's short pause-time guarantees are insufficient for an application's RT characteristics. For these cases, you can use the RTSJ to avoid GC-incurred pause times. The Real-time Specification for Java

The RTSJ is a "specification for additions to the Java platform to enable Java programs to be used for real-time applications." Metronome must be aware of certain aspects of the RTSJ -- in particular, RealtimeThreads? (RT threads), NoHeapRealtimeThreads? (NHRTs), and immortal memory. RT threads are Java threads that, among other characteristics, run at a higher priority than regular Java threads. NHRTs are RT threads that can't contain references to heap objects. In other words, NHRT-accessible objects can't refer to objects subject to GC. In exchange for this compromise, the GC won't impede the scheduling of NHRTs, even during a GC cycle. This means NHRTs won't incur any pause times. Immortal memory provides a memory space that's not subject to GC; this means NHRTs are allowed to refer to immortal objects. These are only some aspects of the RTSJ; see Resources for a link to the complete specification. " -- http://www.ibm.com/developerworks/java/library/j-rtj4/index.html


http://www.jopdesign.com/doc/nbgc.pdf points out that there are two typically atomic parts of GC which lead to stop-the-world pauses; scanning the GC roots, and copying objects during heap compaction.

http://www.jopdesign.com/doc/nbgc.pdf points out that although GC roots must be scanned atomically, this is only per-thread, so at least one thread can be executing while another one is doing GC stuff (scanning its roots atomically).

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html points out that a large part of GC pauses are due to atomic copying of large objects from one memory location to another during heap compaction. It solves this by having a max size for a contiguous block in memory; eg a large array is split into many discontiguous memory blocks. Now the length of the pause is bounded by the time it takes to copy just one of these guys.

another solution for pauses during heap compaction is write barriers; if you do the copies concurrently with ongoing execution, then when someone writes to the object being copied (it will write to the source, but not the destination, because you havent changed the pointer yet), you need to keep a log of the write, and after the copy is finished, to replay the writes on this log to the destination object before actually changing the pointer. Now you only need atomicity when you actually swap in the new pointer (question: couldn't there be multiple references to the same object in different places? don't you have to swap those all atomically to the new object? or do you use another layer of indirection ("handles")?

write barriers are also used in incremental or parallel garbage collection to prevent the mutators (the application program) from creating a new reference from an object that the GC has already traced to an existing argument mid-way through a GC cycle, so that the GC isn't aware of this new reference, in which case the GC might delete the destination object. (i dont really understand this though, because if the GC would have thought the object was unreachable, then how did the new reference to it get created at all?). http://www.ibm.com/developerworks/java/library/j-rtj4/index.html also talks about something called a "fuzzy barrier" that i don't understand that has to do with "objects being moved from one thread to another" (i understand this problem, but i don't understand what a fuzzy barrier is).

http://stackoverflow.com/questions/2663292/how-does-heap-compaction-work-quickly#comment4586534_2668486 suggests the indirection solution

https://en.wikipedia.org/wiki/Handle_%28computing%29 however says "Doubly indirect handles have fallen out of favour in recent times, as increases in available memory and improved virtual memory algorithms have made the use of the simpler pointer more attractive. "

) ---

what does C's malloc do? Actually it does arena memory management and has no heap compaction so fragmentation of the arena can increase arbitrarily, in the worst case causing memory consumption to increase arbitrarily: https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_native_memory_fragmentation_and_process_size_growth?lang=en

---

http://www.ibm.com/developerworks/java/library/j-rtj4/index.html also uses a 'cooperative suspend' in which, when the GC needs to stop-the-word (STW) (which is always for a bounded amount of time in this system), it notifies all threads, each of which then store any held object references into locations accessible from the roots, and then sleep until the GC signals that it's done

---

support read and write barriers; support associating some GC data (like reference counts) with each reference

---

https://news.ycombinator.com/item?id=9163125 suggests that you should be able to group eg an array of structs, or a tree of structs, or whatever, into a single 'allocation unit', to save a tracing garbage collector time

---

http://stackoverflow.com/questions/4484167/details-how-python-garbage-collection-works

---

look good:

Go GC: Prioritizing low latency and simplicity

---

musesum 13 hours ago

"Best Fit" memory management. Malloc three buffers, in order: 64K, 1K, 64K. Now free all three. Next allocate 1K. Guess where it resides in the heap? Yep: the middle! After a while, memory was fragmented with 1K detritus. Even though you were supposed to have 500K available, there wasn't room for a 64K buffer anymore. This was the default for Microsoft's C compiler, in 1990. How did I find out?

There was a project that was crashing into a blue screen of death after a few hours. The crashing library had #defines that went 15 levels deep. The stack dump wasn't very instructive. Even though the library ISV was based in Boston, I had to fly out to Santa Cruz to talk to the developer directly. He was somewhat belligerent: "you should have figured it out yourself!". 15 levels of #defines. He was the perfect example of the other side of Dunning-Kruger thinking his arcane shit was obvious.

But, that didn't solve the problem. His fix didn't. So, I flew back to Boston. I started to trace everything with -ahem- #defines of the main APIs. It was turning into a Zen Koan: the sound of one app crashing. Everything worked with the Borland compiler. It was just the MSC compiler that failed. The company was one of the top 3 ISVs back then. Meetings about what to do would last for hours. There were technical politics: "Can we use Borland?" No; the company had standardized on MS. I could talk to their engineers directly, though. But, they didn't have a clue. So, I read though the docs to see what compiler switches were available. And there it was: best-fit. Fixed!

So, I wrote an analysis about why best-fit was absolutely the worst possible memory management and sent it to Microsoft. It took 8 years for MS to change the default. So, the reason why software would blow up? Best fit. Why early versions of Windows with go into blue-screen-of-death? Best fit. How many developers had to deal with this? Worst nightmare in aggregate.

reply

davidst 13 hours ago

You just brought back old memories for me. I wrote the heap manager for Borland C. I chose the next-fit algorithm for its balance of speed and reasonable fragmentation under common use cases. Best-fit performs exactly as you described.

reply

dalke 2 hours ago

I pulled out my old Minix (1987) book, which I keep for nostalgia. Tanenbaum on p. 202 writes:

> The simplest algorithm is first fit... A minor variation of first fit is next fit. ... Simulations by Bays (1977) show that next first gives slightly worse performance than first fit. ... Another well-known algorithm is best fit. ... Best fit is slower than first fit.... Somewhat surprisingly, it also results in more wasted memory than first fit or next fit because it tends to fill up memory with tiny, useless holes. First fit generates larger holes on the average. ... one could think about worst fit... Simulation has shown that worst first is not a very good idea. ...

> quick fit ... has the same disadvantage as all schemes that sort by hole size, namely, when a process terminates or is swapped out, finding its neighbors to see if a merge is possible is expensive. If merging is not done, memory will quickly fragment into a large number of small, useless holes.

reply

Fiahil 4 hours ago

Wait, the three buffers should be merged together upon free(), if they're consecutive in memory, isn't it?

What did I miss?

reply

coldcode 4 hours ago

Typical memory managers put blocks onto a free list during free() calls in order to reuse them later. I wrote a commercial memory allocator library in the 90's and I always coalesced consecutive blocks in free(), which limits fragmentation and speeds up allocation by avoiding long list searches. But even then many libraries did the easy choice of the free lists.

reply

---

in a talk about paravirtualization:

"The only somewhat tricky thing is that you need to handle MMU updates somehow. I believe (but don't know for certain) that with nested page table support at the processor level, you can just safely give x86 hardware-virtualized guests access to their nested page tables, and they can use the same instructions to modify that as they would on native hardware. So you don't need paravirtualization for that. This support has been in hardware since around 2008."

---

BenoitP? 2 days ago

The Go runtime is really starting to look sexy. 20 ms GC pauses on 200+ GB heaps!

I remember a thread discussing a pauseless GC on the dev mailing-list; where Gil Tene, of Azul C4's fame, commented on having a clear policy on safepoints from the beginning was paramount to having a good GC. It looked like the community is very biased towards doing the fundamental things well, and attracting the right people.

And on top of that we're get BLAS and LAPACK bindings from https://github.com/gonum

reply


_ph_ 1 day ago

The GO gc is not generational, those 20ms are for a full gc. Certainly the details of the heap usage of the application is going to affect gc times, but this is not the time for just a nursery collection.

reply

mike_hearn 1 day ago

OK, the tradeoff they're making that I missed is that it's not a compacting collector. So eventually your heap can fragment to the point where allocation gets expensive or impossible. Unusual design choice.

reply

geodel 1 day ago

Unlike Java Go has first class value types and memory layout can be controlled by developers. So it leads to much less objects on heap and compact layouts both will lead to far less fragmentation. As you can see here Go apps use quite less memory than Java. https://benchmarksgame.alioth.debian.org/u64q/go.html

reply

pcwalton 1 day ago

You can't really compare total memory usage of a JIT to total memory usage of an AOT compiler that way if what you're trying to show is that value types reduce memory usage.

Also, I suspect that the fact that JVMs use a generational GC (and a compacting GC) blows everything else out of the water when it comes to fragmentation. There's no way a best-fit malloc implementation can possibly beat bump allocation in the nursery for low fragmentation.

reply

mike_hearn 1 day ago

Unfortunately it's impossible to reliably measure the memory usage of Java that way because the JVM will happily prefer to keep allocating memory from the OS rather than garbage collect. It makes a kind of sense: GC has a CPU cost that gets lower the more memory is given to the heap, so if you have spare memory lying around, may as well deploy it to make things run faster.

Of course that isn't always what you want (e.g. desktop apps) ... sometimes you'd rather spend the CPU and minimise the heap size. The latest Java versions on some platforms will keep track of total free system RAM and if some other program is allocating memory quickly, it'll GC harder to reduce its own usage and give back memory to the OS.

In the benchmarks game I suspect there aren't any other programs running at the same time, so Java will go ahead and use all the RAM it can get. Measuring it therefore won't give reasonable results as the heap will be full of garbage.

Value types don't have much to do with fragmentation, if anything they make it worse because embedding a value type into a larger container type results in needing larger allocations that are harder to satisfy when fragmentation gets serious. But ultimately a similar amount of data is going to end up in the heap no matter what. Yes, you can save some pointers and some object headers, so it'll be a bit less. But not so much that it solves fragmentation.

reply

---

martijn_himself 2 days ago

A Golang beginner's question: do these GC improvements make Golang a suitable language/ platform for writing games?

EDIT: I realise this is a vague question. I suppose I was wondering if the order of magnitude GC performance in Go is likely to interfere with game loops you might find in reasonably CPU/ GPU intensive Indie games (i.e. NOT Crysis).

reply

espadrine 1 day ago

You have to ensure that the GC never makes you drop a frame. For 60Hz, that means staying below 16.7ms.

Given that a Go 1.6 GC will still take about 4ms, you have 12.7ms to generate a frame, which can be too limiting for some CPU-intensive games, but is perfectly acceptable for many games.

(In Go 1.5, a GC was much more likely to make you drop a frame, as it could easily average 40ms.)

On the other hand, there are less high-quality libraries in Go than in C++ or in JS. That may be the most limiting factor.

reply

mike_hearn 1 day ago

Interesting fact: Unreal Engine uses a GCd heap (or used to at least) for its core game state, so that means many AAA games use GC. And you can hit 60fps with it.

Their secret is they keep the game state heap really small. Like, 50 megabytes, maybe.

reply

 twotwotwo 1 day ago

Another weird thing is, the background collection is still using a core (plus somewhat slowing foreground code, with write barriers) when it runs. It's kinda like you have a varying amount of CPU power.

One approach is just to program as if you had less CPU, as if there were always a GC running. I suppose if you have some code that isn't smoothness critical (game AI, say), or CPU-affecting detail settings you can twiddle without looking too glitchy, maybe you can figure out some way to shed work when you start to fall behind.

It'd be really cool to see someone attempt a game or such in Go--boundary pushing's always fun, more so when the boundaries are recently expanded.

reply

omginternets 1 day ago

In addition to what others have said, you can relieve much of the pressure on the GC by making liberal use of `sync.Pool`.

reply

---

list of @property attributes in Objective-C, from https://www.quora.com/What-is-the-difference-between-strong-retain-nonatomic-etc-in-the-Objective-C-iOS-property/answer/Aritra-Das-9?srid=udfV:

note that nonatomic or even unsafeness may be required for soft realtime, to avoid blocking, not only to make things 'faster'; see http://atastypixel.com/blog/four-common-mistakes-in-audio-development/

" List of attributes of @property : atomic. nonatomic. retain. copy. readonly. readwrite. assign. strong.

    atomic : It is the default behaviour. If an object is declared as atomic then it becomes thread-safe. Thread-safe means, at a time only one thread of a particular instance of that class can have the control over that object.
    Example : 
    @property NSString *name; //by default atomic
    @property (atomic)NSString *name; // explicitly declared atomic
    nonatomic: It is not thread-safe. You can use the nonatomic property attribute to specify that synthesized accessors simply set or return a value directly, with no guarantees about what happens if that same value is accessed simultaneously from different threads. For this reason, it’s faster to access a nonatomic property than an atomic one.

@property (nonatomic)NSString *name;

    retain: is required when the attribute is a pointer to an object.The setter method will increase retain count of the object, so that it will occupy memory in autorelease pool.

@property (retain)NSString *name;

    copy: If you use copy, you can't use retain. Using copy instance of the class will contain its own copy.

Even if a mutable string is set and subsequently changed, the instance captures whatever value it has at the time it is set. No setter and getter methods will be synthesized.

    @property (copy) NSString *name;
     
    NSMutableString *nameString = [NSMutableString stringWithString:@"Liza"];    
    xyzObj.name = nameString;    
    [nameString appendString:@"Pizza"]; 
    readonly: If you don't want to allow the property to be changed via setter method, you can declare the property readonly.

@property (readonly) NSString *name;

    readwrite: is the default behaviour. You don't need to specify readwrite attribute explicitly.

@property (readwrite) NSString *name;

    assign: will generate a setter which assigns the value to the instance variable directly, rather than copying or retaining it. This is best for primitive types like NSInteger and CGFloat, or objects you don't directly own, such as delegates.

@property (assign) NSInteger year;

    strong: is a replacement for retain.

@property (nonatomic, strong) AVPlayer *player;

    unsafe_unretained: There are a few classes in Cocoa and Cocoa Touch that don’t yet support weak references, which means you can’t declare a weak property or weak local variable to keep track of them. These classes include NSTextView, NSFont and NSColorSpace,etc. If you need to use a weak reference to one of these classes, you must use an unsafe reference.

An unsafe reference is similar to a weak reference in that it doesn’t keep its related object alive, but it won’t be set to nil if the destination object is deallocated.

@property (unsafe_unretained) NSObject *unsafeProperty;

"

---

maybe have a read barrier for everything, plus a write barrier but only when writing pointers?

---

anonymousDan 18 hours ago

So how does the GC performance of Go compare to something like Java/the Hotspot JVM?

reply

_ph_ 17 hours ago

...Overall, the Java collectors are very sophisticated and tuned over years, so in principle are excellent. The downside is, that the Java language itself puts a lot of stress on the GC. The biggest problem is, that Java offers no "value" types beyond the builtin int, double,... So everything else has to be allocated as a separate object and pointed to via references. The GC then has to trace all these references, which takes time....

Go on the other side has structs as values, so the memory layout is much easier for the GC. Go always performs full GCs, but mostly running in parallel with the application, a GC cycle only requires a stop-the-world phase of a few milliseconds (for multi gigabyte heaps). ...

reply

fpoling 16 hours ago

Another problem with Java is inability to return multiple values. For that one often creates a wrapper object holding the results. ...

reply

---

pepesza 16 hours ago

I think that not using Erlang in this particular case was a mistake. Erlang is running some of the largest chats out there, including League of Legends and WhatsApp?. They would have avoided all the hassle of GC pauses, since Erlang has per-process GC collection. And scaling single OS process to their number of connections was done for Erlang machines years ago.

reply

--

geodel 17 hours ago

Technically hotspot GC might do more work in same amount of time but Go's GC makes some performance guarantees like <10ms STW phase which hotspot do not claim or offers for large heaps.

reply

pcwalton 13 hours ago

HotSpot? does offer that. It's basic functionality that all incremental or concurrent garbage collectors offer. You can adjust the max pause time with -XX:MaxGCPauseMillis?.

reply

sievebrain 16 hours ago

Yes, but that's because the Go GC doesn't compact, and nobody quotes throughput numbers. Building a slow GC that leaves the heap fragmented but has short pauses is not that difficult indeed, the tricky part starts when you wish to combine all these things together.

reply

--- " The biggest problem with all nongenerational GCs, including Go's, is lack of bump allocation in the nursery. You really want a two-space copying collector (or a single-space one) so that allocation in the nursery can be reduced to 3 or 4 machine instructions. By allowing fragmentation in the nursery, Go pays a heavy cost in allocation performance relative to HotSpot?.

> And again, as with everything it's not a silver bullet, as you sacrifice the high cost of promotion (again expensive moving of memory) in order to have very fast allocations while the nursery isn't fragmented or full.

You're questioning the generational hypothesis. Generational GC was invented in 1984. Whether generational GC works is something we've had over 30 years to figure out, and the answer has consistently been a resounding "yes, it works, and works well". "

---

havent even skimmed this: http://nethack4.org/blog/memory.html discussion: https://news.ycombinator.com/item?id=12118532

---

cpython is not compacting: http://www.gossamer-threads.com/lists/python/python/1163027#1163027 http://stackoverflow.com/questions/35300396/does-cpythons-garbage-collection-do-compaction

---

mono's GC:

http://www.mono-project.com/docs/advanced/garbage-collector/sgen/

---

http://www.elsman.com/mlkit/ "region-based memory management with automatic inference of regions, aiming to support realtime applications"

" The store consists of a stack of regions , see Figure 1.1. Regions hold values, for example tuples, records, function closures, references, and values of recursive types (such as lists and trees). All values, except those that fit within one machine word (for example integers), are stored in regions. The size of a region is not necessarily known when the region is allocated. Thus a region can grow gradually (and many regions can grow at the same time) so one might think of the region stack as a stack of heaps. However, the region stack really is a stack in the sense that (a) if region r1 is allocated before region r2 then r2 is de-allocated before r1 and (b) when a region is de-allocated, all the memory occupied by that region is reclaimed in one constant time operation. Values that reside in one region are often, but not always, of the same type. A region can contain pointers to values that reside in the same region or in other regions. Both forward pointers (i.e., pointers from a region into a region closer to the stack top) and backwards pointers (i.e., pointers to an older region) occur.

As mentioned in the preface, the present version of the MLKit supports reference-tracing garbage collection in combination with region memory man-agement [Hal99]. While most de-allocations can be efficiently performed by region de-allocation, there are some uses of memory for which it is difficult to predict when memory can be de-allocated. In these cases reference-tracing garbage collection does a good job in combination with region de-allocation. ... notice that the pure stack discipline (a stack, but no heap) is a special case of the region stack. Here the size of a region is known at the latest when the region is allocated. Another special case is when one has just one region in the region stack and that region grows dynamically . This case can be thought of as a heap with no garbage collection, which again would not be sufficient. But when one has many regions, one obtains the possibility of distinguishing between values according to what region they reside in. The MLKit has operations for allocating, de-allocating, and extending regions. But it also has an explicit operation for resetting an existing region, that is, reclaiming all the memory occupied by the region without eliminating the region from the region stack. This primitive, simple as it is, enables one to cope with most of those situations where lifetimes simply are not nested.

...

In the ML Kit the vast majority of region management is done automatically by the compiler and the runtime system. Indeed, with one exception, source programs are written in Standard ML, with no added syntax or special directives. The exception has to do with resetting of regions. The Kit provides two built-in functions (resetRegions and forceResetting) which instruct the program to reset regions. Here resetRegions is a safe form of resetting where the compiler only inserts region resetting instructions if it can prove that they are safe, and prints thorough explanations of why it thinks resetting might be unsafe otherwise. Function forceResetting is for potentially unsafe resetting of regions, which is useful in cases where the programmer jolly well knows that resetting is safe even if the compiler cannot prove it. forceResetting is the only way we allow users to make decisions that can make the program crash; many programs do not need forceResetting and hence cannot crash (unless we have bugs in our system). All other region directives, including directives for allocation and deallocation of regions, are inferred automatically by the compiler. This happens through a series of fairly complex program analyses and transformations (in the excess of twenty-five passes involving three typed intermediate languages). These analyses are formally de ned and the central one, called region inference, has been proved correct for a skeletal language. ...

with one exception, every region directive takes constant time and constant space to execute...The exception has to do with exceptions. When an exception is raised, a search down the stack for a handler takes place; this search is not constant time and it involves popping of regions on the way. However, the number of region operations is bounded by the number of handlers that appear on the stack.... a region profiler for examining run-time behaviour. The region pro ler gives a graphical representation of region sizes as a function of time. This makes it possible to see which regions use the most space and even to relate memory consumption back to individual allocation points in the (annotated) source program. ... the region scheme encourages a particular discipline of programming. The purpose of this report is to lay out this discipline of programming ... Example: the Game of Life...A snapshot of the game consisting of the board together with an indication of which positions are alive is called a generation. The rules of the game specify how to progress from one generation to the next...But now we have a design choice. Should we put the new generation in the same region as the previous region or should we arrange that it is put in a separate region? Piling all generations on top of each other in the same region would clearly be a waste of space: only the most recent generation is ever needed. Similarly, giving each generation a separate region on the region stack is no good either, since it would make the stack grow ad infinitum (although this could be alleviated somewhat by resetting all regions except the topmost one). The solution is simple, however: use two regions, one for the current generation and one for the new generation. When the new generation has been created, reset the region of the old region and copy the contents of the the new region into the old region...The use of the as pattern in line 1 forces the argument and the result of nthgen' to be in the same regions. Such a function is called a region endomorphism. In line 3, we compute a fresh generation which lies in fresh regions, as it happens. Having printed the generation (line 4) we then reset the regions containing g. The compiler checks that this is safe. Then, in line 6 we copy g' and the target of this copy must be the regions of g, since nthgen' is a region endomorphism (see Figure 1.4). The above device, which we refer to as double copying, can be seen as a much expanded version of what is often called \tail recursion optimisation". In the case of regions, not just the stack space, but also region space, is re-used. Indeed, double copying is similar to invoking a copying garbage collector on speci c regions which are known not to have live pointers into them. But by doing the copying ourselves, we have full control over when it happens, we know that the cost of copying will be proportional to the size of the generation under consideration and that all other memory management is done automatically by the region mechanism. ... For efficiency reason, however, the Kit distinguishes between two kinds of regions: those regions whose size it can determine at compile-time and those it cannot. These regions are referred to as finite and infinite regions, respectively. Finite regions are always allocated on the runtime stack. An infinite region is represented as a linked list of fixed-size pages. The runtime system maintains a free list of such pages. An infinite region is represented by a region descriptor, which is a record kept on the runtime stack. The region descriptor contains two pointers: one to the rst and one to the last region page in the linked list which represents the region. Allocating an in nite region involves getting a page from the free list and pushing a region descriptor onto the runtime stack. Popping a region is done by appending the region pages of the region and the free list (this is done in constant time) and then popping the region descriptor o the runtime stack. At runtime, every region is represented by a 32-bit entity, called a region name. If the region is nite, the region name is a pointer into the stack, namely to the beginning of the region. If the region is in nite, the region name is a pointer to the region descriptor of the region.

The multiplicity of a region is a statically determined upper bound on the number of times a value is put into the region. The Kit operates with three multiplicities: 0, 1 and 1, ordered by 0 < 1 < infinity. ... Every region has a runtime type. The following runtime types exist: real, string and top. Not surprisingly, regions of runtime type real and string contain values of ML type real and string, respectively. Regions with runtime type top can contain all other forms of allocated values, i.e., constructed values, tuples, records and function closures. It is often, but not always, the case that all values that reside in the same region have the same type (considered as representations of ML values). ... The Kit contains a virtual machine, called the Kit Abstract Machine (KAM, for short), which details the above ideas. The KAM is a register machine with one linear address space which it partitions into a stack and a heap. The heap holds region pages, all of the same size. ... The MLKit uses unboxed representation for integers, booleans, words, the unit value, and characters. The MLKit uses boxed representation for pairs, records (with at least one element), reals, exception values, function closures, and constructed values (i.e., data types, except lists and booleans). A boxed value may reside in a finite or an infinite region. Unboxed values are not stored in regions, except when they are part of a boxed value. ... The intermediate languages we shall refer to in the following are (in the order in which they are used in the compilation process):

Lambda A lambda-calculus like intermediate language. The main di erence between the Standard ML Core Language and Lambda is that the latter only has trivial patterns.

RegionExp? Same as Lambda, but with explicit region annotations (such as the letregion-bindings mentioned in Section 2.3). Region variables have their runtime type (Section 2.2) as an attribute, although, for brevity, the pretty printer omits runtime types when printing expressions, unless instructed otherwise.

MulExp? Same as RegionExp?, but now every binding region variable occurrence is also annotated with a multiplicity (Section 2.1) in addition to a runtime type. Again, the default is that the runtime type is not printed. The terms of MulExp? are polymorphic in the information that annotate the nodes of the terms. That way, MulExp? can be used as a common intermediate language for a number of the internal analyses of the compiler which add more and more information on the syntax tree. The analysis which computes multiplicities is called the multiplicity analysis

The runtime system is written in C. It is small (less than 30Kb of code when compiled). It contains operations for allocating and de-allocating regions, extending regions, obtaining more space from the operating system, recording region pro ling information and performing low-level operations.. ((the earlier draft said "low-level operations on strings")) "

-- [1] and [2]

---

oo, i dont know much about it but if it works in practice, then i really like the idea of MLKit's memory management; i like it because (like reference counting without cycle detection, but with weak references) it fits the idea of safe memory management, but with a little bit of programmer input required.

---

kazinator 70 days ago [-]

Pure functional programming doesn't require any special memory management beyond the stack, if it avoids any data representations that use reference semantics and have indefinite lifetimes. Lazy evaluation and higher order functions with environment are pretty much out. But even C can be functional:

   int (*pg)(int) = f();
   int x = h() + 2*z;
   int w = pg(y);
   /* ... etc */

Here, we just introduce new variables instead of assigning new ones, don't malloc anything and indirect only by means of dumb function pointers carrying no environments.

There could be some higher level (though nonetheless quite primitive) assignment-free language for specifying tasks for the 1000 cores of this chip, instead of programming them in assembler.

---

possibly need a way to indicate that some data in memory is 'pinned', eg should not be moved by the GC. This is important if a pointer to the pinned data has been given out via the FFI.

---

"Sub-millisecond GC pauses in Go 1.8"

https://golang.org/design/17503-eliminate-rescan https://groups.google.com/forum/?fromgroups#!topic/golang-dev/Ab1sFeoZg_8

---

re: "Sub-millisecond GC pauses in Go 1.8"

johncolanduoni 13 hours ago [-]

The tradeoff here seems to be a more complicated write barrier, so the loss in performance here will for the most part not show up as time spent in the GC. I'm curious to see how big of an issue this will be; the only GC I've heard of with such a heavy write barrier is OCaml's, which is ameliorated by the language's disposition towards immutability.

reply

pcwalton 13 hours ago [-]

And OCaml has a generational GC, unlike Go. So Go's throughput is going to be hit even harder.

Not going with a generational GC is a choice that runs against what has been accepted practice for decades. It may of course work out, but it's what I'd consider an experiment.

reply

fpoling 6 hours ago [-]

Typical Go code does not generate a lot of short-lived objects, compared with, say, Java or with typical usage of persistent data structures in functional languages. That removes the practical need for generational GC.

reply

weberc2 1 hour ago [-]

I'm also of the impression that Go's "transactional GC" is similar to a generational GC?

reply

---

geodel 14 hours ago [-]

Java 7 and above provide XX:MaxGCPauseMillis? flag for GC conf

From Mechanical Sympathy Blog [1]

" G1 is target driven on latency –XX:MaxGCPauseMillis?=<n>, default value = 200ms. The target will influence the amount of work done on each cycle on a best-efforts only basis. Setting targets in tens of milliseconds is mostly futile, and as of this writing targeting tens of milliseconds has not been a focus of G1. "

So in Java world we are talking 100s of milli seconds of worst case which is 3 order of magnitude higher than Go.

1. http://mechanical-sympathy.blogspot.com/2013/07/java-garbage...

reply

pcwalton 13 hours ago [-]

Well, yeah. You're comparing the G1 (for server workloads) to Go's concurrent collector. There's a concurrent collector you can use in incremental mode for the JVM if you want to trade throughput for pause time, like Go does.

The HotSpot? designers have (correctly, IMO) observed that server use cases typically prefer throughput to latency.

On a meta note, I wish people would focus on the actual algorithms instead of taking cited performance numbers out of context and comparing them head to head. The Go GC uses the same concurrent marking and sweeping techniques that CMS does. This change to avoid stop the world stack sweeps is something that no HotSpot? collector does. But it's counterbalanced by the fact that Go's GC is not generational. Generational GCs change the balance significantly by dramatically improving throughput.

reply

sambe 13 hours ago [-]

In my experience you are still getting single-to-double-digit millisecond pauses on average using CMS (even with some attention/tuning). Do you really think Hotspot can offer a GC where pauses longer than 100us are considered interesting enough to look into?

reply

pcwalton 13 hours ago [-]

Sure, if they implement barriers for stack references like Go is. But that's a significant throughput hit.

reply

the8472 12 hours ago [-]

reply

---

pcwalton 14 hours ago [-]

To stop the world, Go preempts at the next stack size check [1], which happens during function call preludes. I guess that happens often enough to be fast in practice.

[1]: https://github.com/golang/go/blob/8f81dfe8b47e975b90bb4a2f8d...

reply

rayiner 13 hours ago [-]

I assume this works so quickly in part because goroutines are backed by a pool of OS threads that don't block? So everybody doesn't get stuck waiting for a thread that's blocked to get around to checking the preemption flag?

reply

pcwalton 13 hours ago [-]

Right, blocked threads are stopped in the userspace scheduler so there are no synchronization issues with just keeping them stopped.

reply

johncolanduoni 14 hours ago [-]

Other systems I'm aware of that are capable of similar feats (I know .NET's collector can, not sure about Hotspot's) use page poisoning. Basically the compiler/JIT puts a bunch of extraneous memory accesses to a known address at key points where the GC can track what all variables and registers hold (called "safepoints"), in a fairly liberal pattern around the code. When the GC wants to stop all the threads in their tracks, it unmaps the page holding that address, causing segfaults in every thread it it hits one of the safepoints. The GC hooks the segfault handler, and simply waits for all threads to stop.

I'm not sure that's what Go does, but I'm not aware of a faster way on standard hardware.

reply

---

psyc 1 hour ago [-]

I'll mention a significant case that doesn't have to do with allocation. Large graph-like data structures (lots of small nodes linked by reference) are effectively prohibited entirely by the GC. They make every marking phase take much too long, until whenever time the whole thing gets promoted into the long-lived generation. A mainstream runtime having such an impactful opinion about my data structures (not alloc patterns) is something I just find deeply offensive. Imagine if C++ programs stuttered because you had too many pointers!

They could have avoided that whole problem by providing an API like DontCollectThisRoot?, but instead they (and a lot of programmers) chose to pretend the GC was good enough to treat like a black box.

reply

---

pjmlp 8 hours ago [-]

> In Java for instance the difference in performance compared to C++ comes from memory layout ability not memory management.

We need more modern languages with AOT compilation, ability to allocate on the stack and use value types.

I like C++ and have spent quite a few years using it, before settling on memory safe languages for my work.

Many C++ presentations seem to forget on purpose that there are memory safe programming languages, which offer similar capabilities for controlling memory layouts, thus presenting the language as the only capable of doing it.

Modula-3 is an example, or Swift and D for something more actual.

reply

CyberDildonics? 1 hour ago [-]

The reality is that there aren't any straight replacements though since C++ has such mature tools, move semantics take away significant amounts of memory management pain, D without a gc is still exotic, and rust is still in its early stages.

reply

---

lossolo 12 minutes ago [-]

Embedded systems in most cases have memory constraints and GC languages are memory hogs. The cost of having GC is that you pay with higher memory usage for doing GC in batches thanks to which you do not have to pay for single deallocations. So this performance advantage cannot be used in embedded space because there is no free memory for it, you would need to GC all the time which would kill the performance.

reply

---

weberc2 37 minutes ago [-]

Java, Go, and C# all do stack allocation either implicitly via escape analysis (Java), explicitly via value types (C#), or both (Go). I don't know that this is "most" (perhaps by marketshare), but these are certainly 3 of the most popular languages in this space.

reply

kasey_junk 11 hours ago [-]

> stack allocation is much, much faster than any sort of heap-based memory management.

Is that true? I thought nursery allocation was usually as fast if not faster than stack allocations.

reply

johncolanduoni 11 hours ago [-]

The allocation speed will be close; a bit worse since the pointer to the end of the current allocation buffer usually isn't in a register and a function call and range check is required. However the overall cost of handling that memory from start to finish is significantly higher than the stack even if it gets clobbered in the first nursery collection.

Not because it is very high, but because stack allocation/deallocation is so very simple.

reply

kasey_junk 11 hours ago [-]

C# & Go both have stack allocation options. The JVM is supposedly getting them sometime soon.

reply

---

	 Any thoughts on how arenas (rust) compare to gc and manual allocation for speed?

reply

buzzybee 7 hours ago [-]

Arenas trade some granularity and flexibility for speed and fragmentation-free allocation; they're a great choice for bulk data that you can precompute the size of and want to iterate over very quickly, and they're also easy to reason about. You can do many tasks using only arenas and stack allocations, and it'll zip along very quickly. If needed, you can flag liveness to get very efficient reuse of short-lived objects. They're less ideal if you are gradually adding more data over time, and you want to retain a consistently low usage, since you end up with "stairsteps" of latency when it copies into a bigger buffer, and once you have the big buffer, it's quite complicated to shrink it down again.

malloc()-style allocation gives you precision over how much you want, and this is most interesting to a memory-constrained system trying to allocate amongst many different processes(the original Unix use-case). But willy-nilly use of malloc() and free() leaves you with lots of fragmentation, as well as a larger attack surface for memory errors. What the actual allocation algorithm does is out of your hands, too, at least if you're depending on the OS allocator(you can always go write your own and supply it with a giant heap to munch on, and this may occur when you need tuning for individual allocation scenarios).

In the right scenario, a GC won't do too much differently from a manual allocator(there are lots of optimizations that could bring the allocation to same-or-negligible runtime), but as we all know, right scenarios are something you can't always count on. A GC does, however, greatly simplify the memory management of a long-lived process since it can do nifty things like automatically compact the heap.

IME, a mix of stack, GC, and some arenas in the form of growable arrays, is absolutely fine for the needs of most applications. Quite often this last requirement creates a sticking point, though, where the language disallows value semantics for arrays of objects, and then you can no longer assume they're allocated linearly, plus the GC is taxed with additional references to trace. In those cases, if I have them I use arrays of primitive values as crude containers for an equivalent optimization. Either way, it's very annoying and creates some awful code, because those huge batches that I'd like to put in arenas also tend to be the bottleneck for the GC.

reply

---

faragon 8 hours ago [-]

In most C libraries malloc() is O(log n), e.g. when implemented as balanced trees.

reply

---

re: "Sub-millisecond GC pauses in Go 1.8"

psyc 15 hours ago [-]

If someone could do this for Unity, it would change my life.

reply

forrestthewoods 14 hours ago [-]

I've never been more careful to avoid any allocation as I have been in Unity. I had fewer memory concerns in C++ for crying out loud.

reply

psyc 12 hours ago [-]

I would 10,000x rather have to match new with delete (big deal) than to maintain the revolting unidiomatic contortions I'm obligated to do to outwit the GC.

reply

 JoeAltmaier 12 hours ago [-]

This. Times 100,000

reply

forrestthewoods 11 hours ago [-]

Agreed x10000. For VR we must maintain strictly zero garbage generation. Which is really damn hard in a language built on the assumption of GC.

reply

johncolanduoni 13 hours ago [-]

Depends on the platform you are targeting. Mono's GC (last I heard, they may have integrated .NET Core's by now) is relatively primitive. It was mostly developed just to have something better than Boehm.

Their experience was actually pretty instructive; with their first release of the new GC (a proficient but basic generational collector) they still weren't beating Boehm all the time, and usually weren't beating it by much. Given its constraints, Boehm's GC is impressively optimized when you run it in semi-precise mode.

reply

pcwalton 13 hours ago [-]

Were they bump allocating in the nursery? That's where you tend to see the biggest gains from generational GC.

reply

johncolanduoni 12 hours ago [-]

Yes, they even used TLABs. I believe the issue was that Xamarin was more interested in mobile performance at the time when SGEN was seeing heavy development and preparing for release, so they optimized for memory usage instead of throughput. The generational part was probably more of a hindrance than a benefit at that point in the process.

reply

---