proj-oot-ootMemoryManagementNotes2

	"

Brad Abrams posted an e-mail from Brian Harry written during development of the .Net framework. It details many of the reasons reference counting was not used, even when one of the early priorities was to keep semantic equivalence with VB6, which uses reference counting. It looks into possibilities such as having some types ref counted and not others (IRefCounted?!), or having specific instances ref counted, and why none of these solutions were deemed acceptable.

    Because [the issue of resource management and deterministic finalization] is such a sensitive topic I am going to try to be as precise and complete in my explanation as I can. I apologize for the length of the mail. The first 90% of this mail is trying to convince you that the problem really is hard. In that last part, I'll talk about things we are trying to do but you need the first part to understand why we are looking at these options.
    ...
    We initially started with the assumption that the solution would take the form of automatic ref counting (so the programmer couldn't forget) plus some other stuff to detect and handle cycles automatically. ...we ultimately concluded that this was not going to work in the general case.
    ...
    In summary:
        We feel that it is very important to solve the cycle problem without forcing programmers to understand, track down and design around these complex data structure problems.
        We want to make sure we have a high performance (both speed and working set) system and our analysis shows that using reference counting for every single object in the system will not allow us to achieve this goal.
        For a variety of reasons, including composition and casting issues, there is no simple transparent solution to having just those objects that need it be ref counted.
        We chose not to select a solution that provides deterministic finalization for a single language/context because it inhibits interop with other languages and causes bifurcation of class libraries by creating language specific versions.

"

https://blogs.msdn.microsoft.com/brada/2005/02/11/resource-management/

---

http://stackoverflow.com/tags/garbage-collection/info

---

mb someday ask StackOverflow? for a survey of approaches to garbage collection in programming languages with the following goals:

I'm designing a programming language and i'm looking for suggestions for garbage collection/automatic memory management algorithms that meet the following criteria. Latency (GC pauses) must be minimized; but throughput performance is not important. The language is intended to be easy to reimplement on different platforms, so it is imperative that the GC technique in the reference implementation be simple. To reiterate, the algorithm under discussion could later be replaced with a more complicated, higher-performing GC algorithm on some implementations, but what i need for the reference implementation (ie what i am asking for here) is simplicity and ease of (correct) implementation, even at the expense of inefficiency. The data structures being managed will be concurrently accessed by multiple threads, and will contain reference cycles.

In more detail, design goals include:

The default garbage collection systems of language implementations such as the JVM are not suitable because (a) they can have long GC pauses, and (b) they are probably more concerned with efficiency than simplicity. Some of the other automatic memory management systems that i am currently looking at as potential models are Inferno's, Go's, and Python's; other suggestions, or comments and comparisons, are welcome.

http://stackoverflow.com/questions/tagged/programming-languages http://stackoverflow.com/questions/tagged/language-design http://softwareengineering.stackexchange.com/questions/tagged/programming-languages http://softwareengineering.stackexchange.com/questions/tagged/programming-languages+language-design

---

jeffdavis 6 hours ago [-]

Tangent:

In databases, it's common to do arena based allocation. You allocate objects of similar lifetime (life of a query, life of a transaction, life of a procedure execution, etc.), and free them all at once when the lifetime is up.

When hacking postgresql, for instance, using pfree() is fairly uncommon. You just allocate, (which goes to the current arena by default), and then the arena is wiped out all at once later.

Of course, sometimes you might use a lot of memory temporarily in a loop, and you need to pfree() it each iteration. But that is the exception.

I think of each arena as a tiny heap (though of course they can grow large).

reply

favorited 5 hours ago [-]

This was also an optimization you could do in Objective-C on NeXTSTEP?/OpenStep?/macOS. Allocating methods took an optional "zone" parameter (`allocWithZone:`, `copyWithZone:`), which was basically an arena.

In this case, the bigger benefit wasn't the allocation/deallocation – but the fact that you would be working with memory on the same virtual memory page, IIUC.

reply

---

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ohyo057ev

---

"Go’s garbage collector is a concurrent mark-sweep with very short stop-the-world pauses and it just seems to keep getting better. This is definitely a good thing for a garbage collected language. We did have some issues with unacceptable latencies, but were able to work them all out. No such luck with the Haskell project."

"...previously tried to use Haskell for the project I was working on, but eventually had to give up due to unpredictable latency caused by garbage collection pauses. GHC’s garbage collector is designed for throughput, not latency. It is a generational copying collector, which means that pause times are proportional to the amount of live data in the heap. To make matters worse, it’s also stop-the-world."

---

"Go is garbage-collected. I personally think Swift/Objective-C-style Automatic Reference Counting would have been a better choice for a statically typed systems language, since it gives you the brevity benefits without the GC pauses. I’m sure this has been argued to death elsewhere, so I’ll save my GC rant for a very boring dinner party."

---

pjmlp 1 day ago

parent flag favorite on: C for All

> C does not have RAII-like memory management in any way.

Yes it does, as language extension on gcc and clang.

It is called cleanup attribute.

https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attribute...

https://clang.llvm.org/docs/LanguageExtensions.html#non-stan...

---

tomsmeding 1 day ago

parent flag favorite on: C for All

C does not have RAII-like memory management in any way. Exceptions work beautifully with memory management like that, but if it's not there, you can't just say it should work because memory management should work like that.

So basically you're saying, before adding exceptions, add RAII-like memory management, and then actually add exceptions. I like both features, but am not sure how you'd wedge RAII into C. Any ideas on that?

---

sedachv 1 day ago [-]

> C does not have RAII-like memory management in any way.

C does not have memory management in any way period. The C standard library does. How you get to something with dynamic scoping like RAII in C is to use a different library for managing memory. For example Thinlisp[1] and Ravenbrook's Memory Pool System[2] both provide dynamically-scoped region/pool allocation schemes.

[1] https://github.com/vsedach/Thinlisp-1.1

[2] https://www.ravenbrook.com/project/mps/

reply

---

BruceIV? 1 day ago [-]

[On the Cforall team] For what it's worth, one of the features Cforall adds to C is RAII.

The exception implementation isn't done yet, but it's waiting on (limited) run-time type information, it already respects RAII.

reply

RhysU? 1 day ago [-]

I have often wanted c99 with destructors for RAII purposes.

reply

---

UncleMeat? 1 day ago

parent flag favorite on: C for All

Exceptions work great in garbage collected languages. Reasoning about exceptional control flow w.r.t. manual memory management is a total nightmare.

sedachv 1 day ago [-]

> Reasoning about exceptional control flow w.r.t. manual memory management is a total nightmare.

Exceptions make manual memory management easier because a proper exception system has unwind-protect[1]. Exceptions are just movements up the stack - exceptions combine naturally with dynamic scoping for memory allocation (memory regions/pools). This kind of memory management was used in some Lisp systems in the 1980s, and made its way into C++ in the form of RAII. By extending the compiler you can add further memory management conveniences like smart pointers to this scheme.

Now if you want to talk about something that actually makes manual memory management a total nightmare, look at the OP's suggestion for adding closures to C.

[1] http://www.lispworks.com/documentation/HyperSpec/Body/s_unwind.htm

reply

---

reference counting may require write barriers (for pointers at least):

" while reference counting incurs a high cost — every pointer write (including those to stack frames) results in reference count updates "

" The incremental nature of reference counting is generally con- sidered to be its fundamental advantage. However, the cost of up- dating reference counts every time a new pointer is loaded into a register is typically much too high for high-performance applica- tions. As a result, some form of deferred reference counting [20], in which references from stack frames are accounted for separately, is used in most high-performance implementations of reference counting [3, 19]. However, this means that when an object’s reference count drops to zero, it can not be reclaimed immediately, since there might be a reference from the stack that is not accounted for. As a result, collection is deferred until the periodic scanning of the stack refer- ences. However, the result is delayed collection, floating garbage, and longer application pauses — the typical characteristics of trac- ing collectors! "

---

have some bits in each fat pointer (and every pointer should be fat) for the GC to use. This way it can mark pointer instances.

load and store barriers for the GC

---

One reason that Java takes lots of memory, besides GC: lack of value types "data structures can't be value types[1], consequently consuming large amounts of extra memory because everything sits behind 2 or 3 or 5 layers of unnecessary 8-byte pointers." [1]

---

" jemalloc is removed by default

Long, long ago, Rust had a large, Erlang-like runtime. We chose to use jemalloc instead of the system allocator, because it often improved performance over the default system one. Over time, we shed more and more of this runtime, and eventually almost all of it was removed, but jemalloc was not. We didn't have a way to choose a custom allocator, and so we couldn't really remove it without causing a regression for people who do need jemalloc.

Also, saying that jemalloc was always the default is a bit UNIX-centric, as it was only the default on some platforms. Notably, the MSVC target on Windows has shipped the system allocator for a long time.

Finally, while jemalloc usually has great performance, that's not always the case. Additionally, it adds about 300kb to every Rust binary. We've also had a host of other issues with jemalloc in the past. It has also felt a little strange that a systems language does not default to the system's allocator.

For all of these reasons, once Rust 1.28 shipped a way to choose a global allocator, we started making plans to switch the default to the system allocator, and allow you to use jemalloc via a crate. In Rust 1.32, we've finally finished this work, and by default, you will get the system allocator for your programs. "

---

[2] [3] " drfuchs 9 months ago [-]

The proposal here is way too vague. And if you flesh it out, things start to fall apart: If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"? If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer?)? On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

Similar questions apply to general arrays, as well. Also: Am I able to take the address of an element of an array? Will that be a fat pointer too? How about a pointer to a sequence of elements? Can I do arithmetic on these pointers? If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array? How would you write Quicksort? Heapsort? And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

In short, the reason that this sort of thing hasn't appeared in C is not because nobody has thought about it, nor because the C folks are too hide-bound to accept a good idea, but rather because it's not clear that there's a real, workable, limited, concise, solution that doesn't warp the language far off into Java/C#-land. It would be great if there were, but this isn't it.

WalterBright? 9 months ago [-]

I happened to know the idea does work, and has been working in D for 18 years now.

> If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"?

No. You'll still have the same issues with how memory is allocated and resized. But, once the memory is allocated, you have a safe and reliable way to access the memory without buffer overflows.

> If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer?)?

As I mentioned, it does not address allocating memory. However, it does offer one performance advantage in not having to call strlen to determine the size of the data.

> On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

In D, we call those slices. They look like this:

    T[] array = ...
    T[] slice = array[lower .. upper];

The compiler can insert checks that the slice[] lies within the bounds of array[].

> Am I able to take the address of an element of an array?

Yes: `T* p = &array[3];`

> Will that be a fat pointer too?

No, it'll be regular pointer. To get a fat pointer, i.e. a slice:

    slice = array[lower .. upper];

> How about a pointer to a sequence of elements?

Not sure what you mean. You can get a pointer or a slice of a dynamic array.

> Can I do arithmetic on these pointers?

Yes, via the slice method outlined above.

> If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array?

No, just the slice.

> How would you write Quicksort? Heapsort?

Show me your pointer version and I'll show you an array version.

> And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

The arena will likely be an array, right? Then return slices of it.

"

" MrBingley? 9 months ago [-]

I absolutely agree. Adding an array type to C that knows its own length would solve so many headaches, fix so many bugs, and prevent so many security vulnerabilities it's not even funny. Null terminated strings? Gone! Checked array indexing? Now possible! More efficient free that gets passed the array length? Now we could do it! The possibilities are incredible. Sadly, C is so obstinately stuck in its old ways that adding such a radical change will likely never happen. But one can dream ...

bluetomcat 9 months ago [-]

> Adding an array type to C that knows its own length would solve so many headaches

C arrays know their length, it's always `sizeof(arr) / sizeof(*arr)`. It's just that arrays become pointers when passed between functions, and dynamically-sized regions (what is an array in most other languages) are always accessed via a pointer. "

---

https://stackoverflow.com/questions/54016849/how-does-the-haskell-runtime-distinguish-between-pointers-and-unboxed-word-sized

---

could have reference counting and rely on the programmer to avoid reference cycles by using weak references. Apparently Swift does this:

[4]

i guess that makes our implementation simpler and more performant. But i think at the cost of making our programmers' lives more complex, so i dont really like it.

---

afaict the only way to have a truly pauseless compacting GC is to have a read barrier upon each dereference of any pointer. or maybe handles (pointers to pointers) are enough?

ok there's another way; have a load barrier upon load of any memory location containing a pointer; it's not truly pauseless but you only have to pause to scan the stack roots, so at least the max pause time does not grow as the heap grows.

---

[5]

the-alchemist 1 day ago [-]

Looks like the big challenge is managing a large, LRU cache, which tends to be a difficult problem for GC runtimes. I bet the JVM, with its myriad tunable GC algorithms, would perform better, especially Shenandoah and, of course, the Azul C4.

The JVM world tends to solve this problem by using off-heap caches. See Apache Ignite [0] or Ehcache [1].

I can't speak for how their Rust cache manages memory, but the thing to be careful of in non-GC runtimes (especially non-copying GC) is memory fragmentation.

Its worth mentioning that the Dgraph folks wrote a better Go cache [2] once they hit the limits of the usual Go caches.

From a purely architectural perspective, I would try to put cacheable material in something like memcache or redis, or one of the many distributed caches out there. But it might not be an option.

It's worth mentioning that Apache Cassandra itself uses an off-heap cache.

[0]: https://ignite.apache.org/arch/durablememory.html [1]: https://www.ehcache.org/documentation/2.8/get-started/storag... [2]: https://blog.dgraph.io/post/introducing-ristretto-high-perf-...

reply

dochtman 1 day ago [-]

One the one hand, yes. On the other hand, all of this sounds much more complex and fragile. This seems like an important point to me:

"Remarkably, we had only put very basic thought into optimization as the Rust version was written. Even with just basic optimization, Rust was able to outperform the hyper hand-tuned Go version."

reply

---

chubs 1 day ago [-]

I found similarly when I ported an image resizing algorithm from Swift to Rust: I'm experienced in swift thus was able to write in an idiomatic way, and have little Rust experience thus I wrote it in a naive way; yet still the rust algorithm was twice(!) as fast. And swift doesn't even have a GC slowing things down!

reply

giornogiovanna 20 hours ago [-]

Reference counting is a (low-throughput, low-latency) form of garbage collection.

reply

arcticbull 20 hours ago [-]

Yes and no. From a theoretical perspective, I suppose that's true, but "garbage collection" tends to mean a non-deterministic collector that does its own thing, and you don't have to think at all about memory. That does not apply to Swift, as of course, you need to understand the difference between strong and weak references. It's unfairly simplistic to couple the two.

reply

barrkel 16 hours ago [-]

No, RC is GC.

Most people think of Python as GCed language, and it uses mostly RC.

Any runtime that uses mark & sweep today may elect to use RC for some subset of the heap at some point in a future design, if that makes more sense. The mix of marking GC vs refcounting GC shouldn't affect the semantics of the program.

reply

staticassertion 18 hours ago [-]

But it actually is just garbage collection.

reply

Sean1708 17 hours ago [-]

It is but in the context of this discussion it's very clear that they meant a tracing garbage collector, which has a very different cost than atomic reference counting. Or to put it another way: you're technically correct, the worst kind of correct.

reply

dgellow 1 day ago [-]

ARC, used by Swift, has its own cost.

reply

1propionyl 19 hours ago [-]

True, but it's generally better than most full GC solutions (for processes running for relatively short times without the benefit of profile-guided optimization), and worse than languages with fully statically analyzable memory usage.

Note: that parenthetical is a very big caveat, because properly profile-optimized JVM executables can often achieve exceptional performance/development cost tradeoffs.

In addition however, ARC admits a substantial amount of memory-usage optimization given bytecode, which is now what developers provide to Apple on iOS. Not to mention potential optimizations by allowing Apple to serve last-minute compiled microarchitecture optimized binaries for each device (family).

To satiate the pedants... ARC is more of less GC where calls into the GC mechanism are compiled in statically and where there are at worst deterministic bounds on potential "stop the world" conditions.

While this may not be presently optimal because profile-guided approaches can deliver better performance by tuning allocation pool and collection time parameters, it's arguably a more consistent and statically analyzable approach that with improvement in compilers may yield better overall performance. It also provides tight bounds on "stop the world" situations, which also exist far less frequently on mobile platforms than in long running sever applications.

Beyond those theoretical bounds, it's certainly much easier to handle when you have an OS that is loading and unloading applications according to some policy. This is extremely relevant as most sensible apps are not actually long running.

reply

ernst_klim 15 hours ago [-]

> but it's generally better than most full GC solutions

I doubt that. It implies huge costs without giving any benefits of GC.

A typical GC have compaction, nearly stack-like fast allocation [1], ability to allocate a bunch of objects at once (just bump the heap pointer once for a whole bunch).

And both Perl and Swift do indeed perform abysmally, usually worse than both GC and manual languages [2].

> ARC is more of less GC

Well, no. A typical contemporary GC is generational, often concurrent, allowing fast allocation. ARC is just a primitive allocator with ref/deref attached.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49....

[2] https://github.com/ixy-languages/ixy-languages

reply

pkolaczk 7 hours ago [-]

It is nowhere near stack-like. Stack is hot in cache. Heap memory in tlab is cold. Bringing the lines into cache is the major cost, not bumping the pointer.

reply

pjmlp 12 hours ago [-]

Swift's form of GC is one of the slowest one, no wonder porting to Rust made it faster, specially given that most tracing GC outperform Swift's current implementation.

If one goes with reference counting as GC implementation, then one should take the effort to use hazardous pointers and related optimizations.

reply

terminaljunkid 1 day ago [-]

> Swift doesn't have a GC slowing things down

This Apple marketing meme needs to die. Reference counting incurs arguably more cost than GC, or at least the cost is spread through-out processing.

reply

pkolaczk 20 hours ago [-]

It incurs some cost, but whether it is higher is very debatable. This is very much workload dependent. A smart compiler can elide most reference updates.

reply

pjmlp 12 hours ago [-]

No it can't, not in Swift's current implementation.

https://github.com/ixy-languages/ixy-languages

reply

abjKT26nO8 18 hours ago [-]

It would seem that the Swift compiler is far from smart[1].

[1]: <https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...

reply

---

regarding Rust: "

pornel 2 days ago [-] ... Simplifying the problem to single ownership with shared-immutable vs exclusive-mutable makes it much easier to reason about borrow checking. If something doesn't fit these rules, you either use a different approach, or use `unsafe`, and then wrap it in a higher-level safe abstraction that fits the model. "

...

Animats 2 days ago [-]

Doubly-linked lists are hard with Rust's ownership system. I've argued for Rust having backpointers as a built-in type the borrow checker understands. You have to maintain the invariant that if A points to B, B points back to A. A is an owning pointer, and B is a non-owning pointer locked in a relationship with A. Easy to check if the language lets you say that's what you're doing. Rust lacks that.

If you have safe backpointers, most tree-type data structures with backpointers can be constructed. A nice feature to have.

magicalhippo 2 days ago [-]

As a non-Ruster, any particular reason they didn't include this? A lot of the code I've done over the years involve graphs, which have a lot of circular pointer and/or backpointers. I find it kinda weird they'd make such a common thing a PITA in a new language.

reply

reply

dodobirdlord 2 days ago [-]

...

The Rust question to ask here is: Do you really need pointers, specifically? Or would some other reference mechanism work? You could put all of your graph nodes into a linear data structure like an array, and have them point to each other by holding a list of indices instead of a list of pointers

dmitrygr 1 day ago [-]

> You could put all of your graph nodes into a linear data structure like an array, and have them point to each other by holding a list of indices instead of a list of pointers

Rust practitioners keep proposing this solution. It is like you have never heard of caches or do not understand that modern CPUs have multiple special prefetchers for pointer chasing and no prefetchers for "rust fanatics"

"

---

derefr 2 days ago [-]

Erlang has a parameter called initial_heap_size. Each new actor-process in Erlang gets its own isolated heap, for which it does its own garbage-collection on its own execution thread. This initial_heap_size parameter determines how large each newly-spawned actor’s heap will be.

Why would you tune it? Because, if you set it high enough, then for all your short-lived actors, memory allocation will become a no-op (= bump allocation), and the actor will never experience enough memory-pressure to trigger a garbage-collection pass, before the actor exits and the entire process heap can be deallocated as a block.

 dahart 2 days ago [-]

The games and GPU apps I’ve worked on use memory pools for small allocations, where there will be individual pools for all, say, 1-16 byte allocations, 16-64 byte allocations, 64-256 byte allocations, etc. (Sizes just for illustration, not necessarily realistic). The pool sizes always get tuned over time to match the approximate high water mark of the application.

I think pools and arenas mean pretty much the same thing. https://en.wikipedia.org/wiki/Memory_pool I’ve mostly heard this discussed in terms of pools, but I wonder if there’s a subtle difference, or if there’s a historical reason arena is popular in some circles and pool in others...?

I haven’t personally see a per-frame heap while working in console games, even though games I’ve worked on probably had one, or something like it. Techniques that I did see and are super-common are fixed maximum-size allocations: just pre-allocate all the memory you’ll ever need for some feature and never let it go; stack allocations sometimes with alloca(); and helper functions/classes that put something on the stack for the lifetime of a particular scope.

reply

twoodfin 2 days ago [-]

I’ve always understood an arena to use a bump pointer for allocation and to support only a global deallocation, as the GP describes.

A pool or slab allocator separates allocations into one of a range of fixed-size chunks to avoid fragmentation. Such allocators do support object-by-object deallocation.

reply

WalterBright? 6 hours ago [-]

D now has an Ownership/Borrowing system implemented (though it is in its early stages). It will work with existing code, including DasBetterC? code. You're able to incrementally adapt a program to O/B rather than rewrite it.

reply

---

" The zig standard library provides a standard interface for allocators. Everything in the stdlib that allocates memory takes an allocator as an argument. This makes it easy to use custom allocators without having to write your own basic data-structures. While there isn't much of a zig ecosystem yet, it seems likely that future libraries will follow this convention.

Rust provides an implicit global allocator. The standard library and most other libraries use only the global allocator. It's possible to choose the global allocator, but not to use different global allocators in different parts of your program. " -- [6]

---

https://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works

---

golang's GC seems to be focused on low latency and somewhat on simplicity. It seems like they succeeded on getting the latency down. Probably should model after theirs. i put links to lots of presentations on theirs in plChGolang.

they have a write barrier but it is turned off (via a global variable) except during GC.

they don't do copying or compaction. idea: we could do copying upon program's request but not otherwise.

---

" Anything that needs to allocate memory in Zig takes an allocator as an argument. " -- [7]

---

https://nim-lang.org/blog/2020/12/08/introducing-orc.html

---

Golang's low latency GC is apparently not compacting. So maybe not what we want.

These days JVM seems to have Shenandoah, Azul C4, and ZGC, which apparently can have relatively short pauses (and pauses that do not increase with increasing heap size), and apparently compact. It sounds like all three of those have a read barrier (or 'load barrier', what is that?). Why not just a write barrier (e.g. at each write, first check if the GC is doing anything at all, if not just write if so then check if the GC is in the middle of copying this destination, and if so then apply the write to both copies)? I'm not sure.

"The Garbage Collection Handbook: The Art of Automatic Memory Management By Richard Jones, Antony Hosking, Eliot Moss" might have something to say about that. A google search turns up a hit on page 405 of this book, within section 19.7 CONTROLLING FRAGMENTATION: " Incremental replication on uniprocessors

Before considering more complicated schemes for concurrent compaction, it is worth noting that many real-time applications run in embedded systems, where uniprocessors have been the dominant platform. Preserving atomicity of mutator operations (with respect to the collector and other mutators) is simple on a uniprocessor, either by disabling scheduling or by preventing thread switching except at GC-safe points (making sure that mutator barriers never contain a GC-safe point). In this setting, the collector can freely copy objects as long as mutators subsequently access only the copy (using a Brooks indirection barrier to force a tospace invariant), or they make sure to update __both__ copies (in case other mutators are still reading from the old version in a replicating collector). ...(the book goes on to describe some algorithms that work like this but assume uniprocessors)... Concurrent compaction on a multiprocessor system prevents us from assuming that Read and Write can be made straightforwardly atomic. For that we must consider more fine-grained synchronization among mutators, and between mutator and collector, as follows. ...(the book goes on to describe some algorithms that work on multiprocessors)... "

So i still don't quite understand. Perhaps they are saying that the issue is that if there are two copies a block in the heap, and there are a zillion mutator threads, and some of them are working with the first copy and some with the second, and you have only a write barrier that ensures that each write is written to both threads, then in between the time you write the first copy and the time that you write to the second copy, some of the other threads will observe the first copy and some will observe the second copy? In other words, you'd need to be able to atomically write to both copies at once, and you can't. Although i guess you could just have some sort of lock that makes everyone else wait to read until you're done writing to both, but maybe the sync overhead here is considered too high or maybe the sync time increases with the number of threads, or maybe that would itself be considered a read barrier?

that book has another section, 17Concurrentcopying&compaction, 17.4Replicationcopying

and certainly this sort of thing would get in the way of lockless concurrency algorithms, but i think we can sacrifice that in Oot -- use a lower-level library if you need that, the Oot runtime doesn't provide the ability for user code to have that levl of perf (the runtime itself might be able to use lockless concurrency in some places though).

btw there's some debate about what 'pauseless' means. Does it count as a pause if each thread pauses to find pointers in its roots/stack? Does it count as a pause if two threads have to synchronize? I think we're happy with things like that; what we really care about is probably limiting the maximum time of the pause to a very low number, which doesn't increase with larger heap size and larger number of threads. However, we also care a lot more than others about simplicity -- and less than others about reductions in throughput.

---

sounds like the state-of-the-art is JVM's Redhat's Shenandoah, Azul's C4, Oracle's Z (ZGC). Azul's C4 sounds like it is is not (completely?) FOSS, so people don't talk about it as much.

Shenandoah used to add 'forwarding pointers' as a field in every heap object (50% worst case memory cost, 5-10% typical case) but now "Shenandoah 2.0" dispenses with that and sounds like it uses some sort of tagged pointers instead. ZGC sounds like it uses some sort of tagged pointers. Shenandoah 2.0 works with a 'strong tospace invariant', which sounds like it means some sort of barrier scheme that ensures that if an object is written to during copying, every thread only uses the new version from that point forward.

links:

should also look into the .NET GC https://stackoverflow.com/questions/64252590/how-does-clr-gc-compare-to-latest-zgc-and-shenandoah-gc-on-jvm suggests that it achieves 10ms max pause times, which is good enuf for us. But 2018 article https://medium.com/servicetitan-engineering/go-vs-c-part-2-garbage-collection-9384677f86f1 says that .NET GC STW pause times scale with # of objects in heap, and that as large as 125 sec has been observed! "Go’s pauses are barely visible here — because they are tiny. The max. pause I was able to measure was ~ 1.3 seconds — for the comparison, .NET got 125 sec. STW pause on the same test case...STW pause time seems to scale linearly with the static set size both for .NET and for Go. But if we compare larger pauses, they are ~ 100x shorter for Go...Conclusion: Go’s incremental GC does work; .NET’s concurrent GC doesn’t."

---

chrisseaton on Dec 14, 2019 [–]

I don't think there are any GCs that do not stop-the-world at some point, except those with special hardware support, are there? Even C4 requires in theory stopping individual threads, and in practice even has a global stop-the-world phase.

pcwalton on Dec 14, 2019 [–]

It's easy to write a GC that doesn't stop the world at all. A Baker treadmill [1], for example, is a very simple type of incremental collector that can be implemented quickly. (It's a fun exercise, by the way!) If you implement it "by the book", it will have no stop-the-world pauses whatsoever (assuming the mutator doesn't allocate faster than collection can happen).

The problem is that a simple GC like this will have low throughput, so it's not worth it. These are some of the most common misconceptions about GCs: that it's hard to write a low-latency GC and that pauses are all that matter. In reality, GC engineering is all about the tradeoff between throughput and latency.

[1]: https://www.memorymanagement.org/glossary/t.html#treadmill

---

what's our GC pause target?

a subthread in https://news.ycombinator.com/item?id=19435890 talks about GC pauses and games. A person points out that JVM Shenandoah's 10 ms pause target is still too high for games. Ppl point out that each frame with 60 FPS is 16ms, with 90 FPS (often used in VR), it's 11ms, and with 144ms, it's slightly less than 7ms. A commentor points out that "...throwing away more than half your frame budget just because of GC seems unreasonable". So, if we have half of 11ms, that would be 5ms. So let's target 5ms max pauses (avg would be even smaller).

Comparing this to others, below a slide in https://blog.golang.org/ismmkeynote says "We now have an objective of 500 microseconds stop the world pause per GC cycle" (in 2018). In 2014 they had "10ms STW pauses every 50ms" according to the slide above the previous text. So 5ms max seems aggressive but doable.

---

" ResidentSleeper? on March 19, 2019 [–]

Keep in mind that C# has support for non-primitive value types (structs) which means you can avoid garbage collection entirely if you're careful (not all that difficult in my experience). "

"

keyle on March 20, 2019 [–]

The first thing a game programmer will do in a garbage collected environment is either try to disable it or control it.

pjmlp on March 20, 2019 [–]

Or learn that maybe for the game that they are making it doesn't matter.

And if it does, making themselved acquainted with how value types, slices and gc-free pointers work would probably be a better idea. "

" _qjt0 on March 21, 2019 [–]

Thanks. I understand that the numbers require profiling to determine, but I wish there was at least a high-level guide like:

Do you want to optimise pause times? Use X. Optimise throughput without regard to pause times? Use Y. Optimise memory use? Use Z.

It may not be accurate in every case (no guideline is), but it would be a good starting point for the vast majority of us who don't know the pros and cons of various collectors.

pjmlp on March 21, 2019 [–]

Here are some guides for different JVMs,

https://docs.oracle.com/javase/10/gctuning/introduction-garb... https://docs.oracle.com/javase/10/gctuning/introduction-garbage-collection-tuning.htm#JSGCT-GUID-326EB4CF-8C8C-4267-8355-21AB04F0D304

https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/gen... https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/memman.html#wp1087125

https://www.infoq.com/articles/G1-One-Garbage-Collector-To-R... https://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All

https://www.infoq.com/articles/tuning-tips-G1-GC https://www.infoq.com/articles/tuning-tips-G1-GC

https://developer.ibm.com/articles/garbage-collection-tradeo... https://developer.ibm.com/articles/garbage-collection-tradeoffs-and-tuning-with-openj9/

https://www-01.ibm.com/support/docview.wss?uid=swg27013824&a... https://www-01.ibm.com/support/docview.wss?uid=swg27013824&aid=1 "

---

 	bshanks (2520) | logout
	
	rayiner on Oct 29, 2016
parent favorite on: Sub-millisecond GC pauses in Go 1.8

Very impressive. I wonder how Go manages to even stop all threads in 100 microseconds, much less do any work in that time.

johncolanduoni on Oct 29, 2016 [–]

Other systems I'm aware of that are capable of similar feats (I know .NET's collector can, not sure about Hotspot's) use page poisoning. Basically the compiler/JIT puts a bunch of extraneous memory accesses to a known address at key points where the GC can track what all variables and registers hold (called "safepoints"), in a fairly liberal pattern around the code. When the GC wants to stop all the threads in their tracks, it unmaps the page holding that address, causing segfaults in every thread it it hits one of the safepoints. The GC hooks the segfault handler, and simply waits for all threads to stop.

I'm not sure that's what Go does, but I'm not aware of a faster way on standard hardware.

pcwalton on Oct 29, 2016 [–]

That's basically what Go does; it folds the preemption points into the stack checks during function prologs.

menzoic on Oct 29, 2016 [–]

Where did you learn that? (curious to learn more)

ci5er on Oct 29, 2016 [–]

I haven't seen it written up, but it's discussed some here:

hinkley on Oct 29, 2016 [–]

hopefully in back branches too, right?

pcwalton on Oct 29, 2016 [–]

Not according to this bug. :( https://github.com/golang/go/issues/10958

pcwalton on Oct 29, 2016 [–]

To stop the world, Go preempts at the next stack size check [1], which happens during function call preludes. I guess that happens often enough to be fast in practice.

[1]: https://github.com/golang/go/blob/8f81dfe8b47e975b90bb4a2f8d...

rayiner on Oct 29, 2016 [–]

I assume this works so quickly in part because goroutines are backed by a pool of OS threads that don't block? So everybody doesn't get stuck waiting for a thread that's blocked to get around to checking the preemption flag?

pcwalton on Oct 29, 2016 [–]

Right, blocked threads are stopped in the userspace scheduler so there are no synchronization issues with just keeping them stopped.

---

conclusions so far:

---

https://developers.redhat.com/articles/2021/09/16/shenandoah-openjdk-17-sub-millisecond-gc-pauses#concurrent_thread_processing_in_openjdk_17

---

typical182 1 day ago [–]

This is a great list of influences on the design (from the article comments where the prototype author Sam Gross responded to someone wishing for more cross pollination across language communities):

" "… but I'll give a few more examples specific to this project of ideas (or code) taken from other communities:

---

 hn_throwaway_99 9 hours ago | root | parent | next [–]

After being both a Java programmer for the greater part of a decade, then spending a bunch of years doing Objective C/Swift programming, I don't really understand why the Automatic Reference Counting approach hasn't won out over all others. In my opinion it's basically the best of all possible worlds:

1. Completely deterministic, so no worrying about pauses or endlessly tweaking GC settings.

2. Manual ref counting is a pain, but again ARC basically makes all that tedious and error prone bookkeeping go away. As a developer I felt like I had to think about memory management in Obj C about as much as I did for garbage collected languages, namely very little.

3. True, you need to worry about stuff like ref cycles, but in my experience I had to worry about potential memory leaks in Obj C about the same as I did in Java. I.e. I've seen Java programs crash with OOMs because weak references weren't used when they were needed, and I've seen similar in Obj C.

reply

tsimionescu 8 hours ago

root parent next [–]

On the contrary, as others have pointed out, automatic reference counting seems to be the worst of all worlds.

1. Automatic reference counting adds a barrier every time you pass around an object, even if only for reading, unlike both manual memory management and tracing GC.

2. ARC still requires thought about ownership, since you have to ensure there are no cycles in the ownership graph unlike tracing/copying/compacting GC.

3. ARC still means that you can't look at a piece of code and tell how much work it will do, since you never know if a particular piece of code is holding the last reference to a piece of data, and must free the entire object graph, unlike manual memory management. In the presence of concurrency, cleanup is actually non-deterministic, as 'last one to drop its reference' is a race.

4. ARC doesn't solve the problem of memory fragmentation, unlike arena allocators and compacting/copying GC.

5. ARC requires expensive allocations, it can't use cheap bump-pointer allocation like copying/compacting GC can. This is related to the previous point.

6. ARC cleanup still takes time proportional to the number of unreachable objects, unlike tracing GC (proportional to number of live objects) or arena allocation (constant time).

Reference counting is a valid strategy in certain pieces of manually memory managed code (particularly in single-threaded contexts), but using it universally is almost always much worse than tracing/copying/compacting GC.

Note that there are (soft) realtime tracing GCs, but this can't be achieved with ARC.

reply

elcritch 6 hours ago

root parent next [–]

It depends on the implementation. In traditional ARC implementations these are all issues to a degree, but even then they tend toward much lower overhead with more predictable behavior. Though I agree tracing GCs can be really fast and better in many ways.

1. I'm guessing you mean lock based implementations? There's several non lock, non atomics based ARC designs that still handle threading safely. That means it's little more than a single integer operation.

2. True, but for many contexts this is easy to do and makes it easier to understand data flows. In other cases it's possible to use index based references pretty readily, like in Rust. Or add a cycle collector.

3. In theory you can't, but in practice it's often pretty easy to tell. At least in non-oop languages. I use Nim with its ARC (1) on RTOS and this is really only a concern for large lists or large dynamic structures. It can be managed using the same patterns as RAII where you call child functions who know they won't be the last ones holding the bag. Also you can use the same trick as some Rust code does where you pass the memory to another thread to dispose of (2).

4/5. It depends on the implementation, but you can use pools or arenas or other options. Nim provides an allocator algorithm (tlsf) with proven O(1) times and known fragmentation properties (3). Still tracing gcs can make better usage of short lifetime arenas true. Though with ARC you get similar benefits using stack based objects.

6. It's tradeoffs. Tracing gcs also end up needing to scan the entire heap every so often. ARC's only need to check a root object during usage and only access the entire graph when de-allocating.

Your last point isn't accurate, as you can use an appropriately designed ARC in a hard realtime context. I've found it quite easy to do, but granted it takes a little bit of care, but any real time systems do. For items like interrupt handlers I ensure no memory is allocated or destroyed.

1) https://nim-lang.org/blog/2020/10/15/introduction-to-arc-orc... 2) https://abramov.io/rust-dropping-things-in-another-thread 3) http://www.gii.upv.es/tlsf/

reply

m0zg 3 hours ago

root parent next [–]

> non lock, non atomics based ARC designs that still handle threading safely

Don't think that's even doable at all, at least not portably. Do you have some examples?

reply

elcritch 2 hours ago

root parent next [–]

There's a couple of methods like deferred counting, differential counting, or biased references. They're usually not completely atomic free but generally provide guaranteed constant overhead or tweak when or how the memory can be shared.

It's pretty cool stuff!

reply

chrisseaton 8 hours ago

root parent prev next [–]

Isn't ARC absolutely worst-case for most multi-threading patterns? You'll thrash objects between cores just when you reference them. Every object becomes a mutable object!

reply

flohofwoe 7 hours ago

root parent prev next [–]

If you check ARC code in a profiler, there's a shocking amount of time spent in the retain/release calls:

https://floooh.github.io/2016/01/14/metal-arc.html

There are ways around this overhead at least in Metal, but this requires at least as much effort as not using ARC to begin with.

reply

Twisol 7 hours ago

root parent next [–]

I think the M1 CPU lowers some of the ARC retain/release to silicon -- it doesn't remove the problem, but it does seem to reduce the relative overhead significantly.

https://news.ycombinator.com/item?id=25203924

> this requires at least as much effort as not using ARC to begin with.

Designing a brand new CPU architecture certainly counts as "at least as much effort", yes. ^_^

P.S. While I'm here, your "handles are the better pointers" blog post is one of my all-time favorites. I appreciate you sharing your experiences!

reply

pcwalton 3 hours ago

root parent prev next [–]

Reference counting is a form of garbage collection. Remember that reference counting and tracing garbage collection are simply two extremes of a continuum of approaches. Must-read paper: "A Unified Theory of Garbage Collection".

reply

fulafel 9 hours ago

root parent prev next [–]

Compared to other, state of the art GC strategies, it's slow and has bad cache behavior esp. for multithreading. Also, not deterministic perf wise.

reply

hn_throwaway_99 8 hours ago

root parent next [–]

Can you explain these or point me to any links, I'd like to learn more.

> has bad cache behavior for multithreading

Why is this? Doesn't seem like that would be something inherent to ref counting.

> Also, not deterministic

I always thought that with ref counting, when you decrement a ref count that goes to 0 that it will then essentially call free on the object. Is this not the case?

reply

lallysingh 8 hours ago

root parent next [–]

Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.

determinism: you reset a pointer variable. Once in a while, you're the last referent and now have to free the object. That takes more instructions and cache invalidation.

reply

hn_throwaway_99 7 hours ago

root parent next [–]

> Bad cache behavior: you're on core B, and the object is used by core A and in A's L2 cache. Just by getting a pointer to the object, you have to mutate it. Mutation invalidates A's cache entry for it and forces it to load into B's cache.

Thank you! This was the response that made the cache issues clearest to me.

reply

[8]

---

kevingadd 4 hours ago

root parent prev next [–]

Right, but if you're allocating way too much, no lifetime management model will save you.

As a game dev I still will take a modern GC over refcounting or manual lifetime management any day. There are scenarios where you simply don't want to put up with the GC's interference, but those are rare - and modern stacks like .NET let you do manual memory management in cases where you are confident it's what you want to do. For the vast majority of the code I write, GC is the right approach - the best-case performance is good and the worst-case scenario is bad performance instead of crashes. The tooling is out there for me to do allocation and lifetime profiling and that's usually sufficient to find all the hotspots causing GC pressure and optimize them out without having to write any unsafe code.

The cost of having a GC walk a huge heap is a pain, though. I have had to go out of my way to ensure that large buffers don't have GC references in them so that the GC isn't stuck scanning hundreds of megabytes worth of static data. Similar to what I said above though, the same problem would occur if those datastructures had a smart pointer in them :/

reply

---