proj-oot-ootMemoryManagementNotes2

Difference between revision 17 and current revision

No diff available.

	"

Brad Abrams posted an e-mail from Brian Harry written during development of the .Net framework. It details many of the reasons reference counting was not used, even when one of the early priorities was to keep semantic equivalence with VB6, which uses reference counting. It looks into possibilities such as having some types ref counted and not others (IRefCounted?!), or having specific instances ref counted, and why none of these solutions were deemed acceptable.

    Because [the issue of resource management and deterministic finalization] is such a sensitive topic I am going to try to be as precise and complete in my explanation as I can. I apologize for the length of the mail. The first 90% of this mail is trying to convince you that the problem really is hard. In that last part, I'll talk about things we are trying to do but you need the first part to understand why we are looking at these options.
    ...
    We initially started with the assumption that the solution would take the form of automatic ref counting (so the programmer couldn't forget) plus some other stuff to detect and handle cycles automatically. ...we ultimately concluded that this was not going to work in the general case.
    ...
    In summary:
        We feel that it is very important to solve the cycle problem without forcing programmers to understand, track down and design around these complex data structure problems.
        We want to make sure we have a high performance (both speed and working set) system and our analysis shows that using reference counting for every single object in the system will not allow us to achieve this goal.
        For a variety of reasons, including composition and casting issues, there is no simple transparent solution to having just those objects that need it be ref counted.
        We chose not to select a solution that provides deterministic finalization for a single language/context because it inhibits interop with other languages and causes bifurcation of class libraries by creating language specific versions.

"

https://blogs.msdn.microsoft.com/brada/2005/02/11/resource-management/

---

http://stackoverflow.com/tags/garbage-collection/info

---

mb someday ask StackOverflow? for a survey of approaches to garbage collection in programming languages with the following goals:

I'm designing a programming language and i'm looking for suggestions for garbage collection/automatic memory management algorithms that meet the following criteria. Latency (GC pauses) must be minimized; but throughput performance is not important. The language is intended to be easy to reimplement on different platforms, so it is imperative that the GC technique in the reference implementation be simple. To reiterate, the algorithm under discussion could later be replaced with a more complicated, higher-performing GC algorithm on some implementations, but what i need for the reference implementation (ie what i am asking for here) is simplicity and ease of (correct) implementation, even at the expense of inefficiency. The data structures being managed will be concurrently accessed by multiple threads, and will contain reference cycles.

In more detail, design goals include:

The default garbage collection systems of language implementations such as the JVM are not suitable because (a) they can have long GC pauses, and (b) they are probably more concerned with efficiency than simplicity. Some of the other automatic memory management systems that i am currently looking at as potential models are Inferno's, Go's, and Python's; other suggestions, or comments and comparisons, are welcome.

http://stackoverflow.com/questions/tagged/programming-languages http://stackoverflow.com/questions/tagged/language-design http://softwareengineering.stackexchange.com/questions/tagged/programming-languages http://softwareengineering.stackexchange.com/questions/tagged/programming-languages+language-design

---

jeffdavis 6 hours ago [-]

Tangent:

In databases, it's common to do arena based allocation. You allocate objects of similar lifetime (life of a query, life of a transaction, life of a procedure execution, etc.), and free them all at once when the lifetime is up.

When hacking postgresql, for instance, using pfree() is fairly uncommon. You just allocate, (which goes to the current arena by default), and then the arena is wiped out all at once later.

Of course, sometimes you might use a lot of memory temporarily in a loop, and you need to pfree() it each iteration. But that is the exception.

I think of each arena as a tiny heap (though of course they can grow large).

reply

favorited 5 hours ago [-]

This was also an optimization you could do in Objective-C on NeXTSTEP?/OpenStep?/macOS. Allocating methods took an optional "zone" parameter (`allocWithZone:`, `copyWithZone:`), which was basically an arena.

In this case, the bigger benefit wasn't the allocation/deallocation – but the fact that you would be working with memory on the same virtual memory page, IIUC.

reply

---

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ohyo057ev

---

"Go’s garbage collector is a concurrent mark-sweep with very short stop-the-world pauses and it just seems to keep getting better. This is definitely a good thing for a garbage collected language. We did have some issues with unacceptable latencies, but were able to work them all out. No such luck with the Haskell project."

"...previously tried to use Haskell for the project I was working on, but eventually had to give up due to unpredictable latency caused by garbage collection pauses. GHC’s garbage collector is designed for throughput, not latency. It is a generational copying collector, which means that pause times are proportional to the amount of live data in the heap. To make matters worse, it’s also stop-the-world."

---

"Go is garbage-collected. I personally think Swift/Objective-C-style Automatic Reference Counting would have been a better choice for a statically typed systems language, since it gives you the brevity benefits without the GC pauses. I’m sure this has been argued to death elsewhere, so I’ll save my GC rant for a very boring dinner party."

---

pjmlp 1 day ago

parent flag favorite on: C for All

> C does not have RAII-like memory management in any way.

Yes it does, as language extension on gcc and clang.

It is called cleanup attribute.

https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attribute...

https://clang.llvm.org/docs/LanguageExtensions.html#non-stan...

---

tomsmeding 1 day ago

parent flag favorite on: C for All

C does not have RAII-like memory management in any way. Exceptions work beautifully with memory management like that, but if it's not there, you can't just say it should work because memory management should work like that.

So basically you're saying, before adding exceptions, add RAII-like memory management, and then actually add exceptions. I like both features, but am not sure how you'd wedge RAII into C. Any ideas on that?

---

sedachv 1 day ago [-]

> C does not have RAII-like memory management in any way.

C does not have memory management in any way period. The C standard library does. How you get to something with dynamic scoping like RAII in C is to use a different library for managing memory. For example Thinlisp[1] and Ravenbrook's Memory Pool System[2] both provide dynamically-scoped region/pool allocation schemes.

[1] https://github.com/vsedach/Thinlisp-1.1

[2] https://www.ravenbrook.com/project/mps/

reply

---

BruceIV? 1 day ago [-]

[On the Cforall team] For what it's worth, one of the features Cforall adds to C is RAII.

The exception implementation isn't done yet, but it's waiting on (limited) run-time type information, it already respects RAII.

reply

RhysU? 1 day ago [-]

I have often wanted c99 with destructors for RAII purposes.

reply

---

UncleMeat? 1 day ago

parent flag favorite on: C for All

Exceptions work great in garbage collected languages. Reasoning about exceptional control flow w.r.t. manual memory management is a total nightmare.

sedachv 1 day ago [-]

> Reasoning about exceptional control flow w.r.t. manual memory management is a total nightmare.

Exceptions make manual memory management easier because a proper exception system has unwind-protect[1]. Exceptions are just movements up the stack - exceptions combine naturally with dynamic scoping for memory allocation (memory regions/pools). This kind of memory management was used in some Lisp systems in the 1980s, and made its way into C++ in the form of RAII. By extending the compiler you can add further memory management conveniences like smart pointers to this scheme.

Now if you want to talk about something that actually makes manual memory management a total nightmare, look at the OP's suggestion for adding closures to C.

[1] http://www.lispworks.com/documentation/HyperSpec/Body/s_unwind.htm

reply

---

reference counting may require write barriers (for pointers at least):

" while reference counting incurs a high cost — every pointer write (including those to stack frames) results in reference count updates "

" The incremental nature of reference counting is generally con- sidered to be its fundamental advantage. However, the cost of up- dating reference counts every time a new pointer is loaded into a register is typically much too high for high-performance applica- tions. As a result, some form of deferred reference counting [20], in which references from stack frames are accounted for separately, is used in most high-performance implementations of reference counting [3, 19]. However, this means that when an object’s reference count drops to zero, it can not be reclaimed immediately, since there might be a reference from the stack that is not accounted for. As a result, collection is deferred until the periodic scanning of the stack refer- ences. However, the result is delayed collection, floating garbage, and longer application pauses — the typical characteristics of trac- ing collectors! "

---

have some bits in each fat pointer (and every pointer should be fat) for the GC to use. This way it can mark pointer instances.

load and store barriers for the GC

---

One reason that Java takes lots of memory, besides GC: lack of value types "data structures can't be value types[1], consequently consuming large amounts of extra memory because everything sits behind 2 or 3 or 5 layers of unnecessary 8-byte pointers." [1]

---

" jemalloc is removed by default

Long, long ago, Rust had a large, Erlang-like runtime. We chose to use jemalloc instead of the system allocator, because it often improved performance over the default system one. Over time, we shed more and more of this runtime, and eventually almost all of it was removed, but jemalloc was not. We didn't have a way to choose a custom allocator, and so we couldn't really remove it without causing a regression for people who do need jemalloc.

Also, saying that jemalloc was always the default is a bit UNIX-centric, as it was only the default on some platforms. Notably, the MSVC target on Windows has shipped the system allocator for a long time.

Finally, while jemalloc usually has great performance, that's not always the case. Additionally, it adds about 300kb to every Rust binary. We've also had a host of other issues with jemalloc in the past. It has also felt a little strange that a systems language does not default to the system's allocator.

For all of these reasons, once Rust 1.28 shipped a way to choose a global allocator, we started making plans to switch the default to the system allocator, and allow you to use jemalloc via a crate. In Rust 1.32, we've finally finished this work, and by default, you will get the system allocator for your programs. "

---

[2] [3] " drfuchs 9 months ago [-]

The proposal here is way too vague. And if you flesh it out, things start to fall apart: If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"? If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer?)? On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

Similar questions apply to general arrays, as well. Also: Am I able to take the address of an element of an array? Will that be a fat pointer too? How about a pointer to a sequence of elements? Can I do arithmetic on these pointers? If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array? How would you write Quicksort? Heapsort? And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

In short, the reason that this sort of thing hasn't appeared in C is not because nobody has thought about it, nor because the C folks are too hide-bound to accept a good idea, but rather because it's not clear that there's a real, workable, limited, concise, solution that doesn't warp the language far off into Java/C#-land. It would be great if there were, but this isn't it.

WalterBright? 9 months ago [-]

I happened to know the idea does work, and has been working in D for 18 years now.

> If nul-termination of strings is gone, does that mean that the fat pointers need to be three words long, so they have a "capacity" as well as a "current length"?

No. You'll still have the same issues with how memory is allocated and resized. But, once the memory is allocated, you have a safe and reliable way to access the memory without buffer overflows.

> If not, how do you manage to get string variable on the stack if its length might change? Or in a struct? How does concatenation work such that you can avoid horrible performance (think Java's String vs. StringBuffer?)?

As I mentioned, it does not address allocating memory. However, it does offer one performance advantage in not having to call strlen to determine the size of the data.

> On the other hand, if the fat pointers have a length and capacity, how do I get a fat pointer to a substring that's in the middle of a given string?

In D, we call those slices. They look like this:

    T[] array = ...
    T[] slice = array[lower .. upper];

The compiler can insert checks that the slice[] lies within the bounds of array[].

> Am I able to take the address of an element of an array?

Yes: `T* p = &array[3];`

> Will that be a fat pointer too?

No, it'll be regular pointer. To get a fat pointer, i.e. a slice:

    slice = array[lower .. upper];

> How about a pointer to a sequence of elements?

Not sure what you mean. You can get a pointer or a slice of a dynamic array.

> Can I do arithmetic on these pointers?

Yes, via the slice method outlined above.

> If not, am I forced to pass around fat array pointers as well as index values when I want to call functions to operate on pieces of the array?

No, just the slice.

> How would you write Quicksort? Heapsort?

Show me your pointer version and I'll show you an array version.

> And this doesn't even start to address questions like "how can I write an arena-allocation scheme when I need one"?

The arena will likely be an array, right? Then return slices of it.

"

" MrBingley? 9 months ago [-]

I absolutely agree. Adding an array type to C that knows its own length would solve so many headaches, fix so many bugs, and prevent so many security vulnerabilities it's not even funny. Null terminated strings? Gone! Checked array indexing? Now possible! More efficient free that gets passed the array length? Now we could do it! The possibilities are incredible. Sadly, C is so obstinately stuck in its old ways that adding such a radical change will likely never happen. But one can dream ...

bluetomcat 9 months ago [-]

> Adding an array type to C that knows its own length would solve so many headaches

C arrays know their length, it's always `sizeof(arr) / sizeof(*arr)`. It's just that arrays become pointers when passed between functions, and dynamically-sized regions (what is an array in most other languages) are always accessed via a pointer. "

---

https://stackoverflow.com/questions/54016849/how-does-the-haskell-runtime-distinguish-between-pointers-and-unboxed-word-sized

---

could have reference counting and rely on the programmer to avoid reference cycles by using weak references. Apparently Swift does this:

[4]

i guess that makes our implementation simpler and more performant. But i think at the cost of making our programmers' lives more complex, so i dont really like it.

---

afaict the only way to have a truly pauseless compacting GC is to have a read barrier upon each dereference of any pointer. or maybe handles (pointers to pointers) are enough?

ok there's another way; have a load barrier upon load of any memory location containing a pointer; it's not truly pauseless but you only have to pause to scan the stack roots, so at least the max pause time does not grow as the heap grows.

---

[5]

the-alchemist 1 day ago [-]

Looks like the big challenge is managing a large, LRU cache, which tends to be a difficult problem for GC runtimes. I bet the JVM, with its myriad tunable GC algorithms, would perform better, especially Shenandoah and, of course, the Azul C4.

The JVM world tends to solve this problem by using off-heap caches. See Apache Ignite [0] or Ehcache [1].

I can't speak for how their Rust cache manages memory, but the thing to be careful of in non-GC runtimes (especially non-copying GC) is memory fragmentation.

Its worth mentioning that the Dgraph folks wrote a better Go cache [2] once they hit the limits of the usual Go caches.

From a purely architectural perspective, I would try to put cacheable material in something like memcache or redis, or one of the many distributed caches out there. But it might not be an option.

It's worth mentioning that Apache Cassandra itself uses an off-heap cache.

[0]: https://ignite.apache.org/arch/durablememory.html [1]: https://www.ehcache.org/documentation/2.8/get-started/storag... [2]: https://blog.dgraph.io/post/introducing-ristretto-high-perf-...

reply

dochtman 1 day ago [-]

One the one hand, yes. On the other hand, all of this sounds much more complex and fragile. This seems like an important point to me:

"Remarkably, we had only put very basic thought into optimization as the Rust version was written. Even with just basic optimization, Rust was able to outperform the hyper hand-tuned Go version."

reply

---

chubs 1 day ago [-]

I found similarly when I ported an image resizing algorithm from Swift to Rust: I'm experienced in swift thus was able to write in an idiomatic way, and have little Rust experience thus I wrote it in a naive way; yet still the rust algorithm was twice(!) as fast. And swift doesn't even have a GC slowing things down!

reply

giornogiovanna 20 hours ago [-]

Reference counting is a (low-throughput, low-latency) form of garbage collection.

reply

arcticbull 20 hours ago [-]

Yes and no. From a theoretical perspective, I suppose that's true, but "garbage collection" tends to mean a non-deterministic collector that does its own thing, and you don't have to think at all about memory. That does not apply to Swift, as of course, you need to understand the difference between strong and weak references. It's unfairly simplistic to couple the two.

reply

barrkel 16 hours ago [-]

No, RC is GC.

Most people think of Python as GCed language, and it uses mostly RC.

Any runtime that uses mark & sweep today may elect to use RC for some subset of the heap at some point in a future design, if that makes more sense. The mix of marking GC vs refcounting GC shouldn't affect the semantics of the program.

reply

staticassertion 18 hours ago [-]

But it actually is just garbage collection.

reply

Sean1708 17 hours ago [-]

It is but in the context of this discussion it's very clear that they meant a tracing garbage collector, which has a very different cost than atomic reference counting. Or to put it another way: you're technically correct, the worst kind of correct.

reply

dgellow 1 day ago [-]

ARC, used by Swift, has its own cost.

reply

1propionyl 19 hours ago [-]

True, but it's generally better than most full GC solutions (for processes running for relatively short times without the benefit of profile-guided optimization), and worse than languages with fully statically analyzable memory usage.

Note: that parenthetical is a very big caveat, because properly profile-optimized JVM executables can often achieve exceptional performance/development cost tradeoffs.

In addition however, ARC admits a substantial amount of memory-usage optimization given bytecode, which is now what developers provide to Apple on iOS. Not to mention potential optimizations by allowing Apple to serve last-minute compiled microarchitecture optimized binaries for each device (family).

To satiate the pedants... ARC is more of less GC where calls into the GC mechanism are compiled in statically and where there are at worst deterministic bounds on potential "stop the world" conditions.

While this may not be presently optimal because profile-guided approaches can deliver better performance by tuning allocation pool and collection time parameters, it's arguably a more consistent and statically analyzable approach that with improvement in compilers may yield better overall performance. It also provides tight bounds on "stop the world" situations, which also exist far less frequently on mobile platforms than in long running sever applications.

Beyond those theoretical bounds, it's certainly much easier to handle when you have an OS that is loading and unloading applications according to some policy. This is extremely relevant as most sensible apps are not actually long running.

reply

ernst_klim 15 hours ago [-]

> but it's generally better than most full GC solutions

I doubt that. It implies huge costs without giving any benefits of GC.

A typical GC have compaction, nearly stack-like fast allocation [1], ability to allocate a bunch of objects at once (just bump the heap pointer once for a whole bunch).

And both Perl and Swift do indeed perform abysmally, usually worse than both GC and manual languages [2].

> ARC is more of less GC

Well, no. A typical contemporary GC is generational, often concurrent, allowing fast allocation. ARC is just a primitive allocator with ref/deref attached.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49....

[2] https://github.com/ixy-languages/ixy-languages

reply

pkolaczk 7 hours ago [-]

It is nowhere near stack-like. Stack is hot in cache. Heap memory in tlab is cold. Bringing the lines into cache is the major cost, not bumping the pointer.

reply

pjmlp 12 hours ago [-]

Swift's form of GC is one of the slowest one, no wonder porting to Rust made it faster, specially given that most tracing GC outperform Swift's current implementation.

If one goes with reference counting as GC implementation, then one should take the effort to use hazardous pointers and related optimizations.

reply

terminaljunkid 1 day ago [-]

> Swift doesn't have a GC slowing things down

This Apple marketing meme needs to die. Reference counting incurs arguably more cost than GC, or at least the cost is spread through-out processing.

reply

pkolaczk 20 hours ago [-]

It incurs some cost, but whether it is higher is very debatable. This is very much workload dependent. A smart compiler can elide most reference updates.

reply

pjmlp 12 hours ago [-]

No it can't, not in Swift's current implementation.

https://github.com/ixy-languages/ixy-languages

reply

abjKT26nO8 18 hours ago [-]

It would seem that the Swift compiler is far from smart[1].

[1]: <https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...

reply

---

regarding Rust: "

pornel 2 days ago [-] ... Simplifying the problem to single ownership with shared-immutable vs exclusive-mutable makes it much easier to reason about borrow checking. If something doesn't fit these rules, you either use a different approach, or use `unsafe`, and then wrap it in a higher-level safe abstraction that fits the model. "

...

Animats 2 days ago [-]

Doubly-linked lists are hard with Rust's ownership system. I've argued for Rust having backpointers as a built-in type the borrow checker understands. You have to maintain the invariant that if A points to B, B points back to A. A is an owning pointer, and B is a non-owning pointer locked in a relationship with A. Easy to check if the language lets you say that's what you're doing. Rust lacks that.

If you have safe backpointers, most tree-type data structures with backpointers can be constructed. A nice feature to have.

magicalhippo 2 days ago [-]

As a non-Ruster, any particular reason they didn't include this? A lot of the code I've done over the years involve graphs, which have a lot of circular pointer and/or backpointers. I find it kinda weird they'd make such a common thing a PITA in a new language.

reply

reply

dodobirdlord 2 days ago [-]

...

The Rust question to ask here is: Do you really need pointers, specifically? Or would some other reference mechanism work? You could put all of your graph nodes into a linear data structure like an array, and have them point to each other by holding a list of indices instead of a list of pointers

dmitrygr 1 day ago [-]

> You could put all of your graph nodes into a linear data structure like an array, and have them point to each other by holding a list of indices instead of a list of pointers

Rust practitioners keep proposing this solution. It is like you have never heard of caches or do not understand that modern CPUs have multiple special prefetchers for pointer chasing and no prefetchers for "rust fanatics"

"

---

derefr 2 days ago [-]

Erlang has a parameter called initial_heap_size. Each new actor-process in Erlang gets its own isolated heap, for which it does its own garbage-collection on its own execution thread. This initial_heap_size parameter determines how large each newly-spawned actor’s heap will be.

Why would you tune it? Because, if you set it high enough, then for all your short-lived actors, memory allocation will become a no-op (= bump allocation), and the actor will never experience enough memory-pressure to trigger a garbage-collection pass, before the actor exits and the entire process heap can be deallocated as a block.

 dahart 2 days ago [-]

The games and GPU apps I’ve worked on use memory pools for small allocations, where there will be individual pools for all, say, 1-16 byte allocations, 16-64 byte allocations, 64-256 byte allocations, etc. (Sizes just for illustration, not necessarily realistic). The pool sizes always get tuned over time to match the approximate high water mark of the application.

I think pools and arenas mean pretty much the same thing. https://en.wikipedia.org/wiki/Memory_pool I’ve mostly heard this discussed in terms of pools, but I wonder if there’s a subtle difference, or if there’s a historical reason arena is popular in some circles and pool in others...?

I haven’t personally see a per-frame heap while working in console games, even though games I’ve worked on probably had one, or something like it. Techniques that I did see and are super-common are fixed maximum-size allocations: just pre-allocate all the memory you’ll ever need for some feature and never let it go; stack allocations sometimes with alloca(); and helper functions/classes that put something on the stack for the lifetime of a particular scope.

reply

twoodfin 2 days ago [-]

I’ve always understood an arena to use a bump pointer for allocation and to support only a global deallocation, as the GP describes.

A pool or slab allocator separates allocations into one of a range of fixed-size chunks to avoid fragmentation. Such allocators do support object-by-object deallocation.

reply

WalterBright? 6 hours ago [-]

D now has an Ownership/Borrowing system implemented (though it is in its early stages). It will work with existing code, including DasBetterC? code. You're able to incrementally adapt a program to O/B rather than rewrite it.

reply

---

" The zig standard library provides a standard interface for allocators. Everything in the stdlib that allocates memory takes an allocator as an argument. This makes it easy to use custom allocators without having to write your own basic data-structures. While there isn't much of a zig ecosystem yet, it seems likely that future libraries will follow this convention.

Rust provides an implicit global allocator. The standard library and most other libraries use only the global allocator. It's possible to choose the global allocator, but not to use different global allocators in different parts of your program. " -- [6]

---

https://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works

---

golang's GC seems to be focused on low latency and somewhat on simplicity. It seems like they succeeded on getting the latency down. Probably should model after theirs. i put links to lots of presentations on theirs in plChGolang.

they have a write barrier but it is turned off (via a global variable) except during GC.

they don't do copying or compaction. idea: we could do copying upon program's request but not otherwise.

---

" Anything that needs to allocate memory in Zig takes an allocator as an argument. " -- [7]

---

https://nim-lang.org/blog/2020/12/08/introducing-orc.html

---

Golang's low latency GC is apparently not compacting. So maybe not what we want.

These days JVM seems to have Shenandoah, Azul C4, and ZGC, which apparently can have relatively short pauses (and pauses that do not increase with increasing heap size), and apparently compact. It sounds like all three of those have a read barrier (or 'load barrier', what is that?). Why not just a write barrier (e.g. at each write, first check if the GC is doing anything at all, if not just write if so then check if the GC is in the middle of copying this destination, and if so then apply the write to both copies)? I'm not sure.

"The Garbage Collection Handbook: The Art of Automatic Memory Management By Richard Jones, Antony Hosking, Eliot Moss" might have something to say about that. A google search turns up a hit on page 405 of this book, within section 19.7 CONTROLLING FRAGMENTATION: " Incremental replication on uniprocessors

Before considering more complicated schemes for concurrent compaction, it is worth noting that many real-time applications run in embedded systems, where uniprocessors have been the dominant platform. Preserving atomicity of mutator operations (with respect to the collector and other mutators) is simple on a uniprocessor, either by disabling scheduling or by preventing thread switching except at GC-safe points (making sure that mutator barriers never contain a GC-safe point). In this setting, the collector can freely copy objects as long as mutators subsequently access only the copy (using a Brooks indirection barrier to force a tospace invariant), or they make sure to update __both__ copies (in case other mutators are still reading from the old version in a replicating collector). ...(the book goes on to describe some algorithms that work like this but assume uniprocessors)... Concurrent compaction on a multiprocessor system prevents us from assuming that Read and Write can be made straightforwardly atomic. For that we must consider more fine-grained synchronization among mutators, and between mutator and collector, as follows. ...(the book goes on to describe some algorithms that work on multiprocessors)... "

So i still don't quite understand. Perhaps they are saying that the issue is that if there are two copies a block in the heap, and there are a zillion mutator threads, and some of them are working with the first copy and some with the second, and you have only a write barrier that ensures that each write is written to both threads, then in between the time you write the first copy and the time that you write to the second copy, some of the other threads will observe the first copy and some will observe the second copy? In other words, you'd need to be able to atomically write to both copies at once, and you can't. Although i guess you could just have some sort of lock that makes everyone else wait to read until you're done writing to both, but maybe the sync overhead here is considered too high or maybe the sync time increases with the number of threads, or maybe that would itself be considered a read barrier?

that book has another section, 17Concurrentcopying&compaction, 17.4Replicationcopying

and certainly this sort of thing would get in the way of lockless concurrency algorithms, but i think we can sacrifice that in Oot -- use a lower-level library if you need that, the Oot runtime doesn't provide the ability for user code to have that levl of perf (the runtime itself might be able to use lockless concurrency in some places though).

btw there's some debate about what 'pauseless' means. Does it count as a pause if each thread pauses to find pointers in its roots/stack? Does it count as a pause if two threads have to synchronize? I think we're happy with things like that; what we really care about is probably limiting the maximum time of the pause to a very low number, which doesn't increase with larger heap size and larger number of threads. However, we also care a lot more than others about simplicity -- and less than others about reductions in throughput.

---

sounds like the state-of-the-art is JVM's Redhat's Shenandoah, Azul's C4, Oracle's Z (ZGC). Azul's C4 sounds like it is is not (completely?) FOSS, so people don't talk about it as much.

Shenandoah used to add 'forwarding pointers' as a field in every heap object (50% worst case memory cost, 5-10% typical case) but now "Shenandoah 2.0" dispenses with that and sounds like it uses some sort of tagged pointers instead. ZGC sounds like it uses some sort of tagged pointers. Shenandoah 2.0 works with a 'strong tospace invariant', which sounds like it means some sort of barrier scheme that ensures that if an object is written to during copying, every thread only uses the new version from that point forward.

links:

should also look into the .NET GC https://stackoverflow.com/questions/64252590/how-does-clr-gc-compare-to-latest-zgc-and-shenandoah-gc-on-jvm suggests that it achieves 10ms max pause times, which is good enuf for us. But 2018 article https://medium.com/servicetitan-engineering/go-vs-c-part-2-garbage-collection-9384677f86f1 says that .NET GC STW pause times scale with # of objects in heap, and that as large as 125 sec has been observed! "Go’s pauses are barely visible here — because they are tiny. The max. pause I was able to measure was ~ 1.3 seconds — for the comparison, .NET got 125 sec. STW pause on the same test case...STW pause time seems to scale linearly with the static set size both for .NET and for Go. But if we compare larger pauses, they are ~ 100x shorter for Go...Conclusion: Go’s incremental GC does work; .NET’s concurrent GC doesn’t."

---

chrisseaton on Dec 14, 2019 [–]

I don't think there are any GCs that do not stop-the-world at some point, except those with special hardware support, are there? Even C4 requires in theory stopping individual threads, and in practice even has a global stop-the-world phase.

pcwalton on Dec 14, 2019 [–]

It's easy to write a GC that doesn't stop the world at all. A Baker treadmill [1], for example, is a very simple type of incremental collector that can be implemented quickly. (It's a fun exercise, by the way!) If you implement it "by the book", it will have no stop-the-world pauses whatsoever (assuming the mutator doesn't allocate faster than collection can happen).

The problem is that a simple GC like this will have low throughput, so it's not worth it. These are some of the most common misconceptions about GCs: that it's hard to write a low-latency GC and that pauses are all that matter. In reality, GC engineering is all about the tradeoff between throughput and latency.

[1]: https://www.memorymanagement.org/glossary/t.html#treadmill

---

what's our GC pause target?

a subthread in https://news.ycombinator.com/item?id=19435890 talks about GC pauses and games. A person points out that JVM Shenandoah's 10 ms pause target is still too high for games. Ppl point out that each frame with 60 FPS is 16ms, with 90 FPS (often used in VR), it's 11ms, and with 144ms, it's slightly less than 7ms. A commentor points out that "...throwing away more than half your frame budget just because of GC seems unreasonable". So, if we have half of 11ms, that would be 5ms. So let's target 5ms max pauses (avg would be even smaller).

Comparing this to others, below a slide in https://blog.golang.org/ismmkeynote says "We now have an objective of 500 microseconds stop the world pause per GC cycle" (in 2018). In 2014 they had "10ms STW pauses every 50ms" according to the slide above the previous text. So 5ms max seems aggressive but doable.

---

" ResidentSleeper?