proj-oot-ootMemoryManagementNotes3

sources of GC latency in golang, from https://tip.golang.org/doc/gc-guide : " Brief stop-the-world pauses when the GC transitions between the mark and sweep phases, Scheduling delays because the GC takes 25% of CPU resources when in the mark phase, User goroutines assisting the GC in response to a high allocation rate, and Pointer writes requiring additional work while the GC is in the mark phase. Running goroutines must be suspended for their roots to be scanned. "

so, the relevant ones for us are:

Getting to Go: The Journey of Go's Garbage Collector mentions that:

---

i dont think we'll do this but it is interesting, b/c that's not what i thought 'reference counting' was:

randyrand 26 days ago

parent prev next [–]

I feel like the most user friendly solution for Use After Free is to just use Reference Counting. Basically, just copy Objective-C's "weak_ptr" design.

For every allocation, also store on the heap a "reference object", that keeps track of this new reference.

  struct reference_t {
   // the ptr returned by malloc
   void* ptr;
   // the stack/heap location that ptr was written to.
   // i.e  the reference location.
   void* referenceAddr; 
  }

Reference track: every time ptr is copied or overwritten, create and delete reference objects as needed.

If ptr is freed, visit every reference and call *referenceAddr = NULL, turning all of these references into NULL pointers.

---

maybe ask somewhere for advice:

Choosing a automatic memory management scheme for a high-level programming language.

I'm looking for a simple automatic memory management scheme with short pause times. Most programming languages that I see want performant memory management at the cost of being complex, but I care more about simplicity; except that I do care about having short pause times.

Reference counting or tracing garbage collection or both are ok (if there is a third option, that's okay too); but if reference counting, some sort of cycle detection should be in there too. The high-level language will have a runtime, so it's okay to require that the runtime do things like: - box values and store GC metadata with each box - have read barriers or write barriers or both - do reference counting, or some other computation whenever memory is accessed or changed

The developer using this high-level programming language should not have to think about memory management at all, so stuff like what Rust does is out of scope.

Criteria:

My best guess right now is that I should use reference counting plus a cycle detector, and a write barrier that is on only when the cycle detector is confirming a cycle, but I could be wrong.

What do you recommend?

(later: well i think a 'cycle detector' would have to be a full tracing GC, like Python, b/c what if each node in every cycle participates in multiple cycles? It's not as simple as looking for nodes with refcount 1, following outgoing references from them, and finding a cycle)

shorter version:

Choosing a automatic memory management scheme for a programming language.

I'm looking for a simple automatic memory management scheme with short pause times. Most programming languages that I see want performant memory management at the cost of being complex, but I care more about simplicity; except that I do care about having short pause times.

This will be a high-level programming language with a runtime. The programmer should not have to think about memory management much, if at all (therefore the programmer should not have to do Rust-style thinking about ownership). The runtime will know where pointers are and is allowed to do things like box values and store GC metadata there, have write barriers and/or read barriers, reference count, etc.

Criteria:

What do you recommend?

---

I'm looking for recommendations for a simple automatic memory management scheme for a programming language. Reference-counting and/or tracing garbage collection are both options (I'm open to other suggestions too). I care about short pause times.

Unlike many programming language implementations that I see, which are performant at the cost of a complex implementation, I care most about simplicity of implementation, and am willing to accept poor performance (except my concern about pause times). This is all all relative; I will accept a little implementation complexity in exchange for a lot of performance. So the criteria are: short pause times > simple implementation > low memory usage > fast

I'm on the fence about compaction. I do want the user to be able to run programs for a long time without running out of memory. However, various contemporary language implementations, such as Golang (Go) and Python have non-compacting GCs and I don't hear a lot of complaints about people's programs crashing due to memory fragmentation. Is fragmentation only a theoretical issue that doesn't come up in practice -- or, do those implementations have to do something complicated to avoid fragmentation, something that I could avoid doing if I do compaction?

This will be a high-level programming language with a runtime. The programmer should not have to think about memory management much, if at all (the language should detect reference cycles). The runtime will know where pointers are and could do things like box values and store GC metadata there, have write barriers and/or read barriers, reference count, etc. The memory management scheme should be portable and concurrent.

What do you recommend?

---

"The M-Machine’s limitation of power-of-two-sized objects was a show stopper for a lot of real-world things (at least 25% memory overhead, close to 50% in some cases). " -- [1]

---

Title Hardware and Software Support for Managed-Language Workloads in Data Centers Permalink https://escholarship.org/uc/item/1qv7m47s Author Maas, Martin Christoph Publication Date 2017 Peer reviewed

Thesis/dissertation

https://escholarship.org/content/qt1qv7m47s/qt1qv7m47s.pdf 7.2 and 7.2.1

---

doug-moen 24 hours ago

link flag

RC in C++ using std::shared_ptr is a different proposition than RC in a memory safe language that is designed for safe and seamless RC. In the latter case, the compiler knows about RC and can do a lot of tricks to minimize cost.

High performance GC uses a lot of tricks to achieve performance. If the same amount of work is put into optimizing RC, using an optimizing compiler that emits different kinds of refcount management code in different situations, then you can do far better than what’s possible using a smart pointer class in C++.

In C++ using std::shared_ptr, it’s true that RC is all done locally and you can’t stop it halfway. That isn’t true in a compiler supported and optimized RC runtime. You can use lazy RC, for example. Deferred RC is also nonlocal.

Furthermore, with the approach I’m describing, you can eliminate the need for atomic operations for RC update in multithreaded code. There are several ways to do lock-free RC.

For more information, the Wikipedia article is a start but there’s a lot more information about techniques in papers that you can find with a Google Scholar search.

https://en.wikipedia.org/wiki/Reference_counting https://scholar.google.com/scholar?&q=reference%20counting

---

"AFAIK, the best modern state of art garbage collectors have stop-the-world pauses, proportional to size of root set and/or thread count." -- [2]

"alloc/free..can be doing all sorts of book keeping behind the scenes, including cascading destructor calls which could free a chain of objects the size of live memory." [3]

"The Garbage Collection Handbook originally published in 1996 defines garbage collection as: "All garbage collection schemes are based on one of four fundamental approaches; mark-sweep collection, copying collection, mark-compact collection or reference counting."" -- [4]

--- real-time inplace tricolor marking as per https://dl.acm.org/doi/pdf/10.1145/130854.130862 (thanks davmac https://lobste.rs/s/qdtqi2/automatic_memory_management_with_short#c_4c4fde ) looks good

some potentially good stuff here:

https://www.iecc.com/gclist/GC-algorithms.html

---

https://lobste.rs/s/qdtqi2/automatic_memory_management_with_short#c_68owhm

NOTE: updated version of this list, specific for this project, is at ootMemoryManagementThoughts.txt; if you want to add stuff, add it there

A summary of things I think I've learned so far (from the replies in the first 2 days)

NOTE: updated version of this list, specific for this project, is at ootMemoryManagementThoughts.txt; if you want to add stuff, add it there

---

ok after reading that i figured i should lookup concurrent tricolor garbage collectors. I googled for concurrent incremental tricolor barrier and found some interesting results. The most useful was probably this:

https://www.cs.tau.ac.il/~maon/teaching/2014-2015/seminar/seminar1415a.html

which is a sort of journal club where the students read http://gchandbook.org/ and each presented a chapter or so to the class. The PDFs of the presentations are on this webpage.

There are chapters on

i note that the phrases "incremental", "concurrent", "real-time" and "on-the-fly" all seem to come up in paper discussing techniques relevant to me.

my notes on

Runtime interface and language specific concerns: http://www.cs.tau.ac.il/~maon/teaching/2014-2015/seminar/seminar1415a-lec6-runtime.pdf

the following are quotes but i omitted ... when i omit things; also as usual i change punctuation and layout: " Where to look for pointers

Accurate pointer finding using tagged values

Accurate pointer finding in stacks Finding frames within the stack

Handling interior pointers Computing standard reference from interior pointer

Handling derived pointers Definition Derived pointer is a pointer that is derived from other pointers using arithmetic or logic expression

Object tables Definition Object table - dense array of small records which describe objects

Object tables advantages

Object tables disadvantages

References from external (non-managed) code

the object must survive only for the duration of the call to external code, it’s enough to leave a live reference to the object on stack

Stack barriers

It’s possible to scan a stack incrementally using stack barriers

GC-safe points

Most systems have short sequences of code that must be run entirely to preserve GC invariants

Write barriers detect and record interesting pointers in remembered sets

Card tables Definition Card tables record written pointers with card’s size granularity. Implemented as dense array of cards (memory blocks) indexed by dense array of bytes (header), which mark dirty cards.

" Weak references with tracing collectors

---

notes on Concurrent GC

tricolor intro (i'm not going to bother reiterating this, it's widely available eg wikipedia)

the mutator can be colored black or gray, too (to keep track of the status of the roots)

(sorta-quotations, as in the previous section)" Mutator color influences color of allocated memory

Allocating black objects is always safe

Tricolor invariant types

Barrier techniques To maintain one of the two invariants, the following actions can be taken at barriers:

Incremental update solutions

Snapshot-at-the-beginning solutions

all possible barrier solutions:

Grey mutator incremental update (v1)

Maintains the strong invariant

Grey mutator incremental update (v2)

Less precise

Grey mutator incremental update (v3)

(can this be replaced by a non-atomic version without the isBlack condition check? Surprisingly, apparently no)

Black mutator incremental update (v1)

Maintains the strong invariant (after snapshotting mutator)

Black mutator incremental update (v2)

Less precise (more coarse-grained) than before

Black mutator snapshot-at-the-beginning

shade(src[i]) src[i] ß ref
isWhite(src)

Maintains the weak invariant

Black mutator hybrid barrier

Maintains the weak invariant

Completeness of barrier techniques

Card tables: " Concurrent write barrier mechanisms

---

my notes on http://www.cs.tau.ac.il/~maon/teaching/2014-2015/seminar/seminar1415a-lec12-conc-rc.pdf

most of these are quotes, some are not: " to have less locking: buffered reference counting

Sliding View Reference Counting (Coalesced Reference Counting, and Sliding View):

An Efficient On-the-Fly Cycle Collection A “Big algorithm” - incorporates many techniques I Mark & Sweep (Lecs. 2 & 10) I Generational GC (Lec. 5) I The Recycler (Lec. 3) I Coalesced reference counting I Sliding views Relatively new - Paz et al. 2007

Young generation I High mutation and death rates I Use1 Mark & Sweep - trace only live objects Old generation I Low mutation and death rates I Use1 Reference counting - handle big live heaps

The Recycler Proposed by Bacon & Rajan at 2001 Reclaim unreachable circles using trial deletion Algorithm: I Build a set of candidates I Decrement counts for each candidate (mark as gray) I Scan objects reachable from candidates with rc 6 = 0 F Those items are marked black F The rest are marked white I Remove white objects Many traversals are required - needs a “fixed” heap

---

my notes on Concurrent Mark-Sweep

"Doligez-Leroy-Gonthier [1993]  Let’s separate data allocated solely on a single thread and not shared with other threads.  In addition we have a global heap that allows sharing of objects among threads.  Private objects may move to global heap, if needed. "

see slides for algorithms

(is this paper relevant? http://pauillac.inria.fr/~doligez/publications/doligez-gonthier-popl-1994.pdf ?)

clearer than that paper and those slides: https://justinethier.github.io/cyclone/docs/Garbage-Collector

looks pretty darn complicated. But still perhaps the simplest thing that is reasonable...

---

my notes on [5]

mostly quotes, as above: " Outlineu Concurrent Copying

compacting vs copying: copying you are copying stuff all the time. Compacting you are copying stuff only when you feel that you want to -- more freedom.

Concurrent copying Invariants mutator To-space invariant: mutator must, by definition, hold only To-space pointers. mutator From-space invariant: mutator must, by definition, hold only From-space pointers at the beginning of the collector cycle.

Termination all mutator threads must end the collection cycle holding only To-space pointers!

Baker’s read barrier Every object read by the mutator must be in the To-space! When the mutator tries to read through a pointer

atomic Read(src, 1): ref <— src[i] if ref != null && isGrey(src) ref <— forward(ref) return ref

forward(fromRef): toRef ← forwardingAddress(fromRef) if toRef = null toRef ← copy(fromRef) return toRef

Copy(fromRef): toRef ← free free ← free+size(fromRef) move(fromRef,toRef) forwardingAddress(fromRef) ← toRef add(worklist,toRef) return toRef

objects allocated in To-space are considered Black

expensive in time and space

Halstead's multiprocessor Baker-style algorithm (1985)

alterative:

Brook's indirection barrier Brook’s suggested a different approach : Instead of requiring the To-space invariant allow the mutator to make progress without concern for the wavefront. Brooks observed that if every object (whether in From-space or To-space) has a non-null forwarding pointer (either to its from-space original or to its copy in To- space) then the test on the src object in the read barrier can be eliminated!

atomic Read (src, i): src <-- forwardingAddress(src) return src[i] atomic Write(src,i,ref): src <-- forwardingAddress(src) if is Black(src) ref <-- forward(ref) src[i] <-- ref

Brook's indirection barrier –termination termination requires all mutators to hold only To-space references but Brooks allows mutators to operate grey! therefore, Once copying is finished all mutators need a final scan of their stacks to replace any remaining unforwarded references

pros:

cons:

more complicated ones: "

Concurrent compaction One-pass compressor Kermany and Petrank, 2006

downside: trapping mutator means copying all of the accessed page objects and the pointers in those objects to refer to To-space locations -- Significant pauses

Click and Azul to the rescue! The Pauseless collector [Click et al, 2005; Azul, 2008]

The Pauseless collector is divided into 3 phases: Mark phase- responsible for periodically refreshing the mark bits. Relocate phase- uses the most recently available mark bits to find pages with little amount of live data, relocate their objects and free their physical memory Remap phase- updates every pointer in the heap whose target has been relocated. "

Summary "

---

https://justinethier.github.io/cyclone/docs/Garbage-Collector (and many other places) point out that if your language creates a lot of garbage quickly, a copying GC is better, at least for the young generation, because most of the memory is dead but it only copies the live memory

honestly Cyclone Scheme's writeup, https://justinethier.github.io/cyclone/docs/Garbage-Collector looks like almost the simplest thing that is reasonable.

the tracing itself is just tricolor marking with a write barrier. There's some additional fanciness just to make sure the collector is sync'd with the mutators, without having to do a global pause of all mutators at once

the DLG GC is at https://github.com/justinethier/cyclone/blob/master/gc.c. looks pretty long... (later) actually i've been skimming through it, it's pretty straightforward

it uses http://concurrencykit.org/ btw

btw they also added http://justinethier.github.io/cyclone/docs/Garbage-Collection-Using-Lazy-Sweeping

i guess in our case we could skip the copying collector and just use DLG; also we dont need lazy sweeping


how PyPy? reduced their pauses with an incremental tricolor GC with a write barrier:

" This "major collection" is what gives the long GC pauses. To fix this problem we made the GC incremental: instead of running one complete major collection, we split its work into a variable number of pieces and run each piece after every minor collection for a while, until there are no more pieces. The pieces are each doing a fraction of marking, or a fraction of sweeping. It adds some few milliseconds after each of these minor collections, rather than requiring hundreds of milliseconds in one go.

The main issue is that splitting the major collections means that the main program is actually running between the pieces, and so it can change the pointers in the objects to point to other objects. This is not a problem for sweeping: dead objects will remain dead whatever the main program does. However, it is a problem for marking. Let us see why.

In terms of the incremental GC literature, objects are either "white", "gray" or "black". This is called tri-color marking. See for example this blog post about Rubinius, or this page about LuaJIT? or the wikipedia description. The objects start as "white" at the beginning of marking; become "gray" when they are found to be alive; and become "black" when they have been fully traversed. Marking proceeds by scanning grey objects for pointers to white objects. The white objects found are turned grey, and the grey objects scanned are turned black. When there are no more grey objects, the marking phase is complete: all remaining white objects are truly unreachable and can be freed (by the following sweeping phase).

In this model, the important part is that a black object can never point to a white object: if the latter remains white until the end, it will be freed, which is incorrect because the black object itself can still be reached. How do we ensure that the main program, running in the middle of marking, will not try to write a pointer to white object into a black object? This requires a "write barrier", i.e. a piece of code that runs every time we set a pointer into an object or array. This piece of code checks if some (hopefully rare) condition is met, and calls a function if that is the case.

The trick we used in PyPy? is to consider minor collections as part of the whole, rather than focus only on major collections. The existing minimark GC had always used a write barrier of its own to do its job, like any generational GC. This existing write barrier is used to detect when an old object (outside the nursery) is modified to point to a young object (inside the nursery), which is essential information for minor collections. Actually, although this was the goal, the actual write barrier code is simpler: it just records all old objects into which we write any pointer --- to a young or old object. As we found out over time, doing so is not actually slower, and might actually be a performance improvement: for example, if the main program does a lot of writes into the same old object, we don't need to check over and over again if the written pointer points to a young object or not. We just record the old object in some list the first time, and that's it.

The trick is that this unmodified write barrier works for incminimark too. Imagine that we are in the middle of the marking phase, running the main program. The write barrier will record all old objects that are being modified. Then at the next minor collection, all surviving young objects will be moved out of the nursery. At this point, as we're about to continue running the major collection's marking phase, we simply add to the list of pending gray objects all the objects that we just considered --- both the objects listed as "old objects that are being modified", and the objects that we just moved out of the nursery. A fraction from the former list were black object; so this mean that they are turned back from the black to the gray color. This technique implements nicely, if indirectly, what is called a "backward write barrier" in the literature. The backwardness is about the color that needs to be changed in the opposite of the usual direction "white -> gray -> black", thus making more work for the GC. (This is as opposed to "forward write barrier", where we would also detect "black -> white" writes but turn the white object gray.)

In summary, I realize that this description is less about how we turned minimark into incminimark, and more about how we differ from the standard way of making a GC incremental. What we really had to do to make incminimark was to write logic that says "if the major collection is in the middle of the marking phase, then add this object to the list of gray objects", and put it at a few places throughout minor collection. Then we simply split a major collection into increments, doing marking or sweeping of some (relatively arbitrary) number of objects before returning. That's why, after we found that the existing write barrier would do, it was not much actual work, and could be done without major changes. For example, not a single line from the JIT needed adaptation. All in all it was relatively painless work. ;-) "

-- [6]

---

https://www.google.com/search?q=cocurrent%20incremental%20tricolor%20barrier https://www.google.com/search?q=Baker%20Brooks%20Sapphire%20concurrent%20garbage%20collection

---

" > And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.

This is not a typical solution.

Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.

Garbage collection does not need to be used to collect all objects. " -- https://news.ycombinator.com/item?id=32282512

---

" origin_path 10 days ago

root parent prev next [–]

The best real world garbage collectors are like ZGC, Shenandoah or C4 which don't pause your program at all. They are fully concurrent.

Even if you go for a GC that is designed to balance throughput and latency, like G1, you can still configure what pause times it should target and those can easily be ~10-20msec if you want, with the vast majority of pauses being far less than that (less than 3 msec).

"I was recently reading some blog posts of the V8 JS engine team. There is nothing to understand or not understand there"

Look, if your knowledge of GC comes from reading V8 blog posts then you're proving the grandparent's point pretty nicely. You leaped from an extremely basic and remote understanding of GC to "there's nothing to understand or not understand here".

reply

ahartmetz 10 days ago

root parent next [–]

Serious questions: if these garbage collectors are so good, why aren't they widely used? My guess is that they have downsides such as low throughput, high CPU usage, high memory usage, etc... You can't import a JVM GC into V8, to stay with my example, but you can reimplement the ideas if you have quasi-infinite money like Google.

reply

origin_path 10 days ago

root parent next [–]

For ZGC/Shenandoah because they're new, and they're new because they're extremely hard to implement well. For C4 because it is expensive and requires kernel patches. Also there isn't a whole lot of need for them in many use cases. Web servers for example have far bigger latency problems than GC, normally. Pauseless GC was historically driven by the HFT/finance sector for that reason.

Also yes, pauseless GC tends to have higher overheads than GC that pauses for longer. Whether that matters or not depends a lot on the use cases and actual size of the overheads. For example ZGC is pauseless but not yet generational. Generational ZGC when it launches will improve throughput significantly.

Google haven't done pauseless GC for V8. I don't know why not because they have done one for Android. ART uses a fully concurrent collector iirc. At any rate, although you can't import a JVM GC to V8, you can run JavaScript? on the JVM using GraalJS? and use the GCs that way. Though I don't recall off hand if Graal supports ZGC yet. There's no deep reason why it couldn't.

reply "

---

" ART GC overview

ART has a few different GC plans that consist of running different garbage collectors. Starting with Android 8 (Oreo), the default plan is Concurrent Copying (CC). The other GC plan is Concurrent Mark Sweep (CMS).

Some of the main characterstics of Concurrent Copying GC are:

    CC enables use of a bump-pointer allocator called RegionTLAB. This allocates a thread-local allocation buffer (TLAB) to each app thread, which can then allocate objects out of its TLAB by bumping the "top" pointer, without any synchronization.
    CC performs heap defragmentation by concurrently copying objects without pausing app threads. This is achieved with the help of a read-barrier which intercepts reference reads from the heap, without the need of any intervention from the app developer.
    GC only has one small pause, which is constant in time with regards to the heap size.
    CC extends to be a generational GC in Android 10 and higher. It enables collecting young objects, which often become unreachable fairly quickly, with little effort. This helps by increasing GC throughput and considerably delaying the need to perform a full-heap GC.

The other GC that ART still supports is CMS. This GC also supports compaction, but not concurrently. Compaction is avoided until the app goes into the background, at which time the app threads are paused to perform compaction. Compaction also becomes necessary when an object allocation fails due to fragmentation. In this case the app may potentially become unresponsive for some time.

Since CMS rarely compacts, and thus free objects may not be contiguous, it uses a free-list-based allocator called RosAlloc?. It has a higher allocation cost as compared to RegionTLAB?. Finally, because of internal fragmentation the memory usage for the Java heap can be higher for CMS than CC. "

---

Waterluvian 10 days ago

parent next [–]

I’m out of my league so this may be dumb, but does any language or VM or whatnot have a combined system where each thread has its own heap, and you can talk by passing messages, but they also have a common heap for larger data that’s too expensive to pass around, but at the cost that you have to be much more careful with lifetimes or have to manage it manually or something?

reply

tkhattra 10 days ago

root parent next [–]

In the Erlang VM, each Erlang "process" has its own garbage collected heap [1]. There's also an ETS module for more efficiently storing and sharing large amounts of data, but data stored in ETS is not automatically garbage collected [2].

[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html

reply

jefftk 10 days ago

root parent prev next [–]

If you squint at it right you could say C works that way: you have many processes each with their own heap and they can pass messages, but if they want larger data that's too expensive to pass around you can use shared memory.

reply

kaba0 10 days ago

root parent next [–]

That’s just IPC, nothing inherent to C.

reply

jefftk 10 days ago

root parent next [–]

Sure, you can do this in a lot of languages. I gave C mainly because it's the "native language" of Unix.

reply

Jweb_Guru 10 days ago

root parent prev next [–]

I believe Nim works this way, among others.

reply

jeffmurphy 10 days ago

root parent prev next [–]

I believe Erlang works this way.

reply

Bolkan 10 days ago

root parent prev next [–]

Dart

reply

---

ridiculous_fish 10 days ago

prev next [–]

There's a lot of discussion of comparative performance, but most software isn't performance sensitive so it just doesn't matter. But there's another major facet: FFIs. The choice of memory management has huge implications for how you structure your FFI.

JavaScriptCore? uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.

Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC? GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.

Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical?. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.

One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!

reply

chubot 10 days ago

parent next [–]

I agree that the API and interop with C/C++ is a huge issue and something I haven't seen good articles on.

But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.

With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.

Moving is indeed a big pain, and I'm about to back out of it for Oil :-/


edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:

https://news.ycombinator.com/item?id=32283641

I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.

reply

kaba0 10 days ago

parent prev next [–]

It’s a good point that it is a topic that doesn’t get enough coverage, but let’s just add that it has good solutions for most use cases: GCs can pin objects that might be used from some other language.

reply

---

 eru 10 days ago | parent | prev | next [–]

If you have pure, immutable and strict, you can't create cycles. (That's what Erlang does for example.) That makes a lot of memory management techniques much simpler. Both tracing garbage collection and reference counting.

If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.

reply

---

aidenn0 on Dec 9, 2013

parent prev next [–]

See this: http://www.hpl.hp.com/personal/Hans_Boehm/gc/

But also with a very small number of restrictions, you can make C have non-conservative GC.

ANSI C forbids aliasing as any type but a character. Furthermore, unions are to be interpreted as the last type they are accessed as.

The three places you run into issues are:

1) You can treat two structs that begin with the same values interchangeably so long as you only access the common beginning of the structures.

2) You can freely convert pointers to integers and back.

3) Storage returned by malloc() can be used heterogeneously (however, with strict aliasing rules, you can't switch types of the data once you've accessed it)

If you forbid those three things, then you can have a strict gc by adding runtime overhead to accessing data (once malloced data is accessed by a non-character type, you can treat it like that type forever).

However, lots and lots of code violates these rules, (the most famous code that does is probably the fast inverse square-root)[1]

http://betterexplained.com/articles/understanding-quakes-fas...

---

benhoyt 23 hours ago

next [–]

The "Garbage Collection" chapter in the OP's recent book Crafting Interpreters goes into this in a bunch more detail (in the context of a language VM): https://craftinginterpreters.com/garbage-collection.html -- so well written, both this and the 2013 blog post.

reply

(by 2013 blog post he means https://journal.stuffwithstuff.com/2013/12/08/babys-first-garbage-collector/ )

rurban 16 hours ago

parent next [–]

But it stays at a primitive Mark & Sweep collector, not the more advanced Cheney compacting, copying, half-space collector. It doesn't even mention how bad Mark & Sweep is, esp. for larger heaps.

---

snarfy 11 hours ago

prev next [–]

For a long time I had an idea I thought was great - a lock free garbage collector that could be run on its own core to achieve zero latency. I've got 32 cores busy doing nothing and thought I could throw hardware at the problem and get garbage collection for free.

A bunch of people much smarter than me all chimed in on how this was a terrible idea. They argued it would destroy the cache and result in worse performance than a traditional garbage collector. Locality matters, apparently. I still sometimes think of this idea but it doesn't fit very well into the actual hardware.

reply

hinkley 8 hours ago

parent next [–]

Interprocess cache invalidation is one of the bigger devils in the details. It’s one part of the memory corruption crisis that the Java Memory Model tries to solve and then a number of other languages copied.

There are concurrency libraries that allocate 8x as much memory as they need just to guarantee that two threads don’t share a cache line.

reply

---

eru 10 days ago

unvote parent prev next [–]

If you have pure, immutable and strict, you can't create cycles. (That's what Erlang does for example.) That makes a lot of memory management techniques much simpler. Both tracing garbage collection and reference counting.

If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.

---

agentultra 25 days ago

parent context prev next [–]on: Some Thoughts on Zig

Another thing which makes Zig interesting is it’s approach to allocation, making a failure to allocate memory a catchable error, and enabling developers to easily assign their own allocators.

pjmlp 25 days ago

parent next [–]

Turbo Pascal 6.0 manual from 1990, page 210

http://www.bitsavers.org/pdf/borland/turbo_pascal/Turbo_Pasc...

What is old is new again. :)

alphazino 25 days ago

root parent next [–]

Yes, but I think you missed the point of the parent comment.

> enabling developers to easily assign their own allocators

Through a combination of Zig's convention to pass around allocators instead of using a global allocator, the allocator interface's ability to return errors, and the Zig language forcing you to handle all errors (unless you explicitly discard them), a lot of Zig code is written with allocation failure in mind.

This means that you can seamlessly pass a custom allocator, say a bump allocator backed by a fixed buffer, to a function in some third-party library and still handle the case where the buffer runs out of space.

---

this HN user frequently posts insightful comments with links to research papers about advanced garbage collection techniques https://news.ycombinator.com/threads?id=hayley-patton

this appears to be their blog: https://applied-langua.ge/

and their twitter: https://twitter.com/nodefunallowed/

they apparently use sbcl: https://twitter.com/nodefunallowed/status/1557691547336208384/photo/1

this is apparently their reddit username, according to https://news.ycombinator.com/item?id=32277443: https://www.reddit.com/user/theangeryemacsshibe/

---

adgjlsfhk1 26 days ago

parent context favorite on: Reference count, don't garbage collect

The blog post is largely incoherent for several reasons.

1. It recommends 8 byte counters. This implies making every object in your programming language 8 bytes bigger which is pretty much a non-starter. Almost everyone that actually uses reference counting uses more like 2-4 bits.

2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).

3. The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)

4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.

OskarS? 26 days ago

root parent prev next [–]

> Almost everyone that actually uses reference counting uses more like 2-4 bits.

Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:

1. Rust, in Rc<T> and Arc<T>

2. C++, in std::shared_ptr<T>

3. Objective-C, in NSObject's retainCount

4. Swift, because of Objective-C

5. Python, where reference count is Py_ssize_t

These were literally the first languages i thought of, they all use 64-bit types. I would argue since reference counting is much rarer than GC, these make up the bulk of reference counting use in the real world. "Almost everyone", huh? It's a bit rich accusing the author of being "almost incoherent" and then say this.

-- [7]

---

 kaba0 26 days ago | root | parent | prev | next [–]

With all due respect, why do you believe that your experience is meaningful compared to people writing actual industrial-scale runtimes, when RC is the hello world of GC algorithms? It’s not that they don’t know about it, it’s that they studied the topic countless times and found significant difference between the two approaches, to the point that no language considered fast uses RC (or if they do, they do so because it is a useful escape hatch to manual memory management that doesn’t need support from the runtime).

---

benibela 25 days ago

prev next [–]

An advantage of RC is that you can also use it to verify ownership.

When the counter is 1, you can do anything with the object without affecting any other references.

Like the object could be mutable for a counter=1, and copy-on-write otherwise. Then you can make a (lazy) deep copy by just increasing the counter.

---

pizlonator 25 days ago

prev next [–]

Atomic inc/dec is hella expensive relative to not doing it. It’s true that CPUs optimize it, but not enough to make it free. RC as a replacement for GC means doing a lot more of this expensive operation - which the GC will do basically zero of in steady state - so this means RC just costs more. Like 2x slowdown more.

The atomic inc/dec also have some nasty effects on parallel code. The cpu ends up thinking you mutated lines you didn’t mean to mutate.

So, GC is usually faster. RC has other benefits (more predictable behavior and timing, uses less memory, plays nicer with OS APIs).

manuelabeledo 25 days ago

parent next [–]

> So, GC is usually faster.

GC is way faster if there is little collection.

In memory or cache intensive applications, garbage collection as a whole can be significantly slower.

pizlonator 25 days ago

root parent next [–]

GC is faster even if you collect a lot. GCs create better cache locality especially for recently allocated objects, and their cache behavior is not generally worse than malloc (but there are many GCs and many mallocs and some try harder than others to make caches happy).

The total time spent in GC across a program’s execution time is usually around 30% or so. Maybe more in some cases (some crazy Java workloads can go higher) or less in others (JavaScript? since the mutator is slow), but 30% is a good rule of thumb. That includes the barriers, and total cost of all allocations, including the cost of running the GC itself.

Reference counting applied as a solution to memory safety, as a replacement for GC, is going to cost you 2x overhead just for the ref counting operations and then some more on top of that for the actual malloc/free. When you throw in the fact that GCs always beats malloc/free in object churn workloads, it’s likely that the total overhead of counting refs, calling free(), and using a malloc() that isn’t a GC malloc is higher than 2x, I.e. more than 50% of time spent in memory management operations (inc, dec, malloc, free).

It’s a trade off, though. The GC achieves that 30% because it uses more memory. All of the work of understanding the object graph is amortized into a graph search that happens infrequently, leading to many algorithmic benefits (like no atomic inc/dec, faster allocation fast path, freeing is freeish, etc), but also causing free memory to be reused with a delay, leading to 2x or more memory overhead.

That also implies that if you ask the GC to run with lower memory overhead, it’ll use more than 30% of your execution time. It’s true that if you want the memory usage properties of RC, and you try to tune your GC to get you there, you gonna have a slow GC. But that’s not how most GC users run their GCs.

---

" But yeah, the correct way to handle resources (not just memory!) is with value semantics and RAII. Because then you know the object will be cleaned up as soon as it goes out of scope with zero additional effort on your part. In places where this is not appropriate, a simple reference counting scheme may be used, but the idea is to keep the number of rc'd objects small. Do not use cyclical data structures. Enforce a constraint: zero cycles. For data structures like arbitrary graphs, keep an array of vertices and an array of edges that reference vertices by index. " -- [8]

---

"objc said this some small number of bits is more than enough in all normal code, and falls back to the hash table if the refcount ever hits the maximum value." -- [9]

---

" There are also promising experiments with non-garbage-collected managed languages. Eg both val and roc feature deterministic memory management and c/rust-like levels of control over memory layout while still preserving the simple experience of a garbage-collected language. I think it's likely that we'll see at least one production-ready language in this vein within the next decade. "-- https://www.scattered-thoughts.net/writing/how-safe-is-zig/

https://www.val-lang.dev/ https://www.roc-lang.org/

---

Oil shell had a lot of problems with their GC b/c they kept forgetting to notify the garbage collector about intermediate values that should have been GC roots [10]

Their current solution is to disallow garbage collection except at safe points (it used to be allowed at any allocation) and to only put safepoints in a few spots. Now only functions that are on the call stack as of the safepoint need to have correctly notified the GC about their roots. Most of these functions are automatically compiled from a higher-level-language, so now there is less chance of a mistaken omission from the GC roots: [11]

[12]

older: [13] (i assume the reference to MPS means this? https://www.ravenbrook.com/project/mps/master/manual/html/guide/overview.html#guide-overview ) ---