I built a garbage collector for a language that doesn't need one
If carcinization happens when languages evolve to be more like Rust, then what do you call it when Rust evolves to be more like Java? Caffeination?
Over this summer, I’ve had a decent amount of time to kill. What better way to spend a beautiful hot summer than sitting inside, staring at core dumps in a supposedly memory-safe language? I built a garbage collector - in short, a piece of software that manages allocations for another, more exciting piece of software. The cool part: I made it in Rust, for Rust - a language designed to eliminate garbage collection and which provides few facilities for making it work properly. Not only that, I managed to make my garbage collector work with relatively few compromises, which I’ll describe in further detail lower down.
If you don’t like reading, don’t care about the journey to implement it, or just want to get a sales pitch for the final result, check out the source code on GitHub here. You can also download it directly from crates.io.
Background
If you’re familiar with the details of Rust and its standard library, feel free to skip this section.
The core backing behind Rust’s memory model is affine typing and the borrow checker. Values may only be bound to one identifier at a time, and borrows (a.k.a. references) may not outlive the scope binding their referent.
For example, the following code is invalid:
let x = vec![1, 2, 3];
let y = x;
println!("{x:?}"); // compile error - x has already been moved
Normally, we work around this by borrowing against a binding, such as by making y = &x
in the
example above.
However, we often need to share some heap-allocated value without knowing which binding will live
the longest.
The solution to this problem is shared ownership via garbage collection.
Rust’s standard library offers two simple reference-counted garbage collectors: the single-threaded
Rc
and its atomically-indexed counterpart
Arc
.
They operate by maintaining a reference count in each heap allocation.
Under most circumstances, these work great, but they can’t handle cyclic references.
Combined with interior mutability, it’s trivial to refute them.
use std::{cell::OnceCell, rc::Rc};
struct Foo(OnceCell<Rc<Foo>>);
let x = Rc::new(Foo(OnceCell::new()));
x.0.set(Rc::clone(&x));
// My foo has a reference to itself. It can never be freed!
This is why people actually get paid money to build garbage collectors. If using a reference counter were all you needed, a number of people working at Oracle would be out of a job.
Battle plan
We’d like to create some Gc
data structure with a similar API to Rc
and Arc
, which can accept
nearly any data type contained within, and still manage to detect and collect cycles.
We have a few weapons at our disposal:
Drop
: Every non-copiable data type in Rust can implement theDrop
trait to ensure some code is called every time it is dropped. In our case, we can implementDrop
forGc
to try to glean some knowledge about when an allocation becomes inaccessible.- Traits: We can construct some trait (in our case, let’s call it
Collectable
) as a mandatory requirement to be contained in aGc
. Creating this trait has some major downsides (libraries upstream ofdumpster
can’t implement it) but it’s a necessary evil.
However, our formidable tools are matched by equally formidable challenges:
- Undetectable moves. When a value is moved in Rust, there is no way to detect that fact from
within that value.
If we had some sort of trait like
OnMove
which could allow for a function to be called every time it had moved, we could use it to detect when aGc
moved inside anotherGc
, making it unrooted, and allowing us to create a simple mark-sweep collector. At this rate, though, we would just reinvent C++’s copy constructors. - Variable-sized data. I foolishly decided to make it so that
Gc
could store?Sized
types, which is more flexible for library users (enabling things likeGc<[T]>
). - Type erasure. In a typical Rust program, generics are implemented via monomorphization, and no type information is retained at runtime. This makes it harder to clean up an allocation without prior context.
With these two, relatively simple tools, we have enough to build a collector which can handle just about any data type inside.
My approach
We need to detect whether a Gc
is reachable without actually scanning the stack.
I’ll start with a few definitions, using graph-theoretical language.
- An allocation graph \(G = (V, E, r)\) is a directed graph with a root node \(r \in V\) whose indegree is zero.
- A node \(v\) in an allocation graph is said to be accessible if and only if there exists a path from \(r\) to \(v\) in the graph.
It should be clear to see why these definitions are useful to us.
We can imagine each node being one allocation, pretending that all data not indirected through a
Gc
is part of an imaginary allocation for the root.
Additionally, each Gc
acts as an edge connecting two allocations.
If an allocation is accessible in the graph-theoretical sense, it’s still possible for a program to
reach it, and if not, it’s safe to free that allocation.
Note that the indegree of a node is precisely equal to the reference count of its corresponding
allocation.
I’ll roughly outline our approach to determining whether some node \(n: \neq r\) is accessible.
- Find \(W\), the set of all descendants of \(n\). Construct the subgraph \(H\) of \(G\) whose vertex set is \(W\), preserving all edges connecting nodes in \(W\).
- Find the set of nodes \(A \subseteq W\) for which the indegree of each \(a \in A\) in \(H\) is not equal to its indegree in \(G\).
- For each \(a \in A\), mark it and its descendants as accessible.
- If and only if \(n\) was marked as accessible, it is accessible.
Here’s a serpentine implementation in pseudocode, if that’s what you prefer.
def is_accessible(n):
counts = {n: n.indegree()}
for (_, m) in n.edges():
dfs(m, counts)
reachable = set()
for (a, count) in counts:
if count != 0:
mark(a, reachable)
return n in reachable
def dfs(n, counts):
if n not in counts.keys():
counts[n] = n.indegree()
for (_, m) in n.edges():
dfs(m, counts)
counts[n] -= 1
def mark(a, reachable):
if a not in reachable:
reachable.add(a)
for (_, b) in a.edges():
mark(b, reachable)
It should be pretty clear to see that is_accessible
runs in \(O(|V| + |E|)\) time.
Reference counting with extra steps
Our single-threaded Gc
will behave much like an Rc
, but with some minor details changed.
First, we define a Gc
and the allocation it points to, a GcBox
.
pub struct Gc<T: ?Sized> {
ptr: NonNull<GcBox<T>>
}
#[repr(C)]
struct GcBox<T: ?Sized> {
ref_count: Cell<usize>,
value: T
}
We can then hook into the Drop
behavior for our Gc
to make it all work.
impl<T: ?Sized> Drop for Gc<T> {
fn drop(&mut self) {
let box_ref = unsafe { self.ptr.as_ref() };
match box_ref.ref_count.get() {
0 => (),
n => {
box_ref.ref_count.set(n - 1);
if n == 1 || !is_accessible(box_ref) {
box_ref.ref_count.set(0);
unsafe {
drop_in_place(addr_of_mut!(ptr.as_mut().value));
dealloc(ptr.as_ptr().cast(), Layout::for_value(ptr.as_ref()));
}
}
}
}
}
}
I, being very clever, decided to use a sentinel value of 0 for the reference count when an allocation is being cleaned up (to prevent spurious double-frees).
However, this code has a problem - how the hell are we supposed to implement is_accessible
?
Trait hackery
Our pseudocode for is_accessible
required us to be able to access the set of edges going out from
a node.
Doing so is kind of hard.
If we were writing this in C, we would scan the allocation on the heap, looking for data that
looked like they could be pointers into other allocations.
In Rust, though, we can be a lot more precise by adding a small constraint to every
garbage-collected type.
If the garbage collector wasn’t enough, we’re bringing the visitor pattern over from Java.
We force every garbage-collected value to implement the Collectable
trait, which will delegate
some Visitor
to each of its garbage-collected fields.
pub struct Gc<T: Collectable + ?Sized> {
// same as before ...
}
pub trait Collectable {
fn accept<V: Visitor>(&self, visitor: &mut V);
}
pub trait Visitor {
fn visit_gc<T: Collectable + ?Sized>(&mut self, gc: &Gc<T>);
}
For example, an array might implement it by delegating to accept
on each of its elements.
impl<T: Collectable> Collectable for [T] {
fn accept<V: Visitor>(&self, visitor: &mut V) {
self.iter().for_each(|elem| elem.accept(visitor));
}
}
We can now write code that finds the outgoing edges from each allocation by simply applying a
visitor to the allocation.
I ended up writing two visitors - one for the dfs
step and one for the mark
step.
I would have liked to make Collectable
object-safe, but that would also require Visitor
to be
object safe.
That, in turn, would cause every garbage collector to be coercible to Gc<dyn Collectable>
, which
would only be possible on nightly, making it impossible to write a stable crate.
Bulk dropping
A cursory analysis of the implementation of Drop
above shows that the call to is_accessible
runs
in linear time with respect to the total number of Gc
s existing.
You would be excused for thinking that drop
, as written, runs in linear time.
However, the call to drop_in_place
could also result in another Gc
being dropped, yielding
a quadratic-time cleanup.
This is unacceptable, especially since our first call to is_accessible
actually did all the work
of determining whether the allocations are reachable from the original one - if the original
allocation being freed was inaccessible, any allocation referred to by it should also be
inaccessible.
We can save our time on marking by a pretty simple method: we’ll do one more pass through the
reference graph, recording a pointer to each allocation to be destroyed and its implementation of
drop
. Then, once we’ve found every inaccessible allocation, we’ll drop them all in one quick go.
There’s one final problem, though: each Gc
owned by a dropped value could itself try to manipulate
the reference counts of its pointees, resulting in undefined behavior, or worse, slow code.
However, all problems in computer science can be solved by adding more state or more indirection.
We opt to solve it with extra state this time, creating a thread-local value called COLLECTING
.
When we do a bulk-drop, we set COLLECTING
to true
.
Then, when we call drop
on a Gc
, it checks the state of COLLECTING
.
If COLLECTING
is true, the Gc
does nothing when dropped (not even affecting its reference
count).
To handle outbound edges to still-accessible allocations, we add one more traversal before the final
cleanup to handle outbound edges.
Let’s take a look at what the code for that roughly looks like now.
thread_local! {
static COLLECTING: Cell<bool> = Cell::new(false);
}
impl<T: Collectable + ?Sized> Drop for Gc<T> {
fn drop(&mut self) {
if COLLECTING.with(Cell::get) { return; }
let box_ref = unsafe { self.ptr.as_ref() };
match box_ref.ref_count.get() {
0 => (),
1 => {
box_ref.ref_count.set(0);
unsafe {
drop_in_place(addr_of_mut!(ptr.as_mut().value));
dealloc(ptr.as_ptr().cast(), Layout::for_value(ptr.as_ref()));
}
}
n => {
box_ref.ref_count.set(n - 1);
unsafe { collect(self.ptr) };
}
}
}
}
collect
is the function we’ll write to check if the allocation is accessible and perform a bulk
cleanup if needed.
There’s one more problem before we can implement collect
, though: our psuedocode required us to be
able to store each allocation as a key in counts
and also to access the allocation’s neighbors.
This means we have to use a little bit of type erasure to achieve our ends.
I won’t go into too much detail, but we need to be able to do two things:
- Store pointers to allocations as keys in a hash-map.
- Reconstruct a stored pointer using type information to make it possible to do work.
Due to the limitations of Rust, these are actually mutually exclusive.
Because it’s undefined behavior (due to provenance issues) to compare the raw data in pointers, we
can’t erase a pointer and then use it as a key in a hash-map.
However, we can’t just get a pointer to the first byte of the allocation, because our allocation
could be ?Sized
.
As a result, my implementation actually has two different kinds of erased pointers: a thin pointer which can be used for comparisons and a fat pointer which can be used to reconstruct a pointer to any type.
struct AllocationId(NonNull<Cell<usize>>); // used as a key in hash-maps
struct ErasedPtr([usize; 2]); // used as a reconstructible pointer to the allocation
Then, all the functions which need to know the type information of the allocation (such as dfs
and
mark
) can be stored opaquely as function pointers, and when called, they can “rehydrate” the
ErasedPtr
passed in with type information.
Below, I’ve included an example of how it can be used to implement dfs
.
unsafe fn dfs<T: Collectable + ?Sized>(
ptr: ErasedPtr,
counts: &mut HashMap<AllocationId, usize>
) {
struct DfsVisitor<'a> {
counts: &'a mut HashMap<AllocationId, usize>,
}
impl Visitor for DfsVisitor<'_> {
fn visit_gc<U>(&mut self, gc: &Gc<U>) {
let id = AllocationId::new(gc.ptr);
if counts.insert(id, unsafe { gc.ptr.as_ref().ref_count.get() }) {
(**gc).accept(self);
}
*counts.get_mut(&id).unwrap() -= 1;
}
}
// pretend that `specify` converts the pointer back into NonNull<T>
let specified = ptr.specify::<T>();
specified.as_ref().accept(&mut DfsVisitor { counts });
}
I won’t include the implementation of collect
here because it’s lengthy and not even the
final revision of the cleanup algorithm, but you can trust me that it works.
After much gnashing of teeth and tearing of hair, this is finally enough to achieve bulk dropping,
making calling drop()
on a Gc
a linear-time operation.
Amortizing
What’s that, you say? You need your reference operations to work in \(O(1)\) time? Why are programmers these days so picky?
Luckily, it isn’t too hard to make our garbage-collection efforts terminate in average-case
\(O(1)\) time.
Instead of calling collect
directly, we’ll just punt the allocation, which may be inaccessible,
into a set of “dirty” allocations.
In my code, I call that set a dumpster.
Then, once every \(|E|\) times an allocation is dropped, we’ll go through the whole dumpster and
collect them all in \(O(|E|)\) time.
Amortized across all allocations, that’s technically \(O(1)\), enabled by the same accounting
wizardry that makes array-backed lists useful and that caused the 2008 financial crisis.
To save ourselves a little bit of work, whenever an allocation is accessed (such as by dereferencing or cloning), we can yank the allocation away from that same dirty set since it was just proven to be accessible.
Are we concurrent yet?
So far, the garbage collector I’ve been outlining to you has been a single-threaded collector, which can only store thread-local garbage-collected values. I’m told that all code worth writing ends up being concurrent, so we have to do better than that. Lucky for us, this algorithm can be made to operate concurrently with little effort, assuming that only one thread collects at a time.
However, even if only one thread is collecting, we can still run into some nasty concurrency issues. Let’s imagine that we have 2 allocations, \(a\) and \(b\), plus the imaginary “root” allocation \(r\), where \(a\) can only be accessed from \(r\) and \(b\) can only be accessed from \(a\).
flowchart LR
r("r")
a --> b
r --> a
Now, consider performing a depth-first search starting from \(a\).
First, we record that \(a\) has 1 unaccounted-for-reference in counts
.
Then, we move on to looking at \(b\) because it’s one of \(a\)’s children.
However, between when we record \(a\)’s reference count and examine \(b\), another malicious thread mutates \(b\), adding a back-reference from \(b\) to \(a\).
flowchart LR
r("r")
a --> b
b --> a
r --> a
Then, the depth-first search will return to \(a\), decrement the number of unaccounted references, and assume that all incoming references to \(a\) and \(b\) were totally accounted for (and therefore inaccessible). Then the garbage collector would happily destroy \(a\) and \(b\)’s allocations, leaving a dangling reference to the heap, potentially causing a use-after-free or other gnarly memory errors.
Resolving concurrency
The simplest solution to this concurrency issue is to just wrap everything in a big fat lock by
forcing every single operation involving a Gc
to acquire a single global mutex, preventing any
sort of concurrency bugs at all.
However, that’s ridiculously slow.
We’d like to have an approach that only needs to achieve mutual exclusion over a small subset of
all allocations if at all possible.
To do so, let’s take a closer look at our concurrency problem: we can see that we only really get
erroneous results when another thread mutates the subgraph \(H\) as we interact with it.
The solution is quite clever - we tag every Gc
with a special number, which I’ll call a collecting
tag.
First, we’ll add a global atomic integer called the COLLECTING_TAG
.
static COLLECTING_TAG: AtomicUsize = AtomicUsize::new(0);
Whenever we create or clone a new Gc
, we’ll annotate it with the current value of
COLLECTING_TAG
.
// this new `Gc` type is intended to be `Sync`.
pub struct Gc<T: Collectable + Send + Sync + ?Sized> {
ptr: NonNull<GcBox<T>>,
tag: AtomicUsize,
}
impl <T: Collectable + Send + Sync + ?Sized> Gc<T> {
pub fn new(x: T) -> Gc<T> {
// (other bookkeeping hidden)
Gc {
ptr: /* initialization of GcBox<T> */,
tag: COLLECTING_TAG.load(Ordering::Relaxed),
}
}
}
Next, at the start of the collection process, we’ll increment the value of COLLECTING_TAG
just
before we call dfs
.
fn collect() {
COLLECTING_TAG.fetch_add(1, Ordering::Relaxed);
// carry on with DFS and such
}
Lastly, whenever we encounter a Gc
during the dfs
process, we’ll check its tag.
If the tag is equal to the current value of COLLECTING_TAG
, that means the Gc
was created after
dfs
started.
If so, then whatever spot we found the Gc
in must have been accessible, and we can mark it as so.
To prevent us from visiting the same edge twice (due to shenanigans), we’ll also update the tag
on every edge we visit to the value of COLLECTING_TAG
.
Let’s return to our broken example to show how this modification fixes it.
We return to our reference graph with \(a\) pointing to \(b\) and \(r\) pointing to \(a\).
Since every Gc
must be annotated with a tag, let’s say they’re all tagged with 0.
graph LR
r("r")
r -- "0" --> a
a -- "0" --> b
Next, when dfs
begins, COLLECTING_TAG
is incremented to 1.
As a result, the new reference created by a malicious thread must be tagged with 1.
graph LR
r("r")
r -- "0" --> a
a -- "0" --> b
b -- "1" --> a
Now, when dfs
visits the edge pointing from \(b\) to \(a\), it sees that the reference is
labeled with a tag of 1, and therefore both \(b\) and \(a\) must be accessible - no accidental
early deallocations here.
Moving in and out
We might hope that tagging the pointers is sufficient to prevent all concurrency bugs. However, this is sadly not enough. To show why, we’ll begin with an example.
First, let’s imagine an allocation called \(x\), which contains two optional references to itself.
struct X {
x1: Mutex<Option<Gc<X>>>,
x2: Mutex<Option<Gc<X>>>,
}
At the start of our example, there will be only one Gc
pointing from inside \(x\) to itself, and
the other slot that can hold a Gc
will be empty.
\(x\) is accessible directly from the root allocation \(r\), and all existing references are
tagged with 0.
For the sake of clarity, I’ll draw the arrow representing the first slot in \(x\) with a solid
line and the second slot with a dashed line.
graph LR
r("r")
r -- "0" --> x
x -- "0" --> x
x -.-> nowhere[ ]
style nowhere fill:#FFFFFF00, stroke:#FFFFFF00;
Now, let’s have dfs
begin, starting at \(x\).
- We increment
COLLECTING_TAG
to 1. dfs
records that \(x\) has 2 unaccounted-for references at the start.dfs
visits the \(x\)’s first slot and tags the currently-existing self reference in that slot. \(x\) now has 1 unaccounted-for reference.
graph LR
r("r")
r -- "0" --> x
x -- "1" --> x
x -.-> nowhere[ ]
style nowhere fill:#FFFFFF00, stroke:#FFFFFF00;
- A malicious thread intervenes, first moving the
Gc
from \(x\)’s first slot to \(r\), then moving the originalGc
which connected \(r\) to \(x\) into \(x\)’s second slot.
fn do_evil(r_to_x: Gc<X>) {
let z = std::mem::take(&mut *r_to_x.x1.lock().unwrap());
*z.x2.lock().unwrap() = r_to_x;
}
graph LR
r("r")
r -- "1" --> x
x --> nowhere[ ]
x -. "0" .-> x
style nowhere fill:#FFFFFF00, stroke:#FFFFFF00;
dfs
continues, now visiting the second slot in \(x\). Seeing another reference to \(x\), we decrement the number of unaccounted-for references to 0.dfs
found no ancestors of \(x\) with unaccounted-for ancestors, so we free \(x\). However, \(x\) is still accessible!
This problem stumped me for a decent while.
It seems impossible to prevent other threads from do-si-do-ing every Gc
pointing to an allocation
in order to fool our search.
Luckily, there is a way through.
Since tagging every Gc
worked out for us before, we’ll try tagging every allocation.
#[repr(C)]
struct GcBox<T: Collectable + Send + Sync + ?Sized> {
ref_count: AtomicUsize,
tag: AtomicUsize,
value: T
}
Next, every time a Gc
is dereferenced, we’ll update the tag in the allocation to the current
value of COLLECTING_TAG
.
impl<T: Collectable + Send + Sync + ?Sized> std::ops::Deref for GcBox<T> {
type Target = T;
fn deref(&self) -> &Self::Target {
let box_ref: &GcBox<T> = unsafe { self.ptr.as_ref() };
box_ref.tag.store(COLLECTING_TAG.load(Ordering::Acquire), Ordering::Release);
&box_ref.value
}
}
Whenever dfs
observes that the tag on an allocation is equal to COLLECTING_TAG
, that means that
the allocation was accessed, and therefore accessible.
In order to shuffle references around in our above example, the malicious thread would have to
dereference a Gc
, notifying the search and preventing an early deallocation.
In fact, this is enough to prove that dfs
will never incorrectly mark an allocation as
inaccessible, even under concurrent execution.
This blog post is already rather long, so I’ll only provide a rough sketch of the proof.
Theorem: dfs
will never erroneously mark an allocation as inaccessible.
Proof:
- If
dfs
encounters aGc
created afterdfs
begins, it will know that the referent is accessible. - Due to (1), if
dfs
marked some allocation \(a\) as inaccessible, it must have only visitedGc
s pointing to \(a\) which existed beforedfs
began. - Weakening the statement from (2), if
dfs
marked some allocation \(a\) as inaccessible, it must have only visitedGc
s pointing to \(a\) which existed when \(a\)’s reference count was recorded. dfs
never visits aGc
while it points from the root allocation \(r\) to any other allocation.dfs
must visit everyGc
which pointed to an allocation at the point when its reference count was recorded in order to mark it as inaccessible.- Due to (4) and (5), if \(a\) is marked as inaccesssible, any
Gc
which at any point connected \(r\) to \(a\) must have been moved to connect another node (not \(r\)) to \(a\). - Due to (6), there is no lifetime
'x
which can borrow against aGc
pointing from \(r\) to \(a\) and last between whendfs
first visits \(a\) and when it last visits \(a\). - The final
Gc
to move into \(a\) beforedfs
must have been moved while another borrow against aGc
pointing from \(r\) to \(a\) existed. Let the lifetime of that borrow be'y
. - Because
'y
lasts longer than the final move,'y
can be extended to last arbitrarily long, even until afterdfs
has finished visiting everyGc
pointing to \(a\). - Due to (7),
'y
must not have existed beforedfs
first visited \(a\). Therefore, if \(a\) is erroneously marked as inaccessible, aGc
pointing to \(a\) must have been dereferenced to generate a borrow with lifetime'y
. - If a
Gc
pointing to \(a\) was dereferenced before the last timedfs
visited aGc
pointing to \(a\), thendfs
would have observed that \(a\)’s tag had changed. Therefore, if an allocation was erroneously marked as inaccessible, it must have been marked as accessible. Therefore no allocations were ever erroneously marked as inacessible bydfs
. QED.
This proof is relatively handwavy, but it should be possible for an astute reader to fill in the
details.
I will leave it to such astute readers to prove the complementary proof to this; namely, that dfs
will never erroneously mark an allocation as accessible.
I also think that it’s really cool that the entire mechanism of the correctness of this approach is totally supported by Rust’s borrow checking semantics. It makes the whole thing feel like it fits right in.
It’s the weak ones that count
There’s one more cause for memory safety issues from concurrency. If an allocation is freed because its reference count reached zero, it could still be accessible by another thread because it was stored in the global dumpster. We can’t guarantee that we’ve obliterated all references to that allocation from every other thread’s memory, so it seems like we’re hosed.
However, there’s no problem in computer science that can’t be solved by adding more state.
Taking a leaf from Rc
’s book and adapting it liberally, we annotate each allocation with a “weak”
reference count, which is the number of times it’s referenced by a dumpster or cleanup thread.
We’ll rename our original reference count to be the strong count instead.
#[repr(C)]
struct GcBox<T: Collectable + Send + Sync + ?Sized> {
strong: AtomicUsize,
weak: AtomicUsize,
tag: AtomicUsize,
value: T,
}
Then, when dropping the last strong reference to a Gc
, we first check that its weak reference
count is nonzero before we can deallocate it.
If not, it’s the cleanup threads’ problem to take care of it.
Whenever an allocation is removed from a dumpster and about to be thrown away by a thread running a cleanup procedure, the thread decrements the weak count of the allocation and checks if both the weak and strong count have reached zero. If so, that thread can then safely destroy the allocation.
Deadlocked!
Imagine that we want to create a Gc<Mutex<()>>
, and then drop a Gc
pointing to it while the
Mutex
is locked.
let gc1 = Gc::new(Mutex::new(()));
let gc2 = gc1.clone();
let guard = gc1.lock()?;
drop(gc2); // suppose this triggers a cleanup
drop(guard);
While traversing the reference graph, the dfs
visitor would attempt to visit the mutex, lock it,
and try to carry on.
However, since the same thread that triggered the allocation still holds a guard to that mutex, the
thread would deadlock disastrously.
Lucky for us, we know that any allocation with a held mutex guard must still be accessible (due to
Mutex
’s excellent API), so we can be certain that if we fail to acquire a mutex, we have immediate
proof that an allocation is accessible.
To take advantage of that, we’ll make visiting a collectable type a fallible operation, and make it so that failing to acquire a lock also fails the visit.
pub trait Collectable {
fn accept<V: Visitor>(&self, visitor: &mut V) -> Result<(), ()>;
}
pub trait Visitor {
fn visit_gc<T: Collectable + ?Sized>(&mut self, gc: &Gc<T>) -> Result<(), ()>;
}
impl<T: Collectable + ?Sized> Collectable for Mutex<T> {
fn accept<V: Visitor>(&self, visitor: &mut V) -> Result<(), ()> {
// pretend poison errors don't exist
self.try_lock().map_err(|_| ())?.accept(visitor);
Ok(())
}
}
Sharing your dumpsters
You’ve made it this far into this (admittedly rather lengthy) post. Enjoy this hand-drawn picture by yours truly.
When implementing this garbage collector, one of the biggest bottlenecks with concurrent performance
is accessing the one, global dumpster every single time a Gc
is dropped.
If using a Mutex<HashMap>
, this will remove nearly all parallelism from the whole system and
make you wonder why you paid all that money for a 16-core computer.
As a result, any sane person might lean for a concurrent hashmap.
There are many Rust implementations - I tried
chashmap
but gave up on it due to a
rare panic bug, and then I tried
dashmap
, which worked fine but was too slow as thread
counts increased.
The heart of the issue is that concurrently updating any data structure is going to be slow. Even so-called “concurrent” data structures which use clever sharded locks have this issue - there’s always some contention overhead. The best workaround to concurrency slowdowns is to just not do things concurrently. We can fix this by making every thread have its own dumpster, and then having a global collection of to-be-cleaned allocations which I’ll call the garbage truck. Whenever another allocation is marked dirty, we can check if this thread’s dumpster is full according to some heuristic, and if so, transfer all the allocations from this thread’s dumpster to the garbage truck. I’ve included some pseudocode for the whole process below.
dumpster = set() # local to this thread
garbage_truck = mutex(set()) # global to all threads
def mark_dirty(allocation):
dumpster.add(allocation)
if is_full(dumpster):
garbage_truck.lock()
for trash_bag in dumpster:
garbage_truck.add(trash_bag)
garbage_truck.unlock()
dumpster = set()
If is_full
returns true
rarely enough, there will be almost no contention over the garbage
truck.
This means that the concurrency overhead is vastly reduced and we get to pocket some great
performance returns.
Okay, but how fast is it?
In short, this garbage collector runs pretty fast!
In the end, I made two different garbage collector implementations - one which is thread-local,
saving us from all the concurrency headaches, and another thread-safe one that has all the bells and
whistles at the cost of some performance.
Borrowing from once_cell
’s API, the former lives in the
unsync
module of the crate and the latter in sync
.
Next, I collected all the garbage collectors I could find in order to compare them.
I selected all the ones which had a similar API to dumpster
, were published on
crates.io, and actually worked.
Name | Concurrent? | Like dumpster ? |
---|---|---|
dumpster (unsync) |
❌ | ✅ |
dumpster (sync) |
✅ | ✅ |
Rc |
❌ | ✅ |
Arc |
✅ | ✅ |
bacon-rajan-cc |
❌ | ✅ |
cactusref |
❌ | ❌ |
elise |
✅ | ❌ |
gc |
❌ | ✅ |
gcmodule |
yes, but with a different API | ✅ |
rcgc |
❌ | ❌ |
shredder |
✅ | ✅ |
I excluded the ones that didn’t have a dumpster
-like API because it would have been a pain to set
up a benchmarking harness to handle them.
If you feel this is poor diligence, you can go benchmark them yourself.
I also included Rc
and Arc
, even though they don’t collect cycles, because they’re a decent
baseline for the minimum performance of a garbage collector.
During my research, I also found that bacon-rajan-cc
doesn’t initiate collections of its own
accord, so I added a version of each of dumpster
’s garbage collectors which doesn’t automatically
trigger collections in the spirit of competitive fairness.
Those versions are labeled as “manual.”
Lastly, I found that gcmodule
simply stack-overflows and sometimes segfaults when facing extremely
large collection loads, so I removed it from the benchmark set.
As my benchmark, I decided to do the very simplest thing I could think of: have a bunch of allocations randomly add and remove references to each other in a loop, dropping and creating new allocations as well. To account for random noise, I ran 100 trials of each garbage collector doing 1 million operations. Each operation is either the creation of a reference or a deletion.
First, I ran a benchmark of all the garbage collectors, measuring pure single-threaded performance. I’ve shown my benchmark results as a set of violin plots below, relating runtime to frequency.
This plot doesn’t tell us much other than that shredder
is slow.
To take a closer look, let’s remove shredder
from the list and see what we find.
dumpster
’s unsync implementation is in a decent position, beat out only by the non-cycle-detecting
allocators and bacon-rajan-cc
.
bacon-rajan-cc
has very little overhead and is quite performant.
I think there are two main reasons why the concurrent version of the garbage collector is much
slower than the single-threaded one.
First, moving allocations from a thread-local dumpster to the global garbage truck takes some time.
Second, the dfs
operation in a concurrent environment must record the current set of children of
an allocation, typically in a separate heap-allocated structure, to prevent some concurrency bugs.
It’s certainly possible to mitigate the losses from the first issue, but the second is harder to
handle.
My guess as to why the concurrent version is significantly slower when using automatic collection triggering is that it’s a consequence of my testing harness. Currently, it runs all the tests in a loop, so the single-threaded benchmark could run directly after the multi-threaded one and be forced to clean up some of its garbage.
In any event, let’s move on to the results from the multi-threaded benchmark.
The only other concurrent garbage collector available for me to test against was shredder
, and it
looks like it wasn’t much of a competition.
dumpster
doesn’t scale as smoothly as I’d like with thread count, but it’s a relief that
performance does actually improve with thread count, except for the jump from one thread to two.
I think being only 3x slower than Arc is a decent victory, especially for a project that I wasn’t
paid any money to do.
I’m not much of an expert at high-performance computing in concurrent situations, so trying to
optimize this further would likely just be an adventure in trying random things and seeing if they
yield an improvement.
If anyone out there has clever ideas on how to optimize this approach, let me know!
My email is on the “About” page.
Coda
Over the last few months, I built a garbage collector for Rust from scratch. I implemented a novel (ish) algorithm of my own design, got it to work in both single-threaded and multi-threaded contexts, and showed that it was actually pretty performant.
The implementation of this crate is now up on crates.io and available for anyone to download. It’s licensed under the GNU GPLv3.
Special thanks to Wisha, Charlie, and Shreyas for reviewing this post, fielding my bad questions, and putting up with me rambling about this for nearly two months straight.
Postscript
I have a few things that I thought were interesting along the way but didn’t quite have a good spot to include in the story above, so they’ll all be mish-mashed in here.
Ode to bad good ideas
The final, polished solution you see in this blog post is actually quite far from my original idea. In particular, almost all of my first ideas for handling concurrency didn’t work. Here’s a quick roll of them all:
- Wrap a
Mutex
around every single allocation’s reference count and hold them while doingdfs
- Save on space in
Gc
s by tagging only the lowest bit instead of adding an entireusize
- Make everything enable dynamic dispatch so we can have real Java-like objects
- Clean up allocations locally on every thread
- Avoid directly dropping allocations and instead cleverly decrement their reference count
Cool nightly tricks
Rc
and the other pointer types are somewhat “blessed” in that they can access nightly features
from the Rust compiler and still be consumed by stable projects.
dumpster
has no such blessings.
I would like to implement CoerceUnsized
unconditionally for all my garbage collectors, but that’s
simply not possible on stable yet.
Likewise, pending the implementation of strict provenance, there’s actually no way for a stable crate to manipulate the address portion of a fat pointer. This means that even if I found some space optimization using lower bit tagging, I wouldn’t be able to use it because it’s gated behind a nightly feature.
Another path to Gc<dyn T>
Without the ability to coerce Gc
as an unsized type, I can’t make Collectable
object safe due to
the function signature of accept
, which forces Visitor
to also be object-safe, forcing visit
operations to also be object-safe.
I’ve thought of a slightly different approach, though.
We can make Visitor
a single concrete type, and that makes Collectable
trivially object-safe.
The downside here is that I would then have to store all 5 of my current visitor implementations in
the same structure, and that loses a lot of my nice type safety features and adds a large runtime
penalty.
I’m not very happy with that idea, though.
Miri rules
I would probably not have been able to finish this project without
cargo miri
.
If you haven’t heard of it, you should download it right now and run all of your Rust projects with
it.
It’s such a fantastic tool for debugging unsafe code that I wish I could have it in other languages,
too - valgrind is a poor substitute for it when using C.
Testing and debugging
Early on, I had just a few handwritten tests. The first reference graph that I found that gave me trouble looked like this:
graph LR
a --> b
b --> d
d --> a
a --> c
c --> d
However, once I had simple graphs with multiple cycles down, it was hard to come up with new tests. As a result, I resorted to fuzzing, using the same approach as my benchmark - randomly generating references and then seeing if they got cleaned up properly.
Debugging the results of those fuzzed tests absolutely sucked, though. The experience of trying to find the source of an issue was awful, so I ended up having to make some (crude) tools to make the process easier. Eventually, I had settled on printing out the reference graph at choice times during the run of my tests and then visualizing them to get a good guess of what was wrong. However, when making the concurrent implementation, printing something can change the execution behavior of the program. Not only that, it’s not always obvious what the issue is. I had one bug which only occurred after several thousand operations had completed, resulting in reference graphs that looked like this:
graph LR
2892 --> 2883
2892 --> 2852
2849 --> 2848
2849 --> 2852
2868 --> 2890
2868 --> 2837
2776 --> 2762
2853 --> 2857
2853 --> 2871
2822 --> 2814
2836 --> 2822
2837 --> 2868
2848 --> 2801
2835 --> 2765
2851 --> 2783
2851 --> 2852
2828 --> 2878
2857 --> 2862
2857 --> 2837
2713 --> 2698
2713 --> 2835
2698 --> 2685
2698 --> 2712
2698 --> 2742
2813 --> 2819
2813 --> 2783
2772 --> 2713
2772 --> 2813
2772 --> 2823
2772 --> 2846
2771 --> 2807
2771 --> 2759
2771 --> 2851
2881 --> 2771
2881 --> 2837
2783 --> 2836
2783 --> 2807
2783 --> 2849
2783 --> 2843
2890 --> 2901
2765 --> 2750
2765 --> 2776
2765 --> 2776
2765 --> 2849
2819 --> 2853
root --> 2871
root --> 2890
root --> 2902
root --> 2900
root --> 2842
root --> 2837
root --> 2877
root --> 2869
root --> 2873
root --> 2867
root --> 2881
root --> 2868
root --> 2897
root --> 2884
root --> 2855
root --> 2892
root --> 2852
root --> 2822
root --> 2901
root --> 2874
root --> 2899
root --> 2893
root --> 2828
root --> 2883
root --> 2896
root --> 2903
root --> 2904
root --> 2772
2892 --> 2867
The graph above was made by fuzzing my garbage collector, using the same approach as my benchmarks.