How is the “grey list” implemented in modern garbage collectors?

Question

In the Wikipedia article for “tracing garbage collection”, the following claim is made:

...most modern tracing garbage collectors implement some variant of the tri-color marking abstraction

In this abstraction, objects are grouped into three sets: white, black, and grey. The basic tracing algorithm enumerates all the elements of the grey set, marking each grey object as black until all the unreachable objects are white and can be deleted.

What the article doesn’t go into is how the grey set is implemented, which has clear implications for the overall performance of the GC algorithm. The most direct and obvious solution is to model the grey set as a data structure of pointers to objects (hash set, stack, queue, whatever), but (keep in mind I have no data on this) this seems quite expensive in terms of space, one pointer per object reference in the call stack.

How is the “grey list” implemented in modern garbage collectors to maximize efficiency in terms of time and space and what kind of overhead do these solutions incur?

A pointer is the smallest possible way there is to reference an object, unless you know of some way to scan for objects in memory, which seems like it would be an interminably slow process. — Robert Harvey, Feb 19 '18 at 17:29
And if you did that, you'd have to have markers and garbage collection information in the objects so that they could be identified and categorized. This would almost certainly take up more space than a pointer. — Robert Harvey, Feb 19 '18 at 17:58
The key word in that name is "abstraction." It's an abstract idea. The "black" set and the "white" set and the "gray" set are part of the description of what a tracing GC does. They are not necessarily explicit in the description of how any given GC implementation does it. — Solomon Slow, Feb 19 '18 at 21:25

score 1 · Answer 1 · answered Feb 19 '18 at 18:11

In the naive mark-and-sweep method, each object in memory has a flag (typically a single bit) reserved for garbage collection use only.

For tri-color you need one more bit in each object for a total of 2 bits for use by the gc only.

One implementation would go like this: each user gc-able heap object starts its memory layout with a first field that is a kind of class object, which both identifies the true runtime type of the object, and also holds vtables and interface tables (protocol witness tables). This first field, used by the runtime, is unseen by the user (and, the class object is probably just an ordinary object in the runtime's implementation language rather than a gc-able user object).

Like most objects, these class objects are located on long word aligned boundaries, and this makes their addresses' lowest 2-3 bits always zero. These always zero bits can be borrowed for the above gc purposes.

The only downside is that anyone who wants the real class object (e.g. the runtime doing virtual dispatch or cast or something else) most likely will have to mask off these bits before actual use in dereference (this does depend on instruction set architecture).

A naive mark-and-sweep only needs one "mark" bit. During the mark phase, the "black" set is the set of all objects that have been marked. The "gray" set is the set of all unmarked objects that are reachable, and the "white" set is the set of all unmarked objects that are unreachable. At the start of the mark phase, the black set is empty and the white and gray sets are unknown. At the end of the mark phase, the gray set is empty, the black set is explicit, and the white set is the set of all objects, minus the members of the black set. — Solomon Slow, Feb 19 '18 at 21:33
A minimal GC system needs to provide a function, allocate_object(num_pointers,num_raw_bytes). As far as the GC is concerned, all "objects" can be the same type. The only thing the GC needs to know about them is how to find the pointers in one object that reference other objects, and maybe, how to find that "mark" bit. "Classes" and "vtables" and all that other stuff can be implemented in higher layers that the GC does not need to know about. — Solomon Slow, Feb 20 '18 at 16:18

How is the “grey list” implemented in modern garbage collectors?

1 Answers1