Project: Record and Replay for Garbage Collection Development

Overview

Garbage collectors are notoriously difficult to debug. One of the reasons for this is errors in the algorithm (or implementation) often do not manifest promptly. For example, if a collector has a leak (objects are incorrectly kept alive), then evidence of that leak may not become apparent until a long time after the collector first neglected to collect an object, and for some workloads, the error may not be evident at all. A perhaps more surprising example is that if an object is prematurely collected, programs often continue to run without apparently problems for a considerable time, perhaps because the affected object is rarely used by the program, or perhaps because the program is quite resilient to the content of the affected memory. Worse still, the root causes of errors like those above can often be very subtle and can be many levels of indirection prior to the garbage collector making an incorrect decision.

Because garbage collectors are very low-level software, the above problems have generally been combatted using fairly primitive means. Above all, well-written collectors make extensive use of compile-time and run-time assertions that are designed to flag failures of preconditions earlier rather than later. Some sytems (like MMTK and V8) also implement validity checkers, which can be invoked at the command line. In the case of MMTk, the checker (sanity) takes the form of an extremely simple stop-the-world, whole-heap tracing collector that is entirely non-invasive because it uses its own metadata on the side. The premise is that the sanity collector is correct (with high confidence), and that it is therefore possible to identify some classes of error by comparing the work of the sanity collector and the production collector, by invoking the sanity collector before and after each garbage collection. If the sanity collector and production collector disagree about the liveness of an object, this indicates an error.

Record and replay debugging (also known as time travel debugging) is a technique whereby a buggy program can be deterministically replayed, and during that deterministic replay, the system can be heavily instrumented (at great performance overhead) to produce a complete log of all reads and writes of memory values. The results of this profile are then encoded in a spatio-temporal database, and then connected with a suitable modified IDE, allowing backwards-in-time queries to be made with respect to the state of any memory variable. The user can then go to a crash, and find out where (in the past execution) a particular value was written to memory. This is an extraordinarily powerful debugging technique, particularly for memory bugs, and particularly for bugs that have long paths between initial cause and their concrete manifestation.

This program will integrate the rr time-travel debugger into MMTk and demonstrate its use in debugging GC bugs.

Themes

Garbage collection
Debugging
Software engineering

Requirements

Coding This project will involve a modest amount of low-level systems coding in Rust and C++.

Project Length

This project should fit a one or two semester half time student project. Given the depth of the problem of debugging garbage collectors, this could be the beginning of a much larger (masters or PhD-length) project looking at debugging techniques for garbage collectors.

Project Outputs

The project should produce an important addition to MMTk, allowing seamless use of rr when debugging MMTk.