Copyright Notice. The below material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s or organization’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

## Publications

• X. Yang, S. M. Blackburn, and K. S. McKinley, "Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading," in Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16), Denver, CO, June 22-24, 2016, 2016.
FOR subject classification codes: 080308, 080501
Web services from search to games to stock trading impose strict Service Level Objectives (SLOs) on tail latency. Meeting these objectives is challenging because the computational demand of each request is highly variable and load is bursty. Consequently, many servers run at low utilization (10 to 45%); turn off simultaneous multithreading (SMT); and execute only a single service — wasting hardware, energy, and money. Although co-running batch jobs with latency critical requests to utilize multiple SMT hardware contexts (lanes) is appealing, unmitigated sharing of core resources induces non-linear effects on tail latency and SLO violations. We introduce principled borrowing to control SMT hardware execution in which batch threads borrow core resources. A batch thread executes in a reserved batch SMT lane when no latency-critical thread is executing in the partner request lane. We instrument batch threads to quickly detect execution in the request lane, step out of the way, and promptly return the borrowed resources. We introduce the nanonap system call to stop the batch thread’s execution without yielding its lane to the OS scheduler, ensuring that requests have exclusive use of the core’s resources. We evaluate our approach for co-locating batch workloads with latency-critical requests using the Apache Lucene search engine. A conservative policy that executes batch threads only when request lane is idle improves utilization between 90% and 25% on one core depending on load, without compromising request SLOs. Our approach is straightforward, robust, and unobtrusive, opening the way to substantially improved resource utilization in datacenters running latency-critical workloads.
@InProceedings{YBM:16,   author = {Xi Yang and Stephen M Blackburn and Kathryn S McKinley},   title = {Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading},   booktitle = {Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, June 22-24, 2016},   year = {2016},   }
• Y. Lin, S. M. Blackburn, A. L. Hosking, and M. Norrish, "Rust as a Language for High Performance GC Implementation," in Proceedings of the Sixteenth ACM SIGPLAN International Symposium on Memory Management, ISMM ‘16, Santa Barbara, CA, June 13, 2016, 2016.
FOR subject classification codes: 080308, 080501
High performance garbage collectors build upon performance-critical low-level code, typically exhibit multiple levels of concurrency, and are prone to subtle bugs. Implementing, debugging and maintaining such collectors can therefore be extremely challenging. The choice of implementation language is a crucial consideration when building a collector. Typically, the drive for performance and the need for efficient support of low-level memory operations leads to the use of low-level languages like C or C++, which offer little by way of safety and software engineering benefits. This risks undermining the robustness and flexibility of the collector design. Rust’s ownership model, lifetime specification, and reference borrowing deliver safety guarantees through a powerful static checker with little runtime overhead. These features make Rust a compelling candidate for a collector implementation language, but they come with restrictions that threaten expressiveness and efficiency. We describe our experience implementing an Immix garbage collector in Rust and C. We discuss the benefits of Rust, the obstacles encountered, and how we overcame them. We show that our Immix implementation has almost identical performance on micro benchmarks, compared to its implementation in C, and outperforms the popular BDW collector on the gcbench micro benchmark. We find that Rust’s safety features do not create significant barriers to implementing a high performance collector. Though memory managers are usually considered low-level, our high performance implementation relies on very little unsafe code, with the vast majority of the implementation benefiting from Rust’s safety. We see our experience as a compelling proof-of-concept of Rust as an implementation language for high performance garbage collection.
@InProceedings{LBH+:16,   author = {Yi Lin and Stephen M Blackburn and Antony L Hosking and Michael Norrish},   title = {Rust as a Language for High Performance GC Implementation},   booktitle = {Proceedings of the Sixteenth ACM SIGPLAN International Symposium on Memory Management, ISMM '16, Santa Barbara, CA, June 13, 2016},   year = {2016},   }
• I. Jibaja, T. Cao, S. M. Blackburn, and K. S. McKinley, "Portable Performance on Asymmetric Multicore Processors," in Proceedings of the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2016.
FOR subject classification codes: 100605, 080308, 100606
@InProceedings{JCBM:16,   author = {Jibaja, Ivan and Cao, Ting and Blackburn, Stephen M and McKinley, Kathryn S.},   title = {Portable Performance on Asymmetric Multicore Processors},   booktitle = {Proceedings of the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization},   year = 2016, month = feb, location = {Barcelona, Spain},   publisher = {IEEE},   }
• V. Kumar, J. Dolby, and S. M. Blackburn, "Integrating Asynchronous Task Parallelism and Data-centric Atomicity," in Proceedings of PPPJ’16, the 13th International Conference on Principles and Practices of Programming on the Java Platform: virtual machines, languages, and tools, Lugano, Switzerland, 2016.
FOR subject classification codes: 080308, 080501
Processor design has turned toward parallelism and heterogeneous cores to achieve performance and energy efficiency. Developers find high-level languages attractive as they use abstraction to offer productivity and portability over these hardware complexities. Over the past few decades, researchers have developed increasingly advanced mechanisms to deliver performance despite the overheads naturally imposed by this abstraction. Recent work has demonstrated that such mechanisms can be exploited to attack overheads that arise in emerging high-level languages, which provide strong abstractions over parallelism. However, current implementation of existing popular high-level languages, such as Java, offer little by way of abstractions that allow the developer to achieve performance in the face of extensive hardware parallelism. In this paper, we present a small set of extensions to the Java programming language that aims to achieve both high performance and high productivity with minimal programmer effort. We incorporate ideas from languages like X10 and AJ to develop five annotations in Java for achieving asynchronous task parallelism and data-centric concurrency control. These annotations allow the use of a highly efficient implementation of a work-stealing scheduler for task parallelism. We evaluate our proposal by refactoring classes from a number of existing multithreaded open source projects to use our new annotations. Our results suggest that these annotations significantly reduce the programming effort while achieving performance improvements up to 30\% compared to conventional approaches.
@InProceedings{KDB:16,   author = {Vivek Kumar and Julian Dolby and Stephen M Blackburn},   title = {Integrating Asynchronous Task Parallelism and Data-centric Atomicity},   booktitle = {Proceedings of PPPJ'16, the 13th International Conference on Principles and Practices of Programming on the Java Platform: virtual machines, languages, and tools, Lugano, Switzerland},   year = {2016}, }
• S. M. Blackburn, A. Diwan, M. Hauswirth, P. F. Sweeney, J. N. Amaral, T. Brecht, L. Bulej, C. Click, L. Eeckhout, S. Fischmeister, D. Frampton, L. J. Hendren, M. Hind, A. L. Hosking, R. E. Jones, T. Kalibera, N. Keynes, N. Nystrom, and A. Zeller, "The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations," ACM Trans. Program. Lang. Syst., vol. 38, iss. 4, p. 15:1-15:20, 2016.
 10.1145/2983574
An unsound claim can misdirect a field, encouraging the pursuit of unworthy ideas and the abandonment of promising ideas. An inadequate description of a claim can make it difficult to reason about the claim, for example, to determine whether the claim is sound. Many practitioners will acknowledge the threat of unsound claims or inadequate descriptions of claims to their field. We believe that this situation is exacerbated, and even encouraged, by the lack of a systematic approach to exploring, exposing, and addressing the source of unsound claims and poor exposition.

This article proposes a framework that identifies three sins of reasoning that lead to unsound claims and two sins of exposition that lead to poorly described claims and evaluations. Sins of exposition obfuscate the objective of determining whether or not a claim is sound, while sins of reasoning lead directly to unsound claims.

Our framework provides practitioners with a principled way of critiquing the integrity of their own work and the work of others. We hope that this will help individuals conduct better science and encourage a cultural shift in our research community to identify and promulgate sound claims.

@article{BDHS+:16,   author = {Blackburn, Stephen M. and Diwan, Amer and Hauswirth, Matthias and Sweeney, Peter F. and Amaral, Jos{\'e} Nelson and Brecht, Tim and Bulej, Lubom\'{\i}r and Click, Cliff and Eeckhout, Lieven and Fischmeister, Sebastian and Frampton, Daniel and Hendren, Laurie J. and Hind, Michael and Hosking, Antony L. and Jones, Richard E. and Kalibera, Tomas and Keynes, Nathan and Nystrom, Nathaniel and Zeller, Andreas},   title = {The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations},   journal = {ACM Trans. Program. Lang. Syst.},   issue_date = {October 2016},   volume = {38},   number = {4},   month = oct, year = {2016},   issn = {0164-0925},   pages = {15:1--15:20},   articleno = {15},   numpages = {20},   url = {http://doi.acm.org/10.1145/2983574},   doi = {10.1145/2983574},   acmid = {2983574},   publisher = {ACM},   address = {New York, NY, USA},   keywords = {Experimental evaluation, experimentation, observation study}, }
• K. Wang, Y. Lin, S. M. Blackburn, M. Norrish, and A. L. Hosking, "Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development," in 1st Summit on Advances in Programming Languages (SNAPL 2015), 2015.
 http://dx.doi.org/10.4230/LIPIcs.SNAPL.2015.321
FOR subject classification codes: 080308, 080501
Many of today’s programming languages are broken. Poor performance, lack of features and hard-to-reason-about semantics can cost dearly in software maintenance and inefficient execution. The problem is only getting worse with programming languages proliferating and hardware becoming more complicated.

An important reason for this brokenness is that much of language design is implementation-driven. The difficulties in implementation and insufficient understanding of concepts bake bad designs into the language itself. Concurrency, architectural details and garbage collection are three fundamental concerns that contribute much to the complexities of implementing managed languages.

We propose the micro virtual machine, a thin abstraction designed specifically to relieve implementers of managed languages of the most fundamental implementation challenges that currently impede good design. The micro virtual machine targets abstractions over memory (garbage collection), architecture (compiler backend), and concurrency. We motivate the micro virtual machine and give an account of the design and initial experience of a concrete instance, which we call Mu, built over a two year period. Our goal is to remove an important barrier to performant and semantically sound managed language design and implementation.

@InProceedings{WLB+:15,   author = {Kunshan Wang and Yi Lin and Stephen M Blackburn and Michael Norrish and Antony L Hosking},   title = {Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development},   booktitle = {1st Summit on Advances in Programming Languages (SNAPL 2015)},   year = {2015},   doi = {http://dx.doi.org/10.4230/LIPIcs.SNAPL.2015.321},   }
• Y. Lin, K. Wang, S. M. Blackburn, M. Norrish, and A. L. Hosking, "Stop and Go: Understanding Yieldpoint Behavior," in Proceedings of the Fourteenth ACM SIGPLAN International Symposium on Memory Management, ISMM ‘15, Portland, OR, June 14, 2015, 2015.
 http://dx.doi.org/10.1145/10.1145/2754169.2754187
FOR subject classification codes: 080308, 080501
Yieldpoints are critical to the implementation of high performance garbage collected languages, yet the design space is not well understood. Yieldpoints allow a running program to be interrupted at well-defined points in its execution, facilitating exact garbage collection, biased locking, on-stack replacement, profiling, and other important virtual machine behaviors. In this paper we identify and evaluate yieldpoint design choices, including previously undocumented designs and optimizations. One of the designs we identify opens new opportunities for very low overhead profiling. We measure the frequency with which yieldpoints are executed and establish a methodology for evaluating the common case execution time overhead. We also measure the median and worst case time-to-yield. We find that Java benchmarks execute about 100 M yieldpoints per second, of which about 1/20000 are taken. The average execution time overhead for untaken yieldpoints on the VM we use ranges from 2.5% to close to zero on modern hardware, depending on the design, and we find that the designs trade off total overhead with worst case time-to-yield. This analysis gives new insight into a critical but overlooked aspect of garbage collector implementation, and identifies a new optimization and new opportunities for very low overhead profiling.
@InProceedings{LWB+:15,   author = {Yi Lin and Kunshan Wang and Stephen M Blackburn and Michael Norrish and Antony L Hosking},   title = {Stop and Go: Understanding Yieldpoint Behavior},   booktitle = {Proceedings of the Fourteenth ACM SIGPLAN International Symposium on Memory Management, ISMM '15, Portland, OR, June 14, 2015},   year = {2015},   doi = {http://dx.doi.org/10.1145/10.1145/2754169.2754187},   }
• X. Yang, S. M. Blackburn, and K. S. McKinley, "Computer Performance Microscopy with Shim," in ISCA ‘15: The 42nd International Symposium on Computer Architecture, 2015.
FOR subject classification codes: 100605, 080308, 100606
Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim’s fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim’s observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2 overhead at a ~1200 cycle resolution. At an overhead of 61% Shim samples one software signal on the same core with SMT at a ~15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.
@InProceedings{XBM:15,   author = {Yang, Xi and Blackburn, Stephen M. and McKinley, Kathryn S.},   title = {Computer Performance Microscopy with {Shim}},   booktitle = {ISCA '15: The 42nd International Symposium on Computer Architecture},   year = {2015},   location = {Portland, OR},   publisher = {IEEE},   }
• J. Sartor, W. Heirman, S. M. Blackburn, L. Eeckout, and K. S. McKinley, "Cooperative Cache Scrubbing," in PACT ‘14 Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques, 2014.
FOR subject classification codes: 080308, 100604, 100605
Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challenging. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory systems, leading to more energy consumption. However, we demonstrate that not all memory traffic is necessary. For modern Java programs, 10 to 60% of DRAM writes are useless, because the data on these lines are dead – the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero initialize it wastes bandwidth. We propose a software/hardware cooperative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simulator. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57% on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems.
@InProceedings{SHB+:14,   author = {Jennifer Sartor and Wim Heirman and Stephen M. Blackburn and Lieven Eeckout and Kathryn S McKinley},   title = {Cooperative Cache Scrubbing},   booktitle = {PACT '14 Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques},   year = {2014},   month = {August},   publisher = {IEEE},   location = {Edmonton, Canada}, }
• V. Kumar, S. M. Blackburn, and D. Grove, "Friendly Barriers: Efficient Work-Stealing With Return Barriers," in VEE’14: Proceedings of the 2014 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2014, pp. 165-176.
 http://dx.doi.org/10.1145/2576195.2576207
FOR subject classification codes: 080308, 080501
This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.

In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim’s stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.

@InProceedings{KBG:14,   author = {Vivek Kumar and Stephen M Blackburn and David Grove},   title = {Friendly Barriers: Efficient Work-Stealing With Return Barriers},   booktitle = {VEE'14: Proceedings of the 2014 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},   year = {2014},   volume = {47},   doi = {http://dx.doi.org/10.1145/2576195.2576207},   pages = {165--176},   }
• R. Shahriyar, S. M. Blackburn, and K. M. McKinley, "Fast Conservative Garbage Collection," in OOPSLA ‘14: Proceeding of the 25th ACM SIGPLAN conference on object oriented programming systems languages and applications, 2014.
 http://dx.doi.org/0.1145/2660193.2660198
FOR subject classification codes: 080308, 100604
Garbage collectors are exact or conservative. An exact collector identifies all references precisely and may move referents and update references, whereas a conservative collector treats one or more of stack, register, and heap references as ambiguous. Ambiguous references constrain collectors in two ways. (1) Since they may be pointers, the collectors must retain referents. (2) Since they may be values, the collectors cannot modify them, pinning their referents.

We explore conservative collectors for managed languages, with ambiguous stacks and registers. We show that for Java benchmarks they retain and pin remarkably few heap objects: <0.01% are falsely retained and 0.03% are pinned. The larger effect is collector design. Prior conservative collectors (1) use mark-sweep and unnecessarily forgo moving all objects, or (2) use mostly copying and pin entire pages. Compared to generational collection, overheads are substantial: 12% and 45% respectively. We introduce high performance conservative Immix and reference counting (RC). Immix is a mark-region collector with fine line-grain pinning and opportunistic copying of unambiguous referents. Deferred RC simply needs an object map to deliver the first conservative RC. We implement six exact collectors and their conservative counterparts. Conservative Immix and RC come within 2 to 3% of their exact counterparts. In particular, conservative RC Immix is slightly faster than a well-tuned exact generational collector. These findings show that for managed languages, conservative collection is compatible with high performance.

@InProceedings{SBM:14,   author = {Shahriyar, Rifat and Blackburn, Stephen M. and McKinley, Kathryn M.},   title = {Fast Conservative Garbage Collection},   booktitle = {OOPSLA '14: Proceeding of the 25th ACM SIGPLAN conference on object oriented programming systems languages and applications},   location = {Portland, OR, USA},   year = {2014},   month = {oct},   doi = {http://dx.doi.org/0.1145/2660193.2660198},   }
• T. Gao, K. Strauss, S. M. Blackburn, K. S. McKinley, D. Burger, and J. Larus, "Using Managed Runtime Systems to Tolerate Holes in Wearable Memories," in PLDI ‘13: Proceedings of the 34th ACM SIGPLAN conference on Programming Language Design and Implementation, Seattle, WA, USA, June, 2013, 2013.
FOR subject classification codes: 080308, 100604, 100605
New memory technologies, such as phase-change memory (PCM), promise denser and cheaper main memory, and are expected to displace DRAM. However, many of them experience permanent failures far more quickly than DRAM. DRAM mechanisms that handle permanent failures rely on very low failure rates and, if directly applied to PCM, are extremely inefficient: Discarding a page when the first line fails wastes 98% of the memory.

This paper proposes low complexity cooperative software and hardware that handle failure rates as high as 50%. Our approach makes error handling transparent to the application by using the memory abstraction offered by managed languages. Once hardware error correction for a memory line is exhausted, rather than discarding the entire page, the hardware communicates the failed line to a failure-aware OS and runtime. The runtime ensures memory allocations never use failed lines and moves data when lines fail during program execution. This paper describes minimal extensions to an Immix mark-region garbage collector, which correctly utilizes pages with failed physical lines by skipping over failures. This paper also proposes hardware support that clusters failed lines at one end of a memory region to reduce fragmentation and improve performance under failures. Contrary to accepted hardware wisdom that advocates for wear-leveling, we show that with software support non-uniform failures delay the impact of memory failure. Together, these mechanisms incur no performance overhead when there are no failures and at failure levels of 10% to 50% suffer only an average overhead of 4% and 12%, respectively. These results indicate that hardware and software cooperation can greatly extend the life of wearable memories.

@InProceedings{GSB+:13,   author = {Gao, Tiejun and Strauss, Karin and Blackburn, Stephen M. and McKinley, Kathryn S. and Burger, Doug and Larus, James},   title = {Using Managed Runtime Systems to Tolerate Holes in Wearable Memories},   booktitle = {PLDI '13: Proceedings of the 34th ACM SIGPLAN conference on Programming Language Design and Implementation, Seattle, WA, USA, June, 2013},   year = {2013},   month = {June},   series = {ACM SIGPLAN Notices},   volume = {48(6)},   publisher = {ACM Press},   }
• R. Shahriyar, S. M. Blackburn, X. Yang, and K. S. McKinley, "Taking Off the Gloves with Reference Counting Immix," in OOPSLA ‘13: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, 2013.
 http://dx.doi.org/10.1145/2509136.2509527
FOR subject classification codes: 080308, 100604
Despite some clear advantages and recent advances, reference counting remains a poor cousin to high-performance tracing garbage collectors. The advantages of reference counting include a) immediacy of reclamation, b) incrementality, and c) local scope of its operations. After decades of languishing with hopelessly bad performance, recent work narrowed the gap between reference counting and the fastest tracing collectors to within 10%. Though a major advance, this gap remains a substantial barrier to adoption in performance-conscious application domains.

Our work identifies heap organization as the principal source of the remaining performance gap. We present the design, implementation, and analysis of a new collector, RC Immix, that replaces reference counting’s traditional free-list heap organization with the line and block heap structure introduced by the Immix collector. The key innovations of RC Immix are 1) to combine traditional reference counts with per-line live object counts to identify reusable memory and 2) to eliminate fragmentation by integrating copying with reference counting of new objects and with backup tracing cycle collection. In RC Immix, reference counting offers efficient collection and the line and block heap organization delivers excellent mutator locality and efficient allocation. With these advances, RC Immix closes the 10% performance gap, matching the performance of a highly tuned production generational collector. By removing the performance barrier, this work transforms reference counting into a serious alternative for meeting high performance objectives for garbage collected languages.

@InProceedings{SBYM:13,   author = {Shahriyar, Rifat and Blackburn, Stephen M. and Yang, Xi and McKinley, Kathryn S.},   title = {Taking Off the Gloves with Reference Counting Immix},   booktitle = {OOPSLA '13: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications},   location = {Indianapolis, IN, USA},   year = {2013},   month = {oct},   doi = {http://dx.doi.org/10.1145/2509136.2509527},   }
• R. Shahriyar, S. M. Blackburn, and D. Frampton, "Down for the Count? Getting Reference Counting Back in the Ring," in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ‘12, Beijing, China, June 15-16, 2012.
 http://dx.doi.org/10.1145/2258996.2259008
FOR subject classification codes: 080308, 100604
Reference counting and tracing are the two fundamental approaches that have underpinned garbage collection since 1960. However, despite some compelling advantages, reference counting is almost completely ignored in implementations of high performance systems today. In this paper we take a detailed look at reference counting to understand its behavior and to improve its performance. We identify key design choices for reference counting and analyze how the behavior of a wide range of benchmarks might affect design decisions. As far as we are aware, this is the first such quantitative study of reference counting. We use insights gleaned from this analysis to introduce a number of optimizations that significantly improve the performance of reference counting.

We find that an existing modern implementation of reference counting has an average 30% overhead compared to tracing, and that in combination, our optimizations are able to completely eliminate that overhead. This brings the performance of reference counting on par with that of a well tuned mark-sweep collector. We keep our in-depth analysis of reference counting as general as possible so that it may be useful to other garbage collector implementers. Our finding that reference counting can be made directly competitive with well tuned mark-sweep should shake the community’s prejudices about reference counting and perhaps open new opportunities for exploiting reference counting’s strengths, such as localization and immediacy of reclamation.

@InProceedings{SBF:12,   author = {Shahriyar, Rifat and Blackburn, Stephen M. and Frampton, Daniel},   title = {Down for the Count? {G}etting Reference Counting Back in the Ring},   booktitle = {Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM '12, Beijing, China, June 15-16},   year = {2012},   results = {rc-ismm-2012.zip},   month = {jun},   doi = {http://dx.doi.org/10.1145/2258996.2259008},   location = {Beijing, China},   }
• Y. Lin, S. M. Blackburn, and D. Frampton, "Unpicking The Knot: Teasing Apart VM/Application Interdependencies," in VEE ‘12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, New York, NY, USA, 2012.
 http://dx.doi.org/10.1145/2151024.2151048
FOR subject classification codes: 080308, 080309
Flexible and efficient runtime design requires an understanding of the dependencies among the components internal to the runtime and those between the application and the runtime. These dependencies are frequently unclear. This problem exists in all runtime design, and is most vivid in a metacircular runtime — one that is implemented in terms of itself. Metacircularity blurs boundaries between application and runtime implementation, making it harder to understand and make guarantees about overall system behavior, affecting isolation, security, and resource management, as well as reducing opportunities for optimization. Our goal is to shed new light on VM interdependencies, helping all VM designers understand these dependencies and thereby engineer better runtimes.

We explore these issues in the context of a high-performance Java-in-Java virtual machine. Our approach is to identify and instrument transition points into and within the runtime, which allows us to establish a dynamic execution context. Our contributions are: 1) implementing and measuring a system that dynamically maintains execution context with very low overhead, 2) demonstrating that such a framework can be used to improve the software engineering of an existing runtime, and 3) analyzing the behavior and runtime characteristics of our runtime across a wide range of benchmarks. Our solution provides clarity about execution state and allowable transitions, making it easier to develop, debug, and understand managed runtimes.

@InProceedings{LBF:12,   author = {Lin, Yi and Blackburn, Stephen M. and Frampton, Daniel},   title = {Unpicking The Knot: Teasing Apart VM/Application Interdependencies},   booktitle = {VEE '12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},   year = {2012},   location = {London, UK},   doi = {http://dx.doi.org/10.1145/2151024.2151048},   publisher = {ACM},   address = {New York, NY, USA},   }
• T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, "The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software," in ISCA ‘12: The 39th International Symposium on Computer Architecture, 2012.
FOR subject classification codes: 100605, 080308, 100606
On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics.

On the software side, managed workloads written in languages such as C& 35;, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or just-in-time compilation) that together impose substantial energy and performance costs, ranging from 10% to over 80%. We show that these services manifest a differentiated performance and power workload. To differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application’s critical path.

We identify a synergy between AMP and VM services that we exploit to attack the 40% average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant.

@InProceedings{CBGM:12,   author = {Cao, Ting and Blackburn, Stephen M. and Gao, Tiejun and McKinley, Kathryn S.},   title = {The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software},   booktitle = {ISCA '12: The 39th International Symposium on Computer Architecture},   year = {2012},   location = {Portland, OR},   results = {yinyang-isca-2012.zip},   publisher = {IEEE},   }
• X. Yang, S. M. Blackburn, D. Frampton, and A. L. Hosking, "Barriers Reconsidered, Friendlier Still!," in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ‘12, Beijing, China, June 15-16, 2012.
 http://dx.doi.org/10.1145/2258996.2259004
FOR subject classification codes: 080308, 100604
Read and write barriers mediate access to the heap allowing the collector to control and monitor mutator actions. For this reason, barriers are a powerful tool in the design of any heap management algorithm, but the prevailing wisdom is that they impose significant costs. However, changes in hardware and workloads make these costs a moving target. Here, we measure the cost of a range of useful barriers on a range of modern hardware and workloads. We confirm some old results and overturn others. We evaluate the microarchitectural sensitivity of barrier performance and the differences among benchmark suites. We also consider barriers in context, focusing on their behavior when used in combination, and investigate a known pathology and evaluate solutions. Our results show that read and write barriers have average overheads as low as 5.4% and 0.9% respectively. We find that barrier overheads are more exposed on the workload provided by the modern DaCapo benchmarks than on old SPECjvm98 benchmarks. Moreover, there are differences in barrier behavior between in-order and out-of-order machines, and their respective memory subsystems, which indicate different barrier choices for different platforms. These changing costs mean that algorithm designers need to reconsider their design choices and the nature of their resulting algorithms in order to exploit the opportunities presented by modern hardware.
@InProceedings{YBFH:12,   author = {Yang, Xi and Blackburn, Stephen M. and Frampton, Daniel and Hosking, Antony L.},   title = {Barriers Reconsidered, Friendlier Still!},   booktitle = {Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM '12, Beijing, China, June 15-16},   year = {2012},   results = {barrier-ismm-2012.zip},   month = {jun},   doi = {http://dx.doi.org/10.1145/2258996.2259004},   location = {Beijing, China},   }
• H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, "What is Happening to Power, Performance, and Software?," IEEE Micro, vol. 32, pp. 110-121, 2012.
 http://doi.ieeecomputersociety.org/10.1109/MM.2012.20
FOR subject classification codes: 100605, 080308, 100606
Systematically exploring power, performance and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay; and the rise of managed high-level, portable programming languages.
@article{ECYBM:12b,   author = {Esmaeilzadeh, Hadi and Cao, Ting and Yang, Xi and Blackburn, Stephen M. and McKinley, Kathryn S},   title = {What is Happening to Power, Performance, and Software?},   journal ={IEEE Micro},   volume = {32},   issn = {0272-1732},   year = {2012},   pages = {110-121},   doi = {http://doi.ieeecomputersociety.org/10.1109/MM.2012.20},   publisher = {IEEE Computer Society},   address = {Los Alamitos, CA, USA},   }
• V. Kumar and S. M. Blackburn, "Faster Work Stealing With Return Barriers," in The 6th workshop on Virtual Machines and Intermediate Languages (VMIL 2012), Tucson, AZ, October 21, 2012, 2012.
FOR subject classification codes: 080308, 080501
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Our prior work demonstrates that using the exception handling mechanism of modern VMs and gathering the runtime information directly from the victim’s execution stack can significantly reduce these overheads.

In this paper we identify the overhead associated with managing the work-stealing related information on a victim’s execution stack. A return barrier is a mechanism for intercepting the popping of a stack frame, and thus is a powerful tool for optimizing mechanisms that involve scanning of stack state. We present the design and preliminary findings of using return barriers on a victim’s execution stack to reduce these overheads. We evaluate our design using classical work-stealing benchmarks. On these benchmarks, compared to our prior design, we are able to reduce the overheads by as much as 58%. These preliminary findings give further hope to an already promising technique of harnessing rich features of a modern VM inside a work-stealing scheduler.

@InProceedings{KB:12,   author = {Vivek Kumar and Stephen M Blackburn},   title = {Faster Work Stealing With Return Barriers},   booktitle = {The 6th workshop on Virtual Machines and Intermediate Languages (VMIL 2012), Tucson, AZ, October 21, 2012},   year = {2012},   volume = {47},   note = {Unpublished},   }
• H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, "Looking back and looking forward: power, performance, and upheaval," Communications of the ACM, vol. 55, iss. 7, pp. 105-114, 2012.
 http://dx.doi.org/10.1145/2209249.2209272
FOR subject classification codes: 100605, 080308, 100606
The past 10 years have delivered two significant revolutions. (1) Microprocessor design has been transformed by the limits of chip power, wire latency, and Dennard scaling—leading to multicore processors and heterogeneity. (2) Managed languages and an entirely new software landscape emerged—revolutionizing how software is deployed, is sold, and interacts with hardware. Researchers most often examine these changes in isolation. Architects mostly grapple with microarchitecture design through the narrow software context of native sequential SPEC CPU benchmarks, while language researchers mostly consider microarchitecture in terms of performance alone. This work explores the clash of these two revolutions over the past decade by measuring power, performance, energy, and scaling, and considers what the results may mean for the future. Our diverse findings include the following: (a) native sequential workloads do not approximate managed workloads or even native parallel workloads; (b) diverse application power profiles suggest that future applications and system software will need to participate in power optimization and management; and (c) software and hardware researchers need access to real measurements to optimize for power and energy.
@article{ECYBM:12,   author = {Esmaeilzadeh, Hadi and Cao, Ting and Yang, Xi and Blackburn, Stephen M. and McKinley, Kathryn S.},   title = {Looking back and looking forward: power, performance, and upheaval},   journal = {Communications of the ACM},   issue_date = {July 2012},   volume = {55},   number = {7},   month = jul, year = {2012},   issn = {0001-0782},   pages = {105--114},   numpages = {10},   doi = {http://dx.doi.org/10.1145/2209249.2209272},   acmid = {2209272},   publisher = {ACM},   address = {New York, NY, USA},   }
• Y. Lin and S. M. Blackburn, "Bypassing Portability Pitfalls of High-level Low-level Programming," in The 6th workshop on Virtual Machines and Intermediate Languages (VMIL 2012), Tucson, AZ, October 21, 2012, New York, NY, USA, 2012.
FOR subject classification codes: 080308, 080309
Program portability is an important software engineering consideration. However, when high-level languages are extended to effectively implement system projects for software engineering gain and safety, portability is compromised—high-level code for low-level programming cannot execute on a stock runtime, and, conversely, a runtime with special support implemented will not be portable across different platforms.

We explore the portability pitfall of high-level low-level programming in the context of virtual machine implementation tasks. Our approach is designing a restricted high-level language called RJava, with a flexible restriction model and effective low-level extensions, which is suitable for different scopes of virtual machine implementation, and also suitable for a low-level language bypass for improved portability. Apart from designing such a language, another major outcome from this work is clearing up and sharpening the philosophy around language restriction in virtual machine design. In combination, our approach to solving portability pitfalls with RJava favors virtual machine design and implementation in terms of portability and robustness.

@InProceedings{LB:12,   author = {Lin, Yi and Blackburn, Stephen M.},   title = {Bypassing Portability Pitfalls of High-level Low-level Programming},   booktitle = {The 6th workshop on Virtual Machines and Intermediate Languages (VMIL 2012), Tucson, AZ, October 21, 2012},   year = {2012},   location = {Tucson, AZ},   publisher = {ACM},   address = {New York, NY, USA},   }
• V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu, "Work-Stealing Without The Baggage," in Proceedings of the 2012 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2012), Tucson, AZ, October 19-26, 2012, 2012.
 http://dx.doi.org/10.1145/2384616.2384639
FOR subject classification codes: 080308, 080501
Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2X to 12X slowdown over orthodox sequential code.

In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3X overhead and the previous implementation of the system we use has an overhead of 4.1X. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.

@InProceedings{KFB+:12,   author = {Vivek Kumar and Daniel Frampton and Stephen M Blackburn and David Grove and Olivier Tardieu},   title = {Work-Stealing Without The Baggage},   booktitle = {Proceedings of the 2012 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2012), Tucson, AZ, October 19-26, 2012},   year = {2012},   volume = {47},   number = {10},   series = {SIGPLAN Notices},   month = {October},   publisher = {ACM},   doi = {http://dx.doi.org/10.1145/2384616.2384639},   }
• H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, "Looking Back on the Language and Hardware Revolutions: Measured Power, Performance, and Scaling," in Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, Newport Beach, CA, USA, March 5 – 11, 2011.
 10.1145/1961296.1950402
FOR subject classification codes: 100605, 080308, 100606
This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology. We measure representative Intel IA32 processors with technologies ranging from 130nm to 32nm while they execute sequential and parallel benchmarks written in native and managed languages. During this period, hardware and software changed substantially: (1) hardware vendors delivered chip multiprocessors instead of uniprocessors, and independently (2) software developers increasingly chose managed languages instead of native languages. This quantitative data reveals the extent of some known and previously unobserved hardware and software trends.

Two themes emerge. (I) Workload The power, performance, and energy trends of native workloads do not approximate managed workloads. For example, (a) the SPEC CPU2006 native benchmarks on the i7-920 and i5-670 draw significantly less power than managed or scalable native benchmarks; and (b) managed runtimes exploit parallelism even when running single-threaded applications. The results recommend architects always include native and managed workloads to design and evaluate energy efficient designs. (II) Architecture Clock scaling, microarchitecture, simultaneous multithreading, and chip multiprocessors each elicit a huge variety of processor power, performance, and energy responses. This variety and the difficulty of obtaining power measurements recommends exposing on-chip power meters and when possible structure specific power meters for cores, caches, and other structures. Just as hardware event counters provide a quantitative grounding for performance innovations, power meters are necessary for optimizing energy.

@InProceedings{EBCYM:11,   author = {Hadi Esmaeilzadeh and Ting Cao and Xi Yang and Stephen M. Blackburn and Kathryn S. McKinley},   title = {Looking Back on the Language and Hardware Revolutions: Measured Power, Performance, and Scaling},   booktitle = {Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, Newport Beach, CA, USA, March 5 - 11},   year = {2011},   results = {powerperf-asplos-2011.zip},   month = {mar},   doi = {10.1145/1961296.1950402},   location = {Newport Beach, CA, USA}, }
• X. Yang, S. M. Blackburn, D. Frampton, J. Sartor, and K. S. McKinley, "Why Nothing Matters: The Impact of Zeroing," in Proceedings of the 2011 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2011), Portland, OR, October 22-27, 2011, 2011.
 http://dx.doi.org/10.1145/2076021.2048092
FOR subject classification codes: 080308, 100604
Memory safety defends against inadvertent and malicious misuse of memory that may compromise program correctness and security. A critical element of memory safety is zero initialization. The direct cost of zero initialization is surprisingly high: up to 12.7%, with average costs ranging from 2.7 to 4.5% on a high performance virtual machine on IA32 architectures. Zero initialization also incurs indirect costs due to its memory bandwidth demands and cache displacement effects. Existing virtual machines either: a) minimize direct costs by zeroing in large blocks, or b) minimize indirect costs by zeroing in the allocation sequence, which reduces cache displacement and bandwidth. This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance.

Our analysis inspires three better designs: (1) bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously, (2) concurrent non-temporal bulk zeroing that exploits parallel hardware to move work off the application’s critical path, and (3) adaptive zeroing, which dynamically chooses between (1) and (2) based on available hardware parallelism. The new software strategies offer speedups sometimes greater than the direct overhead, improving total performance by 3% on average. Our findings invite additional optimizations and microarchitectural support.

@InProceedings{YBF+:11,   author = {Xi Yang and Stephen M Blackburn and Daniel Frampton and Jennifer Sartor and Kathryn S McKinley},   title = {Why Nothing Matters: The Impact of Zeroing},   booktitle = {Proceedings of the 2011 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2011), Portland, OR, October 22-27, 2011},   year = {2011},   volume = {46},   number = {10},   series = {SIGPLAN Notices},   month = {October},   publisher = {ACM},   doi = {http://dx.doi.org/10.1145/2076021.2048092},   }
• R. Garner, S. M. Blackburn, and D. Frampton, "A Comprehensive Evaluation of Object Scanning Techniques," in Proceedings of the Tenth ACM SIGPLAN International Symposium on Memory Management, ISMM ‘11, San Jose, CA, USA, June 4 – 5, 2011.
 http://dx.doi.org/10.1145/1993478.1993484
FOR subject classification codes: 080308, 100604
At the heart of all garbage collectors lies the process of identifying and processing reference fields within an object. Despite its key role, and evidence of many different implementation approaches, to our knowledge no comprehensive quantitative study of this design space exists. The lack of such a study means that implementers must rely on conventional wisdom’, hearsay, and their own costly analysis. Starting with mechanisms described in the literature and a variety of permutations of these, we explore the impact of a number of dimensions including: a) the choice of data structure, b) levels of indirection from object to metadata, and c) specialization of scanning code. We perform a comprehensive examination of these tradeoffs on four different architectures using eighteen benchmarks and hardware performance counters. We inform the choice of mechanism with a detailed study of heap composition and object structure as seen by the garbage collector on these benchmarks. Our results show that choice of scanning mechanism is important. We find that a careful choice of scanning mechanism alone can improve garbage collection performance by 16% and total time by 2.5%, on average, over a well tuned baseline. We observe substantial variation in performance among architectures, and find that some mechanisms–particularly specialization, layout of reference fields in objects, and encoding metadata in object headers–yield consistent, significant advantages.
@InProceedings{GBF:11,   author = {Robin Garner and Stephen M Blackburn and Daniel Frampton},   title = {A Comprehensive Evaluation of Object Scanning Techniques},   booktitle = {Proceedings of the Tenth ACM SIGPLAN International Symposium on Memory Management, ISMM '11, San Jose, CA, USA, June 4 - 5},   year = {2011},   month = {jun},   doi = {http://dx.doi.org/10.1145/1993478.1993484},   location = {San Jose, CA, USA}, }
• I. Jibaja, S. M. Blackburn, M. R. Haghighat, and K. S. McKinley, "Deferred Gratification: Engineering for High Performance Garbage Collection from the Get Go," in Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC 2011), San Jose, CA, June 5, 2011, 2011.
 http://dx.doi.org/10.1145/1988915.1988930
FOR subject classification codes: 080308, 080309
Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later.

We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java — they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance. Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.

@InProceedings{JBH+:11,   author = {Ivan Jibaja and Stephen M Blackburn and Mohammad R. Haghighat and Kathryn S McKinley},   title = {Deferred Gratification: Engineering for High Performance Garbage Collection from the Get Go},   booktitle = {Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC 2011), San Jose, CA, June 5, 2011},   year = {2011},   month = {June},   publisher = {ACM},   doi = {http://dx.doi.org/10.1145/1988915.1988930},   }
• T. Cao, S. M. Blackburn, and K. S. McKinley, "Virtual Machine Services: An Opportunity for Hardware Customization," in Workshop on Energy-efficient Computing for a Sustainable World, Porto Alegre, Brazil, Dec 4, 2011, 2011.
FOR subject classification codes: 100605, 080308, 100606
This paper considers the intersection of two computing trends: (1) Computer architecture design is now constrained by power, instead of transistor count, which is leading to architecture heterogeneity and customization. (2) Software developers increasingly choose managed languages, such as JavaScript, PHP, Java, and C& 35;. Managed languages require a VM (Virtual Machine), which executes services such as profiling, compilation, scheduling, and memory management together with every application. The ubiquity of VM services makes them a perfect test case for potential improvements in energy through the use of hardware heterogeneity. This paper uses systematic exploration of hardware features such as frequency and voltage scaling, cache size, hardware parallelism, and gross microarchitecture design on the power, and performance of VM services. It thus evaluates this potential on actual hardware, rather than through simulation. We study Just-in-Time (JIT) compilation, interpretation, and memory management. We find that VM services consume around 20% of energy on average. Compared to application code, the memory manager and interpreter offer substantially different workloads and do not uniformly benefit from high performance architectural features. A heterogeneous multicore processor thus has the potential to substantially improve energy of managed applications with cores customized for VM services.
@InProceedings{CBM:11,   author = {Ting Cao and Stephen M Blackburn and Kathryn S McKinley},   title = {Virtual Machine Services: An Opportunity for Hardware Customization},   booktitle = {Workshop on Energy-efficient Computing for a Sustainable World, Porto Alegre, Brazil, Dec 4, 2011},   year = {2011},   month = {December},   publisher = {IEEE/ACM},   }
• J. Sartor, S. M. Blackburn, D. Frampton, M. Hirzel, and K. S. McKinley, "Z-Rays: Divide Arrays and Conquer Speed and Flexibility," in ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010.
 http://doi.acm.org/10.1145/1809028.1806649
FOR subject classification codes: 080308, 100604
Arrays are the ubiquitous organization for indexed data. Throughout programming language evolution, implementations have laid out arrays contiguously in memory. This layout is problematic in space and time. It causes heap fragmentation, garbage collection pauses in proportion to array size, and wasted memory for sparse and over-provisioned arrays. Because of array virtualization in managed languages, an array layout that consists of indirection pointers to fixed-size discontiguous memory blocks can mitigate these problems transparently. This design however incurs significant overhead, but is justified when real-time deadlines and space constraints trump performance.

This paper proposes z-rays, a discontiguous array design with flexibility and efficiency. A z-ray has a spine with indirection pointers to fixed-size memory blocks called arraylets, and uses five optimizations: (1) inlining the first N array bytes into the spine, (2) lazy allocation, (3) zero compression, (4) fast array copy, and (5) arraylet copy-on-write. Whereas discontiguous arrays in prior work improve responsiveness and space efficiency, z-rays combine time efficiency and flexibility. On average, the best z-ray configuration performs within 12.7% of an unmodified Java Virtual Machine on 19 benchmarks, whereas previous designs have two to three times higher overheads. Furthermore, language implementers can configure z-ray optimizations for various design goals. This combination of performance and flexibility creates a better building block for past and future array optimization.

@InProceedings{SBF+:10,   author = {Jennifer Sartor and Stephen M. Blackburn and Daniel Frampton and Martin Hirzel and Kathryn S. McKinley},   title = {Z-Rays: Divide Arrays and Conquer Speed and Flexibility},   booktitle = {ACM SIGPLAN Conference on Programming Language Design and Implementation},   year = {2010},   month = {June},   doi = {http://doi.acm.org/10.1145/1809028.1806649},   publisher = {ACM},   location = {Toronto, Canada},   patch = {arraylets-jikesrvm-3.0.1.patch}, }
• H. Esmaeilzadeh, S. M. Blackburn, X. Yang, and K. S. McKinley, "Power and Performance of Native and Java Benchmarks on 130nm to 32nm Process Technologies," in Sixth Annual Workshop on Modeling, Benchmarking and Simulation, MoBS 2010, Saint-Malo, France, 2010.
FOR subject classification codes: 100605, 080308, 100606
Over the past decade, chip fabrication technology shrank from 130nm to 32nm. This reduction was generally considered to provide performance improvements together with chip power reduc tions. This paper examines how well process technology and microarchitecture delivered on this assumption. This paper evaluates power and performance of native and Java workloads across a selection of IA32 processors from five technology generations (130nm, 90nm, 65nm, 45nm, and 32nm). We use a Hall effect sensor to accurately measure chip power. This paper reports a range findings in three areas. 1) Methodology: TDP is unsurprisingly a poor predictor of application power consumption for a particular processor, but worse, TDP is a poor predictor of relative power consumption between processors. 2) Power-performance trends: Processors appear to have already hit the power wall at 45nm. 3) Native versus Java workloads and their relationship to processor technology: Single threaded Java workloads exploit multiple cores. These results indicate that Java workloads offer different opportunities and challenges compared to native workloads. Our findings challenge prevalent methodologies and offer new insight into how microarchitectures have traded power and performance as process technology shrank.
@InProceedings{EBYM:10,   author = {Hadi Esmaeilzadeh and Stephen M. Blackburn and Xi Yang and Kathryn S. McKinley},   title = {Power and Performance of Native and Java Benchmarks on 130nm to 32nm Process Technologies},   booktitle = {Sixth Annual Workshop on Modeling, Benchmarking and Simulation, MoBS 2010, Saint-Malo, France},   year = {2010},   month = {June},   location = {Saint-Malo, France}, }
• D. Frampton, S. M. Blackburn, P. Cheng, R. J. Garner, D. Grove, E. J. B. Moss, and S. I. Salishev, "Demystifying magic: high-level low-level programming," in VEE ‘09: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, New York, NY, USA, 2009, pp. 81-90.
 http://doi.acm.org/10.1145/1508293.1508305
FOR subject classification codes: 080308, 080309
The power of high-level languages lies in their abstraction over hardware and software complexity, leading to greater security, better reliability, and lower development costs. However, opaque abstractions are often show-stoppers for systems programmers, forcing them to either break the abstraction, or more often, simply give up and use a different language. This paper addresses the challenge of opening up a high-level language to allow practical low-level programming without forsaking integrity or performance.

The contribution of this paper is three-fold: 1) we draw together common threads in a diverse literature, 2) we identify a framework for extending high-level languages for low-level programming, and 3) we show the power of this approach through concrete case studies. Our framework leverages just three core ideas: extending semantics via intrinsic methods, extending types via unboxing and architectural-width primitives, and controlling semantics via scoped semantic regimes. We develop these ideas through the context of a rich literature and substantial practical experience. We show that they provide the power necessary to implement substantial artifacts such as a high-performance virtual machine, while preserving the software engineering benefits of the host language.

The time has come for high-level low-level programming to be taken more seriously: 1) more projects now use high-level languages for systems programming, 2) increasing architectural heterogeneity and parallelism heighten the need for abstraction, and 3) a new generation of high-level languages are under development and ripe to be influenced.

@InProceedings{FBC+:09,   author = {Frampton, Daniel and Blackburn, Stephen M. and Cheng, Perry and Garner, Robin J. and Grove, David and Moss, J. Eliot B. and Salishev, Sergey I.},   title = {Demystifying magic: high-level low-level programming},   booktitle = {VEE '09: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},   year = {2009},   isbn = {978-1-60558-375-4},   pages = {81--90},   location = {Washington, DC, USA},   doi = {http://doi.acm.org/10.1145/1508293.1508305},   publisher = {ACM},   address = {New York, NY, USA},   }
• J. Ha, M. Arnold, S. M. Blackburn, and K. S. McKinley, "A concurrent dynamic analysis framework for multicore hardware," in OOPSLA ‘09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, New York, NY, USA, 2009, pp. 155-174.
 http://doi.acm.org/10.1145/1640089.1640101
FOR subject classification codes: 080304, 100604, 080308
Software has spent the bounty of Moore’s law by solving harder problems and exploiting abstractions, such as high-level languages, virtual machine technology, binary rewriting, and dynamic analysis. Abstractions make programmers more productive and programs more portable, but usually slow them down. Since Moore’s law is now delivering multiple cores instead of faster processors, future systems must either bear a relatively higher cost for abstractions or use some cores to help tolerate abstraction costs.

This paper presents the design, implementation, and evaluation of a novel concurrent, configurable dynamic analysis framework that efficiently utilizes multicore cache architectures. It introduces Cache-friendly Asymmetric Buffering (CAB), a lock-free ring-buffer that implements efficient communication between application and analysis threads. We guide the design and implementation of our framework with a model of dynamic analysis overheads. The framework implements exhaustive and sampling event processing and is analysis-neutral. We evaluate the framework with five popular and diverse analyses, and show performance improvements even for lightweight, low-overhead analyses.

Efficient inter-core communication is central to high performance parallel systems and we believe the CAB design gives insight into the subtleties and difficulties of attaining it for dynamic analysis and other parallel software.

@InProceedings{HAB+:09,   author = {Ha, Jungwoo and Arnold, Matthew and Blackburn, Stephen M. and McKinley, Kathryn S.},   title = {A concurrent dynamic analysis framework for multicore hardware},   booktitle = {OOPSLA '09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications},   year = {2009},   isbn = {978-1-60558-766-0},   pages = {155--174},   location = {Orlando, Florida, USA},   doi = {http://doi.acm.org/10.1145/1640089.1640101},   publisher = {ACM},   address = {New York, NY, USA},   }
• S. M. Blackburn, S. I. Salishev, M. Danilov, O. A. Mokhovikov, A. A. Nashatyrev, P. A. Novodvorsky, V. I. Bogdanov, X. F. Li, and D. Ushakov, "The Moxie JVM Experience," Australian National University, Department of Computer Science, TR-CS-08-01, 2008.
By January 1998, only two years after the launch of the first Java virtual machine, almost all JVMs in use today had been architected. In the nine years since, technology has advanced enormously, with respect to the underlying hardware, language implementation, and in the application domain. Although JVM technology has moved forward in leaps and bounds, basic design decisions made in the 90’s has anchored JVM implementation.

The Moxie project set out to explore the question: “How would we design a JVM from scratch knowing what we know today?’ Amid the mass of design questions we faced, the tension between performance and flexibility was pervasive, persistent and problematic. In this experience paper we describe the Moxie project and its lessons, a process which began with consulting experts from industry and academia, and ended with a fully working prototype.

@TechReport{BSD+:08,   author = {Stephen M. Blackburn and Sergey I. Salishev and Mikhail Danilov and Oleg A. Mokhovikov and Anton A. Nashatyrev and Peter A. Novodvorsky and Vadim I. Bogdanov and Xiao Feng Li and Dennis Ushakov},   title = {The {M}oxie {JVM} Experience},   institution = {Australian National University, Department of Computer Science},   year = {2008},   number = {TR-CS-08-01},   month = {May},   }
• J. Ha, M. Gustafsson, S. M. Blackburn, and K. S. McKinley, "Microarchitectural Characterization of Production JVMs and Java Workloads," in IBM CAS Workshop, Austin, TX, 2008.
FOR subject classification codes: 080308, 100604
Understanding and comparing Java Virtual Machine (JVM) performance at a microarchitectural level can identify JVM performance anomalies and potential opportunities for optimization. The two primary tools for microarchitectural performance analysis are hardware performance counters and cycle accurate simulators. Unfortunately, the nondeterminism, complexity, and size of modern JVMs make these tools difficult to apply and therefore the microarchitectural performance of JVMs remains under-studied. We propose and use new methodologies for measuring unmodified production JVMs using both performance counters and a cycle accurate simulator. Our experimental design controls nondeterminism within a single measurement by using multiple iterations after a steady state is reached. We also use call-backs provided by the production JVMs to isolate application performance from the garbage collector, and where supported, the JIT. Finally, we use conventional statistical approaches to understand the effect of the remaining sources of measurement noise such as nondeterministic JIT optimization plans.

This paper describes these methodologies and then reports on work in progress using these methodologies to compare IBM J9, BEA JRockit, and Sun HotSpot JVM performance with hardware performance counters and simulation. We examine one benchmark in detail to give a flavor of the depth and type of analyses possible with this methodology.

@InProceedings{HGBM:08,   author = { J. Ha and M. Gustafsson and S. M. Blackburn and K. S. McKinley},   title = {Microarchitectural Characterization of Production JVMs and Java Workloads},   booktitle = {IBM CAS Workshop, Austin, TX},   year = {2008},   month = {Feburary},   }
• S. M. Blackburn and K. S. McKinley, "Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance," in ACM SIGPLAN Conference on Programming Language Design and Implementation, 2008.
 http://doi.acm.org/10.1145/1375581.1375586
FOR subject classification codes: 080308
Programmers are increasingly choosing managed languages for modern applications, which tend to allocate many short-to-medium lived small objects. The garbage collector therefore directly determines program performance by making a classic space-time tradeoff that seeks to provide space efficiency, fast reclamation, and mutator performance. The three canonical tracing garbage collectors: semi-space, mark-sweep, and mark-compact each sacrifice one objective. This paper describes a collector family, called mark-region, and introduces opportunistic defragmentation, which mixes copying and marking in a single pass. Combining both, we implement immix, a novel high performance garbage collector that achieves all three performance objectives. The key insight is to allocate and reclaim memory in contiguous regions, at a coarse block grain when possible and otherwise in groups of finer grain lines. We show that immix outperforms existing canonical algorithms, improving total application performance by 7 to 25% on average across 20 benchmarks. As the mature space in a generational collector, immix matches or beats a highly tuned generational collector, e.g. it improves SPEC jbb by 5%. These innovations and the identification of a new family of collectors open new opportunities for garbage collector design.
@InProceedings{BM:08,   author = {Stephen M. Blackburn and Kathryn S. McKinley},   title = {Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance},   booktitle = {ACM SIGPLAN Conference on Programming Language Design and Implementation},   year = {2008},   month = {June},   publisher = {ACM},   doi = {http://doi.acm.org/10.1145/1375581.1375586},   patch = {immix-jikesrvm-r13767.patch.gz},   results = {immix-pldi-2008.csv.tgz},   location = {Tucson, AZ, USA},   }
• S. M. Blackburn, K. S. McKinley, R. Garner, C. Hoffmann, A. M. Khan, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. Eliot, B. Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen, D. von Dincklage, and B. Wiedermann, "Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century," Communications of the ACM, 2008.
 http://doi.acm.org/10.1145/1378704.1378723
FOR subject classification codes: 080308, 080309, 100605
Evaluation methodology underpins all innovation in experimental computer science. It requires relevant workloads, appropriate experimental design, and rigorous analysis. Unfortunately, methodology is not keeping pace with the changes in our feild. The rise of managed languages such as Java, C& 35;, and Ruby in the past decade and the imminent rise of commodity multicore architectures for the next decade pose new methodological challenges that are not yet widely understood. This paper explores the consequences of our collective inattention to methodology on innovation, makes recommendations for addressing this problem in one domain, and provides guidelines for other domains. We describe benchmark suite design, experimental design, and analysis for evaluating Java applications. For example, we introduce new criteria for measuring and selecting diverse applications for a benchmark suite. We show that the complexity and nondeterminism of the Java runtime system make experimental design a first-order consideration, and we recommend mechanisms for addressing complexity and nondeterminism. Drawing on these results, we suggest how to adapt methodology more broadly. To continue to deliver innovations, our field needs to significantly increase participation in and funding for developing sound methodological foundations.
@Article{BMG+:08,   author = {Stephen M. Blackburn and Kathryn S. McKinley and Robin Garner and Chris Hoffmann and Asjad M. Khan and Rotem Bentzur and Amer Diwan and Daniel Feinberg and Daniel Frampton and Samuel Z. Guyer and Martin Hirzel and Antony Hosking and Maria Jump and Han Lee and J. Eliot and B. Moss and Aashish Phansalkar and Darko Stefanovic and Thomas VanDrunen and Daniel von Dincklage and Ben Wiedermann},   title = {{Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century}},   journal = {Communications of the ACM},   year = {2008},   month = {August},   publisher = {ACM Press},   address = {New York, NY, USA},   doi = {http://doi.acm.org/10.1145/1378704.1378723},   note = {Invited paper. CACM Research Highlights.},   }
• S. M. Blackburn, M. Hertz, K. S. McKinley, E. J. B. Moss, and T. Yang, "Profile-based pretenuring," ACM Transactions on Programming Languages and Systems, vol. 27, iss. 1, 2007.
 http://doi.acm.org/10.1145/1180475.1180477
FOR subject classification codes: 080308
Pretenuring can reduce copying costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring as follows: (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find for our benchmarks that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to both applications and Jikes RVM, a compiler and runtime system for Java written in Java. Our results demonstrate that building combined advice into Jikes RVM from different application executions improves performance, regardless of the application Jikes RVM is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring, without any application profiling. No previous work uses profile feedback to pretenure in the runtime system. (3) We find that application-only advice also consistently improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational, Older First, and Beltway collectors, illustrating that it is collector neutral. (5) We include an immortal allocation space in addition to a nursery and older generation, and show that pretenuring to immortal space has substantial benefit.
@Article{BHMMY:07,   author = {Stephen M. Blackburn and Matthew Hertz and Kathryn S. McKinley and J. Eliot B. Moss and Ting Yang},   title = {Profile-based pretenuring},   journal = {ACM Transactions on Programming Languages and Systems},   year = {2007},   month = {January},   volume = {27},   number = {1},   doi = {http://doi.acm.org/10.1145/1180475.1180477},   }
• R. Garner, S. M. Blackburn, and D. Frampton, "Effective Prefetch for Mark-Sweep Garbage Collection," in The 2007 International Symposium on Memory Management, 2007.
 http://doi.acm.org/10.1145/1296907.1296915
FOR subject classification codes: 080308, 100604
Garbage collection is a performance-critical feature of most modern object oriented languages, and is characterized by poor locality since it must traverse the heap. In this paper we show that by combining two very simple ideas we can significantly improve the performance of the canonical mark-sweep collector, resulting in improvements in application performance. We make three main contributions: 1) we develop a methodology and framework for accurately and deterministically analyzing the tracing loop at the heart of the collector, 2) we offer a number of insights and improvements over conventional design choices for mark-sweep collectors, and 3) we find that two simple ideas — edge order traversal and software prefetch — combine to greatly improve garbage collection performance although each is unproductive in isolation.

We perform a thorough analysis in the context of MMTk and Jikes RVM on a wide range of benchmarks and four different architectures. Our baseline system (which includes a number of our improvements) is very competitive with highly tuned alternatives. We show a simple marking mechanism which offers modest but consistent improvements over conventional choices. Finally, we show that enqueuing the edges (pointers) of the object graph rather than the nodes (objects) significantly increases opportunities for software prefetch, despite increasing the total number of queue operations. Combining edge ordered enqueuing with software prefetching yields average performance improvements over a large suite of benchmarks of 20-30% in garbage collection time and 4-6% of total application performance in moderate heaps, across four architectures.

@InProceedings{GBF:07,   author = {Robin Garner and Stephen M. Blackburn and Daniel Frampton},   title = {Effective Prefetch for Mark-Sweep Garbage Collection},   booktitle = {The 2007 International Symposium on Memory Management},   year = {2007},   month = {October},   publisher = {ACM},   location = {Montreal, Canada},   doi = {http://doi.acm.org/10.1145/1296907.1296915},   patch = {pf-ismm-2007.tgz}, }
• X. Huang, S. M. Blackburn, D. Grove, and K. S. McKinley, "Fast and efficient partial code reordering: taking advantage of dynamic recompilation," in The 2006 International Symposium on Memory Management (ISMM 2006),Ottawa, Ontario, Canada, 2006, pp. 184-192.
 http://doi.acm.org/10.1145/1133956.1133980
FOR subject classification codes: 100604, 080308
Poor instruction cache locality can degrade performance on modern architectures. For example, our simulation results show that eliminating all instruction cache misses improves performance by as much as 16% for a modestly sized instruction cache. In this paper, we show how to take advantage of dynamic code generation in a Java Virtual Machine (VM) to improve instruction locality at run-time. We develop a dynamic code reordering (DCR) system; a low overhead, online approach for improving instruction locality. DCR has three optimizations: (1) Interprocedural method separation; (2) Intraprocedural code splitting; and (3) Code padding. DCR uses the dynamic call graph and an edge profile that most VMs already collect to separate hot/cold methods and hot/cold code within a method. It also puts padding between methods to minimize conflict misses between frequent caller/callee pairs. It incrementally performs these optimizations only when the VM is optimizing a method at a higher level. We implement DCR in Jikes RVM and show its overhead is negligible. Extensive simulation and run-time experiments show that a simple code space improves average performance on a Pentium 4 by around 6% on SPEC and DaCapo Java benchmarks. These programs however have very small instruction cache footprints that limit opportunities for DCR to improve performance. Consequently, DCR optimizations on average show little effect, sometimes degrading performance and occasionally improving performance by up to 5%. Our work shows that the VM has the potential to dynamically improve instruction locality incrementally by simply piggybacking on hotspot recompilation.
@InProceedings{HBGM:06,   author = {Xianglong Huang and Stephen M. Blackburn and David Grove and Kathryn S. McKinley},   title = {Fast and efficient partial code reordering: taking advantage of dynamic recompilation},   booktitle = {The 2006 International Symposium on Memory Management (ISMM 2006),Ottawa, Ontario, Canada},   year = {2006},   isbn = {1-59593-221-6},   pages = {184--192},   doi = {http://doi.acm.org/10.1145/1133956.1133980},   publisher = {ACM},   }
• M. Hertz, S. M. Blackburn, E. J. B. Moss, K. S. McKinley, and D. Stefanovic, "Generating Object Lifetime Traces with Merlin," ACM Transactions on Programming Languages and Systems, vol. 26, iss. 3, 2006.
 http://doi.acm.org/10.1145/1133651.1133654
FOR subject classification codes: 080308
Programmers are writing a rapidly growing number of programs in object-oriented languages, such as Java and C , that require garbage collection. Garbage collection traces and simulation speed up research by enabling deeper understandings of object lifetime behavior and quick exploration and design of new garbage collection algorithms. When generating perfect traces, the brute-force method of computing object lifetimes requires a whole-heap garbage collection at every potential collection point in the program. Because this process is prohibitively expensive, researchers often use granulated traces by collecting only periodically, for example, every 32 KB of allocation.We extend the state of the art for simulating garbage collection algorithms in two ways. First, we develop a systematic methodology for simulation studies of copying garbage collection and present results showing the effects of trace granularity on these simulations. We show that trace granularity often distorts simulated garbage collection results compared with perfect traces. Second, we present and measure the performance of a new algorithm called Merlin for computing object lifetimes. Merlin timestamps objects and later uses the timestamps of dead objects to reconstruct when they died. The Merlin algorithm piggybacks on garbage collections performed by the base system. Experimental results show that Merlin can generate traces over two orders of magnitude faster than the brute-force method which collects after every object allocation. We also use Merlin to produce visualizations of heap behavior that expose new object lifetime behaviors.
@Article{HBMMS:06,   author = {Matthew Hertz and Stephen M. Blackburn and J. Eliot B. Moss and Kathryn S. McKinley and Darko Stefanovic},   title = {Generating Object Lifetime Traces with Merlin},   journal = {ACM Transactions on Programming Languages and Systems},   year = {2006},   volume = {26},   number = {3},   doi = {http://doi.acm.org/10.1145/1133651.1133654},   }
• S. M. Blackburn and K. S. McKinley, "Transient Caches and Object Streams," Australian National University, Department of Computer Science, TR-CS-06-03, 2006.
FOR subject classification codes: 100604, 080308
Memory latency limits program performance. Object-oriented languages such as C and Java exacerbate this problem, but their software engineering benefits make them increasingly popular. We show that current memory hierarchies are not particularly well suited to Java in which object streams write and read a window of short-lived objects that pollute the cache. These observations motivate the exploration of transient caches which assist a parent cache. For an L1 parent cache, transient caches are positioned similarly to a classic L0, providing one cycle access time. Their distinguishing features are (1) they are tiny (4 to 8 lines), (2) they are highly associative, and (3) the processor may seek them in parallel with their parent. They can assist any cache level. To address object stream behavior, we explore policies for read and write instantiation, promotion, filtering, and valid bits to implement no-fetch on write.

Good design points include a parallel L0 (PL0) which improves Java programs by 3% on average, and C by 2% in cycle-accurate simulation over a two-cycle 32KB, 128B line, 2-way L1. A transient qualifying cache (TQ) improves further by a) minimizing pollution in the parent by filtering short-lived lines without temporal reuse, and b) using a write no-fetch policy with per-byte valid bits to eliminate wasted fetch bandwidth. TQs at L1 and L2 improve Java programs by 5% on average and up to 15%. The TQ even achieves improvements when the parent has half the capacity or associativity compared to the original larger L1. The one-cycle access time, a write no-fetch policy, and filtering bestow these benefits. Java motivates this approach, but it also improves for C programs.

@TechReport{BM:06,   author = {Stephen M. Blackburn and Kathryn S. McKinley},   title = {Transient Caches and Object Streams},   institution = {Australian National University, Department of Computer Science},   year = {2006},   number = {TR-CS-06-03},   month = {October},   }
• S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. Eliot, B. Moss, A. Phansalkar, D. Stefanovic, T. VanDrunen, D. von Dincklage, and B. Wiedermann, "The DaCapo benchmarks: Java benchmarking development and analysis," in OOPSLA ‘06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, New York, NY, USA, 2006, pp. 169-190.
 http://doi.acm.org/10.1145/1167473.1167488
FOR subject classification codes: 080308
Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection required for Java programs, many evaluations still use methodologies developed for C, C++, and Fortran. SPEC, the dominant purveyor of benchmarks, compounded this problem by institutionalizing these methodologies for their Java benchmark suite. This paper recommends benchmarking selection and evaluation methodologies, and introduces the DaCapo benchmarks, a set of open source, client-side Java benchmarks. We demonstrate that the complex interactions of (1) architecture, (2) compiler, (3) virtual machine, (4) memory management, and (5) application require more extensive evaluation than C, C++, and Fortran which stress (4) much less, and do not require (3). We use and introduce new value, time-series, and statistical metrics for static and dynamic properties such as code complexity, code size, heap composition, and pointer mutations. No benchmark suite is definitive, but these metrics show that DaCapo improves over SPEC Java in a variety of ways, including more complex code, richer object behaviors, and more demanding memory system requirements. This paper takes a step towards improving methodologies for choosing and evaluating benchmarks to foster innovation in system design and implementation for Java and other managed languages.
@InProceedings{BGH+:06,   author = {Stephen M. Blackburn and Robin Garner and Chris Hoffmann and Asjad M. Khang and Kathryn S. McKinley and Rotem Bentzur and Amer Diwan and Daniel Feinberg and Daniel Frampton and Samuel Z. Guyer and Martin Hirzel and Antony Hosking and Maria Jump and Han Lee and J. Eliot and B. Moss and Aashish Phansalkar and Darko Stefanovic and Thomas VanDrunen and Daniel von Dincklage and Ben Wiedermann},   title = {{The DaCapo benchmarks: Java benchmarking development and analysis}},   booktitle = {OOPSLA '06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications},   year = {2006},   isbn = {1-59593-348-4},   pages = {169--190},   location = {Portland, Oregon, USA},   doi = {http://doi.acm.org/10.1145/1167473.1167488},   publisher = {ACM Press},   address = {New York, NY, USA},   }
• H. Paz, E. Petrank, and S. M. Blackburn, "Age-Oriented Concurrent Garbage Collection," in 14th International Conference on Compiler Construction (CC), Edinburgh, Scotland, April 4-8, 2005, 2005.
 http://dx.doi.org/10.1007/b107108
Generational collectors are well known as a tool for shortening pause times incurred by garbage collection and for improving garbage collection efficiency. In this paper, we investigate how to best use generations with on-the-fly collectors. On-the-fly collectors run concurrently with the program threads and induce very short program pauses. Thus, the motivation for incorporating generations is focused at improving the throughput; pauses do not matter, since they are already very short. We propose a new collection approach, denoted age-oriented collection, for exploiting the generational hypothesis to obtain better efficiency. This approach is particularly useful when reference counting is used to collect the old generation, yielding a highly efficient and non-obtrusive on-the-fly collector. Finally, an implementation is provided demonstrating how the age-oriented collector outperforms both the non-generational and the generational collectors’ efficiency.
@InProceedings{PPB:05,   author = {Harel Paz and Erez Petrank and Stephen M. Blackburn},   title = {Age-Oriented Concurrent Garbage Collection},   booktitle = {14th International Conference on Compiler Construction (CC), Edinburgh, Scotland, April 4-8, 2005},   year = 2005, impact = {20-50},   doi = {http://dx.doi.org/10.1007/b107108},   }
• B. Alpern, S. Augart, S. M. Blackburn, M. Butrico, A. Cocchi, P. Cheng, J. Dolby, S. Fink, D. Grove, M. Hind, K. S. McKinley, M. Mergen, J. E. B. Moss, T. Ngo, V. Sarkar, and M. Trapp, "The Jikes Research Virtual Machine project: Building an open source research community," IBM Systems Journal, vol. 44, iss. 2, 2005.
 http://dx.doi.org/10.1147/sj.442.0399
FOR subject classification codes: 080308
This paper describes the evolution of the Jikes Research Virtual Machine project from an IBM internal research project, called Jalapeno, into an open-source project. After summarizing the original goals of the project, we discuss the motivation for releasing it as an open-source project and the activities performed to ensure the success of the project. Throughout, we highlight the unique challenges of developing and maintaining an open-source project designed specifically to support a research community.
@Article{AABB+:05,   author = {B. Alpern and S. Augart and S. M. Blackburn and M. Butrico and A. Cocchi and P. Cheng and J. Dolby and S. Fink and D. Grove and M. Hind and K. S. McKinley and M. Mergen and J. E. B. Moss and T. Ngo and V. Sarkar and M. Trapp},   title = {The {Jikes Research Virtual Machine} project: {B}uilding an open source research community},   journal = {IBM Systems Journal},   year = {2005},   month = {May},   volume = {44},   number = {2},   doi = {http://dx.doi.org/10.1147/sj.442.0399},   }
• S. M. Blackburn, P. Cheng, and K. S. McKinley, "Oil and Water? High Performance Garbage Collection in Java with MMTk," in ICSE 2004, 26th International Conference on Software Engineering, Edinburgh, Scotland, May 23-28, 2004, 2004.
 http://doi.ieeecomputersociety.org/10.1109/ICSE.2004.1317436
Increasingly popular languages such as Java and C require efficient garbage collection. This paper presents the design, implementation, and evaluation of MMTk, a Memory Management Toolkit for and in Java. MMTk is an efficient, composable, extensible, and portable framework for building garbage collectors. MMTk uses design patterns and compiler cooperation to combine modularity and efficiency. The resulting system is more robust, easier to maintain, and has fewer defects than monolithic collectors. Experimental comparisons with monolithic Java and C implementations reveal MMTk has significant performance advantages as well. Performance critical system software typically uses monolithic C at the expense of flexibility. Our results refute common wisdom that only this approach attains efficiency, and suggest that performance critical software can embrace modular design and high-level languages.
@InProceedings{BCM:04,   author = {S. M. Blackburn and P. Cheng and K. S. McKinley},   Title = {Oil and Water? {High} Performance Garbage Collection in {Java} with {MMTk}},   Booktitle = {ICSE 2004, 26th International Conference on Software Engineering, Edinburgh, Scotland, May 23-28, 2004},   month = {May},   Year = {2004},   publisher = {IEEE},   doi = {http://doi.ieeecomputersociety.org/10.1109/ICSE.2004.1317436}, }
• S. M. Blackburn and A. L. Hosking, "Barriers: Friend or Foe?," in The 2004 International Symposium on Memory Management (ISMM 2004), Vancouver, Canada, October 24-25, 2004, 2004.
 http://doi.acm.org/10.1145/1029873.1029891
Modern garbage collectors rely on read and write barriers imposed on heap accesses by the mutator, to keep track of references between different regions of the garbage collected heap, and to synchronize actions of the mutator with those of the collector. It has been a long-standing untested assumption that barriers impose significant overhead to garbage-collected applications. As a result, researchers have devoted effort to development of optimization approaches for elimination of unnecessary barriers, or proposed new algorithms for garbage collection that avoid the need for barriers while retaining the capability for independent collection of heap partitions. On the basis of the results presented here, we dispel the assumption that barrier overhead should be a primary motivator for such efforts.
We present a methodology for precise measurement of mutator overheads for barriers associated with mutator heap accesses. We provide a taxonomy of different styles of barrier and measure the cost of a range of popular barriers used for different garbage collectors within Jikes RVM. Our results demonstrate that barriers impose surprisingly low cost to the mutator, though results vary by architecture. We found that the average overhead for a reasonable generational write barrier was less than 2% on average, and less that 6% in the worst case. Furthermore, we found that the average overhear of an unconditional read barrier on the PowerPC was only 0.85%, while on the AMD it was 8.05%. With both read and write barriers, we found that second order locality effects were sometimes more important that the overhead of the barriers themselves, leading to counter-intuitive speedups in a number of situations.
@InProceedings{BH:04,   author = {S. M. Blackburn and A. L. Hosking},   Title = {Barriers: {F}riend or Foe?},   Booktitle = {The 2004 International Symposium on Memory Management (ISMM 2004), Vancouver, Canada, October 24-25, 2004},   month = {October},   Year = {2004},   publisher = {ACM},   doi = {http://doi.acm.org/10.1145/1029873.1029891},   patch = {wb-ismm-2004.tgz},   }
• S. M. Blackburn, P. Cheng, and K. S. McKinley, "Myths and Realities: The Performance Impact of Garbage Collection," in SIGMETRICS — Performance 2004, Joint International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, June 12–16, 2004, 2004.
 http://doi.acm.org/10.1145/1005686.1005693
This paper explores and quantifies garbage collection (GC) behavior for three whole heap collectors and generational counterparts: copying semi-space, mark-sweep, and reference counting, the canonical algorithms from which essentially all other GC algorithms are derived. Efficient implementations in the memory management toolkit (MMTk) in Jikes RVM share all common mechanisms to provide a clean experimental platform. Performance counters and instrumentation measure timing and memory performance, separating GC and program behavior.

Experiments on SPEC JVM Benchmarks reveal key algorithmic features and how they match program characteristics to explain the direct cost of GC as a function of heap size, and the indirect impact of GC on application performance. Results include the finding that the choice of GC algorithms can improve mutator locality, disputing the myth that “no GC is good GC.” We show how the trade-offs in space utilization versus allocation and tracing costs in copying and mark-sweep collectors motivates a copying nursery for newly allocated objects, even without high nursery mortality. We find that object locality and pointer mutations demographics in the mature space are much less sensitive to GC algorithm, but that copying and mark-sweep for the mature space occasionally excel in different programs. This study is unique in its breadth of GC algorithms and its depth of analysis.

@InProceedings{BCM:04b,   author = {S. M. Blackburn and P. Cheng and K. S. McKinley},   Title = {Myths and Realities: {The} Performance Impact of Garbage Collection},   Booktitle = {{SIGMETRICS} -- Performance 2004, Joint International Conference on Measurement and Modeling of Computer Systems, New York, NY, USA, June 12--16, 2004},   month = {June},   Year = {2004},   publisher = {ACM},   doi = {http://doi.acm.org/10.1145/1005686.1005693}, }
• M. Jump, S. M. Blackburn, and K. S. McKinley, "Dynamic Object Sampling for Pretenuring," in The 2004 International Symposium on Memory Management (ISMM 2004), Vancouver, Canada, October 24-25, 2004, 2004.
 http://doi.acm.org/10.1145/1029873.1029892
Generational garbage collectors make a space-time tradeoff by exploiting the gross statistical object lifetime property that young objects tend to die at a higher rate than old ones. They thus collect the young nursery objects more frequently than old objects. Pretenuring decreases nursery collection work by allocating new, but long-lived objects directly into the old space. This paper introduces a new low-cost dynamic object sampling technique to generate pretenuring advice as the program executes. Every 2^k bytes of allocation, object sampling marks the new object with its allocation site. The collector computes allocation site survival rates using the sampled objects, and then starts pretenuring sites that produce mostly long-lived objects. It occasionally back-samples a site to react to phase changes. Evaluations in Jikes RVM on Java programs show the sampling space and time overheads are low, on average 1 to 2% when the compiler inlines the allocation sequence. As with previous dynamic pretenuring approaches, consistent performance improvements are difficult to attain with dynamic pretenuring and many programs degrade slightly (by 1 to 2%), however we demonstrate a consistent 10% improvement over a number of heap sizes for javac, a SPECjvm benchmark.
@InProceedings{JBM:04,   author = {M. Jump and S. M. Blackburn and K. S. McKinley},   Title = {Dynamic Object Sampling for Pretenuring},   Booktitle = {The 2004 International Symposium on Memory Management (ISMM 2004), Vancouver, Canada, October 24-25, 2004},   month = {October},   Year = {2004},   publisher = {ACM},   doi = {http://doi.acm.org/10.1145/1029873.1029892},   }
• X. Huang, S. M. Blackburn, K. S. McKinley, J. E. B. Moss, Z. Wang, and P. Cheng, "The Garbage Collection Advantage: Improving Program Locality," in Proceedings of the 2004 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2004), Vancouver, Canada, October 24-28, 2004, 2004.
 http://doi.acm.org/10.1145/1028976.1028983
As increases in processor speed continue to outpace increases in cache and memory speed, programs are losing more performance to poor locality. Because copying garbage collectors move objects, they have the opportunity to improve locality for languages such as Java. This paper introduces a new dynamic, online class analysis for finding and exploiting locality in a copying collector. The analysis exploits method sampling in a JIT (just-in-time) optimizing compiler. For each hot (frequently accessed) method, object reordering analysis marks the class fields that the method accesses as hot. Then at garbage collection time, the collector copies referents of hot fields together with their parent. Enhancements to this basic technique include heuristics that decay heat to respond to phase changes, group objects of hot classes together in a separate copy space, and static analysis to exclude cold basic blocks from the reordering analysis. In experiments with Jikes RVM using MMTk on a range of Java programs, the overhead of dynamic class reordering is on average negligible and at most 1.9%. We compare class reordering with a number of static class oblivious orderings (e.g., breadth and depth first). The overall time variation between static orderings can be up to 25% and there is no consistent winner. In contrast, dynamic class reordering always matches or improves over the best static ordering since its history-based copying order tunes memory layout to program traversal.
@InProceedings{HBMM+:04,   author = {Xianglong Huang and Stephen M Blackburn and Kathryn S. McKinley and J.E.B. Moss and Zhenlin Wang and Perry Cheng},   title = {The Garbage Collection Advantage: Improving Program Locality},   booktitle = {Proceedings of the 2004 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2004), Vancouver, Canada, October 24-28, 2004},   year = {2004},   volume = {39},   number = {10},   series = {SIGPLAN Notices},   month = October, publisher = {ACM},   doi = {http://doi.acm.org/10.1145/1028976.1028983},   }
• S. M. Blackburn and K. S. McKinley, "Ulterior Reference Counting: Fast Garbage Collection without a Long Wait," in Proceedings of the 2003 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2003), Anaheim, CA, USA, October 26-30, 2003, 2003.
 http://doi.acm.org/10.1145/949305.949336
General purpose garbage collectors have yet to combine short pause times with high throughput. For example, generational collectors can achieve high throughput. They have modest average pause times, but occasionally collect the whole heap and consequently incur long pauses. At the other extreme, concurrent collectors, including reference counting, attain short pause times but with significant performance penalties. This paper introduces a new hybrid collector that combines copying generational collection for the young objects and reference counting the old objects to achieve both goals. It restricts copying and reference counting to the object demographics for which they perform well. Key to our algorithm is a generalization of deferred reference counting we call Ulterior Reference Counting. Ulterior reference counting safely ignores mutations to select heap objects. We compare a generational reference counting hybrid with pure reference counting, pure mark-sweep, and hybrid generational mark-sweep collectors. This new collector combines excellent throughput, matching a high performance generational mark-sweep hybrid, with low maximum pause
@InProceedings{BM:03,   author = {Stephen M Blackburn and Kathryn S. McKinley},   title = {Ulterior Reference Counting: {F}ast Garbage Collection without a Long Wait},   booktitle = {Proceedings of the 2003 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2003), Anaheim, CA, USA, October 26-30, 2003},   year = {2003},   volume = {38},   number = {10},   series = {SIGPLAN Notices},   month = {October},   publisher = {ACM},   doi = {http://doi.acm.org/10.1145/949305.949336},   }
• S. M. Blackburn and K. S. McKinley, "In or Out? Putting write barriers in their place," in Proceedings of the Third International Symposium on Memory Management, ISMM ‘11, Berlin, Germany, 2002.
 http://doi.acm.org/10.1145/512429.512452
In many garbage collected systems, the mutator performs a write barrier for every pointer update. Using generational garbage collectors, we study in depth three code placement options for remembered-set write barriers: inlined, out-of-line, and partially inlined (fast path inlined, slow path out-of-line). The fast path determines if the collector needs to remember the pointer update. The slow path records the pointer in a list when necessary. Efficient implementations minimize the instructions on the fast path, and record few pointers (from 0.16 to 3% of pointer stores in our benchmarks). We find the mutator performs best with a partially inlined barrier, by a modest 1.5% on average over full inlining.

We also study the compilation cost of write-barrier code placement. We find that partial inlining reduces the compilation cost by 20 to 25% compared to full inlining. In the context of just-in-time compilation, the application is exposed to compiler activity. Regardless of the level of compiler activity, partial inlining consistently gives a total running time performance advantage over full inlining on the SPEC JVM98 benchmarks. When the compiler optimizes all application methods on demand and compiler load is highest, partial inlining improves total performance on average by 10.2%, and up to 18.5%.

@InProceedings{BM:02,   author = {Stephen M. Blackburn and Kathryn S. McKinley},   title = {In or Out? {P}utting write barriers in their place},   month = jun, year = {2002},   booktitle = {Proceedings of the Third International Symposium on Memory Management, ISMM '11, Berlin, Germany},   series = {ACM SIGPLAN Notices},   volume = 37, publisher = {ACM Press},   doi = {http://doi.acm.org/10.1145/512429.512452},   }
• S. M. Blackburn, R. E. Jones, K. S. McKinley, and E. J. B. Moss, "Beltway: Getting Around Garbage Collection Gridlock," in Proceedings of SIGPLAN 2002 Conference on Programming Languages Design and Implementation, PLDI’02, Berlin, June, 2002, Berlin, Germany, 2002.
 http://doi.acm.org/10.1145/512529.512548
We present the design and implementation of a new garbage collection framework that significantly generalizes existing copying collectors. The Beltway framework exploits and separates object age and incrementality. It groups objects in one or more increments on queues called belts, collects belts independently, and collects increments on a belt in first-in-first-out order. We show that Beltway configurations, selected by command line options, act and perform the same as semi-space, generational, and older-first collectors, and encompass all previous copying collectors of which we are aware.

The increasing reliance on garbage collected languages such as Java requires that the collector perform well. We show that the generality of Beltway enables us to design and implement new collectors that are robust to variations in heap size and improve total execution time over the best generational copying collectors of which we are aware by up to 40%, and on average by 5 to 10%, for small to moderate heap sizes. New garbage collection algorithms are rare, and yet we define not just one, but a new family of collectors that subsumes previous work. This generality enables us to explore a larger design space and build better collectors.

@InProceedings{BJMM:02,   author = {Stephen M. Blackburn and Richard E. Jones and Kathryn S. McKinley and J. Eliot B. Moss},   title = {Beltway: Getting Around Garbage Collection Gridlock},   booktitle = {Proceedings of {SIGPLAN 2002} Conference on Programming Languages Design and Implementation, PLDI'02, Berlin, June, 2002},   year = {2002},   month = {June},   series = {ACM SIGPLAN Notices},   volume = {37(5)},   publisher = {ACM Press},   address = {Berlin, Germany},   doi = {http://doi.acm.org/10.1145/512529.512548},   }
• M. Hertz, S. M. Blackburn, E. J. B. Moss, K. S. McKinley, and D. Stefanovic, "Error-Free Garbage Collection Traces: How to Cheat and Not Get Caught," in SIGMETRICS 2002, International Conference on Measurements and Modeling of Computer Systems, June 15-19, 2002, Marina Del Rey, California, USA, Proceedings, 2002.
 http://doi.acm.org/10.1145/511334.511352
Programmers are writing a large and rapidly growing number of programs in object-oriented languages such as Java that require garbage collection (GC). To explore the design and evaluation of GC algorithms quickly, researchers are using simulation based on traces of object allocation and lifetime behavior. The brute force method generates perfect traces using a whole-heap GC at every potential GC point in the program. Because this process is prohibitively expensive, researchers often use granulated traces by collecting only periodically, e.g., every 32K bytes of allocation.

We extend the state of the art for simulating GC algorithms in two ways. First, we present a systematic methodology and results on the effects of trace granularity for a variety of copying GC algorithms. We show that trace granularity often distorts GC performance results compared with perfect traces, and that some GC algorithms are more sensitive to this effect than others. Second, we introduce and measure the performance of a new precise algorithm for generating GC traces which is over 800 times faster than the brute force method. Our algorithm, called Merlin, frequently timestamps objects and later uses the timestamps of dead objects to reconstruct precisely when they died. It performs only periodic garbage collections and achieves high accuracy at low cost, eliminating any reason to use granulated traces.

@InProceedings{HBMMS:02,   author = {Matthew Hertz and Stephen M. Blackburn and J. Eliot B. Moss and Kathryn S. McKinley and Darko Stefanovic},   title = {Error-Free Garbage Collection Traces: {H}ow to Cheat and Not Get Caught},   month = {June},   year = {2002},   booktitle = {SIGMETRICS 2002, International Conference on Measurements and Modeling of Computer Systems, June 15-19, 2002, Marina Del Rey, California, USA, Proceedings},   publisher = {ACM Press},   doi = {http://doi.acm.org/10.1145/511334.511352},   }
• D. Stefanovic, M. Hertz, S. M. Blackburn, K. S. McKinley, and E. J. B. Moss, "Older-first Garbage Collection in Practice: Evaluation in a Java Virtual Machine," in Proceedings of the First ACM SIGPLAN Workshop on Memory System Performance (MSP), Berlin, Germany, June 16, 2002, 2002.
 http://doi.acm.org/10.1145/773146.773042
Until recently, the best performing copying garbage collectors used a generational policy which repeatedly collects the very youngest objects, copies any survivors to an older space, and then infrequently collects the older space. A previous study that used garbage-collection simulation pointed to potential improvements by using an Older-First copying garbage collection algorithm. The Older-First algorithm sweeps a fixed-sized window through the heap from older to younger objects, and avoids copying the very youngest objects which have not yet had sufficient time to die. We describe and examine here an implementation of the Older-First algorithm in the Jikes RVM for Java. This investigation shows that Older-First can perform as well as the simulation results suggested, and greatly improves total program performance when compared to using a fixed-size nursery generational collector. We further compare Older-First to a flexible-size nursery generational collector in which the nursery occupies all of the heap that does not contain older objects. In these comparisons, the flexible-nursery collector is occasionally the better of the two, but on average the Older-First collector performs the best.
@InProceedings{SHBMM:02,   author = {Darko Stefanovic and Matthew Hertz and Stephen M. Blackburn and Kathryn S. McKinley and J. Eliot B. Moss},   title = {Older-first Garbage Collection in Practice: Evaluation in a Java Virtual Machine},   month = {June},   year = {2002},   booktitle = {Proceedings of the First ACM SIGPLAN Workshop on Memory System Performance (MSP), Berlin, Germany, June 16, 2002},   publisher = {ACM Press},   doi = {http://doi.acm.org/10.1145/773146.773042},   }
• S. M. Blackburn, S. Singhai, M. Hertz, K. S. McKinley, and E. J. B. Moss, "Pretenuring for Java," in Proceedings of the 2001 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2001), Tampa, Florida, October 14-18, 2001, 2001.
 http://doi.acm.org/10.1145/504282.504307
Pretenuring can reduce copying costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring as follows. (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to applications and the Jalapeno JVM, a compiler and run-time system for Java written in Java. Our results demonstrate that building combined advice into Jalapeno from different application executions improves performance regardless of the application Jalapeno is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring without any application profiling. No previous work pretenures in the run-time system. (3) We find that application-only advice also improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational and Older First collection, illustrating that it is collector neutral.
@InProceedings{BSHMM:01,   author = {Stephen M Blackburn and Sharad Singhai and Matthew Hertz and Kathryn S. McKinley and J. Eliot B. Moss},   title = {Pretenuring for {J}ava},   booktitle = {Proceedings of the 2001 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2001), Tampa, Florida, October 14-18, 2001 },   year = {2001},   volume = {36},   number = {10},   series = {SIGPLAN Notices},   month = {October},   publisher = {ACM},   doi = {http://doi.acm.org/10.1145/504282.504307},   }
• S. M. Blackburn, R. Hudson, R. Morrison, E. J. B. Moss, D. Munro, and J. Zigman, "Starting with Termination: A Methodology for Building Distributed Garbage Collection Algorithms," in Australian Computer Science Conference, Proceedings, jan 29–31, 2001, Gold Coast, Australia, 2001.
 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=906619
We propose an effective methodology in which a distributed garbage collector may be derived from a distributed termination algorithm and a centralized garbage collector in a manner that preserves interesting properties of the original collector, such as completeness. To illustrate our technique we show how two distributed termination algorithms, credit recovery and task balancing, may be suitably described; and then map four centralized garbage collectors reference counting, mark/scan, a generational scheme, and the Mature Object Space collector (MOS) onto this description. The advantage of our approach is that, by separating the issues of distribution and collection, we alleviate the difficulty of inventing, understanding, and comparing distributed garbage collection techniques.
@InProceedings{blackburn01b,   author = {Stephen M Blackburn and Rick Hudson and Ron Morrison and J. Eliot B. Moss and David Munro and John Zigman},   title = {Starting with Termination: A Methodology for Building Distributed Garbage Collection Algorithms},   booktitle = {Australian Computer Science Conference, Proceedings, } jan 29--31, 2001, Gold Coast, Australia},   year = {2001},   doi = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=906619},  
• A. Marquez and S. M. Blackburn, "Addressing Complexity and Scale in a High Performance Object Server." John Wiley and Sons, 2001.
The so called information explosion’ has lead to an enormous demand for networked information, which has in turn placed enormous pressure on information servers.  The design of high performance server technology has thus become a hot topic in computer science. In this paper we describe how our exposure to a large real-world application has shaped our response to two key issues for high performance object server technologies—the management of complexity and the management of scale.

The Australian Bureau of Statistics’ Business Register (BR) is an OODB application that forms a register of all businesses in Australia.  The register is fundamental to much of the ABS’ economic survey activity and is used by many branches of the large organization.  The basis of the register is an object model that reflects the business structure of all Australian businesses, from large multinationals through to corner stores.  There are over 100,000,000 objects in the register and it is constantly updated, both through operator-driven data entry and the batch processing of data from other government agencies. Specific requirements are acceptable response time for interactive access for more that 100 users, long and short transaction support, efficient storage and online access to historic information, flexible object version views and schema evolution support.

The challenges raised by the BR are dominated by the problems of complexity and scale.  Our response to these challenges has been guided by our view that abstraction has a basic role in addressing complexity and scale, and by our use of Java as an implementation context.

In this paper we report not only our approach, but also novel technological outcomes which stem from our exposure to the challenges highlighted by the ABS-BR.  These outcomes include: portable orthogonally persistent Java, a system for object versioning, a proposal for schema versioning, and a system architecture for scalable object servers based on a separation of programming language and storage system concerns.  These outcomes are examined in the context of an ABS-BR technology demonstrator—a small-scale version of the ABS-BR built to test and showcase new technologies for high performance object servers.

@InCollection{MB:00,   author = {Alonso Marquez and Stephen M Blackburn},   title = {Addressing Complexity and Scale in a High Performance Object Server},   booktitle = {Succeeding with Object Databases},   publisher = {John Wiley and Sons},   year = 2001, chapter = 10, gscholar = 19, pdf = {opj-abs-chapter.pdf}, }
• Z. He, A. Marquez, and S. M. Blackburn, "Opportunistic Prioritised Clustering Framework (OPCF)," in Objects and Databases, International Symposium, Sophia Antipolis, France, June 13, 2000, Proceedings, 2001, pp. 86-100.
 http://dx.doi.org/10.1007/3-540-44677-X_6
Ever since the early days’ of database management systems, clustering has proven to be one of the most effective performance enhancement techniques for object oriented database management systems.  The bulk of the work in the area has been on static clustering algorithms which re-cluster the object base when the database is static.  However, this type of re-clustering cannot be used when 24-hour database access is required.  In such situations dynamic clustering is required, which allows the object base to be reclustered while the database is in operation.  We believe that most existing dynamic clustering algorithms lack three important properties.  These include: the use of opportunism to imposes the smallest I/O footprint for re-organisation; the re-use of prior research on static clustering algorithms; and the prioritisation of re-clustering so that the worst clustered pages are re-clustered first.  In this paper, we present OPCF, a framework in which any existing static clustering algorithm can be made dynamic and given the desired properties of I/O opportunism and clustering prioritisation.  In addition, this paper presents a   performance evaluation of the ideas suggested above and in particular shows the importance of I/O opportunism in improving the performance of dynamic clustering algorithms in a variety of situations.  The main contribution of this paper is the observation that existing static clustering algorithms, when transformed via a simple transformation framework such as OPCF, can produce dynamic clustering algorithms that out-perform complex existing dynamic algorithms, in a variety of situations.  This makes the solution presented in this paper particularly attractive to real OODBMS system implementers who often prefer to opt for simpler solutions.
@InProceedings{he00,   author = {Zhen He and Alonso Marquez and Stephen M Blackburn},   title = {Opportunistic Prioritised Clustering Framework {(OPCF)}},   booktitle = {Objects and Databases, International Symposium, Sophia Antipolis, France, June 13, 2000, Proceedings},   editor = {Klaus R. Dittrich and Giovanna Guerrini and Isabella Merlo and Marta Oliva and Elena Rodr\'{\i}guez},   year = 2001, series = {Lecture Notes in Computer Science},   publisher = {Springer},   volume = 1944, pages = {86--100},   doi = {http://dx.doi.org/10.1007/3-540-44677-X_6},   }
• J. Zigman, S. M. Blackburn, and E. J. B. Moss, "TMOS: A Transactional Garbage Collector," in Advances in Persistent Object Systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS9), sep 6–8, 2000, Lillehammer, Norway, 2001.
 http://dx.doi.org/10.1007/3-540-45498-5_12
Defining persistence in terms of reachability is fundamental to achieving orthogonality of persistence. It is implicit to the principles of orthogonal persistence and is a part of the ODMG~3.0 data objects standard. Although space reclamation in the context of persistence by reachability can be achieved automatically using garbage collection, relatively few papers address the problem of implementing garbage collection in a transactional storage system. A transactional GC algorithm must operate correctly in the face of failure, and in particular must deal with the problem of transaction abort, which by undoing changes such as the deletion of references, subverts the GC reachability axiom of once garbage always garbage’. In this paper we make two key contributions. First, we present a generic approach to the design of transactional collectors that promotes clarity, generality, and understandability, and then using this approach, we present a new transactional garbage collection algorithm, TMOS. Our design approach brings together three independent components—a mutator, a transactional store, and a GC algorithm. TMOS represents the application of the Mature Object Space family of GC algorithms to the transactional context through our approach to transactional GC design.
@InProceedings{zigman01,   author = {John Zigman and Stephen M Blackburn and J Eliot B Moss},   title = {{TMOS}: A Transactional Garbage Collector},   booktitle = {Advances in Persistent Object Systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS9), } sep 6--8, 2000, Lillehammer, Norway},   editor = {Alan Dearle and Graham Kirby and Dag Sj{\o}berg},   year = 2001, publisher = {Springer},   url = {http://www.informatik.uni-trier.de/~ley/db/conf/pos},   doi = {http://dx.doi.org/10.1007/3-540-45498-5_12},  
• A. Marquez, S. M. Blackburn, G. Mercer, and J. Zigman, "Implementing Orthogonally Persistent Java," in Advances in Persistent Object Systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS9), sep 6–8, 2000, Lillehammer, Norway, 2001.
 http://dx.doi.org/10.1007/3-540-45498-5_22
Orthogonally persistent Java combines the power of abstraction over persistence with Java’s rich programming environment. In this paper we report our experience in designing and implementing orthogonally persistent Java. Our design approach is anchored by the view that any system that brings together Java and orthogonal persistence should as far as possible avoid diluting the strengths of Java or the principles of orthogonal persistence. Our approach is thus distinguished by three features: complete transparency of persistence, support for both intra and inter application concurrency through ACID transactions, and the preservation of Java’s property of portability. In addition to discussing design and implementation, we present results that show that our approach performs credibly.
@InProceedings{marquez01,   author = {Alonso Marquez and Stephen M Blackburn and Gavin Mercer and John Zigman},   title = {Implementing Orthogonally Persistent {J}ava},   booktitle = {Advances in Persistent Object Systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS9), } sep 6--8, 2000, Lillehammer, Norway},   editor = {Alan Dearle and Graham Kirby and Dag Sj{\o}berg},   year = {2001},   publisher = {Springer},   doi = {http://dx.doi.org/10.1007/3-540-45498-5_22},  
• Z. He, S. M. Blackburn, L. Kirby, and J. Zigman, "Platypus: The design and implementation of a flexible high performance object store.," in Advances in Persistent Object Systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS9), sep 6–8, 2000, Lillehammer, Norway, 2001.
 http://dx.doi.org/10.1007/3-540-45498-5_10
This paper reports the design and implementation of Platypus, a transactional object store. The twin goals of flexibility and performance dominate the design of Platypus. The design includes: strong support for SMP concurrency; stand-alone, client-server and client-peer distribution configurations; configureable logging and recovery; and object management which can accommodate garbage collection and clustering mechanisms. The first implementation of Platypus incorporates a number of innovations. (1) A new recovery algorithm derived from ARIES that removes the need for log seqeuence numbers to be present on store pages. (2) A zero-copy memory-mapped buffer manager with controlled write-back behavior. (3) New data structures for highly concurrent locking and map querying. We present performance results comparing Platypus with a stream-lined implemention of the SHORE object store. For both medium and small OO7 workloads Platypus outperforms SHORE across a wide range of benchmark operations in both hot’ and cold’ settings.
@InProceedings{he01,   author = {Zhen He and Stephen M Blackburn and Luke Kirby and John Zigman},   title = {Platypus: The design and implementation of a flexible high performance object store.},   booktitle = {Advances in Persistent Object Systems: Proceedings of the 9th International Workshop on Persistent Object Systems (POS9), } sep 6--8, 2000, Lillehammer, Norway},   editor = {Alan Dearle and Graham Kirby and Dag Sj{\o}berg},   year = {2001},   publisher = {Springer},   url = {http://www.informatik.uni-trier.de/~ley/db/conf/pos},   doi = {http://dx.doi.org/10.1007/3-540-45498-5_10},  
• A. Marquez, J. N. Zigman, and S. M. Blackburn, "Fast Portable Orthogonally Persistent Java," Software: Practice and Experience, vol. 30, iss. 4, pp. 449-479, 2000.
 http://dx.doi.org/10.1002/(SICI)1097-024X(20000410)30:4%3C449::AID-SPE306%3E3.3.CO;2-P
A powerful feature of the Java programming language is its user-definable class loading policy, which when combined with the namespace independence between class loaders, allows portable implementation of semi-dynamic program transformations.  Such transformations can be used for a range of purposes, including optimization and semantic extension.

In this paper we present a framework for semantic extensions in Java. This framework consists of a number of simple but powerful transformations that, among other things, allow us to semantically extend Java to provide orthogonal persistence.

The use of semi-dynamic program transformations lends our orthogonally persistent Java a number of important qualities, including simplicity, portability and a clean model of persistence. Significantly, our implementations are efficient and can outperform in some cases PJama, a well-known orthogonally persistent Java, which is based on a modified virtual machine.

In addition to describing the application of these transformations to orthogonally persistent Java, we foreshadow their use in a number of other contexts, including dynamic instance versioning and instrumentation.

@Article{marquez00,   author = {Alonso Marquez and John N Zigman and Stephen M Blackburn},   title = {Fast Portable Orthogonally Persistent {J}ava},   journal = {Software: Practice and Experience},   year = {2000},   volume = {30},   number = {4},   pages = {449-479},   month = {April},   doi = {http://dx.doi.org/10.1002/(SICI)1097-024X(20000410)30:4%3C449::AID-SPE306%3E3.3.CO;2-P},   }
• S. M. Blackburn and R. B. Stanton, "The Transactional Object Cache: A foundation for high performance persistent system construction.," in Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), aug 30– sep 1, 1998, Tiburon, CA, U.S.A., San Francisco, 1999, pp. 37-50.
This paper argues that caching, atomicity and layering are fundamental to persistent systems, and that the transactional object cache architecture, as an embodiment of these concerns, provides a foundation for high performance persistent system construction. Central to the paper is a description of the semantics of an abstract transactional object cache architecture that supports a wide range of transactional models and is open to a broad spectrum of transactional cache coherency algorithms. The centrality of the abstraction is a consequence of its role in facilitating the definition of a transactional interface, which is the key to the practical application of the transactional object cache architecture. The utility of the architectural framework in general, and the interface in particular, is argued for in the context of existing systems and systems currently under construction within that framework.
@InProceedings{blackburn99,   author = {Stephen M. Blackburn and Robin B. Stanton},   title = {The Transactional Object Cache: {A} foundation for high performance persistent system construction.},   booktitle = {Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), } aug 30-- sep 1, 1998, Tiburon, CA, U.S.A.},   editor = {Ronald Morrison and Mick Jordan and Malcolm Atkinson},   year = 1999, pages = {37--50},   address = {San Francisco},   publisher = {Morgan Kaufmann},  
• J. N. Zigman and S. M. Blackburn, "Java Finalize Method, Orthogonal Persistence and Transactions," in Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), aug 30– sep 1, 1998, Tiburon, CA, U.S.A., San Francisco, 1999, pp. 363-369.
Java is a popular, object oriented language that is runtime type safe. As such, it has been seen as an attractive basis for the implementation of orthogonally persistent systems by several research groups. Transactions are widely used as a means of enforcing consistency of the stable image in the face of concurrency, and have been adopted by most groups developing persistent Java systems. However, Java has a user definable {\it finalize} method which provides an asynchronous cleanup mechanism. The strict temporal semantics of transactions and the asynchrony of the finalize method seem at odds. This paper describes this conflict and provides a strategy for resolving the problem.
@InProceedings{zigman99,   author = {John N Zigman and Stephen M Blackburn},   title = {Java Finalize Method, Orthogonal Persistence and Transactions},   booktitle = {Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), } aug 30-- sep 1, 1998, Tiburon, CA, U.S.A.},   editor = {Ronald Morrison and Mick Jordan and Malcolm Atkinson},   year = 1999, pages = {363--369},   address = {San Francisco},   publisher = {Morgan Kaufmann},  
• S. M. Blackburn and J. N. Zigman, "Concurrency—The fly in the ointment?," in Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), aug 30– sep 1, 1998, Tiburon, CA, U.S.A., San Francisco, 1999, pp. 250-258.
Concurrency is a central pillar of the Java programming language, is implicit in the transactional model of computation adopted by most persistent systems, and has been widely studied in the context of orthogonal persistence. We argue that despite the substantial literature on concurrency control and transaction models for orthogonal persistence, a basic question as to the interaction between concurrency and orthogonal persistence has yet to be adequately addressed. The demands of orthogonality appear to place orthogonal persistence at odds with more general approaches to concurrency control. Given its stated objective of providing a complete’ computational environment, the future of the orthogonal persistence vision in general, and persistent Java in particular, will depend, to some extent, on the release of this tension through the realization of new approaches to the integration of concurrency and persistence. This paper addresses both strict transactional and more open’ approaches to managing concurrency in face of persistence, and examines difficulties with each approach. A simple transaction model that relieves some of these difficulties is presented and its limitations briefly examined.
@InProceedings{blackburn99c,   author = {Stephen M Blackburn and John N Zigman},   title = {Concurrency---{T}he fly in the ointment?},   booktitle = {Advances in Persistent Object Systems: Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), } aug 30-- sep 1, 1998, Tiburon, CA, U.S.A.},   editor = {Ronald Morrison and Mick Jordan and Malcolm Atkinson},   year = 1999, pages = {250--258},   address = {San Francisco},   publisher = {Morgan Kaufmann},  
• S. M. Blackburn and R. B. Stanton, "Scalable Multicomputer Object Spaces: a Foundation for High Performance Systems," in Third International Working Conference on Massively Parallel Programming Models, nov 12–14, 1997, London, UK, Los Alamitos, CA, 1998.
The development of scalable architectures at store levels of a layered model has concentrated on processor parallelism balanced against scalable memory bandwidth, primarily through distributed memory structures of one kind or another. A great deal of attention has been paid to hiding the distribution of memory to produce a single store image across the memory structure. It is unlikely that the distribution and concurrency aspects of scalable computing can be completely hidden at that level. The paper argues for a store layer which respects the need for caching and replication, and to do so at an “object” level granularity of memory use. These facits are interrelated through atomic processes, leading to an interface for the store which is strongly transactional in character. The paper describes the experimental performance of such a layer on a scalable multi-computer architecture. The behaviour of the store supports the view that a scalable cached “transactional” store architecture is a practical objective for high performance based on parallel computation across distributed memories.
@InProceedings{blackburn98d,   author = {Stephen M Blackburn and Robin B Stanton},   title = {Scalable Multicomputer Object Spaces: a Foundation for High Performance Systems},   booktitle = {Third International Working Conference on Massively Parallel Programming Models, } nov 12--14, 1997, London, UK},   editor = {John Darlington},   year = 1998, publisher = {IEEE},   address = {Los Alamitos, CA},