Research

My research interests are centered on the challenge of making software run faster and more power-efficiently on modern hardware.  My primary interests include: microarchitectural support for managed languages, fast and efficient garbage collection, and the design and implementation of virtual machines.  As a backdrop to this I have a longstanding interest in role of sound methodology and infrastructure in successful research innovation. Read more here.

News

My student Xi Yang has been awarded a 2012 Google PhD Fellowship, for Energy Aware Computing.   Great work, Xi!
Our 2011 ASPLOS paper on the measurement of power and performance has been selected for the Communications of the ACM Research Highlights and IEEE Micro TopPicks 2011!
Daniel Frampton has won the 2011 CORE Australasian Distinguished Doctoral Dissertation Award for his dissertation “Garbage collection and the case for high-level low-level programming“.  Congratulations Daniel!

Select Recent Publications

  • R. Shahriyar, S. M. Blackburn, X. Yang, and K. M. McKinley, "Taking Off the Gloves with Reference Counting Immix," in OOPSLA ‘13: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, 2013.
    FOR subject classification codes: 080308, 100604
    Despite some clear advantages and recent advances, reference counting remains a poor cousin to high-performance tracing garbage collectors. The advantages of reference counting include a) immediacy of reclamation, b) incrementality, and c) local scope of its operations. After decades of languishing with hopelessly bad performance, recent work narrowed the gap between reference counting and the fastest tracing collectors to within 10%. Though a major advance, this gap remains a substantial barrier to adoption in performance-conscious application domains.

    Our work identifies heap organization as the principal source of the remaining performance gap. We present the design, implementation, and analysis of a new collector, RC Immix, that replaces reference counting’s traditional free-list heap organization with the line and block heap structure introduced by the Immix collector. The key innovations of RC Immix are 1) to combine traditional reference counts with per-line live object counts to identify reusable memory and 2) to eliminate fragmentation by integrating copying with reference counting of new objects and with backup tracing cycle collection. In RC Immix, reference counting offers efficient collection and the line and block heap organization delivers excellent mutator locality and efficient allocation. With these advances, RC Immix closes the 10% performance gap, matching the performance of a highly tuned production generational collector. By removing the performance barrier, this work transforms reference counting into a serious alternative for meeting high performance objectives for garbage collected languages.

    @InProceedings{SBYM:13,
      author = {Shahriyar, Rifat and Blackburn, Stephen M. and Yang, Xi and McKinley, Kathryn M.},
      title = {Taking Off the Gloves with Reference Counting Immix},
      booktitle = {OOPSLA '13: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications},
      location = {Indianapolis, IN, USA},
      year = {2013},
      month = {oct},
      doi = {http://dx.doi.org/10.1145/2509136.2509527},
      }
  • T. Gao, K. Strauss, S. M. Blackburn, K. S. McKinley, D. Burger, and J. Larus, "Using Managed Runtime Systems to Tolerate Holes in Wearable Memories," in PLDI ‘13: Proceedings of the 34th ACM SIGPLAN conference on Programming Language Design and Implementation, Seattle, WA, USA, June, 2013, 2013.
    New memory technologies, such as phase-change memory (PCM), promise denser and cheaper main memory, and are expected to displace DRAM. However, many of them experience permanent failures far more quickly than DRAM. DRAM mechanisms that handle permanent failures rely on very low failure rates and, if directly applied to PCM, are extremely inefficient: Discarding a page when the first line fails wastes 98% of the memory.

    This paper proposes low complexity cooperative software and hardware that handle failure rates as high as 50%. Our approach makes error handling transparent to the application by using the memory abstraction offered by managed languages. Once hardware error correction for a memory line is exhausted, rather than discarding the entire page, the hardware communicates the failed line to a failure-aware OS and runtime. The runtime ensures memory allocations never use failed lines and moves data when lines fail during program execution. This paper describes minimal extensions to an Immix mark-region garbage collector, which correctly utilizes pages with failed physical lines by skipping over failures. This paper also proposes hardware support that clusters failed lines at one end of a memory region to reduce fragmentation and improve performance under failures. Contrary to accepted hardware wisdom that advocates for wear-leveling, we show that with software support non-uniform failures delay the impact of memory failure. Together, these mechanisms incur no performance overhead when there are no failures and at failure levels of 10% to 50% suffer only an average overhead of 4% and 12%, respectively. These results indicate that hardware and software cooperation can greatly extend the life of wearable memories.

    @InProceedings{GSB+:13,
      author = {Gao, Tiejun and Strauss, Karin and Blackburn, Stephen M. and McKinley, Kathryn S. and Burger, Doug and Larus, James},
      title = {Using Managed Runtime Systems to Tolerate Holes in Wearable Memories},
      booktitle = {PLDI '13: Proceedings of the 34th ACM SIGPLAN conference on Programming Language Design and Implementation, Seattle, WA, USA, June, 2013},
      year = {2013},
      month = {June},
      series = {ACM SIGPLAN Notices},
      volume = {48(6)},
      publisher = {ACM Press},
      }
  • V. Kumar, D. Frampton, S. M. Blackburn, D. Grove, and O. Tardieu, "Work-Stealing Without The Baggage," in Proceedings of the 2012 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2012), Tucson, AZ, October 19-26, 2012, 2012.
    FOR subject classification codes: 080308, 080501
    Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. Prior work has demonstrated that work-stealing is very effective in practice. However, work-stealing comes with a substantial overhead: as much as 2X to 12X slowdown over orthodox sequential code.

    In this paper we identify the key sources of overhead in work-stealing schedulers and present two significant refinements to their implementation. We evaluate our work-stealing designs using a range of benchmarks, four different work-stealing implementations, including the popular fork-join framework, and a range of architectures. On these benchmarks, compared to orthodox sequential Java, our fastest design has an overhead of just 15%. By contrast, fork-join has a 2.3X overhead and the previous implementation of the system we use has an overhead of 4.1X. These results and our insight into the sources of overhead for work-stealing implementations give further hope to an already promising technique for exploiting increasingly available hardware parallelism.

    @InProceedings{KFB+:12,
      author = {Vivek Kumar and Daniel Frampton and Stephen M Blackburn and David Grove and Olivier Tardieu},
      title = {Work-Stealing Without The Baggage},
      booktitle = {Proceedings of the 2012 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA 2012), Tucson, AZ, October 19-26, 2012},
      year = {2012},
      volume = {47},
      number = {10},
      series = {SIGPLAN Notices},
      month = {October},
      publisher = {ACM},
      }
  • H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, "Looking back and looking forward: power, performance, and upheaval," Communications of the ACM, vol. 55, iss. 7, pp. 105-114, 2012.
    FOR subject classification codes: 100605, 080308, 100606
    The past 10 years have delivered two significant revolutions. (1) Microprocessor design has been transformed by the limits of chip power, wire latency, and Dennard scaling—leading to multicore processors and heterogeneity. (2) Managed languages and an entirely new software landscape emerged—revolutionizing how software is deployed, is sold, and interacts with hardware. Researchers most often examine these changes in isolation. Architects mostly grapple with microarchitecture design through the narrow software context of native sequential SPEC CPU benchmarks, while language researchers mostly consider microarchitecture in terms of performance alone. This work explores the clash of these two revolutions over the past decade by measuring power, performance, energy, and scaling, and considers what the results may mean for the future. Our diverse findings include the following: (a) native sequential workloads do not approximate managed workloads or even native parallel workloads; (b) diverse application power profiles suggest that future applications and system software will need to participate in power optimization and management; and (c) software and hardware researchers need access to real measurements to optimize for power and energy.
    @article{ECYBM:12,
      author = {Esmaeilzadeh, Hadi and Cao, Ting and Yang, Xi and Blackburn, Stephen M. and McKinley, Kathryn S.},
      title = {Looking back and looking forward: power, performance, and upheaval},
      journal = {Communications of the ACM},
      issue_date = {July 2012},
      volume = {55},
      number = {7},
      month = jul, year = {2012},
      issn = {0001-0782},
      pages = {105--114},
      numpages = {10},
      doi = {http://dx.doi.org/10.1145/2209249.2209272},
      acmid = {2209272},
      publisher = {ACM},
      address = {New York, NY, USA},
      }
  • H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, "What is Happening to Power, Performance, and Software?," IEEE Micro, vol. 32, pp. 110-121, 2012.
    FOR subject classification codes: 100605, 080308, 100606
    Systematically exploring power, performance and energy sheds new light on the clash of two trends that unfolded over the past decade: the rise of parallel processors in response to technology constraints on power, clock speed, and wire delay; and the rise of managed high-level, portable programming languages.
    @article{ECYBM:12b,
      author = {Esmaeilzadeh, Hadi and Cao, Ting and Yang, Xi and Blackburn, Stephen M. and McKinley, Kathryn S},
      title = {What is Happening to Power, Performance, and Software?},
      journal ={IEEE Micro},
      volume = {32},
      issn = {0272-1732},
      year = {2012},
      pages = {110-121},
      doi = {http://doi.ieeecomputersociety.org/10.1109/MM.2012.20},
      publisher = {IEEE Computer Society},
      address = {Los Alamitos, CA, USA},
      }
  • R. Shahriyar, S. M. Blackburn, and D. Frampton, "Down for the Count? Getting Reference Counting Back in the Ring," in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ‘12, Beijing, China, June 15-16, 2012.
    FOR subject classification codes: 080308, 100604
    Reference counting and tracing are the two fundamental approaches that have underpinned garbage collection since 1960. However, despite some compelling advantages, reference counting is almost completely ignored in implementations of high performance systems today. In this paper we take a detailed look at reference counting to understand its behavior and to improve its performance. We identify key design choices for reference counting and analyze how the behavior of a wide range of benchmarks might affect design decisions. As far as we are aware, this is the first such quantitative study of reference counting. We use insights gleaned from this analysis to introduce a number of optimizations that significantly improve the performance of reference counting.

    We find that an existing modern implementation of reference counting has an average 30% overhead compared to tracing, and that in combination, our optimizations are able to completely eliminate that overhead. This brings the performance of reference counting on par with that of a well tuned mark-sweep collector. We keep our in-depth analysis of reference counting as general as possible so that it may be useful to other garbage collector implementers. Our finding that reference counting can be made directly competitive with well tuned mark-sweep should shake the community’s prejudices about reference counting and perhaps open new opportunities for exploiting reference counting’s strengths, such as localization and immediacy of reclamation.

    @InProceedings{SBF:12,
      author = {Shahriyar, Rifat and Blackburn, Stephen M. and Frampton, Daniel},
      title = {Down for the Count? {G}etting Reference Counting Back in the Ring},
      booktitle = {Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM '12, Beijing, China, June 15-16},
      year = {2012},
      results = {rc-ismm-2012.zip},
      month = {jun},
      doi = {http://dx.doi.org/10.1145/2258996.2259008},
      location = {Beijing, China},
      }
  • T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, "The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software," in ISCA ‘12: The 39th International Symposium on Computer Architecture, 2012.
    FOR subject classification codes: 100605, 080308, 100606
    On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics.

    On the software side, managed workloads written in languages such as C& 35;, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or just-in-time compilation) that together impose substantial energy and performance costs, ranging from 10% to over 80%. We show that these services manifest a differentiated performance and power workload. To differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application’s critical path.

    We identify a synergy between AMP and VM services that we exploit to attack the 40% average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant.

    @Inproceedings{CBGM:12,
      author = {Cao, Ting and Blackburn, Stephen M. and Gao, Tiejun and McKinley, Kathryn S.},
      title = {The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software},
      booktitle = {ISCA '12: The 39th International Symposium on Computer Architecture},
      year = {2012},
      location = {Portland, OR},
      results = {yinyang-isca-2012.zip},
      publisher = {IEEE},
      }
  • X. Yang, S. M. Blackburn, D. Frampton, and A. L. Hosking, "Barriers Reconsidered, Friendlier Still!," in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ‘12, Beijing, China, June 15-16, 2012.
    FOR subject classification codes: 080308, 100604
    Read and write barriers mediate access to the heap allowing the collector to control and monitor mutator actions. For this reason, barriers are a powerful tool in the design of any heap management algorithm, but the prevailing wisdom is that they impose significant costs. However, changes in hardware and workloads make these costs a moving target. Here, we measure the cost of a range of useful barriers on a range of modern hardware and workloads. We confirm some old results and overturn others. We evaluate the microarchitectural sensitivity of barrier performance and the differences among benchmark suites. We also consider barriers in context, focusing on their behavior when used in combination, and investigate a known pathology and evaluate solutions. Our results show that read and write barriers have average overheads as low as 5.4% and 0.9% respectively. We find that barrier overheads are more exposed on the workload provided by the modern DaCapo benchmarks than on old SPECjvm98 benchmarks. Moreover, there are differences in barrier behavior between in-order and out-of-order machines, and their respective memory subsystems, which indicate different barrier choices for different platforms. These changing costs mean that algorithm designers need to reconsider their design choices and the nature of their resulting algorithms in order to exploit the opportunities presented by modern hardware.
    @InProceedings{YBFH:12,
      author = {Yang, Xi and Blackburn, Stephen M. and Frampton, Daniel and Hosking, Antony L.},
      title = {Barriers Reconsidered, Friendlier Still!},
      booktitle = {Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM '12, Beijing, China, June 15-16},
      year = {2012},
      results = {barrier-ismm-2012.zip},
      month = {jun},
      doi = {http://dx.doi.org/10.1145/2258996.2259004},
      location = {Beijing, China},
      }
  • Y. Lin, S. M. Blackburn, and D. Frampton, "Unpicking The Knot: Teasing Apart VM/Application Interdependencies," in VEE ‘12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, New York, NY, USA, 2012.
    FOR subject classification codes: 080308, 080309
    Flexible and efficient runtime design requires an understanding of the dependencies among the components internal to the runtime and those between the application and the runtime. These dependencies are frequently unclear. This problem exists in all runtime design, and is most vivid in a metacircular runtime — one that is implemented in terms of itself. Metacircularity blurs boundaries between application and runtime implementation, making it harder to understand and make guarantees about overall system behavior, affecting isolation, security, and resource management, as well as reducing opportunities for optimization. Our goal is to shed new light on VM interdependencies, helping all VM designers understand these dependencies and thereby engineer better runtimes.

    We explore these issues in the context of a high-performance Java-in-Java virtual machine. Our approach is to identify and instrument transition points into and within the runtime, which allows us to establish a dynamic execution context. Our contributions are: 1) implementing and measuring a system that dynamically maintains execution context with very low overhead, 2) demonstrating that such a framework can be used to improve the software engineering of an existing runtime, and 3) analyzing the behavior and runtime characteristics of our runtime across a wide range of benchmarks. Our solution provides clarity about execution state and allowable transitions, making it easier to develop, debug, and understand managed runtimes.

    @Inproceedings{LBF:12,
      author = {Lin, Yi and Blackburn, Stephen M. and Frampton, Daniel},
      title = {Unpicking The Knot: Teasing Apart VM/Application Interdependencies},
      booktitle = {VEE '12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
      year = {2012},
      location = {London, UK},
      doi = {http://dx.doi.org/10.1145/2151024.2151048},
      publisher = {ACM},
      address = {New York, NY, USA},
      }

A full list of my publications appears here.

Prospective Students

I’m always looking for bright students.  If you’re interested in doing research work with me, please read this before you contact me.