Research

My research interests are centered on the challenge of making software run faster and more power-efficiently on modern hardware.  My primary interests include: microarchitectural support for managed languages, fast and efficient garbage collection, and the design and implementation of virtual machines.  As a backdrop to this I have a longstanding interest in role of sound methodology and infrastructure in successful research innovation. Read more here.

News

Our 2011 ASPLOS paper on the measurement of power and performance has been selected for the Communications of the ACM Research Highlights and IEEE Micro TopPicks 2011!
Is performance evaluation an important element of the computer systems research landscape?  I think so, and I think it is undervalued.   If this interests you, please consider attending Evaluate’12, and take a look at the Evaluate Collabotory.
Daniel Frampton has won the 2011 CORE Australasian Distinguished Doctoral Dissertation Award for his dissertation “Garbage collection and the case for high-level low-level programming“.  Congratulations Daniel!

Recent Publications

  • R. Shahriyar, S. M. Blackburn, and D. Frampton, "Down for the Count? Getting Reference Counting Back in the Ring," in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ‘12, Beijing, China, June 15-16, 2012.
    FOR subject classification codes: 080308, 100604
    Reference counting and tracing are the two fundamental approaches that have underpinned garbage collection since 1960. However, despite some compelling advantages, reference counting is almost completely ignored in implementations of high performance systems today. In this paper we take a detailed look at reference counting to understand its behavior and to improve its performance. We identify key design choices for reference counting and analyze how the behavior of a wide range of benchmarks might affect design decisions. As far as we are aware, this is the first such quantitative study of reference counting. We use insights gleaned from this analysis to introduce a number of optimizations that significantly improve the performance of reference counting.

    We find that an existing modern implementation of reference counting has an average 30% overhead compared to tracing, and that in combination, our optimizations are able to completely eliminate that overhead. This brings the performance of reference counting on par with that of a well tuned mark-sweep collector. We keep our in-depth analysis of reference counting as general as possible so that it may be useful to other garbage collector implementers. Our finding that reference counting can be made directly competitive with well tuned mark-sweep should shake the community’s prejudices about reference counting and perhaps open new opportunities for exploiting reference counting’s strengths, such as localization and immediacy of reclamation.

    @InProceedings{SBF:12,
      author = {Shahriyar, Rifat and Blackburn, Stephen M. and Frampton, Daniel},
      title = {Down for the Count? {G}etting Reference Counting Back in the Ring},
      booktitle = {Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM '12, Beijing, China, June 15-16},
      year = {2012},
      results = {rc-ismm-2012.zip},
      month = {jun},
      location = {Beijing, China},
      }
  • X. Yang, S. B. M. Blackburn, D. Frampton, and A. L. Hosking, "Barriers Reconsidered, Friendlier Still!," in Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM ‘12, Beijing, China, June 15-16, 2012.
    FOR subject classification codes: 080308, 100604
    Read and write barriers mediate access to the heap allowing the collector to control and monitor mutator actions. For this reason, barriers are a powerful tool in the design of any heap management algorithm, but the prevailing wisdom is that they impose significant costs. However, changes in hardware and workloads make these costs a moving target. Here, we measure the cost of a range of useful barriers on a range of modern hardware and workloads. We confirm some old results and overturn others. We evaluate the microarchitectural sensitivity of barrier performance and the differences among benchmark suites. We also consider barriers in context, focusing on their behavior when used in combination, and investigate a known pathology and evaluate solutions. Our results show that read and write barriers have average overheads as low as 5.4% and 0.9% respectively. We find that barrier overheads are more exposed on the workload provided by the modern DaCapo benchmarks than on old SPECjvm98 benchmarks. Moreover, there are differences in barrier behavior between in-order and out-of-order machines, and their respective memory subsystems, which indicate different barrier choices for different platforms. These changing costs mean that algorithm designers need to reconsider their design choices and the nature of their resulting algorithms in order to exploit the opportunities presented by modern hardware.
    @InProceedings{YBFH:12,
      author = {Yang, Xi and Blackburn, Stephen M Blackburn and Frampton, Daniel and Hosking, Antony L.},
      title = {Barriers Reconsidered, Friendlier Still!},
      booktitle = {Proceedings of the Eleventh ACM SIGPLAN International Symposium on Memory Management, ISMM '12, Beijing, China, June 15-16},
      year = {2012},
      results = {barrier-ismm-2012.zip},
      month = {jun},
      location = {Beijing, China},
      }
  • T. Cao, S. M. Blackburn, T. Gao, and K. S. McKinley, "The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software," in ISCA ‘12: The 39th International Symposium on Computer Architecture, 2012.
    On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics.

    On the software side, managed workloads written in languages such as C& 35;, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or just-in-time compilation) that together impose substantial energy and performance costs, ranging from 10% to over 80%. We show that these services manifest a differentiated performance and power workload. To differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application’s critical path.

    We identify a synergy between AMP and VM services that we exploit to attack the 40% average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant.

    @Inproceedings{CBGM:12,
      author = {Cao, Ting and Blackburn, Stephen M. and Gao, Tiejun and McKinley, Kathryn S.},
      title = {The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software},
      booktitle = {ISCA '12: The 39th International Symposium on Computer Architecture},
      year = {2012},
      location = {Portland, OR},
      results = {yinyang-isca-2012.zip},
      publisher = {IEEE},
      }
  • Y. Lin, S. M. Blackburn, and D. Frampton, "Unpicking The Knot: Teasing Apart VM/Application Interdependencies," in VEE ‘12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, New York, NY, USA, 2012.
    FOR subject classification codes: 080308, 080309
    Flexible and efficient runtime design requires an understanding of the dependencies among the components internal to the runtime and those between the application and the runtime. These dependencies are frequently unclear. This problem exists in all runtime design, and is most vivid in a metacircular runtime — one that is implemented in terms of itself. Metacircularity blurs boundaries between application and runtime implementation, making it harder to understand and make guarantees about overall system behavior, affecting isolation, security, and resource management, as well as reducing opportunities for optimization. Our goal is to shed new light on VM interdependencies, helping all VM designers understand these dependencies and thereby engineer better runtimes.

    We explore these issues in the context of a high-performance Java-in-Java virtual machine. Our approach is to identify and instrument transition points into and within the runtime, which allows us to establish a dynamic execution context. Our contributions are: 1) implementing and measuring a system that dynamically maintains execution context with very low overhead, 2) demonstrating that such a framework can be used to improve the software engineering of an existing runtime, and 3) analyzing the behavior and runtime characteristics of our runtime across a wide range of benchmarks. Our solution provides clarity about execution state and allowable transitions, making it easier to develop, debug, and understand managed runtimes.

    @Inproceedings{LBF:12,
      author = {Lin, Yi and Blackburn, Stephen M. and Frampton, Daniel},
      title = {Unpicking The Knot: Teasing Apart VM/Application Interdependencies},
      booktitle = {VEE '12: Proceedings of the 2012 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
      year = {2012},
      location = {London, UK},
      doi = {http://dx.doi.org/10.1145/2151024.2151048},
      publisher = {ACM},
      address = {New York, NY, USA},
      }
  • X. Yang, S. M. Blackburn, D. Frampton, J. Sartor, and K. S. McKinley, "Why Nothing Matters: The Impact of Zeroing," in Proceedings of the 2011 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2011), Portland, OR, October 22-27, 2011, 2011.
    Memory safety defends against inadvertent and malicious misuse of memory that may compromise program correctness and security. A critical element of memory safety is zero initialization. The direct cost of zero initialization is surprisingly high: up to 12.7%, with average costs ranging from 2.7 to 4.5% on a high performance virtual machine on IA32 architectures. Zero initialization also incurs indirect costs due to its memory bandwidth demands and cache displacement effects. Existing virtual machines either: a) minimize direct costs by zeroing in large blocks, or b) minimize indirect costs by zeroing in the allocation sequence, which reduces cache displacement and bandwidth. This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance.

    Our analysis inspires three better designs: (1) bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously, (2) concurrent non-temporal bulk zeroing that exploits parallel hardware to move work off the application’s critical path, and (3) adaptive zeroing, which dynamically chooses between (1) and (2) based on available hardware parallelism. The new software strategies offer speedups sometimes greater than the direct overhead, improving total performance by 3% on average. Our findings invite additional optimizations and microarchitectural support.

    @InProceedings{YBF+:11,
      author = {Xi Yang and Stephen M Blackburn and Daniel Frampton and Jennifer Sartor and Kathryn S McKinley},
      title = {Why Nothing Matters: The Impact of Zeroing},
      booktitle = {Proceedings of the 2011 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages \& Applications (OOPSLA 2011), Portland, OR, October 22-27, 2011},
      year = {2011},
      volume = {46},
      number = {10},
      series = {SIGPLAN Notices},
      month = {October},
      publisher = {ACM},
      }
  • R. Garner, S. M. Blackburn, and D. Frampton, "A Comprehensive Evaluation of Object Scanning Techniques," in Proceedings of the Tenth ACM SIGPLAN International Symposium on Memory Management, ISMM ‘11, San Jose, CA, USA, June 4 – 5, 2011.
    FOR subject classification codes: 080308, 100604
    At the heart of all garbage collectors lies the process of identifying and processing reference fields within an object. Despite its key role, and evidence of many different implementation approaches, to our knowledge no comprehensive quantitative study of this design space exists. The lack of such a study means that implementers must rely on `conventional wisdom’, hearsay, and their own costly analysis. Starting with mechanisms described in the literature and a variety of permutations of these, we explore the impact of a number of dimensions including: a) the choice of data structure, b) levels of indirection from object to metadata, and c) specialization of scanning code. We perform a comprehensive examination of these tradeoffs on four different architectures using eighteen benchmarks and hardware performance counters. We inform the choice of mechanism with a detailed study of heap composition and object structure as seen by the garbage collector on these benchmarks. Our results show that choice of scanning mechanism is important. We find that a careful choice of scanning mechanism alone can improve garbage collection performance by 16% and total time by 2.5%, on average, over a well tuned baseline. We observe substantial variation in performance among architectures, and find that some mechanisms–particularly specialization, layout of reference fields in objects, and encoding metadata in object headers–yield consistent, significant advantages.
    @InProceedings{GBF:11,
      author = {Robin Garner and Stephen M Blackburn and Daniel Frampton},
      title = {A Comprehensive Evaluation of Object Scanning Techniques},
      booktitle = {Proceedings of the Tenth ACM SIGPLAN International Symposium on Memory Management, ISMM '11, San Jose, CA, USA, June 4 - 5},
      year = {2011},
      month = {jun},
      doi = {http://dx.doi.org/10.1145/1993478.1993484},
      location = {San Jose, CA, USA},
    }
  • H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, "Looking Back on the Language and Hardware Revolutions: Measured Power, Performance, and Scaling," in Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, Newport Beach, CA, USA, March 5 – 11, 2011.
    FOR subject classification codes: 100605, 080308, 100606
    This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology. We measure representative Intel IA32 processors with technologies ranging from 130nm to 32nm while they execute sequential and parallel benchmarks written in native and managed languages. During this period, hardware and software changed substantially: (1) hardware vendors delivered chip multiprocessors instead of uniprocessors, and independently (2) software developers increasingly chose managed languages instead of native languages. This quantitative data reveals the extent of some known and previously unobserved hardware and software trends.

    Two themes emerge. (I) Workload The power, performance, and energy trends of native workloads do not approximate managed workloads. For example, (a) the SPEC CPU2006 native benchmarks on the i7-920 and i5-670 draw significantly less power than managed or scalable native benchmarks; and (b) managed runtimes exploit parallelism even when running single-threaded applications. The results recommend architects always include native and managed workloads to design and evaluate energy efficient designs. (II) Architecture Clock scaling, microarchitecture, simultaneous multithreading, and chip multiprocessors each elicit a huge variety of processor power, performance, and energy responses. This variety and the difficulty of obtaining power measurements recommends exposing on-chip power meters and when possible structure specific power meters for cores, caches, and other structures. Just as hardware event counters provide a quantitative grounding for performance innovations, power meters are necessary for optimizing energy.

    @InProceedings{EBCYM:11,
      author = {Hadi Esmaeilzadeh and Ting Cao and Xi Yang and Stephen M. Blackburn and Kathryn S. McKinley},
      title = {Looking Back on the Language and Hardware Revolutions: Measured Power, Performance, and Scaling},
      booktitle = {Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, Newport Beach, CA, USA, March 5 - 11},
      year = {2011},
      results = {powerperf-asplos-2011.zip},
      month = {mar},
      doi = {10.1145/1961296.1950402},
      location = {Newport Beach, CA, USA},
    }

A full list of my publications appears here.

Prospective Students

I’m always looking for bright students.  If you’re interested in doing research work with me, please read this before you contact me.