Research

My research interests are centered on the challenge of making software run faster and more power-efficiently on modern hardware.  My primary interests include: microarchitectural support for managed languages, fast and efficient garbage collection, and the design and implementation of virtual machines.  As a backdrop to this I have a longstanding interest in role of sound methodology and infrastructure in successful research innovation. Read more here.

News

Ting Cao graduated in November 2015 after completing her PhD, which includes her landmark work on the way we think about energy, power and performance. Ting holds a distinguished fellowship at the Chinese Academy of Science.

Rifat Shahriyar graduated in July 2015 after completing a PhD that changes the way we think about reference counting.  Rifat is now a professor at BEUT.

Select Recent Publications

  • X. Yang, S. M. Blackburn, and K. S. McKinley, "Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading," in Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16), Denver, CO, June 22-24, 2016, 2016.
    FOR subject classification codes: 080308, 080501
    Web services from search to games to stock trading impose strict Service Level Objectives (SLOs) on tail latency. Meeting these objectives is challenging because the computational demand of each request is highly variable and load is bursty. Consequently, many servers run at low utilization (10 to 45%); turn off simultaneous multithreading (SMT); and execute only a single service — wasting hardware, energy, and money. Although co-running batch jobs with latency critical requests to utilize multiple SMT hardware contexts (lanes) is appealing, unmitigated sharing of core resources induces non-linear effects on tail latency and SLO violations. We introduce principled borrowing to control SMT hardware execution in which batch threads borrow core resources. A batch thread executes in a reserved batch SMT lane when no latency-critical thread is executing in the partner request lane. We instrument batch threads to quickly detect execution in the request lane, step out of the way, and promptly return the borrowed resources. We introduce the nanonap system call to stop the batch thread’s execution without yielding its lane to the OS scheduler, ensuring that requests have exclusive use of the core’s resources. We evaluate our approach for co-locating batch workloads with latency-critical requests using the Apache Lucene search engine. A conservative policy that executes batch threads only when request lane is idle improves utilization between 90% and 25% on one core depending on load, without compromising request SLOs. Our approach is straightforward, robust, and unobtrusive, opening the way to substantially improved resource utilization in datacenters running latency-critical workloads.
    @InProceedings{YBM:16,
      author = {Xi Yang and Stephen M Blackburn and Kathryn S McKinley},
      title = {Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading},
      booktitle = {Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, June 22-24, 2016},
      year = {2016},
     
  • Y. Lin, S. M. Blackburn, A. L. Hosking, and M. Norrish, "Rust as a Language for High Performance GC Implementation," in Proceedings of the Sixteenth ACM SIGPLAN International Symposium on Memory Management, ISMM ‘16, Santa Barbara, CA, June 13, 2016, 2016.
    FOR subject classification codes: 080308, 080501
    High performance garbage collectors build upon performance-critical low-level code, typically exhibit multiple levels of concurrency, and are prone to subtle bugs. Implementing, debugging and maintaining such collectors can therefore be extremely challenging. The choice of implementation language is a crucial consideration when building a collector. Typically, the drive for performance and the need for efficient support of low-level memory operations leads to the use of low-level languages like C or C++, which offer little by way of safety and software engineering benefits. This risks undermining the robustness and flexibility of the collector design. Rust’s ownership model, lifetime specification, and reference borrowing deliver safety guarantees through a powerful static checker with little runtime overhead. These features make Rust a compelling candidate for a collector implementation language, but they come with restrictions that threaten expressiveness and efficiency. We describe our experience implementing an Immix garbage collector in Rust and C. We discuss the benefits of Rust, the obstacles encountered, and how we overcame them. We show that our Immix implementation has almost identical performance on micro benchmarks, compared to its implementation in C, and outperforms the popular BDW collector on the gcbench micro benchmark. We find that Rust’s safety features do not create significant barriers to implementing a high performance collector. Though memory managers are usually considered low-level, our high performance implementation relies on very little unsafe code, with the vast majority of the implementation benefiting from Rust’s safety. We see our experience as a compelling proof-of-concept of Rust as an implementation language for high performance garbage collection.
    @InProceedings{LBH+:16,
      author = {Yi Lin and Stephen M Blackburn and Antony L Hosking and Michael Norrish},
      title = {Rust as a Language for High Performance GC Implementation},
      booktitle = {Proceedings of the Sixteenth ACM SIGPLAN International Symposium on Memory Management, ISMM '16, Santa Barbara, CA, June 13, 2016},
      year = {2016},
      }
  • I. Jibaja, T. Cao, S. M. Blackburn, and K. S. McKinley, "Portable Performance on Asymmetric Multicore Processors," in Proceedings of the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2016.
    FOR subject classification codes: 100605, 080308, 100606
    @InProceedings{JCBM:16,
      author = {Jibaja, Ivan and Cao, Ting and Blackburn, Stephen M and McKinley, Kathryn S.},
      title = {Portable Performance on Asymmetric Multicore Processors},
      booktitle = {Proceedings of the 14th Annual IEEE/ACM International Symposium on Code Generation and Optimization},
      year = 2016, month = feb, location = {Barcelona, Spain},
      publisher = {IEEE},
      }
  • X. Yang, S. M. Blackburn, and K. S. McKinley, "Computer Performance Microscopy with Shim," in ISCA ‘15: The 42nd International Symposium on Computer Architecture, 2015.
    FOR subject classification codes: 100605, 080308, 100606
    Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim’s fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim’s observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2 overhead at a ~1200 cycle resolution. At an overhead of 61% Shim samples one software signal on the same core with SMT at a ~15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.
    @Inproceedings{XBM:15,
      author = {Yang, Xi and Blackburn, Stephen M. and McKinley, Kathryn S.},
      title = {Computer Performance Microscopy with {Shim}},
      booktitle = {ISCA '15: The 42nd International Symposium on Computer Architecture},
      year = {2015},
      location = {Portland, OR},
      publisher = {IEEE},
      }
  • Y. Lin, K. Wang, S. M. Blackburn, M. Norrish, and A. L. Hosking, "Stop and Go: Understanding Yieldpoint Behavior," in Proceedings of the Fourteenth ACM SIGPLAN International Symposium on Memory Management, ISMM ‘15, Portland, OR, June 14, 2015, 2015.
    FOR subject classification codes: 080308, 080501
    Yieldpoints are critical to the implementation of high performance garbage collected languages, yet the design space is not well understood. Yieldpoints allow a running program to be interrupted at well-defined points in its execution, facilitating exact garbage collection, biased locking, on-stack replacement, profiling, and other important virtual machine behaviors. In this paper we identify and evaluate yieldpoint design choices, including previously undocumented designs and optimizations. One of the designs we identify opens new opportunities for very low overhead profiling. We measure the frequency with which yieldpoints are executed and establish a methodology for evaluating the common case execution time overhead. We also measure the median and worst case time-to-yield. We find that Java benchmarks execute about 100 M yieldpoints per second, of which about 1/20000 are taken. The average execution time overhead for untaken yieldpoints on the VM we use ranges from 2.5% to close to zero on modern hardware, depending on the design, and we find that the designs trade off total overhead with worst case time-to-yield. This analysis gives new insight into a critical but overlooked aspect of garbage collector implementation, and identifies a new optimization and new opportunities for very low overhead profiling.
    @InProceedings{LWB+:15,
      author = {Yi Lin and Kunshan Wang and Stephen M Blackburn and Michael Norrish and Antony L Hosking},
      title = {Stop and Go: Understanding Yieldpoint Behavior},
      booktitle = {Proceedings of the Fourteenth ACM SIGPLAN International Symposium on Memory Management, ISMM '15, Portland, OR, June 14, 2015},
      year = {2015},
      doi = {http://dx.doi.org/10.1145/10.1145/2754169.2754187},
      }
  • K. Wang, Y. Lin, S. M. Blackburn, M. Norrish, and A. L. Hosking, "Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development," in 1st Summit on Advances in Programming Languages (SNAPL 2015), 2015.
    FOR subject classification codes: 080308, 080501
    Many of today’s programming languages are broken. Poor performance, lack of features and hard-to-reason-about semantics can cost dearly in software maintenance and inefficient execution. The problem is only getting worse with programming languages proliferating and hardware becoming more complicated.

    An important reason for this brokenness is that much of language design is implementation-driven. The difficulties in implementation and insufficient understanding of concepts bake bad designs into the language itself. Concurrency, architectural details and garbage collection are three fundamental concerns that contribute much to the complexities of implementing managed languages.

    We propose the micro virtual machine, a thin abstraction designed specifically to relieve implementers of managed languages of the most fundamental implementation challenges that currently impede good design. The micro virtual machine targets abstractions over memory (garbage collection), architecture (compiler backend), and concurrency. We motivate the micro virtual machine and give an account of the design and initial experience of a concrete instance, which we call Mu, built over a two year period. Our goal is to remove an important barrier to performant and semantically sound managed language design and implementation.

    @InProceedings{WLB+:15,
      author = {Kunshan Wang and Yi Lin and Stephen M Blackburn and Michael Norrish and Antony L Hosking},
      title = {Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development},
      booktitle = {1st Summit on Advances in Programming Languages (SNAPL 2015)},
      year = {2015},
      doi = {http://dx.doi.org/10.4230/LIPIcs.SNAPL.2015.321},
      }

A full list of my publications appears here.

Prospective Students

I’m always looking for bright students.  If you’re interested in doing research work with me, please read this before you contact me.