architecting future warehouse scale computers

The class of datacenters coined as “warehouse scale computers” (WSCs) house large-scale data intensive webservices such as websearch, maps, social networking, docs, video sharing, etc. Companies like Google, Microsoft, Yahoo, and Amazon spend ten to hundreds of millions to construct and operate WSCs to provide these services. Maximizing the efficiency of this class of computing reduces this cost and has energy implications for a greener planet. However, WSC design and architecture remains in its relative infancy.

Research insight: WSCs are built using commodity processor architectures (Intel/AMD), and software components (Linux, GCC, JVM, etc) that has been engineered and optimized for traditional computing environments and workloads, such as those you’d find in the desktop / laptop / HPC environment. There are many characteristics, assumptions, and requirements present in a WSC computing environment that impacts many design decisions.

The goals: Rethink how WSCs are designed and architected in both the underlying hardware and system software platform. Identify sources of inefficiency and develop solutions to improve WSCs. Imagine a line graph where the y-axis is the size of the WSC required to do some fixed amount of work, and the x-axis is time as research progresses. My vision is to have a line that is monotonically decreasing with a steep slope. (over time, an increasingly smaller WSC is needed for some fixed work)

Impact: Today advances in computing is moving in two direction into to the mobile space, and the cloud. The impact of this vision is to reduce the environmental footprint and cost of providing the platform that is the cloud.

My related publications:

  1. “Bubble-Up: Increasing Utilization in Modern Warehouse Scale Computers via Sensible Co-locations” in MICRO 2011
  2. “Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity” in CAL 2011
  3. “The Impact of Memory Subsystem Resource Sharing on Datacenter Applications” at ISCA 2011
  4. “Contention Aware Execution: Online Contention Detection and Response” at CGO 2010

addressing contention on multicore processors

Multicore processors allow multiple streams of execution to occur in parallel. However, on-chip caches, the bus, memory controller, and other components of the memory subsystem are shared among processing cores. Contention for these resources occurs when the working set of co-running processes or threads exceed the size of the private caches and relies on the shared memory subsystem. This contention can result in a significant amount of performance interference across cores, counteracting the ability to achieve the parallelism promised by multicore processors. When an application suffers a performance degradation due to contention for shared resources with an application on a separate processing core, we call this cross-core performance interference.

Research insight: Contention for shared resources is severely limiting our ability to realize the promised parallelism of multicore architectures, we must explore addressing this challenge by creating new system software solutions that facilitate the interaction between workloads and the underlying microarchitecture.

The goals: of this research direction is to fully understand the nature of contention as it occurs in real commodity multicore processors, and to exploit this understanding to devise innovative approaches and mechanisms for mitigating the negative impact of contention in current commodity multicore processors as this negative impact has had, and is having, negative implications in system performance, utilization, throughput, quality of service, etc.

Impact: Making significant progress in this direction will enable the continuation of Moore’s law via allocating transistors to implement higher degrees of parallel processing. (At the very least for multi-programmed workloads)

My related publications:

  1. “The Impact of Memory Subsystem Resource Sharing on Datacenter Applications” at ISCA 2011
  2. “Directly Characterizing Cross Core Interference Through Contention Synthesis” at HiPEAC 2011
  3. “Contention Aware Execution: Online Contention Detection and Response” at CGO 2010
  4. “Synthesizing Contention” at WBIA 2009 (workshop @ MICRO 2009)

practical online application restructuring

The online restructuring of native applications enables an entire class of application optimizations that can only be performed dynamically as they require information that is only available at runtime. However restructuring native application code layout dynamically demands a high level of complexity, while traditionally, the benefit for this cost has not motivated its practical and commercial adoption. The fact that applying such optimizations with natively compiled binary applications has proven to be so difficult can be attributed to two factors: a lack of source level information with binary to binary restructuring, and added complexity and overhead for achieving the online monitoring and code rewriting.

Research insight: The ‘rewriting’ doesn’t have to occur dynamically to enable dynamic restructuring. Why not identify the dynamic situations you wish to accommodate and statical specialize a number of instances (or versions) of targeted application code for a number of these dynamic situations.

The goals: We must understand the implications of specializing at varying granularities to understand their trade-offs. We must also demonstrate the unique capabilities of this “scenario based optimization” (SBO) and its practical applications.

Impact: I expect SBO style optimizations to one day be commonly applied optimization techniques. The only challenge currently is the lack of a standardized semantic for performance monitoring hardware and its ABI (application binary interface) and how it is to be interfaced by each layer of the software stack. As our community further establishes these semantics SBO can be easily adopted in numerous application domains.

My related publications:

  1. “Loaf: A Framework and Infrastructure for Creating Online Adaptive Solutions” at Exadapt 2011 (co-located with PLDI 2011)
  2. “Scenario Based Optimization: A Framework for Statically Enabling Online Optimizations” at CGO 2009
  3. “Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems” at CGO 2007

general online optimization analyses for a new microarchtiectural environment

Online and dynamic optimization for native binary applications is not yet popular commercially, however  is proving to be one of the most promising research directions for continuing to realize performance optimization opportunities at the binary level. Online optimizers use information that is only available dynamically to predict future application behavior and microarchitectural events and exploit these predictions to allow the executing application to adapt to its execution environment, or allow the environment to adapt to the application. However, the required online analyses cause overhead to the application, and traditionally, in many cases the overhead outweighs the benefits of the optimizations themselves. As a result, effectively achieving online optimization has proved quite challenging, especially at the binary level, since traditional dynamic binary optimizers often limit themselves to perform only the least costly lightweight online analyses.

Research insight: A new microarchitecural environment is emerging, and traditional online binary optimization techniques and analyses may no longer be relevant. In this new environment we have sophisticated multicore, and lightweight hardware performance monitoring capabilities. By leveraging these capabilities intelligently, the cost of online monitoring and analysis begins to melt away.

The goals: Demonstrate the opportunity of unobtrusively perform sophisticated and much more beneficial online analyses for dynamic optimizations.

Impact: Successfully showing that reduction in complexity of harnessing binary level online optimization and demonstrating the benefits of leveraging the latest microarchitectural advances to re-approach online optimization problems should lead to evidence of the commercial and practical viability of online optimization for native application binaries.

My related publications:

  1. “Exploiting Hardware Advances for Software Testing and Debugging” at ICSE 2011 (NIER)
  2. “A Reactive Unobtrusive Prefetcher for Multicore and Manycore Architectures” at SHCMP 2008 (workshop @ ISCA 2008)
  3. “MATS: Multicore Adaptive Trace Selection” at STMCS 2008 (workshop @ CGO 2008)

ph.d proposal: online adaptation for application performance and efficiency

Outdated, PhD Thesis: “Rethinking the Architecture of Warehouse-Scale Computers: Online Adaptation for Efficiency and Utilization”

Online adaptation is the restructuring of an executing application to dynamically react and adapt to its execution environment using information that is only available at run-time. This information includes the dynamic application inputs, its resulting execution paths, microarchitectural events, system wide events, and with the proliferation of multicore and many-core architectures, the set of programs running simultaneously alongside the executing application. These characteristics are unpredictable and change between application runs, and indeed during a single application run.

Achieving effective online adaptation for natively executed applications has proved quite challenging [6, 34, 45, 4] and to date has not been widely adopted. Traditionally, at the binary level, a run-time layer is added that virtualizes the execution of the application by performing dynamic binary to binary translation, injecting trampolines and instrumentation into the translated code to maintain control of the application. This approach often adds high overhead and complexity to the application [33, 45, 17, 34], discouraging its use and adoption in industry and for commercial applications. We propose a new paradigm for online adaptation. We propose a lightweight approach to online adaptation that leverages current microarchitectural advances to efficiently enable online monitoring and adaptation without the complexity of binary translation or fine-grain instrumentation. Our proposed methodology takes advantage of the ubiquitous hardware performance monitors [13, 44, 3] present in modern chip micro-architectures to dynamically monitor the micro-architectural events of a chip and application behavior with negligible overhead. By leveraging these capabilities to develop an innovative lightweight online adaptation framework (Loaf) we will be able to address a number of important real-world online adaptation problems.

This proposal argues for application flexibility and ability to adapt itself to its environment or to adapt its environment to itself. We propose a mechanism to enable this, and show how it is to be applied to a number of problems in computing.

ph.d thesis proposal (pdf)



  • Web Chair for CGO 2012
  • Co-organizing EXADAPT workshop @ PLDI 2011
  • CGO 2011 External Reviewer
  • PLDI 2010 External Reviewer
  • Founder of the Sicro Group

my research vision

as microarchitectural design advances, so must software;
as software advances, so must microarchitectural design.

my cv

my cv

Download (PDF, 299.87KB)


  1. [pending] Scenario Based Optimization Technology [Jason Mars, Robert Hundt]
  2. [pending] Detecting and Responding to Cross-Core Interference [Jason Mars, Neil Vachharajani, Robert Hundt]

contributions to my work and research vision

  • (december 2011) awarded google research grant – 70k
  • (december 2010) awarded google research grant – 80k
  • (july 2010) awarded google phd fellowship for compiler technology – up to 3 years / up to 105k + tuition (2011-2013)
  • (march 2010) awarded google research grant – 55k
  • (april 2009) awarded google research grant –  80k
  • (january 2007) awarded uncf scholarship - 5k (2007)
  • (november 2006) awarded ford diversity fellowship - 3 years / 60k (2007-2010)