MICRO 2017, CAL 2016, SIGMETRICS 2017, SIGMETRCIS 2016, HPCA 2017, SIGMETRICS 2014, DSN 2016, DSN 2015

There is an increasing demand on the main memory capacity with the increasing number of cores and wide adoption of in-memory applications. For the past few decades, DRAM vendors provided higher capacity DRAM at low cost by manufacturing smaller cells in the same die area. Unfortunately, smaller cells increase interference among cells, which in turn makes cells vulnerable to failures. As a result, it is getting challenging to build reliable high-capacity DRAM chips at smaller technology nodes. This work proposes to enable DRAM scaling by allowing the manufacturers to build smaller cells, without fully ensuring correct operation for every cell. Instead, it is the responsibility of the system to provide reliable DRAM operation, using system-level detection and mitigation of DRAM failures while the system is running in the field. The ability to detect and correct failing cells in the system enables smaller feature size at a low cost, and high yield. With experimental data from real DRAM chips, this work analyzes the efficacy of existing system-level detection and mitigation techniques, and develops new efficient system-level techniques to identify and mitigate DRAM failures.

HPCA 2018, MICRO 2015, WEED 2013

The new memory technologies, such as spin-transfer torque RAM (STT-RAM), phase-change memory (PCM), and resistive random-access memory (ReRAM) exhibit fast, energy efficient, byte addressable, and persistent memory access. Leveraging these byte addressable non-volatile memories, this work proposes to unify memory and storage under a single address space, where persistent data structures are directly accessed by the processor with load/store instructions. Such a system improves performance by eliminating the overhead of data transfer between memory and storage. One major challenge of manipulating persistent data directly in the memory is ensuring data consistency when there is a partial update caused by a system crash or power loss. This work designs a software-transparent consistency mechanism that leverages an efficient hardware-assisted checkpointing scheme to enforce consistent memory states in non-volatile memories.

HPCA 2015 ((Data Set), TACO 2015

Memory latency and bandwidth both are critical bottleneck that limit system performance. This project reduces DRAM access latency and increases bandwidth of die-stacked 3D memory. First, this work shows that DRAM specification adds a huge margin on the access latency to ensure reliable operation for the worst-case scenario (for example, takes into account the high leakage at extreme operating temperature of 85 C, adds margin to ensure correct operation of the atypical weak cells that are more leaky, etc). As a result, in the common case, when systems run at ambient environment with DRAM chips consisting of average cells, it is possible to reduce this extra margin to provide faster, yet reliable access to DRAM.

HPCA 2013, Intel Technology Journal 2013

Improving Multi-Core Performance with Low-Voltage Operation

In a power constrained system, operating cores at a lower voltage and frequency helps to increase the number of active cores, therefore increases the parallelism in the system. However, reducing operating voltage in caches leads to a dramatic loss in reliability. Using a small fraction of bigger robust cells at a lower voltage can mitigate the cache failures, but drastically reduces the cache capacity, leading to performance degradation. This work improves performance by designing a cache management technique that enables reliable cache access at a lower voltage without sacrificing any cache capacity.

HPCA 2014, HPCA 2012, ISCA 2012, MICRO 2010, PACT 2010

Modern processors buffer highly reused data in large capacity on-die caches, to hide the long access latency of the main memory. As data buffered in caches can be accessed within a few cycles, maximizing cache reuse using efficient cache management techniques significantly improves performance. This project improves system performance by designing cache management techniques, which can be categorized in to two groups. The first set of works leverage the key observation that read requests are more critical than write requests, as write requests are buffered and generally do not stall the processor and improves performance by protecting cache blocks that service read requests, at the cost of blocks that service only write requests. The other set of works predict blocks that will no longer be reused and evict those blocks to provide more cache space for blocks with high reuse.