Computer Science Colloquia
Tuesday, December 20, 2011
Taniya Siddiqua
Advisor: Sudhanva Gurumurthi
Attending Faculty: Kevin Skadron, Chair; Mary Lou Soffa; Joanne Bechta Dugan and Mircea Stan
Olsson Hall, Room 236D, 1:00 PM
Ph.D. Proposal Presentation
A Multi-Level Approach to Processor and Memory Reliability
ABSTRACT
We are in the era of multicore processors and it is expected that the
number of the processing cores on a chip will steadily increase over the
next decade, driven by Moores Law. While technology scaling paves the way
for high performance multicore processors, the scaling has a dark side
too: silicon reliability. The silicon reliability problems affect both
the processor cores and the caches, as well as main memory. Processors
and their memory system have to be designed to provide adequate protection
against these reliability problems. Designing a fault-tolerant computing
system is a three-step process: i) understanding the underlying
reliability problem through measurement studies and field-data analysis
of deployed systems, ii) abstraction of the understanding and insights
gained in the first step in the form of models for each reliability phenomenon,
and iii) developing protection techniques to mitigate or tolerate the
reliability problems. While this three-step measurement-modeling-optimization process may appear straightforward,
there are several challenges one has to address in applying this
methodology. Designing a reliable computer system is a large and complex
multi-dimensional and multi-level problem, comprising of different
hardware blocks, reliability phenomena, design layers, metrics, and
optimization techniques. This dissertation proposes to target a subset of
this large problem space. This dissertation will consider both the
processor and main memory. For the processor, it will consider two key
reliability phenomena, namely: Bias Temperature Instability (BTI) and
Process Variations (PV). The proposed research will involve measurement
studies from chips that contain PMOS and NMOS devices, the development of
models that are suitable for architecture analysis, and the development
of mitigation techniques for both logic and memory structures. For main
memory, this dissertation will present an analysis of field-data collected
from 30,000 systems deployed in data centers and will develop a model for
main memory reliability based on that analysis.