Computer Science Colloquia
Monday, December 12, 2011
Sriram Sankar
Advisor: Sudhanva Gurumurthi
Attending Faculty: Kevin Skadron (Chair), Paul Reynolds, Marty Humphrey
Olsson Hall, Room 236D, 2:00 pm
Ph.D. Qualifying Exam Presentation
Impact of Temperature on Hard Disk Drive Reliability in Large Datacenters
ABSTRACT
With the advent of cloud computing and online services, large enterprises
rely heavily on their datacenters to serve end users. A large datacenter
facility incurs increased maintenance costs in addition to service
unavailability when there are increased failures. However, there is very
little understanding on the major determinants of server failures in
datacenters. Hard disk drives are known to contribute significantly to
server failures in datacenters. In this work, I focus on the relationship
between temperature and hard disk drive failures in a real datacenter. I
present a case study on failures in a dense storage design from a large
population of servers housing close to 80000 disk drives, hosting a large
scale online service at Microsoft. In our preliminary DSN 2011 work, we
specifically establish correlation between temperatures and failures observed
at different location granularities: a) inside drive locations in a server
chassis, b) across server locations in a rack and c) across multiple racks in a
datacenter. In this presentation, we extend the previous study and show that
Temperature exhibits a stronger correlation to failures compared to disk
utilization or workload characteristics. Additionally, I explore the impact of
variations in temperature on hard disk drive failures with data collected from
the datacenter deployment. With data from real drives and experimental evaluation
under lab conditions, we show that workload changes contribute minimally to
temperature changes or failures in the storage system under study. We also
explore parameters in chassis design that can influence temperature
experienced by hard disk drives, including placement of disk drives within
the chassis and the impact of varying fan speeds. Finally, with the help
of a datacenter cost model and the results of an Arrhenius model to
estimate reliability, I shall show the proposed cost benefit of
temperature optimizations that increase hard disk drive reliability, and
motivate the need for datacenter architects to consider temperature
impact at design phase.