- 1998-02
- Process State Capture and Recovery in High-Performance Heterogeneous Distributed systems
- Adam John Ferrari, January, 1998
- Advisor: Andrew Grimshaw
- Online Formats: PostScript, PDF
Abstract:
Process Introspection is a fundamentally new solution to the
process state capture and recovery problem suitable for use in
high-performance heterogeneous distributed systems. A process state
capture and recovery mechanism for such an environment has the primary
requirement that it must be platform-independent: process checkpoints
produced on a computer system of one architecture or operating system
platform must be recoverable on a computer system of a different
architecture or operating system platform. The central feature of the
Process Introspection approach is automatic transformation of program
code to incorporate state capture and recovery functionality. This
program modification is performed at a platform-independent intermediate
level of code representation, and preserves the original program semantics.
The attractive properties of this approach include portability, ease
of use, and flexibility with respect to basic performance trade-offs
and application-specific requirements. Our solution is novel in its
true platform and run-time system no system support or non-portable
code is required by our core mechanisms. Experimental results obtained
using a prototype implementation of the Process Introspection system
indicate this mechanism can be applied to computationally demanding
scientific applications automatically, resulting in very low run-time
overhead (typically below 10%) and efficient state capture and recovery
service.