Introduction
The Process Introspection project is a design and implementation effort,
the main goal of which is to construct a general purpose, flexible, efficient
checkpoint/restart mechanism appropriate for use in a high performance
heterogeneous distributed systems. This checkpoint/restart mechanism
has the primary constraint that it must be platform independent; that is,
checkpoints produced on one architecture or operating system platform
must be restartable on a different architecture or operating system
platform. The Process Introspection mechanism is based on a design pattern
for constructing interoperable checkpointable modules. Application of the
design pattern is automated by two levels of software tools: a library of
support routines that facilitate the use of the design pattern, and a source
code translator that automatically applies the pattern to platform
independent modules. A prototype implementation of library has been
constructed and used to demonstrate that the design pattern can be applied
effectively to construct platform independent checkpointable programs
that operate efficiently.
Process Introspection is the ability of a process to examine and describe
its own internal state in a logical, platform independent format. In some
senses, all processes that employ a custom programmed checkpoint/restart
implementation utilize the concept of Process Introspection. The Process
Introspection extends this technique of hand coding checkpoint/restart
functionality for individual processes into an integrated approach in
which the development of checkpointable program modules is completely
automated when possible, or is at least rendered significantly less
complex through the use of library tools and a general design pattern
when a handed coded checkpoint facility for a module is still appropriate.
The system design consists of the following components:
- The Process Introspection Design Pattern, a design
template for writing checkpointable codes. This design pattern
describes the elements that must be added to a program in order for
it to support introspective checkpointing, as well as the
relationships and responsibilities of these elements. A key element
of the pattern is the mechanism by which subroutine activation
stacks can be checkpointed and restored using the normal
"subroutine return" and "subroutine call" language features.
- Dynamic runtime support via the Process Introspection Library
(PIL), a set of tools to automate or simplify many of the
tasks involved in implementing an architecture independent
introspective checkpoint mechanism for a module.
- The Automatic Process Introspection Compiler (APrIL),
a source code translator that can transform architecture independent
modules specified in a high level language (e.g. C or Fortran) into
introspective checkpointable modules that utilize the PIL for
interoperability.
- A Central Checkpoint Coordinator (CCC) module
interface definition. The CCC is the part of an introspective process
that coordinates the operation of the introspective modules in the
process at checkpoint/restart time. The CCC also provides the public
checkpoint interface for the introspective process; that is, the
interface via which the process can be asked to produce a checkpoint
or to restart from a given checkpoint.
Initial Implementation Efforts
Prototype implementations of the PIL, a simple CCC module, and the
APrIL compiler have been constructed and used as the basis for
feasibility demonstrations and initial performance and cost analysis.
A set of sample applications were transformed using the APrIL
source-to-source compiler, linked with the PIL and basic CCC,
and executed to examine typical runtime overheads, to gain an initial
insight into the impact on back end optimizations, to determine checkpoint
request service wait times (i.e. the time between checkpoint request
and service), and to measure basic checkpoint and restart costs.
These initial tests also demonstrate the fundamental feasibility of the
process introspection technique, as each of the example programs was
verified as checkpointable/restartable across the following platforms:
- Sun workstations running Solaris or SunOS 4.x
- SGI workstations running IRIX 5.x
- IBM RS/6000 workstations running AIX
- DEC Alpha workstations running OSF1
- PC compatibles running Linux
- PC compatibles running Microsoft Windows NT or
Microsoft Windows 95, using a GNU/Win32 compatibility library
(available from
here).
The interface selected for the simple CCC overloads the "control-C" interrupt
of a process to checkpoint and exit the running program instead of
simply terminating it. Later, when the program is run again, the
CCC notes the presence of a checkpoint, and uses it to implement a restart
instead of allowing the process to start up normally.
Initial performance results obtained using the prototype compiler and library
indicate that the system can be used to achieve very low checkpoint
request wait times (0.01 to 1.0 milliseconds on average),
and that it introduces little or no run-time overhead into the normal
operation of transformed programs. Overhead to to the code inserted by the
compiler and it mpact on back-end optimizations has
been found to be generally low (0%-15%), but is application dependent
and tunable via certain trade-offs (for example, a slightly higher checkpoint
request wait time can be traded off for less optimizer interference).
A very simple example of using process introspection to automatically implement
the checkpoint mechanism for a simple matrix multiply program is available
here. The original code is listed here.
The APrIL transformed version of code is listed
here. A key important feature
to note about the transformed code is that it contains "consistency
points" at which the process polls the "PIL_CheckpointStatus" variable
to check the "PIL_StatusCheckpointNow" bit to determine if a checkpoint
should be performed. If the checkpoint is requested (checkpoint requests
interrupt the process and set the "PIL_CheckpointStatus" variable),
the code uses the normal C "return" mechanism to traverse the stack
saving all local variables and actual parameters. An equally important
feature of the transformed code is the addition of a prologue to each
function that checks the "PIL_CheckpointStatus" variable for the
"PIL_StatusRestoreNow" bit. If this bit is set, the process uses the
normal C function call mechanism to restore the stack (restoring the
actual parameter and local variable values each function as it restores
its stack frame), and the C goto mechanism to jump to the right code
location in each stack frame.
Current and Future Directions
Ongoing work on the Process Introspection project centers on the
following areas:
- First, current efforts are focused on the completion of the design
and implementation of version 1 of the Process Introspection Library
(PIL), the APrIL compiler, the Central Checkpoint Coordinator, and
checkpointable utility libraries for sequential codes. Currently,
limitations of the compiler due to not yet implemented language
features limit the complexity of the kinds of codes that can be
automatically checkpointed and restarted. Furthermore, without
checkpointable utility libraries (e.g. a checkpointable file
system interface), the system is not usable for most real problems.
- Secondly, current and future research will be continue to expand the
empirical cost analysis of the checkpoint/restart mechanism for
sequential applications. This research is aimed at experimentally
determining :
- What is the run-time overhead of code transformed to
used the PIL as compared to non-trans formed code?
- How much does APrIL affect the operation of back-end
optimizers? In other words, how great is the speedup of
APrIL transformed code with back end optimization as compared
to non-transformed, non-checkpointable code?
- What is the observed checkpoint request service wait time
for APrIL transformed codes? In other words, how long on
average and at worst does a process take to begin servicing
a checkpoint?
- What is the observed cost of performing a checkpoint
for processes of different state complexities and sizes?
- What is the observed cost of performing a restart of
checkpoints of different complexities and sizes?
- The third future research goal will be the integration of
introspective checkpointing into at least one parallel distributed
system. The general problems associated with integration of the
system are somewhat independent of the distributed system targeted
for integration. The primary task here is the design and implementation
of a checkpointable wrapper interface for the selected interface.
For example, a checkpointable MPI library is a possibility.
- The fourth major item to be addressed by future work will be
the evaluation of the checkpoint mechanism for distributed programs,
and for use with sequential programs in a distributed environment
(e.g. for load sharing).
More Information
Links to Related Projects
- A useful
Survey of Checkpoint Mechanisms
- Condor, load
balancing for networks of workstations.
- CUMULVS,
distributed program visualization and control, including
heterogeneous task checkpoint/restart.
- Tui,
heterogeneous process migration.
- MIST,
PVM with transparent migration and checkpointing.
ferrari@virginia.edu
Last modified
Mon Sep 9 17:33:15 EDT 1996
Visitors so far: