Asynchronous Checkpointing for PVM Requires Message-Logging

Kevin Skadron

18 April 1994; revised 7 February, 1996

I find that checkpointing under PVM using the aggressive, asynchronous two-phase-commit protocol proposed by Elnozahy et al [1] requires sender-based logging.

Distributed computing using networked workstations offers cost-efficient parallel computing, but the higher rate of failure in such distributed systems requires effective fault-tolerance. Asynchronous consistent checkpointing provides transparent fault-tolerance with low failure-free overhead and minimal rollback. Although checkpointing is well studied, little work exists for distributed checkpointing under Unix. I examine consistent checkpointing for PVM, a popular environment providing a message-passing distributed environment for networks of Unix systems.

I focus on Elnozahy's two-phase-commit protocol. This protocol only suspends computation long enough to fork a thread; checkpoints are written concurrently with computation. The overlap permits very low failure-free overhead, typically less than 5% even with frequent checkpoints. The protocol requires tagging all messages and acknowledgements and delaying some messages' delivery. The existing PVM checkpointer by Leon et al [2] does not overlap computation with checkpoint writing. (Other checkpointers for PVM may have been introduced since I did this work)

I find that Elnozahy's scheme requires sender-based logging for a correct user-level implementation under Unix. PVM uses Unix sockets, in which the kernel sends message acknowledgements automatically. Without logging, gaining access to these is necessary but would require Unix kernel extension. This is not an option; one of PVM's chief virtues is that it is portable, strictly user-level code and can be installed by any user on a UNIX system. Message logging provides a simple solution.

I conducted this work as a senior at Rice in the spring of 1993-94, supervised by Prof. Willy Zwaenepoel. It was supported by the Rice Undergraduate Scholars Program and the Rice Department of Computer Science. The work was done using PVM version 2.4.


This work is reported in:

Kevin Skadron. "Asynchronous Checkpointing for PVM Requires Message-Logging". Rice University Department of Computer Science, April 18, 1994. Revised 7 February, 1996. (was "Modifications to PVM for Distributed Checkpointing under UNIX")

[ PDF ]


References:

1. E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. ``The Performance of Consistent Checkpointing''. In Proc. of the Eleventh Symposium on Reliable Distributed Systems, October 1992.

2. J. Leon, A. L. Fisher, and P. Steenkiste. ``Fail-safe PVM: a portable package for distributed programming with transparent recovery''. Carnegie Mellon University, CMU-CS-93-124, February 1993.


Some Checkpointing Links:
Updated Aug. 26, 1996 Copyright © 1996