CS 656, Spring 2002

CS 851

Lecture Notes on Group Communication

MAIN CONCEPTS

Group Communication

Group communication is a paradigm for multi-party communication that is based on the notion of groups as a main abstraction. A group is a set of parties that, presumably, want to exchange information in a reliable, consistent manner. For example:

Group Communication Primitives

Group communication is implemented using middleware that provides two sets of primitives to the application:

Group Communication Semantics

Different approaches to group communication present different delivery semantics concerning the post primitive. Several alternatives are possible:

Failure Semantics

The semantics of group communication (such as causal delivery or totally ordered delivery) must be enforced in the presence of failures. Several different failure models were proposed in literature. For example:

Synchronous versus Asynchronous Systems

For the purposes of developing group communication algorithms, we divide distributed systems into two categories:

Virtual Synchrony

The concept of virtual synchrony was proposed by Kenneth Birman as the abstraction that group communication protocols should attempt to build on top of an asynchronous system. Virtual synchrony, according to Birman's paper [1], is defined as follows:

Impossibility of Consensus

At first it appears that virtual synchrony requires members to achieve consensus on the group view. In other words, all correct processors agree on the same view. Consensus, however, is impossible to achieve in an asynchronous system in the presence of crash failures because we cannot distinguish a crashed processor from an arbitrarily slow one (or an arbitrarily slow link). To appreciate the impossibility of consensus, consider the following problem:

Process Group Agreement

Since the absence of a deterministic solution to the consensus problem results from the impossibility to tell a slow processor/link from a crash failure, the process group protocol can still reach agreement by faking member death. In other words, if a member is thought to be dead, that member is declared dead by the protocol and is isolated from the group. In reality, the member might simply be slow. Once a member is declared dead, however, subsequent messages from this member are dropped in the communication subsystem. Thus, from the application's perspective a crash failure of the dead member is emulated. Agreement can now be achieved among the surviving members (those not declared dead by the protocol). In a way, this solution to the agreement problem is similar to the following.

The point is, process group agreement is different from true consensus in that it resorts to killing dissenting processes instead of reconciling them with the rest (i.e., killing the patients who wouldn't recover instead of curing them). The trick circumvents the impossibility of consensus. Agreement is guaranteed among the surviving members. Hence, virtual synchrony can be implemented. In the subsequent sections we describe how that implementation is done. Intuitively, we need three components to implement virtual synchrony:

PROTOCOL DESIGN

FIFO delivery

FIFO message delivery between a sender and receiver is easiest to implement. The sender simply numbers outgoing messages sequentially. The receiver delivers them in the order of their sequence numbers. Note that this simple protocol catches send and receive omissions. Should a message get lost (e.g., because of an error at the sender or receiver), a gap is observed at the receiver in the sequence numbers. The missing message can then be requested from the sender. Delivery of all subsequent messages is postponed until the missing one is delivered. Such a layer will provide the abstraction of a reliable FIFO channel with a potentially unbounded latency. (The latency is unboundend because we do not know how many retransmissions will be necessary to deliver a message reliably.) Hence, in the following we assume reliable FIFO channels and an asynchronous system model (unbounded latency).

Reliable Multicast

Next, we consider the all-or-nothing problem, i.e., that of ensuring that a message is delivered reliably to all destinations in a group once it has been delivered to at least one destination in the group. There are two algorithms that achieve reliable delivery to all members:

Causal Multicast

Causal multicast implements causal delivery semantics. Each process keeps a precedence vector with one entry per group member. The vector is updated upon send and receive events. Each entry in the vector gives the number of the latest message from the corresponding group member that causally precedes the current event. The protocol for ensuring causal delivery order is as follows:

The above protocol can be implemented on top of the reliable message delivery layer described earlier (in the same sense as TCP is implemented above IP). In such an implementation, a message declared stable (i.e., committed) by the reliable multicast layer is forwarded up the protocol stack to the causal multicast layer that may further delay the delivery of the message to the application if the message is forwarded out of order.

Totally Ordered Multicast

In the simplest case, to implement totally ordered multicast, all messages are sent to a single coordinator. The coordinator assigns sequence numbers to them and broadcasts them to the group. Everyone in the group delivers the messages in the order of these sequence numbers. If the coordinator dies, a new coordinator is elected.

Membership View Changes

To ensure the first two properties of virtual synchrony (i.e., that all members who deliver a message agree on their membership view, and that the message destination list consists precisely of all members in that view), membership view changes must be totally ordered with respect to message multicasts. If that is not true, a message can be delivered to two members with different membership views as shown below.

The following algorithm is used to achieve total order of membership changes with respect to multicasts. Let the current membership view be view i.

The previous example now looks as follows.

Note that if another process (e.g., P3) fails before view i+1 is installed, the surviving processes will not receive a flush message from P3 and will not be able to install the new view. Instead, we allow a new view change message, flush (i+2), to be sent to members of the latest view, informing them of the demise of P3. Those members can now give up waiting for P3 and install view i+1 followed immediately by i+2. This can be easily generalized to any number of back-to-back failures.

RING-BASED PROTOCOL SIMPLIFICATIONS

Significant simplifications could be made to the implementation of group communication protocols if members could be assumed to form a logical ring where only one member can send at a time. While the one-at-a-time sender assumption is too restrictive in WANs, it is perfectly reasonable if all members are on a single LAN. In such a case members can be numbered in some arbitrary order. A token can circulate the logical ring. A member that has the token can send messages. The serialization of senders simplifies protocol design considerably. In the following we shall demonstrate these simplifactions using RTCAST as a case study. RTCAST is a group communication protocol that provides reliable, totally ordered delivery of messages, as well as reliable totally ordered delivery of membership changes.

Implementing Total Order To implement total order in RTCAST, each member (i) numbers their own messages sequentially, and (ii) marks the last message they send using a special bit in the header. Each member keeps track of the last message it received from every other member. Following the receipt of message j from sender i, a receiver expects one of the following:

Hence, the next message to be received is known unambiguously to the receiver. If the receiver receives a different message instead, a gap is identified. In such a case the new message is not delivered until the gap is filled. Hence, messages are delivered to the application in total order. Note that the algorithm requires that each member send at least one message each time they get the token. If no messages are queued up, a dummy message (marked last) is sent. This message is also specially market to be discarded at the receiver.

Implementing Reliable Multicast

In RTCAST, receivers deliver received messages immediately to the application. No agreement protocol is run to make sure that all receive the message and no two phase commit is invoked. Yet, it is guaranteed that delivery is all-or-nothing to surviving members. The implementation of this property is very simple. Having sent all its messages, each sender broadcasts a heartbeat to the group. When received by the next member on the ring, this heartbeat interpretted as the passing of the token. Note that because channels are FIFO, by the time the token arrives to a member from the immediately preceding one, the receiving member should have also received the last data message sent by its predecessor (since it is sent immediately before the token). In fact, it should have received all messages up to that last message. Upon receiving the token, each member can thus check if it has received all messages transmitted in the previous token round. If a gap exists in the sequence of messages received, the member fails by crashing. The only exception to that is if the member is missing a message from someone who was declared dead in the same round. The death declaration carries the number of the last message communicated by the dead sender allowing to check for gaps. Hence, surviving members are guaranteed to have received all messages in order. In other words, every message sent is guaranteed to reach all surviving members.

Note that virtual synchrony has somewhat stronger semantics than achieved above. Virtual synchrony requires that a message sent to a group reach all members of that group. It turns out that RTCAST achieves these semantics as well. This is because members who missed a message will crash as soon as they get the token following the gap due to the missing message. Hence, they crash before they can send anything that divulges the fact that they didn't get all the messages communicated in the preceding round. From the perspective of the other members its as if each message indeed reached all its recepients, immediately followed by the failure of some of them. The fact that in reality some messages were never delivered to the failed processors remains entirely concealed.

Total Order of Membership Updates

Membership updates are sent as regular data messages to the union of the two membership views involved. The mechanisms described above ensure that these updates are delivered in total order reliably to all members. In RTCAST, a maximum token rotation time is imposed on the protocol by bounding the amount of time each sender is allowed to send messages. If a member does not receive the token in time, it will time out. The member will then check the heartbeats it received during the last round and eliminate (by sending a membership change message) all its consequitive predecessors from which heartbeats have not been received. In this elimination message, the sender also mentions the number of the latest message received from the failed members. The member then sends a heartbeat, which serves as a regenerated token to its successor. The ring operation continues as usual.