CS 656, Spring 2002
CS 851
Lecture Notes on Group Communication
MAIN CONCEPTS
Group Communication
Group communication is a paradigm for multi-party communication that
is based on the notion of groups as a main abstraction.
A group is a set of parties that, presumably, want to exchange information
in a reliable, consistent manner. For example:
- The participants of a message-based conferencing tool
may constitute a group.
Ideally, in order to have meaningful communication, each participant
wants to receive all communicated messages from each other
participant. Moreover, if one message is a response to another,
the original message should be delivered before the response.
(In this example, if two participants originate messages
independently at about the same time, the order in which such
independent messages are delivered is not important)
- The set of replicas of a fault-tolerant database server
may constitute a group. Consider update messages to the
server. Since the contents of the database
depend on the history of all update messages received, all updates
must be delivered to all replicas. Furthermore, all updates
must be delivered in the same order. Otherwise, inconsistencies
may arise.
Group Communication Primitives
Group communication is implemented using middleware that provides two
sets of primitives to the application:
- Multicast primitive (e.g., post): This primitive allows a sender
to post a message to the entire group.
- Membership primitives
(e.g., join, leave, query_membership): These
primitives allow a process to join or leave a particular group,
as well as to query the group for the list of all current
participants.
Group Communication Semantics
Different approaches to group communication present different delivery
semantics concerning the post
primitive. Several alternatives are possible:
- Best effort delivery:
No guarantees are given on message delivery. No guarantees are given
on the order of messages delivered to the destination. Obviously,
this is not a useful abstraction.
- Reliable messages delivery:
The communication subsystem ensures the following three properties:
- If a correct process broadcasts message m, then
all correct processes eventually deliver m.
- If a correct process delivers message m, then
all correct processes eventually deliver m.
(Note that the difference of this property from the previous
one is that this property should hold even if the
sender was not a correct process. This is important to
ensure consistency even when the
sender crashes in the middle of a broadcast after some
correct receivers have already delivered the message).
- Every process delivers m at most once, and only if
was previously broadcast.
- FIFO message delivery:
It's a reliable message delivery that also
ensures that messages from the same source are delivered in the
order they were sent. (If a process broadcasts messsage m
before broadcasting message m` then no correct process
delivers m` unless it has previously delivered m).
FIFO delivery does not guarantee that messages from different
sources will be delivered in the same order. For example, if
process A responds to a message from process B, these messages
can be delivered in different orders to processes C and D who
are listening to the conversation. This is
because the messages do not originate from the same sender.
TCP implements FIFO delivery for the special case of a group of
two participants. (Note that: a. Every byte gets delivered. b. Bytes
from the same sender are delivered in order.)
- Causal message delivery:
It generalizes FIFO message delivery to a scenario where messages
are delivered in the order of potential causality. In other words,
if message m` could have been a response to m,
all correct processes must deliver m before m`.
Causal message delivery delivers messages in the order they
were sent according to Lamport's happened before relation.
Lamport proposed that in the absence of global time in
a distributed system event A happens before event B is iff:
- Event A precedes event B on the same processor
- Event A is that of sending message m, and
event B is that of receving the same message m
- Event A is shown to occur before B by transitivity
(e.g., A happens before C which happens before B)
- Totally ordered message delivery:
All messages are delivered in the same order to all destinations.
Failure Semantics
The semantics of group communication (such as causal delivery or totally
ordered delivery) must be enforced in the presence of failures. Several
different failure models were proposed in literature. For example:
- Processor crash failures: In this model, a failed processor
stops communicating with the outer world. This type of failure is
the easiest to deal with, since we never hear from the failed
party again. It should be noted that a recovered processor
(after a crash failure) joins the group as a new member. It has
no "memory" of its "previous life" before the crash. It will
be made consistent with the rest of the group anew as a part of the
join process.
- Send/Receive omissions: In this model, a failed processor
can sporadically omit sending or receiving messages. Such is usually
the case when the network interface is faulty. This type of failure
is more difficult to deal with because a failed processor can
become inconsistent with the rest of the group
(e.g., due to failure to receive
an important message delivered to all other members).
This processor, however, continues to
communicate with the outer world possibly spreading erroneous
information to other parties based on its inconsistent data.
This is called contamination. Contamination can be avoided
by preventing the processor from communicating with others
upon a send or receive omission.
Of course, this requires a mechanism for detecting such
omissions which is what makes these types of failure more difficult
to deal with than crash failures.
- Arbitrary failures: Such failures can exhibit an arbitrarily
complex (possibly malicious) behavior. For example, a processor
can knowingly send false information to others in order to
contaminate them. Such failures are not dealt with in current
group communication approaches.
Synchronous versus Asynchronous Systems
For the purposes of developing group communication algorithms, we
divide distributed systems into two categories:
- Synchronous systems: The key property of such systems is that
all communication and processing takes bounded time. Assuming
a reliable communication meduim and a crash failure model for
processors, a sender can know for sure that a receiver has failed
if the sender didn't get an acknowledgement from the receiver within
some finite time computed from the worst case processing and
communication delays.
- Asynchronous systems: The key property of such systems is
that communication and processing delays can be arbitrarily long
even in the absence of failures. Hence, there is no way we can
distinguish between an arbitrarily slow processor, and one that has
failed. Implementing group communication semantics in the presence
of failures becomes more challenging in the asynchronous model.
Virtual Synchrony
The concept of virtual synchrony was proposed by Kenneth Birman as the
abstraction that group communication protocols should attempt to build
on top of an asynchronous system. Virtual synchrony, according to Birman's
paper [1], is defined as follows:
- All recepients have identical group views when a message is
delivered. (The group view of a recepient defines the set of
"correct" processors from the perspective of that recepient.)
- The destination list of the message consists precisely of the
members in that view
- The message should be delivered either
to all members in its destination
list or to no one at all.
The latter case can occur only if the sender fails
during transmission.
- Messages should be delivered in causal or total order (depending
on application semantics).
Impossibility of Consensus
At first it appears that virtual synchrony requires members to achieve
consensus on the group view. In other words, all correct processors
agree on the same view. Consensus, however, is impossible to achieve in
an asynchronous system in the presence of crash failures because we cannot
distinguish a crashed processor from an arbitrarily slow one (or an
arbitrarily slow link). To appreciate the impossibility of consensus,
consider the following problem:
A Babelonian King hears that two of his prime aids are plotting
a conspiracy against him. The King orders that the culprits
be apprehended. When they are brought to his presence, he tells
them: "The punishment for conspiracy is death. However, since
you've served me for so many years, I will give you a last chance.
I will put you in separate prison cells that are far apart. Each cell
has two doors, one Blue and one Green, guarded
at all times. You may exit the sell at your own risk.
If both of you happen to exit through the same-color door, I will
pardon you and set you free. If you happen to exit through doors of
different colors, I will hang you both. If you stay in your cell, I may hang
one of you, in which case no matter which door the other one uses to exit
he's free. You may communicate by messengers to
reach agreement on which door color to choose. If the messanger finds the
recepient dead (hanged) he will not return. Messangers can take an
arbitrarily long time to reach the recepient. There is no bound on
how long that time might be." The King then put the two men
in their cells and asked the messagers to take an infinite amount of
time to communicate their messages. In the absence of communication,
there is no way the two prisoners can deterministically achieve
consensus on the door
color to use. Furthermore, neither one can deterministically tell
that the other has been executed (which would have given the survivor the
right to exit from either door with impunity). Note however, that
probabilistic solutions exist to the problem. For example, if both
choose a door at random, there is a 50% chance that they will survive.
Process Group Agreement
Since the absence of a deterministic solution to the consensus problem
results from the impossibility to tell a slow processor/link from
a crash failure, the process group protocol can still reach agreement
by faking member death. In other words, if a member is thought to
be dead, that member is declared dead by the protocol and is isolated from
the group. In reality, the member might simply be slow. Once a member is
declared dead, however, subsequent messages from this member are dropped in
the communication subsystem. Thus, from the application's perspective
a crash failure of the dead member is emulated. Agreement can now be
achieved among the surviving members (those not declared dead by the
protocol). In a way, this solution to the agreement problem is similar
to the following.
A pharmaceutical company announces the release of a new medication
"X". The medication is guaranteed to cure its recepient from the
common cold within 24 hours, with the obvious minor stipulation that
the person should still be alive at that time. The company delivers
the promise literally. If the recepient is not fully cured after
23 hours and 59 minutes, the company has a hit man eliminate the
patient immediately. As a result of this policy, all those who
remain alive 24 hours after administering the wonder drug must,
by design, be
fully cured! Note that, this mechanism indeed
guarantees the promissed semantics, i.e., full recovery of all
surviving patients.
The point is, process group agreement is different from true
consensus in that it resorts to killing dissenting processes
instead of reconciling them with the rest
(i.e., killing the patients who wouldn't recover instead of curing
them). The trick circumvents the impossibility of consensus.
Agreement is guaranteed among the surviving members. Hence, virtual
synchrony can be implemented. In the subsequent sections we describe
how that implementation is done. Intuitively, we need three components
to implement virtual synchrony:
- An implementation of reliable multicast (that enforces the
all-or-nothing property).
- An implementation of causal or total order
among different multicasts
- An implementation of membership updates that are totally ordered
with respect to multicasts. This ensures that all members who deliver
a multicast have the same membership view which also happens to
be the view of the sender.
PROTOCOL DESIGN
FIFO delivery
FIFO message delivery between a sender and receiver is easiest to
implement. The sender simply numbers outgoing messages
sequentially. The receiver delivers them in the order of their
sequence numbers. Note that this simple protocol catches send and
receive omissions. Should a message get lost (e.g., because of an
error at the sender or receiver), a gap is observed at the receiver in the
sequence numbers. The missing message can then be requested from the
sender. Delivery of all subsequent messages is postponed until the missing
one is delivered. Such a layer will provide the abstraction
of a reliable FIFO channel with a potentially unbounded latency.
(The latency is unboundend because we do not know how many retransmissions
will be necessary to deliver a message reliably.) Hence, in
the following we assume reliable FIFO channels and an asynchronous system
model (unbounded latency).
Reliable Multicast
Next, we consider the all-or-nothing problem, i.e., that of ensuring that
a message is delivered reliably to all destinations in a group once
it has been delivered to at least one destination in the group.
There are two algorithms that achieve reliable delivery to all members:
- The first protocol is the simplest but has quadratic complexity.
In this protocol a sender sends a copy of the message to each group
member. A member upon receiving the first copy of a message
forwards it to all members. This ensures that all correct members
eventually get the message. Note that if everything goes well,
every member will eventually receive a copy of the message from
every other member. If some member, A, never gets a copy of the
message from anyone, that member will never broadcast the message
either. Hence, no one will receive a copy of the message from A
and will send A a negative acknowledgement (piggybacked by the
original message). Eventually, A either receives the
(piggybacked) message and broadcasts it, or is presumed dead
by the rest of the group and is excluded from membership.
Consequently, the multicast is guaranteed to reach all
surviving members.
- The second protocol is similar to a 2-phase commit. It has the
following two stages:
- The sender sends a "prepare to commit m" message to all
recepients. Everyone stores a local copy of m
and sends an ack. At this point message m is said
to be unstable. It will be considered stable
when the commit message arrives.
- When the sender has collected all the acks it knows that
everyone in the group has a local copy of m and
should be able to deliver it. The sender then sends a
commit message to everyone. Upon receipt of the commit
message each process delivers m and sends an ack.
If the sender does not receive an ack from some processes
it will reissue the commit message
to them. Eventually, such
processes either commit m or are considered to have
crashed and excluded from membership.
Note that in this protocol, inconsistency will arise if the sender
dies in the middle of sending the commit message. In this situation,
some receivers may receive the commit and deliver, while others
might not. Since the sender is dead, no one is there to
re-issue the commit message to them. The problem
is solved when the failure of the sender is detected as will
be described in the protocol for membership changes. The
solution forces delivery of unstable message received
from the dead sender.
Causal Multicast
Causal multicast implements causal delivery semantics. Each process
keeps a precedence vector with one entry per group member. The vector is
updated upon send and receive events. Each entry in the vector gives
the number of the latest message from the corresponding group member
that causally precedes the current event. The protocol for ensuring causal
delivery order is as follows:
- When a member sends a message, it increments their own entry
in their precedence vector and sends the vector along with the
message.
- When a member A receives a message from B, it checks that:
- The message arrived in order from B. That is the
sequence number in B's entry of the precedence vector
carried in the message is one more than the
number in B's entry of the predence vector maintained
at A.
- The message does not causally depend on anything that
A hasn't seen. That is, all remaining fields in the
vector carried in the message are no larger than the
corresponding fields in the vector maintained at A.
If both of the above is satisfied, A delivers the message and
updates their precedence vector by incrementing B's entry
in that vector.
The above protocol can be implemented on top of the reliable message delivery
layer described earlier (in the same sense as TCP is implemented above IP).
In such an implementation,
a message declared stable (i.e., committed) by the reliable
multicast layer is forwarded up the protocol stack to the causal multicast
layer that may further delay the delivery of the message to the
application if the message is forwarded out of order.
Totally Ordered Multicast
In the simplest case, to implement totally ordered multicast, all messages
are sent to a single coordinator. The coordinator assigns sequence
numbers to them and broadcasts them to the group. Everyone in the
group delivers the messages in the order of these sequence numbers. If the
coordinator dies, a new coordinator is elected.
Membership View Changes
To ensure the first two properties of virtual synchrony (i.e.,
that all members who deliver a message agree on their membership view,
and that the message destination list consists precisely of all members
in that view), membership view changes must be
totally ordered with respect to message multicasts. If that is not
true, a message can be delivered to two members with different membership
views as shown below.
The following algorithm is used to achieve total order of membership
changes with respect to multicasts. Let the current
membership view be view i.
- When a process initiates a membership change to install view
i+1, it sends a
flush(i+1) message to all members of the new view. The message
contains the new membership view (or equivalently,
the change from the previous view).
- Upon receiving the flush message for the first time, each
process (a) broadcasts all unstable messages received
from the dead process, (b) stops sending regular multicasts, and
(c) forwards the flush (i+1)
message to each process in the new view.
Note that since channels are FIFO (which is trivial to implement),
the flush message is the last message sent on each channel in
the old view.
- When a process has received a flush (i+1) message from all other
processes, and because channels are FIFO, it
can safely conclude that it has received all messages sent
in the old view. After delivering those messages,
the process installs the new view.
The previous example now looks as follows.
Note that if another process (e.g., P3) fails before view i+1
is installed, the surviving processes will not receive a flush message
from P3 and will not be able to install the new view. Instead, we allow
a new view change message, flush (i+2), to be sent to members of the
latest view, informing them of the demise of P3. Those members can now
give up waiting for P3 and
install view i+1 followed immediately by i+2. This can
be easily generalized to any number of back-to-back failures.
RING-BASED PROTOCOL SIMPLIFICATIONS
Significant simplifications could be made to the implementation
of group communication protocols if members could be assumed to
form a logical ring where only one member can send at a time.
While the one-at-a-time sender assumption is too restrictive in
WANs, it is perfectly
reasonable if all members are on a single LAN. In such a case
members can be numbered in some arbitrary order. A token can
circulate the logical ring. A member that has the token can send
messages. The serialization of senders simplifies protocol design
considerably. In the following we shall demonstrate these simplifactions
using RTCAST as a case study. RTCAST is a group communication protocol
that provides reliable, totally ordered delivery of messages, as well
as reliable totally ordered delivery of membership changes.
Implementing Total Order
To implement total order in RTCAST,
each member (i) numbers their own messages
sequentially, and (ii) marks the last message they send using
a special bit in the header. Each member keeps track of the
last message it received from every other member.
Following the receipt of message j from sender i,
a receiver expects one of the following:
- If message j was not marked last, the receiver expects
message j+1 from sender i.
- If message j was marked last, the receiver expects
message k+1 from sender i+1, where k
is the last message received from i+1. Note that sender
i+1 is the successor of sender i on the logical
ring. If the ring has N members, sender N+1
is in fact sender 1, etc.
Hence, the next message to be received is known unambiguously to the receiver.
If the receiver receives a different message instead, a gap is identified.
In such a case the new message is not delivered until the gap is filled.
Hence, messages are delivered to the application in total order.
Note that the algorithm requires that each member send at least one message
each time they get the token. If no messages are queued up, a dummy message
(marked last) is sent. This message is also specially market to
be discarded at the receiver.
Implementing Reliable Multicast
In RTCAST, receivers deliver received messages immediately to the
application. No agreement protocol is run to make sure that all
receive the
message and no two phase commit is
invoked. Yet, it is guaranteed that delivery is all-or-nothing
to surviving members. The implementation of this property is very simple.
Having sent all its messages, each sender broadcasts a heartbeat to the group.
When received by the next member on the ring, this heartbeat interpretted
as the passing of the token. Note that because channels are FIFO, by the time
the token arrives to a member from the immediately preceding one,
the receiving member should have also received the last data message sent
by its predecessor (since it is sent immediately before the token).
In fact, it should have received all messages
up to that last message. Upon receiving the token, each member can thus
check if it has received all messages transmitted in the previous token round.
If a gap exists in the sequence of messages received, the member
fails by crashing. The only exception to that is if the member is missing
a message from someone who was declared dead in the same round.
The death declaration carries the number of the last message communicated
by the dead sender allowing to check for gaps.
Hence, surviving members are guaranteed to have received
all messages in order. In other words, every message sent is guaranteed to
reach all
surviving members.
Note that virtual synchrony has somewhat stronger semantics than achieved above.
Virtual synchrony requires that a message sent to a group reach all members
of that group. It turns out that RTCAST achieves these semantics as well.
This is because members who missed a message will crash as soon as they
get the token following the gap due to the missing message. Hence, they
crash before they can send
anything that divulges the fact that they didn't get all the messages
communicated in the preceding round.
From the perspective of the other members its as if each message
indeed reached all its recepients, immediately followed by the failure
of some of them. The fact that in reality some messages were never
delivered to the failed processors remains entirely concealed.
Total Order of Membership Updates
Membership updates are sent as regular data messages to the union of
the two membership views involved. The mechanisms described above
ensure that these updates are delivered in total order reliably to
all members. In RTCAST, a maximum token rotation time is imposed
on the protocol by bounding the amount of time each sender is allowed
to send messages. If a member does not receive the token in time, it
will time out. The member will then check the heartbeats it received
during the last round and eliminate (by sending a membership change message)
all its consequitive predecessors from which heartbeats have not been
received. In this elimination message, the sender also mentions
the number of the latest message received from the failed members.
The member then sends a heartbeat, which serves as a regenerated token to
its successor. The ring operation continues as usual.