A link failure must be detected as early as possible in order to keep the total repair time low. To do so, probes should be sent at a high frequency. However, since detecting a link failure triggers traffic switchover, link failures should not be detected when a link has not actually failed. We call
the period used by nodes to send probes. A probe can be detected as missing while a link has not failed in two cases. First, due to delay jitter, a delayed probe may be considered as missing. Second, since probes are sent as UDP messages, a lost probe is not retransmitted. Therefore, a single lost probe should not be interpreted as a consequence of a link failure.
|
|
We consider that a link has failed only when several probes are missing in sequence. Let the beat checking number
be the number of probes that must be missing before a node reports a link failure. The failure detection time
is the time between the instant at which a link fails and the instant at which a node that receives probes via this link considers that the link has failed. We now determine the distribution of the failure detection time. Suppose node
is sending probes to node
as shown in Figure 4.3. At time
,
sends the last probe before link
fails. Link
fails at time
. Time
is uniformly distributed between
and
. Every period
, node
checks whether it received at least one probe from
. Since the sender of probes at node
and the receiver of probes at node
are not synchronized, the time
at which
checks and records the presence of the last probe sent by
is uniformly distributed between
and
. Node
detects the failure at time
since no probe is received between
and
. Therefore the time
at which the failure is detected is uniformly distributed between
and
. The failure detection time
is given by
. The distribution of
is represented in Figure 4.4.
There is a trade-off between the speed of the failure detection and the accuracy of the detection, i.e., the ability of the nodes to detect failures only when a link has failed. Low values of
may lead to a high number of false failure detection. Since the link failure detection time is comprised between
and
, high values of
yield long times before link failures are reported.
|
|
The same probing mechanism can be used to detect the repair of a link. When a link is reported as failed, its two end nodes keep trying sending probes with period
. When one of these two nodes receives such a probe, then the link is detected as repaired.
Different from link failure detection where a single missing probe is not enough for an end node to infer that a failure has occurred, the arrival of the first probe on a link previously reported as failed indicates the recovery. For instance, suppose that node
sends a probe on link
time
after link
has been repaired as shown in Figure 4.5. Since node
tries to send probes every period
,
is uniformly distributed between
and
. Node
detects the repair as soon as it receives a probe, thus the recovery detection time
to detect the repair is equal to
and is uniformly distributed between
and
(Figure 4.6).
The average value of
is
and the average value of
is
. Using millisecond timers on the nodes that perform failure or repair detection, it is possible to detect the failure or the repair of a link in a few milliseconds or tens of milliseconds depending on
.