next up previous contents
Next: Failure and recovery notification Up: MPLS Multicast Fast Reroute Previous: Overview   Contents


Link failure and recovery detection

To detect link failure or recovery, nodes regularly send small messages called probes on all the links on which they send traffic, and listen for probes on the links from which they receive traffic. In the context of bidirectional mLSPs, nodes actually send and listen for probes on each link they are attached to. Probes are small messages that take a low percentage of the bandwidth of the link on which they are sent. In our implementation (see Chapter 5), we send small UDP messages with a payload of 4 bytes.

A link failure must be detected as early as possible in order to keep the total repair time low. To do so, probes should be sent at a high frequency. However, since detecting a link failure triggers traffic switchover, link failures should not be detected when a link has not actually failed. We call $T_p$ the period used by nodes to send probes. A probe can be detected as missing while a link has not failed in two cases. First, due to delay jitter, a delayed probe may be considered as missing. Second, since probes are sent as UDP messages, a lost probe is not retransmitted. Therefore, a single lost probe should not be interpreted as a consequence of a link failure.

Figure 4.3: Link failure detection mechanism. Node $C$ starts sending probes with period $T_p$ at time $0$, node $D$ starts checking the number of probes it received from $C$ every period $n T_p$ at time $T_2$. Link $CD$ fails at time $T_1$ and $D$ detects the failure at time $T_3$. The failure detection time is $T_{fdetect}=T_3-T_1$.
\includegraphics[width=0.55\textwidth]{figures/probe_distrib_explanation}
Figure 4.4: Link failure detection time probability density function. The probability density function of the difference $T_{fdetect}=T_3-T_1$ is obtained with a convolution of $T_1$ and $T_3$.
\includegraphics[width=0.7\textwidth]{figures/probe_distrib}

We consider that a link has failed only when several probes are missing in sequence. Let the beat checking number $n\ge2$ be the number of probes that must be missing before a node reports a link failure. The failure detection time $T_{fdetect}$ is the time between the instant at which a link fails and the instant at which a node that receives probes via this link considers that the link has failed. We now determine the distribution of the failure detection time. Suppose node $C$ is sending probes to node $D$ as shown in Figure 4.3. At time $t=0$, $C$ sends the last probe before link $CD$ fails. Link $CD$ fails at time $T_1$. Time $T_1$ is uniformly distributed between $0$ and $T_p$. Every period $n T_p$, node $D$ checks whether it received at least one probe from $C$. Since the sender of probes at node $C$ and the receiver of probes at node $D$ are not synchronized, the time $T_2$ at which $D$ checks and records the presence of the last probe sent by $C$ is uniformly distributed between $0$ and $n T_p$. Node $D$ detects the failure at time $T_2+n T_p$ since no probe is received between $t=T_2$ and $t=T_2+n T_p$. Therefore the time $T_3$ at which the failure is detected is uniformly distributed between $n T_p$ and $2n T_p$. The failure detection time $T_{fdetect}$ is given by $T_3-T_1$. The distribution of $T_{fdetect}$ is represented in Figure 4.4.

There is a trade-off between the speed of the failure detection and the accuracy of the detection, i.e., the ability of the nodes to detect failures only when a link has failed. Low values of $n$ may lead to a high number of false failure detection. Since the link failure detection time is comprised between $(n-1) T_p$ and $2n T_p$, high values of $n$ yield long times before link failures are reported.

Figure 4.5: Link repair detection mechanism. Link $CD$ is repaired at time $t=T_1$ and node $C$ sends a probe at time $t=0$. Node $D$ detects the link repair as soon as it receives the probe, thus the repair detection time is $T_{rdetect}=T_1$.
\includegraphics[width=0.55\textwidth]{figures/probe_distrib_explanation_rec}
Figure 4.6: Link repair detection time probability density function. $T_{rdetect}$ is uniformly distributed between $0$ (a probe is sent immediately after the link is repaired) and $T_p$ (a probe is sent time $T_p$ after the link is repaired).
\includegraphics[width=0.7\textwidth]{figures/probe_distrib_rec}

The same probing mechanism can be used to detect the repair of a link. When a link is reported as failed, its two end nodes keep trying sending probes with period $T_p$. When one of these two nodes receives such a probe, then the link is detected as repaired. Different from link failure detection where a single missing probe is not enough for an end node to infer that a failure has occurred, the arrival of the first probe on a link previously reported as failed indicates the recovery. For instance, suppose that node $C$ sends a probe on link $CD$ time $T_1$ after link $CD$ has been repaired as shown in Figure 4.5. Since node $C$ tries to send probes every period $T_p$, $T_1$ is uniformly distributed between $0$ and $T_p$. Node $D$ detects the repair as soon as it receives a probe, thus the recovery detection time $T_{rdetect}$ to detect the repair is equal to $T_1$ and is uniformly distributed between $0$ and $T_p$ (Figure 4.6).

The average value of $T_{fdetect}$ is $\overline{T}_{fdetect}=\frac{3n-1}{2}T_p$ and the average value of $T_{rdetect}$ is $\overline{T}_{rdetect}=\frac{T_p}{2}$. Using millisecond timers on the nodes that perform failure or repair detection, it is possible to detect the failure or the repair of a link in a few milliseconds or tens of milliseconds depending on $n$.


next up previous contents
Next: Failure and recovery notification Up: MPLS Multicast Fast Reroute Previous: Overview   Contents
Yvan Pointurier 2002-08-11