next up previous contents
Next: Measuring link failure and Up: Experiments Previous: Measuring MPLS multicast throughput   Contents


Experiment 2: Measuring link failure and recovery detection times

Figure 6.4: Experimental setup for determining the link failure and recovery detection times. No traffic flows on the tree. Once the mLSP is established, we successively simulate the failure and recovery of the link between PC2 and PC3 by bringing down and up the interface eth3 of PC2.
\includegraphics[width=\textwidth]{figures/exp_detect_config}
In the second experiment, we determine the link failure and recovery detection times and compare the experimental values with the values from the analytical model presented in Section 4.2. The setup for this experiment is depicted in Figure 6.4. In this experiment, we set up a multicast LSP but do not transmit data over the LSP. The core of the tree is PC2. The members of the multicast group are PC1 and PC5. On each machine involved in the experiment, we set the beat checking number $n$ to the minimum value $n=2$ as defined in Section 4.2. In the Linux operating system, the most accurate timer has a resolution of 10 ms [15]. We use this resolution of 10 ms for the period $T_p$. The 10 ms timer is accurate when the machine is underloaded, however since Linux is not a real-time operating system, the accuracy becomes questionable when the system is overloaded [15] [45]. In this experiment, the PC routers are underloaded and we assume that the timer is accurate. When an interface is disabled, the kernel considers that no link is attached to the interface therefore disabling (or bringing down) an interface is equivalent to cutting the link attached to the interface. Reenabling (bringing up) an interface is equivalent to repairing the link attached to the interface.


We modify the code of MulTreeLDP on PC2 to measure $T_{fdetect}$ and $T_{rdetect}$. We add a thread to MulTreeLDP that automatically brings down and brings up eth3. MulTreeLDP records in a timestamp the instant at which it brings eth3 up. MulTreeLDP computes the difference between the time at which it detects the link failure and the instant at which eth3 is brought down. This time difference is $T_{fdetect}$. Then, the thread we added to MulTreeLDP chooses a random time value using the internal random number generator of the PC and sleeps during that time. When the new thread wakes up, it brings up eth3 and records in a timestamp the instant at which it brings eth3 up. When the link repair is detected, PC2 computes the time difference between the instant of the repair detection and the aforementioned timestamp. This difference is $T_{rdetect}$. The new thread successively brings down and brings up eth3 100 times and then quits. We record the 100 values for $T_{fdetect}$ and $T_{rdetect}$, stop the MulTreeLDP program and restart it on all four machines used in this experiment. We collect 25 series of 100 values for both $T_{fdetect}$ and $T_{rdetect}$. Therefore, we collect 2500 values for $T_{fdetect}$ and 2500 values for $T_{rdetect}$. According to our model presented in Section 4.2, the time to detect a link failure depends on two factors. The first factor is the length of the time interval between the instant at which PC3 sends the last probe before the failure occurs and the instant at which the failure occurs. We called this time $T_1$ in Section 4.2. We randomize $T_1$ by bringing up and bringing down interface eth3 at random times. The second factor is related to the synchronization between the timers on PC2 and PC3. In Section 4.2, we called $T_2$ the difference of synchronization of the timers of PC2 and PC3. We assume that manually stopping and restarting MulTreeLDP on all machines randomizes $T_2$.

In Figure 6.5, we show the distribution of the 2500 samples of $T_{fdetect}$ for 2 ms long time intervals, and compare this experimental distribution with the expected distribution derived from the analytical model in Section 4.2. The average for the 2500 samples of $T_{fdetect}$ is $\overline{T}_{fdetect}$=25.4 ms. With $n=2$ and $T_p$=10 ms the theoretical average is $\frac{3n-1}{2}T_p$ = 25 ms. Although the model we discuss in Section 4.2 is simple, our experimental results match the theoretical values determined with the model.

In Figure 6.6, we show the distribution of the 2500 samples of $T_{rdetect}$ for 2 ms long time intervals and compare this distribution with the expected distribution derived from our model. Here, the experimental results do not match the model well. We expect 10 % of the recovery detection times to be comprised between 0 and 10 ms and 0 % above 10 ms, but only 6.1 % of the samples are comprised between 0 and 10 ms and more than 3 % of the values are higher than 10 ms. Actually, the experimental recovery detection times are not comprised between 0 and 10 ms but between 0.4 and 10.4 ms, as shown in Figure 6.7. The average for the 2500 samples of $T_{rdetect}$ is $\overline{T}_{rdetect}$=5.48 ms, which is close to the theoretical average (5 ms).

Figure 6.5: Experimental distribution of the link failure detection time. In bold lines, the theoretical distribution for time intervals of 2 ms.
\includegraphics[width=\textwidth]{figures/exp_detect_failure}
Figure 6.6: Experimental distribution of the link recovery detection time. In bold lines, the theoretical distribution for time intervals of 1 ms.
\includegraphics[width=\textwidth]{figures/exp_detect_recovery}
Figure 6.7: Experimental distribution of the link recovery detection time compared with the theoretical distribution shifted by 0.4 ms. The experimental distribution matches the theoretical distribution when we add 0.4 ms to the time intervals in the model. There is a difference of 0.4 ms between the experimental and theoretical distributions.
\includegraphics[width=\textwidth]{figures/exp_detect_recovery_offset}

We conduct additional experiments to assess the behavior of the link failure detection mechanism for high link capacity utilization. We modify the setup of the experiment such that PC1 sends traffic to PC5 using the multicast LSP. The traffic consists of UDP packets of 8192 bytes. When we set the sending rates at 93 Mbits/s or more, we observe that PC2 and PC3 make false detections, i.e they detect that the link between PC2 and PC6 successively fails and is repaired several times per second. The PC routers are not fast enough to forward the packets and send or check the reception of the probes at the same time. As discussed earlier, Linux is not a real-time operating system therefore there is no guarantee that probes are sent exactly every $T_p$ ms or that probe reception is checked exactly every $n T_p$ ms under high load of the system. Solutions to this issue include increasing $n$ or $T_p$ (at the cost of higher link failure and detection times), using a real-time operating system, using faster routers, or using a fraction of the maximum throughput achievable with MPLS multicast to send traffic. In the remaining experiments, we send traffic at lower rates to avoid false detections.


next up previous contents
Next: Measuring link failure and Up: Experiments Previous: Measuring MPLS multicast throughput   Contents
Yvan Pointurier 2002-08-11