how multiple effects and multiple causes complicate troubleshooting of network performance. For more background, read part 2 to learn how round-trip time and limiting data rate impact network performance.
Multiple effects and multiple causes of degraded network performance
What we see are multiple effects of multiple causes, typical of real networks. It's clear that "How much [limiting] data rate is enough?" is a canard -- there's plenty. If we had it always available to our 1.5 MB transfer, we would have waited only 8 seconds instead of 35. But, apparently, it's not always available, as the lower-right quadrant's measurements show.
Cause 1: There's network path congestion. We're sharing the T1 part of our path with other traffic and the router/bridge at that point is buffering and delaying our packets, sometimes beyond 100 msec. And this delay is highly variable because the competing traffic itself is variable. Clearly, a statistical analysis is the only way to proceed toward estimating performance under these circumstances.
Cause 2: This is a simpler matter -- delayed ACKs in the upper-right quadrant. They cluster just above 100 msec. This is a protocol issue, specific to TCP-stack parameter settings at the receiver, but also interacting with the sender's way of sending TCP packets. It sounds complex, but it's just that TCP is very old and carries quirks from the days of very slow links, when elimination of "unnecessary" packets was thought useful.
TCP, therefore, still allows a receiver to ACK only every other incoming packet (the upper-right quadrant has about half the dots in the upper-left one). Of course, receipt of packet N doesn't say N+1 will ever come, so if N weren't ACKed, ever, the sender would never know the receiver got all the data. The solution (or kludge) in TCP is to have the receiver start a timer, after receiving every (e.g., odd numbered) packet that it doesn't ACK at once. If packet N+1 comes, the timer is stopped. If the timer expires, the receiver sends an ACK for the last packet received.
What value should we give that delayed-ACK-timer parameter? Well, we don't want the sender to wait too long for the delayed ACK because it might time out and go into its retransmission mode, which would really slow things down. The default value of this TCP parameter is usually 100-150 msec and, unfortunately, is often inaccessible to the network technician. End of story? No, this added source of delay can be combated by being sure the sender's transmit window is not set to an odd number of packets -- a parameter that usually is adjustable (more on that later).
Cause 3 is a lesser effect, shown in the upper-right quadrant: The spread of ACK delays from 200 microseconds to 100 msec. We know the lower and upper bounds are not network induced, but the spread between them is, again, just ACK path congestion. These delays have a similar effect to the strict ones of Cause 2, but because ACKs are usually small packets and, for TCP, sent half as often as data, the effects of these delays are noticeable but not too bad.
These causes of individual delays together generate variable performance that can only be modeled and estimated statistically. We see from the data above that the largest delay is due to congestion at a particular router/bridge driving a T1 line. We don't need more data rate anywhere but at that link, if the router/bridge can operate more quickly. If it can't, then we need a better, or upgraded, router/bridge as well.
The prior graphs and the one below were generated from real network packet data, via tools (NetCalibrator and NetPredictor) specifically intended for network-performance prediction. Sophisticated tools like these aren't always necessary if the principles described here are understood and packet captures can be made on a path. The graph below shows how the sources of delay in another, shorter, path can be broken out, using just the kind of data we've discussed:
Note that the inevitably statistical distribution of measurements is reflected by the curved tops of the bars assigned to each component in the network path (from Percheron to AFS08) -- unfortunately, as we now know, this tool's output display misuses "bandwidth"! Note also that having a network map and component properties (speeds, etc.) allows all of the above information to be estimated from measurements made at one key point, such as an end node. What we want is a breakout of the sources of delay, therefore throughput loss, along any path. In the graph above, we see the server itself is very busy and the source of most delay and variable throughput. In this example, the network path is OK -- it's not the network and we know it!
⇒ Continue reading part 4: Space probes, throughput and error probability.
About the author:
Alexander B. Cannara, PhD, is an electrical engineer, a software and networking consultant, and an educator. He has 18 years of experience in the computer-networking field, including 11 years in managing, developing and delivering technical training. He is experienced in many computer languages and network protocols and is a member of IEEE, the Computer Society, and the AAAS. Alex lives with his wife and son in Menlo Park, California.
This was first published in September 2006