Internet-Draft cca_analysis October 2024
Lai, et al. Expires 24 April 2025 [Page]
Workgroup:
Congestion Control Working Group
Internet-Draft:
draft-lai-ccwg-lsncc-00
Published:
Intended Status:
Informational
Expires:
Authors:
Z. Lai
Tsinghua University
Z. Li
Tsinghua University
Q. Wu
Tsinghua University
H. Li
Tsinghua University
Q. Zhang
Zhongguancun Laboratory

Analysis for the Adverse Effects of LEO Mobility on Internet Congestion Control

Abstract

This document provides a performance analysis on various congestion control algorithms(CCAs) in an operational low-earth orbit (LEO) satellite network (LSN). Internet CCAs are expected to perform well in any Internet path, including those paths with LEO satellite links. Our analysis results reveal that existing CCAs struggle to deal with the drastic network variations caused by the mobility of LEO satellites, resulting in poor link utilization or high latency. Further, this document discusses the key challenges of achieving high throughput and low latency for end-to-end connections over an LSN, and provides useful information when the LSN-specific congestion control principles for the Internet is standardized in the future.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 24 April 2025.

Table of Contents

1. Introduction

Low-earth-orbit (LEO) satellite networks (LSNs) are evolving rapidly in recent years, thanks to the fast deployment of mega-constellations such as SpaceX's Starlink, Eutelsat OneWeb and Amazon Kuiper Project. Emerging LSNs aim to provide broadband coverage and low-latency services globally, and are carrying an increasing amount of Internet traffic. For example, Starlink, the largest operational LSN today, has attracted more than 4 million customers worldwide on 7 continents as of September 2024.

One key property differentiating LSNs with other existing terrestrial networks is that: a portion of the network infrastructure are moving at a high velocity related to the earth surface. Hence, for a network path through an LSN, a subset of its intermediate links continuously change over time, imposing substantial challenges on existing Internet congestion control algorithms (CCAs).

On the one hand, the unique LEO mobility not only results in rapidly varying satellite link capacity, but also involves frequent non-congestion RTT variations (e.g. caused by path fluctuations) and bursty packet losses (e.g. due to ground-satellite handovers). On the other hand, existing end-to-end CCAs leverage the network performance changes observed on the sender to infer congestion, but they are unable to discriminate whether a network variation is exactly caused by a congestion event (e.g. queuing at the bottleneck link) or not. As will be shown in this document, the frequent non-congestion network variations caused by LEO mobility often mislead existing CCAs such as TCP Cubic[RFC9438], Vegas[vegas_cc], Copa[copa_cc] etc, and further result in self-restrained CCA performance.

This document advocates that CCAs in LSNs require more effective indicators that can help the sender discriminate non-congestion performance changes and estimate network conditions more accurately. On our further investigation, we find that the unique LEO mobility is managed and scheduled by a special feature called "LEO reconfiguration" in LSNs, which is closely related to the non-congestion network variations, and thus is a potential indicator for discriminating non-congestion network variations (e.g. link capacity and propagation RTT changes due to LEO mobility).

The aim of the document is to describe the new challenges involved by the new characteristics of the emerging LSNs, and provide suggestions based on our insights obtained from an operational LSN. We hope this document will be an useful source when new LSN-specific congestion control principles for the Internet is needed to be standardized in the future.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Characteristics of LEO Satellite Networks

                        Outgoing Satellite  Incoming Satellite
 high-speed movement        +-----------+   +-----------+   +-----------+
     related to    <--------| Satellite |---| Satellite |---| Satellite |---
the earth surface(~7.5km/s) +-----+-----+   +------+----+   +-----+-----+
                                      |  \  /  |
             frequent space-ground    |   \/   |
             handovers due to LEO     |   /\   |
              satellite mobility      |  /  \  |
                                      | /    \ |
 +----+----+    +---+---+    +----+----+    +---+---+    +--------+--------+
 |  User   |--->|  Home |--->|Satellite|    |Ground |--->|Point-of Presence|
 | Terminal|<---| Router|<---| Terminal|    |Station|<---|   and Internet  |
 +---------+    +---+---+    +----+----+    +---+---+    +--------+--------+
Figure 1: Emerging LSN architecture.

2.1. LSN Architecture

Figure 1 plots a typical architecture of emerging LSNs. Overall, the entire LSN can be divided into a space segment which consists of a considerable number of LEO satellites, together with a terrestrial segment that contains many geo-distributed ground stations around the world. On the user side, to access the LSN, a user needs to purchase and deploy a dedicated satellite terminal (i.e. a dish), together with a home router which connects to the user's terminal (e.g. a smartphone or a laptop) via a WiFi or Ethernet interface. On the ground station side, the LSN exchanges traffic with the terrestrial Internet through a set of geographically distributed gateways behind ground stations. When the LSN provides Internet services for terrestrial users, user traffic is forwarded to LEO satellites via the satellite terminal, then to a ground station, and finally to the gateway and terrestrial Internet (and vice versa). If the user is close to an available ground station, satellites can use the well-known "bent-pipe" routing mechanism to transparently forward user traffic to the corresponding ground station. Otherwise, for users in remote areas far away from available ground stations, the LSN can exploit inter-satellite links (ISLs) to route user traffic to the ground station.

2.2. Low Orbit Altitude, Low Latency and High Throughput

Low latency in LSNs results from their proximity to earth, minimizing the ground-to-satellite distance data packets must travel. This allows for quicker response time, essential for applications like gaming and video conferencing. High throughput is achieved through the use of high-speed Ka-/Ku-band spectrum, and the use of multiple satellites working together in a mesh network, enabling large amounts of data to be transmitted simultaneously. This combination of low latency and high throughput positions LEO networks as a competitive alternative to traditional broadband options, especially in remote areas.

2.3. LEO Mobility

Unlike traditional geostationary (GEO) satellite networks, one key property of emerging LSNs is that: a portion of the network infrastructure, i.e. LEO satellite routers, are moving at a high velocity related to the earth. Such unique LEO mobility can result in a series of network instability issues such as handovers (i.e. ground nodes have to frequently disconnect with the outgoing satellite, and connect to a new incoming satellite), channel quality variations and link rate adaptation, and path fluctuations which can further affect the performance of end-to-end connections.

3. Impacts of LEO Mobility on Internet Congestion Control

3.1. Principles of Today's Internet Congestion Control

Internet congestion control is vital for maintaining network performance and stability. Typically, congestion control uses feedback loops to manage data flow. The sender adjusts its sending rate based on time-varying network conditions, ensuring efficient use of available bandwidth without overwhelming the network. In particular, congestion control algorithms (CCAs) detect network congestion through monitoring the performance changes observed on the sender, based on certain indicators such as packet loss, increased latency. Based on the different principles used for congestion detection and rate adaptation, existing CCAs generally can be classified by the following categories.

3.1.1. Loss-based CCAs

Loss-based CCAs, such as TCP Reno [RFC5681] and Cubic [RFC9438], primarily detect packet loss as a signal of network congestion. Loss-based CCAs identify congestion through packet loss, often using acknowledgments (ACKs) from the receiver. If packets are not acknowledged within a certain timeframe, the sender infers congestion. Upon detecting packet loss, loss-based CCAs reduce the congestion window size, which limits the amount of unacknowledged data in transit, thus decreasing the sending rate.

3.1.2. Delay-based CCAs

Delay-based CCAs such as Vegas [vegas_cc] and Copa [copa_cc] focus on monitoring changes in network delay as a signal of congestion, rather than relying solely on packet loss. In delay-based CCAs, the sender continuously measures round-trip time (RTT) to detect increases in latency, which can indicate congestion before packet loss occurs. When increased delay is detected, the sender adjusts its transmission rate, often by decreasing the congestion window, to alleviate potential congestion. Delay-based CCAs aim to react to congestion signals proactively, adjusting data rates to prevent packet loss rather than responding to it after the fact.

3.1.3. Model-based CCAs

Model-based CCAs employ mathematical and statistical models to predict network behavior and optimize data transmission. They often begin with modeling the network's dynamics, including bandwidth, delay, and packet loss, to understand how traffic interacts with these factors. Continuous data collection on network performance helps update the model, allowing for dynamic adjustments based on current conditions. Instead of reacting solely to congestion signals like packet loss, model-based control predicts congestion based on trends in delay and other metrics, allowing for proactive rate adjustments. For example, Google's BBR[cardwell2016bbr] models the bottleneck bandwidth and round-trip time to estimate the optimal sending rate. At runtime, BBR continuously monitors the conditions of the network path, adjusting its model based on real-time measurements of bandwidth and delay. By maintaining the sending rate close to the estimated bandwidth without causing excessive queuing delays, BBR optimizes throughput while minimizing latency.

3.1.4. Learning-based CCAs

Recent learning-based CCAs such as PCC-VIVACE[dong2018pcc] leverages machine learning techniques to optimize data transmission by predicting and adapting to network conditions. Instead of relying solely on predefined algorithms, learning-based CCAs analyze historical network data to identify patterns and make informed decisions. Relevant features include packet loss, throughput, and round-trip time (RTT) extracted from network measurements to serve as input for machine learning models. The trained model can adjust the sending rate in real time based on current network conditions, continuously learning and refining its predictions as it receives new data.

3.2. LEO Mobility Breaks the Fundamental Assumptions of Congestion Control

However, the unique characteristics of LEO networks introduced in Section 2 will bring significant challenges to existing CCAs. Although the low orbital altitude and high-speed satellite links bring low latency and high bandwidth to LSN, the LEO mobility also involves very frequent network variations because LEO satellites are constantly moving, causing endless ground-to-satellite handovers and link rate adaptations. For example, through a measurement based on Starlink, the largest operational LSN currently, we find that from an average perspective Starlink performs quite well: the average uplink/downlink capacity can reach about 30/300Mbps while the average RTT is about 27ms in a vantage point in Europe. However, due to the LEO mobility, end-to-end connections through Starlink suffer from drastic network variations over time. The maximum network capacity can drastically fluctuate between 10Mbps and 65Mbps in three minutes. RTT drastically changes over time. Even at low data rates below network capacity, end-to-end flows can experience unpredictable bursts of packet loss during data transmission. These specific features of LSNs could break the fundamental assumptions of existing CCAs: packet loss may not be caused by congestion, but by LEO satellite handovers. The delay increase observed from the sender may be due to changes of the end-to-end path, rather than queueing on the bottleneck link. All delays, bandwidths, and packet loss rates change unpredictably over time, and it is difficult to learn deterministic principles from their changes.

4. A Case Study: Congestion Control Performance in Starlink

To quantitatively understand the performance of different CCAs in real-world operational LSNs, we conduct a performance study of several kinds of representative CCAs based on an operational LSN. Specifically, we evaluate: (i) loss-based CCAs, Reno [RFC5681] and Cubic[RFC9438] which use packet losses as the signal for adjusting data sending rate; (ii) delay-based CCAs, Vegas[vegas_cc] and Copa[copa_cc], which exploit measured delay to estimate network congestion and adjust sending rate; (iii) model-based CCAs, Google BBRv1/v3[I-D.ietf-ccwg-bbr], which frequently measure the bottleneck bandwidth and minimal RTT to model the bandwidth-delay product (BDP) of the path, and accordingly regulates sending rate; and (iv) learning-based CCA, PCC-VIVACE[dong2018pcc], which can automatically adapt itself to various conditions based on a utility function without manually tuning. We describe our case-by-case observations and corresponding analysis as follows.

Table 1
Algorithm Average Throughput (Mbps) Average RTT (ms) 90th RTT (ms) 95th RTT (ms)
Reno 10.89 26.81 30.08 31.89
Cubic 10.56 27.27 30.90 32.77
Vegas 4.53 28.32 31.77 33.31
Copa 6.85 39.87 43.46 44.71
BBRv1 22.79 47.90 73.02 89.79
BBRv3 16.52 26.13 35.35 48.29
PCC-Vivace 17.15 97.08 171.33 207.26

Reno and Cubic. End-to-end connections experience non-congestion packet losses over the unstable LEO satellite links. It is a well-known limitation that Cubic and Reno can not discriminate such non-congestion packet losses. As a result, TCP Reno and Cubic mistakenly think network is congested and shrink their congestion window conservatively when non-congestion packet losses occur, causing self-limited throughput.

Vegas and Copa. Delay-based CCAs rely on a basic assumption that the increase in RTT observed by the sender may reflect queuing at the bottleneck link. However, delay-based CCAs can be seriously misled in LSNs because it is difficult for them to distinguish whether the observed RTT changes are caused by congestive queuing or by path fluctuations due to LEO mobility. Specifically, Vegas detects congestion by increasing RTTs, and we observe that Vegas is frequently misled by these non-congestion RTT increases in LSNs, resulting in severe throughput degradation. Similarly, Copa is a recent delay-based CCA that converges on a target sending rate 1/(sigma * Dq) where Dq is the measured queuing delay and sigma is a constant. Copa adjusts the congestion window in the direction of this target rate, and estimates the queuing delay as Dq = RTTstanding - RTTmin, where RTTstanding is the smallest RTT observed over a recent short time-window and RTTmin is the smallest RTT observed over a long period of time (e.g. 10 seconds). We find that as the environmental RTT fluctuates frequently and drastically, Copa usually overestimates Dq and then limits its sending rate. When the environmental RTT suddenly increases to a new level, although Copa's RTTstanding estimation can be updated in time, it still takes a long time for Copa's RTTmin estimation to converge to the correct value. Therefore, as the environmental RTT changes drastically, Copa frequently underestimates RTTmin, and then overestimates Dq which is calculated by RTTstanding - RTTmin. As a result, Copa mistakenly infers that there is congestion in the network and limits its sending rate, and achieves low link utilization and self-limited throughput.

BBR frequently probes the network for its propagation RTT (pRTT) and bottleneck bandwidth (bBW), and then adjusts sending rate to match the bandwidth-delay product (BDP). We identify several issues in different versions of BBR.

BBRv1 experiences bBW overestimation and pRTT underestimation under the drastic network variations caused by LEO mobility. First, BBRv1 estimates bBW by the maximum delivery rate (deliveryRate) over a 10-RTT window. When the link capacity fluctuates drastically, such a maximum filter always over-estimates bBW. Note that BBRv1's sending rate is set by the estimated bBW multiplied by a factor called pacing_gain and the data in flight is capped by cwnd=2 * BDP. When the link capacity fluctuates, because bBW is overestimated, BBRv1 overshoots the link capacity until the data in flight reaches 2 * BDP, resulting in high queuing delay especially when the link capacity significantly slumps. Second, BBRv1 estimates pRTT by the minimum observed RTT over a 10-second window. Thus, when the RTT increases due to LEO mobility rather than congestion, BBRv1 under-estimates pRTT. However, because bBW is overestimated most of the time, while pRTT is underestimated much less often, in our experiments we observe that in most cases the BDP is still overestimated.

BBRv3 has made several modifications upon BBRv1. One important aspect is that BBRv3 estimates bBW as the minimum value of two new parameters bw_high and bw_low. Specifically, bw_high is calculated by the maximum delivery rate over a short window, while bw_low is set to an extremely high value when there is no packet loss, but is set to max(latest_deliveryRate, 0.7*bw_high) if packet loss >0. In other words, BBRv3 suppresses the sending rate in case of packet loss. The original intention of this change is that when packet loss occurs, it may indicate congestion, so BBRv3 should reduce the sending rate. However, in our experiments, we observe that due to random packet losses in LEO satellite links, BBRv3 avoids overshooting the link but is less resilient to non-congestion loss as compared to BBRv1. As a result, BBRv3 can only achieve about 60-70% link utilization under lossy Starlink condition.

PCC-VIVACE. Recent CCAs like VIVACE try to learn from the observed network conditions based on a utility function, and accordingly estimate a proper sending rate. Specifically, VIVACE's utility function is calculated based on the sending rate contribution, latency penalty (calculated by RTT gradient) and loss penalty in each measurement interval. We observe two performance issues in VIVACE. First, non-congestion RTT increase caused by LEO mobility can amplify latency penalty and result in inaccurate utility estimation. Second, VIVACE incorporates a dynamic change boundary omega to limit the rate change in a certain range. The original intention of omega is to avoid drastic rate change that overshoots the link capacity, but such a boundary also leads to slow rate convergence in Starlink as the link capacity changes rapidly. As a result, VIVACE: (i) under-utilizes the network when the link capacity drastically increases or when the propagation RTT suddenly increases due to LEO mobility; and (ii) overshoots the network when the link capacity drastically decreases, causing high queuing delay.

The fundamental challenge. Based on our analysis we find that it is quite challenging for existing CCAs to detect network congestion promptly and accurately in an LSN with drastic, multi-dimensional network variations induced by LEO mobility. Essentially, every CCA relies on certain network models and assumptions, based on which the CCA infers network conditions and whether congestion occurs. However, these fundamental assumptions they used become inaccurate in emerging LSNs. Link capacity, RTT and loss rate can change frequently and drastically in LSNs, mixing both congestion and non-congestion variations, and existing CCAs can easily be misled by these non-congestion signals. Considering the fundamental challenge is that it is hard for end-to-end CCAs to discriminate whether the observed performance changes are caused by congestion or not, we argue that CCAs in LSNs require some effective indicators which can implicitly help end host discriminate non-congestion performance changes.

5. Potential Mitigations

5.1. Explicit notifications for network variance discrimination

Through our in-depth analysis, we found that the main reason for the performance degradation of existing CCAs in the LSN environment is that these CCAs running on the end system cannot distinguish whether an observed network performance change is caused by network congestion or non-network congestion factors. A possible way to improve the performance of CCAs is to explicitly inform the sender of the packet loss caused by satellite handovers or the delay increase caused by path changes directly from the network side through explicit notifications. This can avoid CCAs on the sender side from misjudging the changes in network conditions.

5.2. Cross-layer optimization

In conventional networks like WiFi and cellular networks in which the link capacity may rapidly change over time, one classic optimization method is to directly use the underlying channel information to estimate bandwidth changes more accurately and timely, thereby improving the performance of end-to-end congestion control. For example, ExLL [park2018exll] is a congestion control for LTE networks and it exploits cellular bandwidth inference to achieve low latency. CQIC [lu2015cqic], CLAW [xie2017accelerating] and piStream [xie2015pistream] use PHY-layer information to accurately and timely estimate traditional cellular and WiFi networks. PropRate [leong2017tcp] adjusts sending rate by directly monitoring the bottleneck buffer size in cellular networks. PBE-CC [xie2020pbe] leverages PHY-layer measurements to precisely react to capacity variations. The similar cross-layer optimizations might also be available if the satellite network operators expose sufficient programmable interfaces for system and application developers.

5.3. Multipath enhancement

Multi-path transmission can also significantly enhance CCAs in LSNs by improving reliability, throughput, and resilience. By utilizing multiple LSN paths or an LSN path and a cellular path simultaneously, networks can aggregate bandwidth, leading to higher overall data rates and improved performance for users. If one path encounters issues (packet loss), data can still be transmitted via alternative paths, enhancing overall network reliability.

6. Conclusion

In this document, we provide a performance analysis on various CCAs in an operational LEO satellite network (LSN). Our analysis results reveal that existing CCAs struggle to deal with the drastic network variations caused by the mobility of LEO satellites, resulting in poor link utilization or high latency. Further, this document discusses the key challenges of achieving high throughput and low latency for end-to-end connections over an LSN, and provides useful information when the LSN-specific congestion control principles for the Internet is standardized in the future.

7. IANA Considerations

This document includes no request to IANA.

8. Security Considerations

This document does not represent a change to any aspect of the TCP/IP protocol suite and therefore does not directly impact Internet security.

9. References

9.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC9293]
Eddy, W., Ed., "Transmission Control Protocol (TCP)", STD 7, RFC 9293, DOI 10.17487/RFC9293, , <https://www.rfc-editor.org/rfc/rfc9293>.
[RFC5681]
Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, DOI 10.17487/RFC5681, , <https://www.rfc-editor.org/rfc/rfc5681>.
[RFC9438]
Xu, L., Ha, S., Rhee, I., Goel, V., and L. Eggert, Ed., "CUBIC for Fast and Long-Distance Networks", RFC 9438, DOI 10.17487/RFC9438, , <https://www.rfc-editor.org/rfc/rfc9438>.

9.2. Informative References

[I-D.ietf-ccwg-bbr]
Cardwell, N., Swett, I., and J. Beshay, "BBR Congestion Control", Work in Progress, Internet-Draft, draft-ietf-ccwg-bbr-00, , <https://datatracker.ietf.org/doc/html/draft-ietf-ccwg-bbr-00>.
[copa_cc]
Arun, V. and H. Balakrishnan, "Copa: Practical Delay-Based Congestion Control for the Internet", .
[vegas_cc]
Brakmo, L. S., O'malley, S. W., and L. L. Peterson, "TCP Vegas: New techniques for congestion detection and avoidance", .
[cardwell2016bbr]
Cardwell, N., Cheng, Y., Gunn, C. S., Yeganeh, S. H., and V. Jacobson, "BBR: Congestion-based congestion control: Measuring bottleneck bandwidth and round-trip propagation time", .
[dong2018pcc]
Dong, M., Meng, T., Zarchy, D., Arslan, E., Gilad, Y., Godfrey, B., and M. Schapira, "PCC Vivace:Online-Learning Congestion Control", .
[park2018exll]
Park, S., Lee, J., Kim, J., Lee, J., Ha, S., and K. Lee, "ExLL: An extremely low-latency congestion control for mobile cellular networks", .
[lu2015cqic]
Lu, F., Du, H., Jain, A., Voelker, G. M., Snoeren, A. C., and A. Terzis, "CQIC: Revisiting cross-layer congestion control for cellular networks", .
[xie2017accelerating]
Xie, X., Zhang, X., and S. Zhu, "Accelerating mobile web loading using cellular link information", .
[xie2015pistream]
Xie, X., Zhang, X., Kumar, S., and L. E. Li, "piStream: Physical layer informed adaptive video streaming over LTE", .
[leong2017tcp]
Leong, W. K., Wang, Z., and B. Leong, "TCP congestion control beyond bandwidth-delay product for mobile cellular networks", .
[xie2020pbe]
Xie, Y., Yi, F., and K. Jamieson, "PBE-CC: Congestion control via endpoint-centric, physical-layer bandwidth measurements", .

Acknowledgements

Contributors

Thanks to all of the contributors.

Authors' Addresses

Zeqi Lai
Tsinghua University
30 ShuangQing Ave
Beijing
100089
China
Zonglun Li
Tsinghua University
30 ShuangQing Ave
Beijing
100089
China
Qian Wu
Tsinghua University
30 ShuangQing Ave
Beijing
100089
China
Hewu Li
Tsinghua University
30 ShuangQing Ave
Beijing
100089
China
Qi Zhang
Zhongguancun Laboratory
Beijing
China