Internet-Draft Congestion Control for Distributed AIDC October 2024
Ji, et al. Expires 24 April 2025 [Page]
Workgroup:
Congestion Control Working Group
Internet-Draft:
draft-ji-ccwg-distributed-lossless-mechanism-00
Published:
Intended Status:
Standards Track
Expires:
Authors:
S. Ji
Chinat Telecom
C. Li
Chinat Telecom
K. Zhu
Huawei Technologies

A congestion control mechanism based on distributed AIDC lossless network

Abstract

This document proposes a congestion control mechanism based on distributed AIDC lossless network. It can effectively solve the problem of declining model training performance due to congestion and packet loss on long-distance links when training large models across multiple data centers within a region. In addition, this document outlines the practice scenario of this congestion control mechanism.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 24 April 2025.

Table of Contents

1. Introduction

With the rapid development of big data and artificial intelligence (AI) technology, it is getting more clear that AI solutions represented by large models have gradually penetrated into various industries, and the demands for computing power is increasing. A large-scale GPU cluster is a necessary condition for large model training. However, when deploying a cluster with 10,000 or even 100,000 GPUs, the computing power of a single intelligent DC is limited due to the issues such as insufficient space/power and heat dissipation of the computer room. In order to solve this problem, multiple intelligent DCs within a region can be interconnected into a large virtual intelligent computing cluster, which realizes collaborative computing among multiple intelligent DCs through distributed AIDC lossless network (also known as RDMA remote). It meets the demands for high computing power.

However, in the process of exploring using multiple intelligent DCs to build a larger-scale intelligent computing cluster, we have encountered many challenges. For example, RDMA remote will generate traffic flows across long distances. If congestion occurs on long-distance links, traditional congestion control mechanisms such as PFC/ECN may become invalid because of longer congestion feedback time, resulting in insufficient buffer of network devices and packet loss eventually.

In order to solve the problems of congestion and packet loss in interconnection of DCs across long distances, this document proposes a congestion control mechanism that effectively alleviates network congestion by shortening the congestion feedback time and adjusting the flow rate of the transmitting node based on the congestion degree.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

1.2. Terminology

The following terms are used in this document:

RDMA remote: Interconnect multiple intelligent DCs within a region into a large virtual intelligent computing cluster, realizing collaborative computing among multiple intelligent DCs.

PFC(Priority-based Flow Control): It can provide priority-based flow control hop-by-hop, enabling multiple types of traffic flows to run on Ethernet links without affecting each other.

ECN(Explicit Congestion Notification): A congestion control mechanism that reduces the flow rate of the transmitting node by sending CNPs from the receiving node to the transmitting node, achieving end-to-end congestion management.

CNP:Congestion Notification Packet.

2. Congestion Control Mechanism

2.1. Congestion Control Principle

At present, the most widely used congestion control mechanism in RoCE network is ECN. When congestion occurs on the network device, the device sends a packet with an ECN label to the receiving node, and the receiving node then sends CNPs to the transmitting node to notify the node to reduce the transmitting rate of the packets, thus alleviating network congestion. However, in distributed AIDC lossless networks, training large models in cooperation across multiple DCs generates long-distance data transmission. If congestion occurs on the long-distance link, the CNP packets generated by the traditional ECN mechanism has a longer feedback path, which may cause the flow rate of the transmitting node not to be reduced in time, resulting in packet loss and affecting the training performance of the large models eventually. To meet the lossless requirements of distributed AIDC networks, this document proposes a congestion control mechanism that transfers the "congestion point" occurs on the long-distance link to the network device closest to the transmitting node, thus dealing with congestion problems over long distances with low latency.

2.2. Congestion Control Process

Figure 1 shows the specific process of congestion control mechanism. H1 and H2 are respectively the transmitting node and receiving node, R11 is the next-hop device closest to the transmitting node (known as proximal device) , R12 is the device on the long-distance link, and the distance between R11 and R12 is in the range of hundreds of kilometers.

               1.notification message
               <-------------------
+-------+     +------+  120km +------+     +-------+
|  H1   #-----#  R11 #--------#  R12 #-----#   H2  |
+-------+     +------+        +------+     +-------+
     2.flow-control
     protocol packets
   <--------------

Figure 1: The Process of Congestion Control Mechanism

• First, each device monitors the network state, including the queue accumulation condition and buffer usage of each port, determining whether congestion occurs on the link;

• If congestion occurs on the link, and the congested device (R12) is not the proximal device (R11) of the transmitting node, R12 will send a notification message to R11. The notification message contains information such as the port number where congestion occurs, the queue depth and the buffer usage of the congested port;

• R11 determines the congestion degree of the device based on the content of the notification message, and calculates the number of CNP packets or other flow-control protocol packets that need to be sent. The flow-control protocol packets contain information about the congested traffic flows;

• After receiving the flow-control protocol packets, H1 reduces the transmitting rate of the corresponding congested traffic flows to alleviate the congestion of network devices.

The traffic flow of large models has a characteristic of periodicity, that is, if a certain flow is congested in the first training period, it will be congested in every subsequent period. Therefore, this document designs the network devices to record the information of the forwarding packets in the flow table entry, including which flows are congested. When the congested flow occurs periodically, R11 directly sends CNP or other flow-control protocol packets to H1 based on the learned flow table entries for transmitting rate control. The remote congested device (R12) does not need to send notification message any more. In this way, after obtaining the congestion information of the entire network in the first training period, the traffic flows can be lossless in remaining periods.

3. Practice Scenario

The lossless interconnection technology for distributed AIDC lossless networks is a research hotspot in recent years. At present, the congestion control mechanism proposed in this document has been applied in the testing environment of the current network.

Figure 2 and Figure 3 show the test environments of two AI training clusters, where each cluster deploy 512 GPUs respectively. The distance between cluster A and cluster B is 120km, and the spine switches in two clusters are interconnected through wavelength division equipments with the capacity of 25.6T to train large models with billions of parameters collaboratively.

             +-------------+                +-------------+
             |    Spine1   |                |    Spine2   |
             +-+---+--+--+-+                +--+---+--+-+-+
              /    |  |  |                     |   |  |  |
             /     |  |  |                     |   |  |  |
            /   +--+--+--+---------------------+   |  +  +
           /   /   |  |  |   +---------------------+ /    \
          /   /    |  |  +---|-----------------+----/----+ \
         /   /     +  +------|----------+          /      \ \
        /   /       \        |          |         /        \ \
+------+---++      +-+-------+-+      +-+-------+-+      +--+-+------+
|   leaf1   |      |   leaf2   |      |   leaf3   | .... |   leaf16  |
+--+----+---+      +--+----+---+      +--+----+---+      +--+----+---+
   |    |             |    |             |    |             |    |
   H1...H4           H5...H8            H9...H12           H61...H64

                     Figure 2:   Cluster A
             +-------------+                +-------------+
             |    Spine3   |                |    Spine4   |
             +-+---+--+--+-+                +--+---+--+-+-+
              /    |  |  |                     |   |  |  |
             /     |  |  |                     |   |  |  |
            /   +--+--+--+---------------------+   |  +  +
           /   /   |  |  |   +---------------------+ /    \
          /   /    |  |  +---|-----------------+----/----+ \
         /   /     +  +------|----------+          /      \ \
        /   /       \        |          |         /        \ \
+------+---++      +-+-------+-+      +-+-------+-+      +--+-+------+
|   leaf17  |      |   leaf18  |      |   leaf19  | .... |   leaf32  |
+--+----+---+      +--+----+---+      +--+----+---+      +--+----+---+
   |    |             |    |             |    |             |    |
  H65...H68           H69...H72         H73...H76          H125...H128

                      Figure 3:   Cluster B

The experimental results show that the training performance of distributed intelligent DCs reaches over 90% of that of the centralized single intelligent DC under the same number of GPUs, proving the feasibility of distributed AIDC lossless network scheme and the proposed congestion control mechanism.

4. Conclusion

Building distributed AI training clusters across multiple data centers is one of the important research directions for the future of AIDC lossless networks. The congestion control mechanism proposed in this document can effectively solve the problems of congestion and packet loss in long-distance DCs interconnection by shortening the congestion feedback time and adjusting the flow rate of the transmitting node reasonably based on the congestion degree. It plays a positive role in promoting the construction of distributed AIDC lossless networks.

5. Security Considerations

There is no additional security risk introduced by this design.

6. IANA Considerations

This document introduces no additional considerations for IANA.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC3168]
Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, , <https://www.rfc-editor.org/info/rfc3168>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.

7.2. Informative References

[I-D.hcl-rtgwg-ai-network-problem]
Huo, P., Chen, G., Lin, C., and Z. Jiang, "Gap Analysis, Problem Statement, and Requirements in AI Networks", Work in Progress, Internet-Draft, draft-hcl-rtgwg-ai-network-problem-01, , <https://datatracker.ietf.org/doc/html/draft-hcl-rtgwg-ai-network-problem-01>.
[I-D.he-huang-rtgwg-wan-lossless-framework]
He, T., Huang, H., Zhengxin, H., Wang, N., and T. Zhou, "Framework for Implementing Lossless Techniques in Wide Area Networks", Work in Progress, Internet-Draft, draft-he-huang-rtgwg-wan-lossless-framework-00, , <https://datatracker.ietf.org/doc/html/draft-he-huang-rtgwg-wan-lossless-framework-00>.
[I-D.huang-rtgwg-wan-lossless-uc]
Zhengxin, H., He, T., Huang, H., and T. Zhou, "Use Cases and Requirements for Implementing Lossless Techniques in Wide Area Networks", Work in Progress, Internet-Draft, draft-huang-rtgwg-wan-lossless-uc-01, , <https://datatracker.ietf.org/doc/html/draft-huang-rtgwg-wan-lossless-uc-01>.

Authors' Addresses

Siwei Ji
Chinat Telecom
Beiqijia Town, Changping District
Beijing, 102209
China
Cong Li
Chinat Telecom
Beiqijia Town, Changping District
Beijing, 102209
China
Keyi Zhu
Huawei Technologies
Huawei Campus, No.156 Beiqing Road
Beijing, 100095
China