GSE architecture for AIDC

Introduction The development of Artificial Intelligence (AI) and Machine Learning (ML) has brought about a transformation in data center development. Due to the data-intensive nature of large language model (LLM) computations, AI tasks often generate large amounts of traffic. If the link bandwidth is insufficient, it can lead to packet loss or significant latency. AI computation has very high reliability requirements and extremely low tolerance for packet loss and latency. Network congestion that causes packet loss or excessive latency will significantly impact the computational efficiency of AI tasks. There are many implementations in the industry to reduce packet loss and latency. This document introduces an implementation architecture called GSE for reference.

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in .

GSE Architecture

Figure 1 shows a common data center architecture for AI computing, divided into three layers: control layer, network layer, and computation layer.

The computation layer consists of servers used for AI computing, including GPUs and NICs.
The network layer uses a common Clos/Fat Tree topology as an example, while other topologies can also be used in practice. In a 3-layer Clos topology, it consists of Leaf switches connected to the servers and Layer 1 and Layer 2 Spine switches.
The control layer consists of centralized or distributed controllers.

This document mainly focuses on implementation methods for the network layer. Notably, cross-layer collaboration between the network layer and the computation layer is also required. To meet the stringent packet loss and latency requirements for AI computing, the following implementation mechanisms can be used at the network layer:

Credit-based authorization mechanism: The main idea is to use credit-based authorization to control data transmission and reduce congestion probability. Before packet transmission, the sender initiates an authorization request to the receiver to ensure that the receiver has sufficient bandwidth to receive packets, thereby avoiding packet loss caused by last-hop congestion.
Packet aggregation mechanism: The main idea is to aggregate packets into uniform-sized segments, which is more conducive to packet forwarding and reception control.
Improved ECMP mechanism: This mechanism not only distributes traffic evenly across ECMP links to avoid congestion, but also ensures in-order packet arrival at the destination, thereby reducing the buffering and processing overhead at the receiver.

Additionally, technologies such as PFC (Priority-based Flow Control, IEEE802.1Qbb) and ECN (Explicit Congestion Notification, ) are also deployed to further reduce congestion-related packet loss.

GSE deployment scenarios

GSE Scenario 1

Each server include GPUs, NICs, etc. As shown in Figure 2, NIC1 connects to Leaf1, and NIC20 connects to Leaf3. Take the scenario where GPU1 sends AI computing traffic to GPU20 as an example: Before sending the traffic, NIC1, to which GPU1 is connected, initiates an authorization request to NIC20, to which GPU20 is connected. Both the request and response messages are encapsulated in a specific message to ensure identification and forwarding by switches. When NIC20 confirms that traffic transmission is possible, it sends a negotiation response to NIC1. NIC1 only begins traffic sending after receiving the authorization response. This specific message (including negotiation request and negotiation response) is generated and sent by hardware such as chips. Its outer addressing can adopt an encapsulation similar to the GSE header defined in this document. This specific negotiation message includes information such as the required bandwidth, which is not defined in this draft. This negotiation mechanism applies bidirectionally. For example, the same workflow is followed when GPUs on Server2 or Server3 send traffic to GPU1 on Server1. This negotiation mechanism requires a link between the NIC and the Leaf switch, which can be identified by the address of GPU/NIC plus the interface index connected to the Leaf switch. For NICs supporting the credit authorization mechanism, the NIC obtains the port ID from its upstream Leaf switch, and initiates credit authorization requests and responses based on this set of identification information. Information exchange between the NIC and the Leaf switch can be implemented via private ARP messages or extensions such as LLDP, which are not defined in this document. This GPU/NIC address and associated port information can be advertised via control plane routing protocols, learned through interactions between the Leaf and Spine switches. If the NIC does not support the authorization mechanism, this process can also be done by the Leaf switch connected to the NIC.

GSE Scenario 2

Figure 1 only shows a portion of a single PoD. Typical data centers used for AI computing are much larger and require more connections to other PoDs. Figure 3 provides an example where not all connections are displayed due to the complexity of the wiring. Each PoD's SSpine switch is connected to other PoDs' SSpine switches via a Core switch. It is difficult to ensure that traffic flows from one PoD's Leaf switch to another PoD's Leaf switch without congestion throughout the entire forwarding process. However, within a single PoD, a method similar to that in Scenario 1 can be used, employing GSE to ensure low-latency, congestion-free forwarding of traffic from the Leaf switch to the Core switch. Before forwarding traffic from N1, Leaf1 sends a negotiation message to the SSpine1 switch, specifying the link between SSpine1 and Core1. Leaf1 will only begin forwarding traffic from N1 if the bandwidth between SSpine1 and Core1 is sufficient, i.e., if the negotiation is successful. If the link bandwidth between SSpine1 and Core1 is insufficient, Leaf1 can send another negotiation message to the SSpine1 switch, specifying the link between SSpine1 and Core2. Again, Leaf1 will only begin forwarding traffic from N1 if the negotiation is successful. This ensures that unless a link failure occurs, there will be no packet loss before the traffic reaches Core1.

GSE summary The requirements for the two scenarios above are similar; both require carrying the corresponding port information when announcing routes (routes from the NIC/GPU and routes obtained from other PoDs). Therefore, the port ID information used for authorization can be advertised along with the route; for the advertising method, refer to . Based on the information, certain existing implementations support pre-transmission negotiation to ensure sufficient bandwidth at the egress point before sending traffic. After traffic transmission begins, data packets are aggregated into uniform-sized segments, and sequence numbers are added to these packets based on the segments they belong to. This ensures that even if a link fails or congestion occurs, data packets passing through different paths can be reordered based on sequence numbers. During the transmission of the same segment, in order to utilize the same path on the ECMP links as much as possible to reduce the buffering and processing pressure on packet reassembly, an entropy value is needed to guarantee the stability of path selection. In some implementations, the source and destination queue identifiers, such as QP, can be used directly as entropy. Although this mechanism reduces the probability of congestion, network congestion can still occur. In such cases, other idle or light load ECMP links can be used to transmit segments. Simultaneously, mechanisms such as PFC and ECN can be also used to adjust the traffic transmission rate, thereby further reducing the probability of congestion. Such a message forwarding mechanism is difficult to implement using traditional IP-based forwarding, so additional definitions may be required. These defined fields can be identified and processed by GSE header encapsulation.

GSE header

Figure 4 shows a GSE packet header example for reference. It can be recognized and forwarded by the network layer, one implementation uses a new type Ethernet encapsulation.

Destination: The IP address or address value of the GPU/NIC or switch;
port-ID: Port-ID used for authorization;
Priority: Traffic transmission priority, similar to DSCP.
Entropy: Values used for traffic load balancing.
Seq: The sequence number of the data packet is used for reassembly in case of out-of-order delivery.

IANA Considerations This document includes no request to IANA.

Security Considerations This draft provides an implementation reference. Implementing this scheme will introduce new packet identification and forwarding processes, impacting the implementation of switches and NICs. Inappropriate implementation and deployment may lead to packet forgery attacks.