Internet-Draft AIDC GSE Architecture June 2026
Zhuang & Zhang Expires 26 December 2026 [Page]
Workgroup:
RTGWG
Internet-Draft:
draft-zhuang-rtgwg-aidc-gse-architecture-00
Published:
Intended Status:
Informational
Expires:
Authors:
R. Zhuang
China Mobile
Z. Zhang, Ed.
ZTE Corporation

GSE architecture for AIDC

Abstract

This document introduces a Global Scheduling Ethernet (GSE) architecture for data centers used for AI computing. This architecture can minimize the probability of packet forwarding congestion in the network and improve the efficiency of packet interaction.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 26 December 2026.

Table of Contents

1. Introduction

The development of Artificial Intelligence (AI) and Machine Learning (ML) has brought about a transformation in data center development. Due to the data-intensive nature of large language model (LLM) computations, AI tasks often generate large amounts of traffic. If the link bandwidth is insufficient, it can lead to packet loss or significant latency. AI computation has very high reliability requirements and extremely low tolerance for packet loss and latency. Network congestion that causes packet loss or excessive latency will significantly impact the computational efficiency of AI tasks.

There are many implementations in the industry to reduce packet loss and latency. This document introduces an implementation architecture called GSE for reference.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

2. GSE Architecture

                     +------------+
                     | Controller |                    Control Layer
                     +------------+
--------------------------------------------------------------------

                                                      Network Layer
        +-------+    +-------+           +-------+
        | Spine |    | Spine |  ......   | Spine |     Layer2
        +-------+    +-------+           +-------+

        +-------+    +-------+           +-------+
        | Spine |    | Spine |  ......   | Spine |     Layer1
        +-------+    +-------+           +-------+

    +------+   +------+   +------+   +------+        +------+
    | Leaf |   | Leaf |   | Leaf |   | Leaf | ...... | Leaf |
    +------+   +------+   +------+   +------+        +------+
--------------------------------------------------------------------

                                                   Computation Layer
+--------+  +--------+  +--------+  +--------+        +--------+
| Server |  | Server |  | Server |  | Server | ...... | Server |
+--------+  +--------+  +--------+  +--------+        +--------+
Figure 1

Figure 1 shows a common data center architecture for AI computing, divided into three layers: control layer, network layer, and computation layer.

This document mainly focuses on implementation methods for the network layer. Notably, cross-layer collaboration between the network layer and the computation layer is also required.

To meet the stringent packet loss and latency requirements for AI computing, the following implementation mechanisms can be used at the network layer:

Additionally, technologies such as PFC (Priority-based Flow Control, IEEE802.1Qbb) and ECN (Explicit Congestion Notification, [RFC3168]) are also deployed to further reduce congestion-related packet loss.

3. GSE deployment scenarios

3.1. GSE Scenario 1

           +----------+                        +----------+
           |  Spine1  |                        |  Spine2  |
           +--+-+-+-+-+                        +--+-+-+-+-+
              | | | |                             | | | |
            +-------------------------------------+ | | |
            | | | | |    +--------------------------+ | |
            | | | | |    |                   +--------+ |
            | | | | |    |                   |          |
            | | | | +--------------------------------------+
            | | | +-----------------------+  |          |  |
          +---+ +------+ |                |  |          |  |
          | |          | |                |  |          |  |
     +----+-+--+    +--+-+----+        +--+--+---+    +-+--+----+
     |  Leaf1  |    |  Leaf2  |        |  Leaf3  |    |  Leaf4  |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
       |     |        |     |            |     |        |     |
       | ..  |        | ... |            | ... |        | ... |
       |     |        |     |            |     |        |     |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
     |N1|N2|...|    |N9|N10|..|        |N17|N18|.|    |N25|N26|.|
     +---------+    +---------+        +---------+    +---------+
       Server1        Server2            Server3        Server4
Figure 2

Each server include GPUs, NICs, etc. As shown in Figure 2, NIC1 connects to Leaf1, and NIC20 connects to Leaf3. Take the scenario where GPU1 sends AI computing traffic to GPU20 as an example: Before sending the traffic, NIC1, to which GPU1 is connected, initiates an authorization request to NIC20, to which GPU20 is connected. Both the request and response messages are encapsulated in a specific message to ensure identification and forwarding by switches. When NIC20 confirms that traffic transmission is possible, it sends a negotiation response to NIC1. NIC1 only begins traffic sending after receiving the authorization response.

This specific message (including negotiation request and negotiation response) is generated and sent by hardware such as chips. Its outer addressing can adopt an encapsulation similar to the GSE header defined in this document. This specific negotiation message includes information such as the required bandwidth, which is not defined in this draft. This negotiation mechanism applies bidirectionally. For example, the same workflow is followed when GPUs on Server2 or Server3 send traffic to GPU1 on Server1.

This negotiation mechanism requires a link between the NIC and the Leaf switch, which can be identified by the address of GPU/NIC plus the interface index connected to the Leaf switch. For NICs supporting the credit authorization mechanism, the NIC obtains the port ID from its upstream Leaf switch, and initiates credit authorization requests and responses based on this set of identification information. Information exchange between the NIC and the Leaf switch can be implemented via private ARP messages or extensions such as LLDP, which are not defined in this document. This GPU/NIC address and associated port information can be advertised via control plane routing protocols, learned through interactions between the Leaf and Spine switches.

If the NIC does not support the authorization mechanism, this process can also be done by the Leaf switch connected to the NIC.

3.2. GSE Scenario 2

   +--------------------------------------------------------+
   |                                            ...  PodM   |
   |                                                        |
   |   +----------+  +----------+         +----------+      |
   |   |  Core1   |  |  Core2   |  ...    |  CoreZ   |      |
   |   +--+-------+  ++---------+         +----------+      |
   |      |           |       ......                        |
   |      |      +----+                                     |
   |      |      |                                          |
   |   +--+------++  +----------+         +----------+      |
   +-- |  SSpine1 |  |  SSpine2 |   ...   | SSpineN  | -----+
       +----------+  +----------+         +----------+

           +----------+                     +----------+
           |  Spine1  |                     |  Spine2  |
           +----------+                     +----------+

     +----+-+--+    +--+-+----+        +--+--+---+    +-+--+----+
     |  Leaf1  |    |  Leaf2  |        |  Leaf3  |    |  Leaf4  |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
       |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
     |N1|N2|...|    |N9|N10|..|        |N17|N18|.|    |N25|N26|.|
     +---------+    +---------+        +---------+    +---------+
           Server1        Server2            Server3        Server4
Figure 3

Figure 1 only shows a portion of a single PoD. Typical data centers used for AI computing are much larger and require more connections to other PoDs. Figure 3 provides an example where not all connections are displayed due to the complexity of the wiring. Each PoD's SSpine switch is connected to other PoDs' SSpine switches via a Core switch. It is difficult to ensure that traffic flows from one PoD's Leaf switch to another PoD's Leaf switch without congestion throughout the entire forwarding process. However, within a single PoD, a method similar to that in Scenario 1 can be used, employing GSE to ensure low-latency, congestion-free forwarding of traffic from the Leaf switch to the Core switch.

Before forwarding traffic from N1, Leaf1 sends a negotiation message to the SSpine1 switch, specifying the link between SSpine1 and Core1. Leaf1 will only begin forwarding traffic from N1 if the bandwidth between SSpine1 and Core1 is sufficient, i.e., if the negotiation is successful. If the link bandwidth between SSpine1 and Core1 is insufficient, Leaf1 can send another negotiation message to the SSpine1 switch, specifying the link between SSpine1 and Core2. Again, Leaf1 will only begin forwarding traffic from N1 if the negotiation is successful. This ensures that unless a link failure occurs, there will be no packet loss before the traffic reaches Core1.

3.3. GSE summary

The requirements for the two scenarios above are similar; both require carrying the corresponding port information when announcing routes (routes from the NIC/GPU and routes obtained from other PoDs). Therefore, the port ID information used for authorization can be advertised along with the route; for the advertising method, refer to [I-D.zhang-idr-portid-ec]. Based on the information, certain existing implementations support pre-transmission negotiation to ensure sufficient bandwidth at the egress point before sending traffic.

After traffic transmission begins, data packets are aggregated into uniform-sized segments, and sequence numbers are added to these packets based on the segments they belong to. This ensures that even if a link fails or congestion occurs, data packets passing through different paths can be reordered based on sequence numbers. During the transmission of the same segment, in order to utilize the same path on the ECMP links as much as possible to reduce the buffering and processing pressure on packet reassembly, an entropy value is needed to guarantee the stability of path selection. In some implementations, the source and destination queue identifiers, such as QP, can be used directly as entropy. Although this mechanism reduces the probability of congestion, network congestion can still occur. In such cases, other idle or light load ECMP links can be used to transmit segments. Simultaneously, mechanisms such as PFC and ECN can be also used to adjust the traffic transmission rate, thereby further reducing the probability of congestion.

Such a message forwarding mechanism is difficult to implement using traditional IP-based forwarding, so additional definitions may be required. These defined fields can be identified and processed by GSE header encapsulation.

4. GSE header

 +------------------------------------------------------------------+
 |  Destination | port-ID | Priority | Entropy | Seq | ......
 +------------------------------------------------------------------+
Figure 4

Figure 4 shows a GSE packet header example for reference. It can be recognized and forwarded by the network layer, one implementation uses a new type Ethernet encapsulation.

5. IANA Considerations

This document includes no request to IANA.

6. Security Considerations

This draft provides an implementation reference. Implementing this scheme will introduce new packet identification and forwarding processes, impacting the implementation of switches and NICs. Inappropriate implementation and deployment may lead to packet forgery attacks.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

7.2. Informative References

[I-D.zhang-idr-portid-ec]
Zhang, J., Zhuang, R., Zhang, Z., and D. Yuan, "BGP PORT EC for AIDC", Work in Progress, Internet-Draft, draft-zhang-idr-portid-ec-01, , <https://datatracker.ietf.org/doc/html/draft-zhang-idr-portid-ec-01>.
[RFC3168]
Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, , <https://www.rfc-editor.org/info/rfc3168>.

Authors' Addresses

Rui Zhuang
China Mobile
China
Zheng Zhang (editor)
ZTE Corporation
China