<?xml version='1.0' encoding='utf-8'?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->
<!-- <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> -->
<!-- This third-party XSLT can be enabled for direct transformations in XML processors, including most browsers -->

<rfc
      xmlns:xi="http://www.w3.org/2001/XInclude"
      category="info"
      docName="draft-zhuang-rtgwg-aidc-gse-architecture-00"
      ipr="trust200902"
      obsoletes=""
      updates=""
      submissionType="IETF"
      xml:lang="en"
      tocInclude="true"
      tocDepth="4"
      symRefs="true"
      sortRefs="true"
      version="3">
  <!-- xml2rfc v2v3 conversion 2.38.1 -->
  <!-- category values: std, bcp, info, exp, and historic
    ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
       or pre5378Trust200902
    you can add the attributes updates="NNNN" and obsoletes="NNNN" 
    they will automatically be output with "(if approved)" -->

 <!-- ***** FRONT MATTER ***** -->

 <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title abbrev="AIDC GSE Architecture">GSE architecture for AIDC</title>
    <seriesInfo name="Internet-Draft" value="draft-zhuang-rtgwg-aidc-gse-architecture-00"/>
    <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->
   	<author fullname="Rui Zhuang" initials="R" surname="Zhuang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>zhuangruiyjy@chinamobile.com</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
  
   <author fullname="Zheng Zhang" initials="Z" surname="Zhang" role="editor">
      <organization>ZTE Corporation</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>zhang.zheng@zte.com.cn</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
    <date year="2026"/>
    <!-- If the month and year are both specified and are the current ones, xml2rfc will fill 
        in the current day for you. If only the current year is specified, xml2rfc will fill 
     in the current day and month for you. If the year is not the current one, it is 
     necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the 
     purpose of calculating the expiry date).  With drafts it is normally sufficient to 
     specify just the year. -->

   <!-- Meta-data Declarations -->

   <area>Routing</area>
    <workgroup>RTGWG</workgroup>
    <!-- WG name at the upperleft corner of the doc,
        IETF is fine for individual submissions.  
     If this element is not present, the default is "Network Working Group",
        which is used by the RFC Editor as a nod to the history of the IETF. -->

   <keyword>AIDC GSE Architecture</keyword>
    <!-- Keywords will be incorporated into HTML output
        files in a meta tag but they have no effect on text or nroff
        output. If you submit your draft to the RFC Editor, the
        keywords will be used for the search engine. -->

   <abstract>
      <t>This document introduces a Global Scheduling Ethernet (GSE) architecture for data centers used for AI computing. 
	  This architecture can minimize the probability of packet forwarding congestion in the network 
	  and improve the efficiency of packet interaction.</t>
    </abstract>
  </front>
  <middle>
    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>The development of Artificial Intelligence (AI) and Machine Learning (ML) has brought about 
	  a transformation in data center development. 
	  Due to the data-intensive nature of large language model (LLM) computations, 
	  AI tasks often generate large amounts of traffic. 
	  If the link bandwidth is insufficient, it can lead to packet loss or significant latency. 
	  AI computation has very high reliability requirements and extremely low tolerance for packet loss and latency. 
	  Network congestion that causes packet loss or excessive latency will significantly impact the computational efficiency of AI tasks.</t>
	  
	  <t>There are many implementations in the industry to reduce packet loss and latency. 
	  This document introduces an implementation architecture called GSE for reference.</t>
	  
      <section numbered="true" toc="default">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
       document are to be interpreted as described in <xref target="RFC2119" format="default"/>.</t>
      </section>
    </section>
	
	<section numbered="true" toc="default">
      <name>GSE Architecture</name>
	    <figure anchor="Fig1">
          <artwork align="left" name="Figure 1" type="" alt=""><![CDATA[
                     +------------+
                     | Controller |                    Control Layer
                     +------------+
--------------------------------------------------------------------

                                                      Network Layer 
        +-------+    +-------+           +-------+
        | Spine |    | Spine |  ......   | Spine |     Layer2
        +-------+    +-------+           +-------+

        +-------+    +-------+           +-------+
        | Spine |    | Spine |  ......   | Spine |     Layer1
        +-------+    +-------+           +-------+

    +------+   +------+   +------+   +------+        +------+
    | Leaf |   | Leaf |   | Leaf |   | Leaf | ...... | Leaf |
    +------+   +------+   +------+   +------+        +------+
--------------------------------------------------------------------

                                                   Computation Layer
+--------+  +--------+  +--------+  +--------+        +--------+
| Server |  | Server |  | Server |  | Server | ...... | Server | 
+--------+  +--------+  +--------+  +--------+        +--------+
           ]]></artwork>
        </figure>
	  
	  
        <t>Figure 1 shows a common data center architecture for AI computing, 
		divided into three layers: control layer, network layer, and computation layer.</t>
		
		<ul spacing="normal">
        <li>The computation layer consists of servers used for AI computing, including GPUs and NICs.</li>
        <li>The network layer uses a common Clos/Fat Tree topology as an example, while other topologies can also be used in practice. 
		In a 3-layer Clos topology, it consists of Leaf switches connected to the servers and Layer 1 and Layer 2 Spine switches.</li>
		<li>The control layer consists of centralized or distributed controllers.</li>
      </ul>
		
        <t>This document mainly focuses on implementation methods for the network layer. 
		Notably, cross-layer collaboration between the network layer and the computation layer is also required.</t>
		
		<t>To meet the stringent packet loss and latency requirements for AI computing, 
		the following implementation mechanisms can be used at the network layer:</t>
		
		<ul spacing="normal">
        <li>Credit-based authorization mechanism: The main idea is to use 
		credit-based authorization to control data transmission and reduce congestion probability. 
		Before packet transmission, the sender initiates 
		an authorization request to the receiver to ensure that the receiver has sufficient bandwidth to receive packets, 
		thereby avoiding packet loss caused by last-hop congestion.</li>
		
        <li>Packet aggregation mechanism: The main idea is to aggregate packets into uniform-sized segments, 
		which is more conducive to packet forwarding and reception control.</li>
		
		<li>Improved ECMP mechanism: This mechanism not only distributes traffic evenly across ECMP links to avoid
        congestion, but also ensures in-order packet arrival at the destination, 
        thereby reducing the buffering and processing overhead at the receiver.</li>
      </ul>
		
      <t>Additionally, technologies such as PFC (Priority-based Flow Control,
        IEEE802.1Qbb) and ECN (Explicit Congestion Notification, <xref target="RFC3168" format="default"/>)
        are also deployed to further reduce congestion-related packet loss.</t>
    </section>

    <section numbered="true" toc="default">
	  <name>GSE deployment scenarios</name>
      <section numbered="true" toc="default">
	    <name>GSE Scenario 1</name>
	  	  <figure anchor="Fig2">
            <artwork align="left" name="Figure 2" type="" alt=""><![CDATA[
           +----------+                        +----------+
           |  Spine1  |                        |  Spine2  |
           +--+-+-+-+-+                        +--+-+-+-+-+
              | | | |                             | | | |
            +-------------------------------------+ | | |
            | | | | |    +--------------------------+ | |
            | | | | |    |                   +--------+ |
            | | | | |    |                   |          |
            | | | | +--------------------------------------+
            | | | +-----------------------+  |          |  |
          +---+ +------+ |                |  |          |  |
          | |          | |                |  |          |  |
     +----+-+--+    +--+-+----+        +--+--+---+    +-+--+----+
     |  Leaf1  |    |  Leaf2  |        |  Leaf3  |    |  Leaf4  |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
       |     |        |     |            |     |        |     |     
       | ..  |        | ... |            | ... |        | ... |
       |     |        |     |            |     |        |     |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
     |N1|N2|...|    |N9|N10|..|        |N17|N18|.|    |N25|N26|.|
     +---------+    +---------+        +---------+    +---------+
       Server1        Server2            Server3        Server4
           ]]></artwork>
          </figure>
	  
        <t>
		Each server include GPUs, NICs, etc. As shown in Figure 2, NIC1
        connects to Leaf1, and NIC20 connects to Leaf3. 
		Take the scenario where GPU1 sends AI computing traffic to GPU20 as an example: 
		Before sending the traffic, NIC1, to which GPU1 is connected, initiates an authorization
        request to NIC20, to which GPU20 is connected. Both the request and
        response messages are encapsulated in a specific message to ensure
        identification and forwarding by switches. 
		When NIC20 confirms that traffic transmission is possible, it sends a
        negotiation response to NIC1.  NIC1 only begins traffic sending after
        receiving the authorization response.</t> 
		
		<t>This specific message (including negotiation request and negotiation
        response) is generated and sent by hardware such as chips.
        Its outer addressing can adopt an encapsulation similar
        to the GSE header defined in this document.  This specific
        negotiation message includes information such as the required
        bandwidth, which is not defined in this draft.
		This negotiation mechanism applies bidirectionally. 
		For example, the same workflow is followed when GPUs on 
		Server2 or Server3 send traffic to GPU1 on Server1.
		</t>
		
		<t>
		This negotiation mechanism requires a link between the NIC and the
        Leaf switch, which can be identified by the address of GPU/NIC
        plus the interface index connected to the Leaf switch. 
		
        For NICs supporting the credit authorization mechanism, the
        NIC obtains the port ID from its upstream Leaf switch, 
		and initiates credit authorization requests and responses based on this set of identification information.
		
        Information exchange between the NIC
        and the Leaf switch can be implemented via private ARP messages or
        extensions such as LLDP, which are not defined in this document.
		
        This GPU/NIC address and associated port information can be
        advertised via control plane routing protocols, learned through
        interactions between the Leaf and Spine switches.</t>
		
		<t>If the NIC does not support the authorization mechanism, this process can also be done by the Leaf switch 
		connected to the NIC.</t>
      </section>
    
      <section numbered="true" toc="default">
	    <name>GSE Scenario 2</name>
	  	  <figure anchor="Fig3">
            <artwork align="left" name="Figure 3" type="" alt=""><![CDATA[
   +--------------------------------------------------------+
   |                                            ...  PodM   |
   |                                                        |
   |   +----------+  +----------+         +----------+      |
   |   |  Core1   |  |  Core2   |  ...    |  CoreZ   |      |
   |   +--+-------+  ++---------+         +----------+      |
   |      |           |       ......                        |
   |      |      +----+                                     |
   |      |      |                                          |
   |   +--+------++  +----------+         +----------+      |
   +-- |  SSpine1 |  |  SSpine2 |   ...   | SSpineN  | -----+
       +----------+  +----------+         +----------+

           +----------+                     +----------+
           |  Spine1  |                     |  Spine2  |
           +----------+                     +----------+

     +----+-+--+    +--+-+----+        +--+--+---+    +-+--+----+
     |  Leaf1  |    |  Leaf2  |        |  Leaf3  |    |  Leaf4  |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
       |
     +-+-----+-+    +-+-----+-+        +-+-----+-+    +-+-----+-+
     |N1|N2|...|    |N9|N10|..|        |N17|N18|.|    |N25|N26|.|
     +---------+    +---------+        +---------+    +---------+
	   Server1        Server2            Server3        Server4
           ]]></artwork>
          </figure>

       <t>Figure 1 only shows a portion of a single PoD.
	    Typical data centers used for AI computing are much larger and require more connections to other PoDs. 
	    Figure 3 provides an example where not all connections are displayed due to the complexity of the wiring. 
	    
		Each PoD's SSpine switch is connected to other PoDs' SSpine switches via a Core switch. 
		It is difficult to ensure that traffic flows from one PoD's Leaf switch to another PoD's Leaf switch 
		without congestion throughout the entire forwarding process. 
		However, within a single PoD, a method similar to that in Scenario 1 can be used, 
		employing GSE to ensure low-latency, congestion-free forwarding of traffic from the Leaf switch to the Core switch.</t>
	    
	    <t>
		Before forwarding traffic from N1, Leaf1 sends a negotiation message to the SSpine1 switch, 
		specifying the link between SSpine1 and Core1. 
		Leaf1 will only begin forwarding traffic from N1 if the bandwidth between SSpine1 and Core1 is sufficient, 
		i.e., if the negotiation is successful. 
		If the link bandwidth between SSpine1 and Core1 is insufficient, 
		Leaf1 can send another negotiation message to the SSpine1 switch, specifying the link between SSpine1 and Core2. 
		Again, Leaf1 will only begin forwarding traffic from N1 if the negotiation is successful. 
		This ensures that unless a link failure occurs, there will be no packet loss before the traffic reaches Core1.
		</t>
      </section>

      <section numbered="true" toc="default">
	    <name>GSE summary</name>
		<t>
		The requirements for the two scenarios above are similar; both
        require carrying the corresponding port information when announcing
        routes (routes from the NIC/GPU and routes obtained from other PoDs).
        Therefore, the port ID information used for authorization can be
        advertised along with the route; for the advertising method, 
        refer to <xref target="I-D.zhang-idr-portid-ec" format="default"/>.  Based on the information, 
        certain existing implementations support pre-transmission negotiation to ensure sufficient bandwidth 
		at the egress point before sending traffic.</t>
		
		<t>
		After traffic transmission begins, data packets are aggregated into
        uniform-sized segments, and sequence numbers are added to these packets based on the segments they belong to.
        This ensures that even if a link fails or congestion occurs, data packets passing through different paths 
		can be reordered based on sequence numbers.
        During the transmission of the same segment, in order to utilize the same
        path on the ECMP links as much as possible to reduce the buffering and processing pressure on packet reassembly, 
		an entropy value is needed to guarantee the stability of path selection. 
		In some implementations, the source and destination queue identifiers, such
        as QP, can be used directly as entropy.  Although this
        mechanism reduces the probability of congestion, network congestion
        can still occur.  In such cases, other idle or light load ECMP links
        can be used to transmit segments.  Simultaneously, mechanisms such as
        PFC and ECN can be also used to adjust the traffic transmission rate, thereby
        further reducing the probability of congestion.</t>
		
		<t>Such a message forwarding mechanism is difficult to implement using traditional IP-based forwarding, 
		so additional definitions may be required. 
		These defined fields can be identified and processed by GSE header encapsulation.</t>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>GSE header</name>
	  	<figure anchor="Fig4">
          <artwork align="left" name="Figure 4" type="" alt=""><![CDATA[
 +------------------------------------------------------------------+
 |  Destination | port-ID | Priority | Entropy | Seq | ...... 
 +------------------------------------------------------------------+
           ]]></artwork>
        </figure>

      <t>Figure 4 shows a GSE packet header example for reference.  It can be
         recognized and forwarded by the network layer, one implementation
         uses a new type Ethernet encapsulation.</t>
	  
	  <ul spacing="normal">
        <li>Destination: The IP address or address value of the GPU/NIC or switch;</li>
        <li>port-ID: Port-ID used for authorization;</li>
		<li>Priority: Traffic transmission priority, similar to DSCP.</li>
		<li>Entropy: Values ​​used for traffic load balancing.</li>
		<li>Seq: The sequence number of the data packet is used for reassembly in case of out-of-order delivery.</li>
      </ul>
	  
	</section>

    <section anchor="IANA" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>This document includes no request to IANA.</t>
    </section>
	
    <section anchor="Security" numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>This draft provides an implementation reference. 
	  Implementing this scheme will introduce new packet identification and forwarding processes, 
	  impacting the implementation of switches and NICs. 
	  Inappropriate implementation and deployment may lead to packet forgery attacks.</t>
    </section>
  </middle>
  <!--  *****BACK MATTER ***** -->

 <back>

   <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
      </references>
      <references title="Informative References">
       <xi:include href="http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml"/>
	   <?rfc include="reference.I-D.zhang-idr-portid-ec.xml"?>
    </references>
    </references>
 </back>
</rfc>
