<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
    There has to be one entity for each item to be referenced. 
    An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs), 
    please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
    (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space 
    (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-xu-rtgwg-fare-in-mp-son-00"
     ipr="trust200902">
  <front>
    <title abbrev="FARE in SUN">Fully Adaptive Routing Ethernet in Multi-Plane
    Scale-Out Networks</title>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <email>xuxiaohu_ietf@hotmail.com</email>
      </address>
    </author>

    <author fullname="Zongying He" initials="Z." surname="He">
      <organization>Broadcom</organization>

      <address>
        <email>zongying.he@broadcom.com</email>
      </address>
    </author>

    <author fullname="Nan Wang " initials="N." surname="Wang">
      <organization>Intel</organization>

      <address>
        <email>nan.wang@intel.com</email>
      </address>
    </author>

    <author fullname="Nan Wang " initials="N." surname="Wang">
      <organization>Hygon</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>wangn@hygon.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Wei Wan" initials="W." surname="Wan">
      <organization>Sugon</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>wanwei@sugon.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Hua Wang" initials="H." surname="Wang">
      <organization>Moore Threads</organization>

      <address>
        <email>wh@mthreads.com</email>
      </address>
    </author>

    <author fullname="Jian Guo" initials="J." surname="Guo">
      <organization>Biren Technology</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>jguo@birentech.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Xiang Li" initials="X." surname="Li">
      <organization>Enflame Technology</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>xiang.li@enflame-tech.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Tianyou Zhou" initials="T." surname="Zhou">
      <organization>Resnics Technology</organization>

      <address>
        <email>tzhou@resnics.com</email>
      </address>
    </author>

    <author fullname="Yongtao Yang" initials="Y." surname="Yang">
      <organization>Centec</organization>

      <address>
        <email>yangyt@centec.com</email>
      </address>
    </author>

    <author fullname="Yinben Xia" initials="Y." surname="Xia">
      <organization>Tencent</organization>

      <address>
        <email>forestxia@tencent.com</email>
      </address>
    </author>

    <author fullname="Weifeng Zhang" initials="W." surname="Zhang">
      <organization>Tencent</organization>

      <address>
        <email>wikkizhang@tencent.com</email>
      </address>
    </author>

    <author fullname="Peilong Wang" initials="P." surname="Wang">
      <organization>Baidu</organization>

      <address>
        <email>wangpeilong01@baidu.com</email>
      </address>
    </author>

    <author fullname="Yan Zhuang" initials="Y." surname="Zhuang">
      <organization>Huawei Technologies</organization>

      <address>
        <email>zhuangyan.zhuang@huawei.com</email>
      </address>
    </author>

    <author fullname="Fajie Yang " initials="F." surname="Yang">
      <organization>Cloudnine Information Technologies</organization>

      <address>
        <email>yangfajie@cloudnineinfo.com</email>
      </address>
    </author>

    <author fullname="Chao Li" initials="C." surname="Li">
      <organization>Metanet Networking Technology</organization>

      <address>
        <email>lichao22@ieisystem.com</email>
      </address>
    </author>

    <author fullname="Wang Xiaojun" initials="X." surname="Wang">
      <organization>Ruijie Networks</organization>

      <address>
        <email>wxj@ruijie.com.cn</email>
      </address>
    </author>

    <author fullname="Roman Glebov" initials="R." surname="Glebov">
      <organization>Yandex</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>kitaro630@yandex.ru</email>

        <uri/>
      </address>
    </author>

    <!---->

    <date day="10" month="June" year="2026"/>

    <abstract>
      <t>FARE&nbhy;BGP enables weighted ECMP load balancing using a
      path&nbhy;bandwidth extended community. FARE&nbhy;in&nbhy;SUN extends
      this mechanism from switches to GPUs for scale&nbhy;up networks, which
      are typically multi&nbhy;plane. Large AI training clusters increasingly
      adopt multi&nbhy;plane scale-out network topologies. This document
      further extends FARE&nbhy;BGP from switches to RoCE NICs (RNICs) for
      such multi&nbhy;plane scale&nbhy;out networks. The document also
      presents two techniques to address route scalability concerns caused by
      the injection of numerous host routes.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Large AI training clusters (beyond 100,000 GPUs) increasingly use
      multi&nbhy;plane scale&nbhy;out network topologies (see below) to reduce
      the total number of switches and links. In such a topology, a
      high&nbhy;speed RNIC is split into multiple lower&nbhy;speed lanes, each
      connected to an independent CLOS fabric (a &ldquo;plane&rdquo;). Because
      there are no links between planes, the RNIC itself must decide which
      plane to use for each packet or flow. In other words, the RNIC must know
      the reachability of each plane and then perform global load balancing
      across planes. </t>

      <t><figure>
          <artwork align="center"><![CDATA[      
       
   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   #                              Plane-1  #
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   =========================================

   ===================================     ===================================
   # +-----+ +-----+ +-----+ +-----+ #     # +-----+ +-----+ +-----+ +-----+ #
   # |RNIC1| |RNIC2| |RNIC3| |RNIC4| #     # |RNIC1| |RNIC2| |RNIC3| |RNIC4| #
   # +-----+ +-----+ +-----+ +-----+ #     # +-----+ +-----+ +-----+ +-----+ #
   #              Server-1           #     #             Server-n            #
   #================================== ... ===================================

   =========================================         
   # +----+ +----+ +----+ +----+           #
   # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
   # +----+ +----+ +----+ +----+           #
   #                              Plane-2  #
   # +----+ +----+ +----+ +----+           #
   # | S1 | | S2 | | S3 | | S4 | (Spine)   #
   # +----+ +----+ +----+ +----+           #
   =========================================           


                              Figure 1
]]></artwork>
        </figure></t>

      <t>(For simplicity, the diagram above omits the connections between
      RNICs and leaf switches. In practice, each RNIC is multi&nbhy;homed to
      one leaf switch in every plane.)</t>

      <t>FARE&nbhy;in&nbhy;SUN <xref target="I-D.xu-rtgwg-fare-in-sun"/>
      describes how to extend the FARE&nbhy;BGP protocol <xref
      target="I-D.xu-idr-fare"/> from switches to GPUs for scale&nbhy;up
      networks. Because scale&nbhy;up shares the same multi&nbhy;plane
      architectural pattern as multi-plane scale-out networks, the adaptive
      routing approach defined in FARE&nbhy;in&nbhy;SUN can be applied
      directly to multi&nbhy;plane scale&nbhy;out networks. </t>

      <t>The solution described in this document is almost identical to
      FARE&nbhy;in&nbhy;SUN, with the following two essential differences.
      First, FARE&nbhy;BGP is extended from switches to RNICs rather than to
      GPUs. Second, in a scale&nbhy;up network, the number of route entries is
      small (typically a few hundred) and can be installed directly on GPUs.
      In an isolated multi&nbhy;plane scale&nbhy;out network with 100,000 GPUs
      and four planes, each plane may propagate up to 100,000 host routes
      &ndash; a total of 400,000 routes. Storing all these routes on an RNIC
      is impractical. Therefore, the RNIC must suppress the routing table
      using the techniques described in Section 4. </t>

      <t>This document describes how to extend the Fully Adaptive Routing
      Ethernet (FARE) using BGP (FARE-BGP in short) as described in , which
      was originally designed for scale-out netowrks, to scale-up
      networks.</t>
    </section>

    <section anchor="Abbreviations_Terminology" title="Terminology">
      <t>This memo makes use of the terms defined in <xref
      target="RFC2119"/>.</t>
    </section>

    <section title="Solution Description">
      <t>In an isolated multi&nbhy;plane scale&nbhy;out network, an RNIC
      connects to each plane and is configured as a stub BGP speaker per
      plane. It establishes separate BGP sessions with the attached leaf
      switches of each plane. The BGP neighbor discovery <xref
      target="I-D.xu-idr-neighbor-autodiscovery"/> can be used to simplify
      configuration. </t>

      <t>Through these sessions, the RNIC learns routes to remote GPUs
      together with the path&nbhy;bandwidth extended community. Because the
      RNIC participates in BGP with each plane independently, it aggregates
      per&nbhy;plane path&nbhy;bandwidth information and performs weighted
      load balancing across planes. The RNIC thus performs the same Weighted
      Equal&nbhy;Cost Multi&nbhy;Path (WECMP) functions as a FARE&nbhy;capable
      switch, distributing traffic in proportion to the path bandwidth of each
      ECMP route.</t>

      <t>Two modes of WECMP are supported: </t>

      <t><list>
          <t>Per&nbhy;flow WECMP (for RNICs that cannot handle disordered
          packet delivery): The RNIC establishes at least one QP per plane.
          The number of QPs allocated to a plane is proportional to the
          plane&rsquo;s weight. All packets of a given flow go through the
          same plane, preserving order. </t>

          <t>Per&nbhy;packet WECMP (for RNICs that support
          out&nbhy;of&nbhy;order packet delivery): A single QP per (source,
          destination) RNIC pair suffices. The RNIC sprays each packet of that
          QP across all available planes according to the weights.</t>
        </list></t>

      <t>In an isolated multi&nbhy;plane scale&nbhy;out network with 100,000
      GPUs and four planes, each plane may propagate up to 100,000 host routes
      &ndash; a total of 400,000 routes. Storing all these routes on an RNIC
      is impractical. Two complementary approaches can reduce the number of
      routes the RNIC must store.</t>

      <section title="Route Aggregation with Explicit Unreachable Host Route Advertisement ">
        <t>It's straightfoward to resort to route aggregation mechanism, i.e.,
        aggregating host routes when advertising them from leaf to spine.
        However, naive aggregation can cause&nbsp;route blackholes: if a
        specific host within an aggregate becomes unreachable, the aggregated
        route still points to that plane. Consequently, traffic destined for
        that host will still be forwarded according to the aggregated route
        and then dropped. </t>

        <t>To address this issue, the switches MUST&nbsp;explicitly advertises
        unreachable host routes for a given RNIC&nbsp;to the other RNICs. When
        a RNIC becomes unreachable via a particular plane, the leaf switch
        advertises this unreachability to the RNIC using one of two methods:
        </t>

        <t><list>
            <t>Path bandwidth value of 0:&nbsp;The leaf switch advertises the
            host route (NLRI) with the BGP path&nbhy;bandwidth extended
            community set to&nbsp;0. The RNIC interprets this as
            &ldquo;unreachable&rdquo; and excludes that plane from the
            next&nbhy;hop set for that destination. </t>

            <t>Specific BGP unreachability advertisement:&nbsp;The leaf switch
            sends a dedicated BGP unreachability message. This is distinct
            from a standard BGP route withdrawal. It explicitly marks the host
            as unreachable via that plane while keeping the aggregated route
            intact. </t>
          </list></t>

        <t>Upon receiving such an advertisement, the RNIC updates its
        forwarding table as follows: </t>

        <t><list>
            <t>It locates the&nbsp;longest&nbhy;matching aggregated
            route&nbsp;that covers the unreachable host (e.g., a default route
            or a supernet prefix). </t>

            <t>From that aggregated route&rsquo;s set of next&nbhy;hops (which
            originally included multiple planes), it&nbsp;removes the
            next&nbhy;hop corresponding to the plane where the host is
            unreachable. </t>

            <t>It then installs a&nbsp;host&nbhy;specific route&nbsp;for the
            unreachable destination, with the remaining next&nbhy;hops from
            the aggregated route. </t>
          </list></t>

        <t>Example:&nbsp;Suppose an RNIC has a default route (0.0.0.0/0) with
        next&nbhy;hops pointing to planes A, B, C, and D. Host X (a specific
        /32) becomes unreachable via plane A. The RNIC learns an unreachable
        advertisement for X. It then creates a host route for X with
        next&nbhy;hops set to {B, C, D} &ndash; i.e., the original aggregated
        next&nbhy;hops minus the next&nbhy;hop associated with plane A.
        Traffic to X will never be sent to plane A, avoiding blackholes. </t>

        <t>This technique dramatically reduces BGP table size on the RNIC: the
        RNIC only needs to store aggregated routes (e.g., a handful of default
        routes per plane) plus explicit unreachable host routes for the small
        number of hosts that are actually unreachable. The majority of
        reachable hosts are covered by aggregates and require no per&nbhy;host
        state. The approach is especially effective when unreachability is
        rare, which is typical in well&nbhy;managed clusters. </t>

        <t>Switches within each plane does&nbsp;not&nbsp;need to install the
        unreachable host route into their FIB tables. </t>
      </section>

      <section title="Prefix&nbhy;ORF&nbhy;Based Route Filtering&#8232;">
        <t>Since a given RNIC communicates only with a limited subset of GPUs
        (due to AI training parallelism patterns), it&rsquo;s possible for the
        RNIC&nbsp;to filter routes to retain only those it actually needs.
        </t>

        <t>The RNIC sends Address Prefix ORF entries to its BGP peer (leaf
        switch) per plane. These entries indicate the host routes for remote
        RNICs the local RNIC is interested in. The peer filters outbound route
        updates accordingly, sending only the requested routes. In this way,
        the RNIC stores only a limited number of routes. </t>

        <t>For switches, there is no need install host routes for remote
        RNICs. Therefore, the FIB-suppression mechanism as described in
        Virtual Aggregation Auto-configuration <xref
        target="I-D.ietf-grow-va-auto"/> could be reused.</t>
      </section>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD.</t>

      <!---->
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>

      <!---->
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>

      <!---->
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.7306'?>

      <?rfc include="reference.I-D.xu-idr-fare"?>

      <?rfc include="reference.I-D.xu-rtgwg-fare-in-sun"?>

      <?rfc include="reference.I-D.xu-idr-neighbor-autodiscovery"?>

      <?rfc include="reference.I-D.ietf-grow-va-auto"?>

      <!---->
    </references>
  </back>
</rfc>
