Internet-Draft | In-Network Inference | September 2025 |
Wang, et al. | Expires 19 March 2026 | [Page] |
The rapid proliferation of deep learning models has led to growing demands for low-latency and high-throughput inference across heterogeneous environments. While edge devices often host data sources, their limited compute and network resources restrict efficient model inference. Cloud servers provide abundant capacity but suffer from transmission delays and bottlenecks. Emerging programmable in-network devices (e.g., switches, FPGAs, SmartNICs) offer a unique opportunity to accelerate inference by processing tasks directly along data paths.¶
This document introduces an architecture for Distributed Collaborative Inference Acceleration. It proposes mechanisms to split, offload, and coordinate inference workloads across edge devices, in-network resources, and cloud servers, enabling reduced response time and improved utilization.¶
This note is to be removed before publishing as an RFC.¶
The latest revision of this draft can be found at https://kongyanye.github.io/draft-wang-cats-innetwork-infer/draft-wang-cats-innetwork-infer.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-wang-cats-innetwork-infer/.¶
Discussion of this document takes place on the Computing-Aware Traffic Steering Working Group mailing list (mailto:cats@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/cats/. Subscribe at https://www.ietf.org/mailman/listinfo/cats/.¶
Source for this draft and an issue tracker can be found at https://github.com/kongyanye/draft-wang-cats-innetwork-infer.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 19 March 2026.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Large foundation models and domain-specific deep neural networks are increasingly deployed in real-time services such as surveillance video analysis, autonomous driving, industrial inspection, and natural language interfaces. Inference for such models requires both low latency and scalable throughput.¶
Current deployments typically follow two paradigms:¶
Edge-only inference, which minimizes data transmission but is constrained by limited device resources.¶
Cloud-centric inference, which exploits large compute capacity but introduces network delays.¶
However, neither paradigm fully exploits the potential of programmable in-network intelligence, where intermediate devices along the data path can actively participate in computation. By integrating such devices into distributed collaborative inference, networks can enable end-to-end acceleration of large-scale deep learning model inference.¶
This document outlines the motivation, problem statement, and architectural considerations for Distributed Collaborative Inference Acceleration (DCIA). The goal is to establish a framework where deep learning inference tasks are intelligently partitioned, scheduled, and executed across heterogeneous resources, including edge devices, in-network resources, and cloud servers.¶
Latency bottlenecks: Large model inference may exceed the latency tolerance of interactive applications if computed only at edge or cloud.¶
Resource fragmentation: Heterogeneous resources (edge GPUs, in-network accelerators, cloud clusters) are not effectively coordinated.¶
Lack of steering semantics: Existing approaches to service steering are not optimized for inference workload partitioning and scheduling.¶
The framework for DCIA includes the following:¶
Model Partitioning and Mapping Split large models into sub-tasks (e.g., early layers at edge, mid layers in-network, final layers in cloud) and map them based on node capabilities, load, and network conditions.¶
In-Network Execution Enable inference acceleration in programmable switches, FPGAs, or SmartNICs, utilizing data-plane programmability to process features in transit (e.g., feature extraction, embedding computation).¶
Task Scheduling and Steering Extend service capability advertisements with inference-oriented metrics (e.g., GPU/FPGA availability, model version, layer compatibility), and dynamically balance inference tasks across heterogeneous resources.¶
Load Balancing Protocols Support task redirection and failover when a device becomes overloaded, and explore transport-level extensions to allow adaptive task splitting along paths.¶
Video Analytics: Smart cameras extract features locally, switches perform intermediate tensor transformations, and cloud servers handle complex classification.¶
Autonomous Vehicles: Onboard processors execute lightweight inference, roadside units conduct mid-layer fusion, and cloud clusters finalize planning decisions.¶
Interactive AI Services: Edge devices handle pre-processing, in-network resources accelerate embeddings, and cloud models provide final responses.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Inference partitioning must consider:¶
This document has no IANA actions.¶
The authors would like to thank colleagues and reviewers in the community who provided feedback on the early version of this draft.¶