<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced.
     An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC6241 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6241.xml">
<!ENTITY RFC7950 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7950.xml">
<!ENTITY RFC7149 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7149.xml">
<!ENTITY RFC7426 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7426.xml">
<!ENTITY RFC8299 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8299.xml">
<!ENTITY RFC8309 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8309.xml">
<!ENTITY RFC8340 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8340.xml">
<!ENTITY RFC8453 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8453.xml">
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
<!ENTITY RFC8345 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8345.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
     please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
     (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-he-ippm-congestion-loss-monitoring-arch-00"
     ipr="trust200902">
  <front>
    <title>An Architectural Framework for Monitoring Packet Loss Caused by Network Congestion</title>
    <author fullname="Xiaoming He" initials="X." surname="He">
      <organization>China Telecom</organization>

      <address>
        <email>hexm4@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Zijing He" initials="Z." surname="He">
      <organization>South China University of Technology</organization>

      <address>
        <email>katehe163@163.com</email>
      </address>
    </author>
    
    <author fullname="Cancan Huang" initials="C." surname="Huang">
      <organization>China Telecom</organization>

      <address>
        <email>huangcanc@chinatelecom.cn</email>
      </address>
    </author>


    <date year="2026"/>

    <area>IPPM</area>

    <workgroup>IPPM Working Group</workgroup>

    <keyword>Monitoring Packet Loss Caused by Network Congestion</keyword>

    <abstract>
      <t>Network congestion can lead to performance degradation and increase uncertainty in service delivery, 
so real-time congestion monitoring is necessary. This document
describes a comprehensive packet loss
monitoring architectural framework. The proposed
scheme is capable to not only determine the time and location of
packet loss occurrence, make the accurate statistics of discarded
packets, parse what traffic flows are contained in discarded
packets and identify what traffic flows lead to microburst, but
also obtain accurate packet loss ratio results.
More importantly, the proposed scheme can achieve little or
even no interference to network, and is applicable to any data
plane without modifying the forwarding chip and packet header
as existing measurement methods do.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="Introduction" title="Introduction">
      <t>With the large-scale deployment of 5G networks,
emerging services including enhanced Mobile
Broadband (eMBB) and Ultra-Reliable Low Latency
Communication (uRLLC) have imposed stringent requirements on IP bearer network performance, demanding
significantly reduced latency, minimized jitter, and near-zero
packet loss rates [_GPP_TS_22.261]. At the same time, the technical development of Big
Data and Artificial Intelligence (AI) calls for intelligent
computing network infrastructure whose goal is to construct
a lossless network characterized by "high throughput, low
latency, and zero packet loss" [Adithya_Gangidi24][Kun_Qian24]. However, the inherent statistical multiplexing nature of
TCP/IP-based IP networks results in bursty traffic patterns,
making network congestion an inevitable occurrence. Such
congestion phenomena degrade network performance and
introduce the uncertainty in service delivery, e.g., loss leads to
packet retransmission, increasing delay leads to decreasing
throughput. For a long time, numerous studies have been
concentrated on congestion control mechanisms and related
algorithms [RFC9293][RFC9743] to improve network performance.</t>

      <t>Network congestion is roughly divided into two classes:
long-lived congestion and short-lived congestion. A long lived congestion is generally caused by persistent traffic
growth, e.g., congestion duration ranging from hours to days,
which is easy to be observed through Network Management
System/Element Management System (NMS/EMS). However,
a short-lived congestion is almost caused by traffic bursts,
among which microburst is one of the major contributors.
Microburst is a phenomenon where a device port receives
a considerable amount of burst data in a very short time
(i.e., milliseconds, even microseconds), resulting in an instantaneous burst rate much higher than the average rate, even
exceeding the port bandwidth [Microburst][Shuhei_Yoshida21]. A
microburst is prone to packet loss but difficult to detect in time.
Many investigations prove that microburst is the main culprit
affecting latency-sensitive and packet loss-sensitive services.
When a microburst occurs, the queuing time increases rapidly,
and in severe case, packet loss may even occur, which are
intolerable for applications like Virtual reality (VR).</t>

   <t>In order to reduce uncertain service delivery caused by
network congestion, it is essential to monitor congestion-induced packet loss in real time so that network operators
can quickly locate the congested nodes and links, and then
make path optimization for the affected traffic flows to avoid
congestion; and evaluate network congestion level so as to
provide the guidance for network planning, capacity expansion
and optimization.</t>

      <t>[I-D.he-ippm-congestion-loss-monitoring-problem] discusses the requirements of real-time monitoring of packet loss caused by congestion, presents the
problems and challenges faced by existing monitoring and
measurement techniques in real-time monitoring of congestion-induced packet loss. This document describes an architectural
framework for real-time monitoring of congestion-induced
packet loss.</t>        
    </section>

    <section title="Conventions">
    
     <section title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
      "OPTIONAL" in this document are to be interpreted as described in BCP 14
      <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when,
      they appear in all capitals, as shown here.</t>
     </section>

     <section title="Terminology">
      <t>Abbreviations used in this document:</t>
      <t>AI:      Artificial Intelligence</t>
      <t>CLI: 	   Command Line Interface</t>
      <t>CPU: 	   Central Processing Unit</t>
      <t>MPLS:    Multi-Protocol Label Switching</t>
      <t>NTP:     Network Time Protocol</t>
      <t>PLR:     packet loss ratio</t>
      <t>SNMP:    Simple Network Management Protocol?</t>
      <t>SLA:     Service Level Agreement</t>
      <t>SLO:     Service Level Objective</t>
      <t>SRv6:    Segment Routing over IPv6</t>
      <t>VPN:     Virtual Private Network</t>
     </section>
    </section>

	<section title="Architectural Framework for Real-time Monitoring of Packet Loss Caused by Congestion">
      <t>To monitor congestion-induced packet loss effectively, this document
proposes a comprehensive packet loss monitoring architectural framework [Xioaming_He25].
The proposed framework is mainly composed of network
devices and the collection and analysis system. All network
devices need to report loss events caused by congestion, and
also cache the discarded packets due to queue overflow and
upload them to the collection and analysis system in real-time manner. Telemetry interface (e.g., YANG Push [RFC8641],
gRPC [gRPC]) with subscription mechanism is used to push
loss data immediately when the loss event occurs, avoiding
the inefficiency of the traditional SNMP polling mode. The
collection and analysis system is required to count the total
number of the discarded packets reported, parse the service
types of discarded packets, count the number of the discarded
packets for every traffic flow contained in all loss events, and
calculate packet loss ratio (PLR) of the specified user flow, etc.
Furthermore, the real-time visibility of packet loss gained from
the collection and analysis system can feed into NMS/EMS
so that network operators can quickly pinpoint the congested
nodes and the affected traffic flows. Also, with the injection of
such a real-time visibility of packet loss, the network controller
can make timely path optimization for the affected traffic flows
sensitive to latency and loss to improve user Quality of Experience (QoE). Figure 1 illustrates the proposed framework for monitoring packet loss
caused by congestion.<figure title="Framework for monitoring packet loss caused by congestion">
   <artwork>
 +-------------+        +--------------------------+        +-------------+
 | Network     |&lt;-------| Collection and Analysis  |------->| NMS/EMS     |
 | controller  |        | system                   |        |             |
 +-------------+        +-------------^------------+        +-------------+
       |                              |
       | Timely path optimization     |                Rapid troubleshooting
       | based on real-time           |                based on real-time
       | loss visibility              |                loss visibility
       |                    Packet loss data reporting 
 +-----v--^-------------^-------------^-------------^-------------^-------+
 |        |             |             |             |             |       |
 |        |             |             |             |             |       |
 |    +---+--+      +---+--+      +---+--+      +---+--+      +---+--+    |
 |    | Node |------| Node |------| Node |------| Node |------| Node |    |
 |    +------+      +------+      +------+      +------+      +------+    |
 |                               IP Network                               |     
 +------------------------------------------------------------------------+</artwork>
     </figure></t>
            
	 <section title="Network Devices">  
       <t>In IP networks, network devices such as router and switch
are mainly used to implement packet forwarding. Traditional
network devices can only record the number of discarded
packets by port or queue overflow, and no loss information is notified
promptly when packet loss occurs. The operator can only log
on the device (e.g., through CLI) to search for loss event.
Network devices need to have the ability to detect congestion and packet loss in real time. The traditional
query using CPU on main control engine consumes much
processing resources, and the network device must leverage
built-in dedicated hardware to detect packet loss in real time.
On the other hand, existing forwarding devices do not cache
the packets overflowed by queue, but simply drop them, hence
it are not clear what packets were dropped, and what traffic
flows contributed to congestion or microburst. In order to
capture the traffic flows related to the packet loss, a cache
for the discarded packets is needed. The proposed in-device packet loss detection architecture
is shown in Figure 2. <figure title="In-Device Packet Loss Detection Architecture">
   <artwork>
+------------------------------------------------------------------+
|                             Network device                       |
|                                                                  |
|  +---------------------------+    +--------------------------+   |
|  | Real-time packet loss     |--->| packet loss information  |   |
|  | detection module          |    | reporting module         |   |
|  +-------------|-------------+    +--------------------------+   |
|                |                                                 |
|                v                                                 |
|  +---------------------------+        +-----+                    |
|  | packet loss counter       |&lt;-------|queue|  port1             |
|  +---------------------------+        +-----+                    |
|                                       +-----+                    |
|  +---------------------------+        |queue|  port2             |
|  |Cache module for discarded |&lt;-------+-----+                    |
|  | Packets                   |        +-----+                    |
|  +-------------|-------------+        |queue|  port3             |
|                |                      +-----+                    |
|                v                         :                       |
|  +---------------------------+        +-----+                    |
|  | packet loss file Upload   |        |queue|  portN             |
|  | module                    |        +-----+                    |
|  +---------------------------+        +-----+                    |
|                                       |queue|  portM             |
|                                       +-----+                    |
+------------------------------------------------------------------+</artwork>
     </figure></t>

	  <t>The in-device packet loss Detection architecture is
required to add four new functional modules, which are
described as follows.</t>
	  <list style="symbols">
<t>Real-time packet loss detection module: Leverage the
built-in dedicated hardware to query the queue overflow packet loss
counter of every port at millisecond interval; also,
records the location and time of loss occurrence.</t>

<t>Packet loss information reporting module: Sends loss
information according to subscription request, including
the number of discarded packets, the timestamp of loss occurrence, the localization of packet loss such as device
ID, port ID and queue ID.</t>

<t>Cache module for discarded packets: Caches packets
dropped by queue overflow, and optionally, records the
number of discarded packets, the time of loss occurrence,
the localization of packet loss such as device ID, port
ID and queue ID. Only one shared cache is needed for
all ports and queues. In order to save buffer space,
the cached  packets should be cleaned immediately after
uploading.</t>

<t>Packet loss file upload module: Packages the cached
discarded packets as a file or compressed file and uploads
it to the collection and analysis system according to the
specified rule.</t> 
       </list>

	  <section title="Cache for Discarded Packets">
	   <t>To analyze packet drops
caused by queue overflow, implementing a cache mechanism
is essential for capturing discarded packets. However, since
packet parsing and statistical analysis consume significant
local resources (such as memory and computing power), 
these tasks are more suitable for being handled by a remote central processing entity. since
packet headers typically contain all necessary service type and
flow attribute information, truncating discarded packets to a
fixed length (e.g., the first 64 bytes) provides sufficient data
for analysis while dramatically reducing cache need.</t>
      <t>In the process of uploading packet loss file and cleaning the
discarded packets, any loss event may happen to occur, leading
to no buffer available for the subsequent dropped packets.
In order to avoid this situation, the cache should be divided
into two separate spaces in appropriate proportion: primary
space and spare space. The primary space is used to cache
the discarded packets for uploading each time, and the spare
space is used to cache subsequent discarded packets during
the current packet packaging and uploading operation.</t>
	 </section>

	 <section title="Packet Loss File Upload">
       <t>In order to support the real-time
uploading of packet loss file, file transfer protocol such as Trivial File Transfer Protocol (TFTP) [RFC1350] should to
be used for transferring the file immediately when the loss
file is available. To minimize cache capacity, a smart uploading scheme for packet loss file is proposed, which is described as follows.
</t>

<t>S1 If there is no discarded packet in any cache, no
packet loss file will be uploaded to minimize processing
resources.</t>

<t>S2 If there exist some discarded packets in any cache,
including the primary space or the spare space, and
neither space reaches the utilization threshold (e.g.,
90%), the packet loss file is uploaded according to the
preset fixed cycle (e.g., 10s) that needs to meet the real-time requirements for packet parsing and statistics.</t>

<t>S3 Else, when either space reaches the utilization threshold
due to considerable dropped packets, the packet loss file
is uploaded immediately without waiting for the next
uploading cycle.</t>
	 </section>

	 <section title="Telemetry Data Collection and Report">
	 <t>The local device
is also required to collect real-time loss data caused by congestion. 
In order to capture loss event in real time, the network
device needs to leverage the built-in dedicated hardware such as Application Specific Integrated Circuit (ASIC) to
read the packet loss counter of each port or queue at millisecond interval, 
and send telemetry data about loss information
according to subscription request. In order to improve the
real-time awareness of packet loss in some scenarios such as
traffic optimization and congestion discovery, the on-change
update (compared to periodic update) is more preferable, that
is, a telemetry update is sent immediately when packet loss
counter value changes. While supporting on-change update,
a dampening period should be configurable to minimize the
amount of data sent.</t>

	<t>On the other hand, in order to measure Packet Loss Ratio (PLR) caused by
congestion, the network device is required to collect the
statistical data of the monitored traffic flows and send the
corresponding telemetry data to the collection and analysis
system periodically. The ingress device, such as access router
and Provider Edge router (PE), is required to configure
the receiving packet counter for the monitored traffic.  The
specified traffic flows may be identified by 
Layer 2 flows (e.g., based on source and/or destination Media
Access Control (MAC) address, Virtual Local Area Network
Identifier (VLAN ID), Virtual eXtensible Local Area Network
Identifier (VxLAN VNI)), Layer 3 flows (e.g., identified by
N-tuple, and/or Flow Label field of IPv6 packet header), Layer 2/3
VPN ID carried in SR-MPLS label stack or IPv6 Segment
Routing Header (SRH), etc.</t> 
	 </section>

	 <section title="Time Synchronization">
	 <t>The global time synchronization is
also needed for the accurate calculation of PLR measurement.
For instance, when the ingress device periodically reports the
received VPN traffic statistical data (packet counter value)
with the timestamp in telemetry data, and during some report
period, this specified VPN traffic has happened to encounter
packet loss caused by a microburst, and the loss information is
immediately reported carrying the timestamp of loss occurrence.
Figure 3 depicts the timing
relationship between the time of telemetry data of the specified
traffic reported and that of loss occurrence reported.<figure title="Loss Occurrence and Telemetry Data Report Period Timing">
   <artwork>                         
  |   Report period   |   Report period   |
--|-------------------|---------|---------|--------> Synchronization time
  ^                   ^         ^         ^
  |                   |         |         |
  |                   |         |         |
                      Tp        Tl       Tc </artwork>
     </figure></t>
     
<t>Based on their respective timestamps, e.g., the timestamp Tl
of loss occurrence falls between the timestamp Tp and Tc carried by the two consecutive traffic telemetry data, the collection and
analysis system can correctly calculate PLR of the specified
VPN traffic at that exact period.</t>

	 <t>The network device is required to support time synchronization
techniques such as Network Time Protocol(NTP)or IEEE1588,
which are widely deployed in operator's networks. Generally,
NTP can meet precision of 50 ms and IEEE1588 can meet
precision of microseconds. In the proposed scheme, time
synchronization precision depends on measurement period.
For normal measurement period of tens of seconds or even
minutes, synchronization precision of 50ms(easy to implement)
is enough to satisfy the measurement requirement.</t>
	  </section>                 
	 </section>
	
	 <section title="Collection and Analysis System">
	 <t>The proposed framework is required to handle packet
loss information, and claims higher real-time requirements.
Therefore, an independent collection and analysis system is
more suitable to monitor the real-time packet loss caused by
congestion. The proposed structure of collection and analysis
system is shown in Figure 4.<figure title="Internal Functional Modules of Collection and Analysis System">
   <artwork> 
+--------------------------------------------------------------------------+
|                        Collection and analysis system                    |
|                                                                          |
|  +------------------+                  +-------------------------+       |
|  | PLR measurement  |&lt;-----------------| Packet loss statistics  |       |
|  | module           |                  | module                  |       |
|  +--------^---------+                  +---^---------------^-----+       |
|           |                                |               |             |
|           |                                |               |             |
| +---------+--------------+     +----------------+    +-----------------+ |
| | Measured traffic flows |     | Packet parsing |    |Packet loss data | |
| | collection module      |     | module         |&lt;---|collection module| |
| +------------------------+     +----------------+    +-----------------+ |
+--------------------------------------------------------------------------+</artwork>
     </figure></t>
     
      <t>The proposed structure of collection and analysis system
mainly consists of five functional modules, which are
described as follows.</t>
	<list style="symbols">

	<t>Packet loss data collection module: Accepts the packet
loss data from network devices, including the telemetry
data of loss information reported and loss files uploaded,
and stores them for a specified time; records the number
of discarded packets, the timestamp and location ID
carried in the telemetry data every time it is reported.</t>

      <t>Measured traffic flows collection module: Accepts the
telemetry data of the measured traffic flows reported from
network ingress devices, and stores them for a specified
time; records the number of received packets and the
timestamp carried in the telemetry data every time it is
reported.</t>

	 <t>Packet parsing module: Leverages the professional packet
parsing tools to make real-time resolution of discarded
packets from packet loss files uploaded.</t>

	 <t>Packet loss statistics module: Based on packet parsing
results, counts the number of discarded packets belonging
to different traffic flows; Based on packet loss information
reported, counts the total number of the discarded packets
of each device, each port and queue, and also records the
time and location of loss occurrence.</t>

	 <t>PLR measurement module: Based on the statistical data
of the measured traffic flows reported periodically and
the number of the discarded packets of the measured
traffic flows, calculates PLR of the measured traffic flows
according to the requirements of network operators (e.g.,
periodic measurement).</t>
 	 </list>

	 <section title="Packet Parsing">
      <t>The discarded packets should be parsed as
soon as possible to meet the real-time requirement of packet
loss statistics and measurement. For the purpose of the real-time visibility of packet loss statistics as well as on-line PLR
measurement, packet parsing time for the current uploaded
packet loss file should be as little as possible, say, 100ms. The
packet flow parsing of the discarded packets should at least
include the measured traffic mentioned above, such as Layer
2/3 flows, Layer 2/3 VPN traffic, etc.</t>
	 </section>

	 <section title="PLR Measurement">
      <t>PLR measurement module can obtain
the number of packets and timestamps carried in the telemetry
data of the measured traffic flow from the measured traffic
flows collection module. Meanwhile, it also can obtain the
number of the discarded packets of the measured traffic flow
and the timestamps carried in the loss information or carried in the packet
loss file from packet loss statistics module. Therefore, based
on the timing relationship between the timestamp carried in
the telemetry data of the measured traffic flow and that of
loss occurrence, as well as the number of received packets
carried in the telemetry data of the measured traffic flow and
the number of the discarded packets of the measured traffic
flow, PLR measurement module can calculate the PLR of the
measured traffic flow during a specified measurement period.</t>

      <t>For example, the collection and analysis system receives the
previous telemetry data of the measured traffic flow carrying
the number N1 of received packets and the timestamp T1, as well as
the current telemetry data carrying the number N2 of received
packets and the timestamp T2. Meanwhile, it also obtains the
number N3 of the discarded packets of the measured traffic
flow and the timestamp T3 carried in the packet loss file. If the
timestamp T3 is between timestamp T1 and T2, then the PLR of
the measured traffic flow for the current measurement period
(T2-T1) is accurately calculated as:</t> 

	 <t>PLR = N3/(N2-N1)          (1) </t>
	 </section>
     </section>
    </section>
     
	<section title="Functional Requirements for Real-time Monitoring of Packet Loss Caused by Congestion">
	 <t>In summary, to monitor packet loss caused by congestion in real time and obtain accurate packet loss ratio results, 
	 the proposed architectural framework needs to meet the following functional requirements.</t>

	 <t>[REQ-1] Network device is REQUIRED to support detecting packet loss caused by congestion 
	 at least every millisecond interval.</t>

	 <t>[REQ-2] Network device is REQUIRED to report packet loss events in real time, i.e., immediately upon detection.
	 and the reported telemetry data is REQUIRED to carry the timestamp of packet loss occurrence, the number of discarded packets, and the packet loss location such as device ID, port ID, and queue ID.</t>

	 <t>[REQ-3] Network device is REQUIRED to support the capability to subscribe to periodic updates, e.g., to collect the
statistical data of the monitored traffic flows and send the
corresponding telemetry data to the collection and analysis system periodically. 
	 The subscription period shall be configurable as part of the subscription request. 
	 For periodic subscription, network device is RECOMMENDED to support the ability of redundant suppression, 
	 where a telemetry update should not be generated unless the value of the subscribed data objects has changed.</t>

	 <t>[REQ-4] Network device is REQUIRED to support the capability to subscribe to updates on-change, i.e., whenever values of the subscribed data objects change. 
	 For example, a telemetry update is sent
   immediately when queue overflow packet loss counter value changes.
	 For on-change subscription, network device is REQUIRED to support a dampening period that needs to pass before subsequent on-change updates are sent. 
	 The dampening period should be configurable as part of the subscription request.</t>

	 <t>[REQ-5] Network device is REQUIRED to cache all discarded packets caused by queue overflow. 
	 For purpose of Packet loss statistics and analysis, network device is REQUIRED to record the time of packet loss occurrence, 
	 the number of discarded packets, and the packet loss location such as device ID, port ID, and queue ID. 
	 To reduce cache capacity, it is RECOMMENDED to truncate
   discarded packets to a fixed length (e.g., the first 64 bytes).</t>

	<t>[REQ-6] Network device is REQUIRED to upload all discarded packets as a file or compressed file in real-time manner.</t>
   
	<t>[REQ-7] Network device is REQUIRED to support time synchronization for measuring packet loss ratio caused by congestion, and time synchronization precision SHOULD be less than 50ms.</t>

	<t>[REQ-8] Collection and analysis system is REQUIRED to support parsing the header of all discarded packets to determine the flow attribute of every discarded packet,
	count the number of discarded packets of each traffic flow in a real-time manner.</t>

	<t>[REQ-9] Collection and analysis system is REQUIRED to support periodic measurement of PLR based on the total number of discarded packets divided by the total number of sent packets. 
	Also, it is REQUIRED to support periodic measurement of PLR according to the number of the discarded packets divided by the number of the sent packets for the specified user traffic.</t>

	<t>[REQ-10] Collection and analysis system is REQUIRED to support visualization of data analysis for the discarded packets in the form of tables and figures, which are easily understandable by the operators.</t>		 
	</section>

	<section title="Use Cases">

	 <t>In this section we consider three typical application scenarios to demonstrate the advantages of the proposed architectural framework for real-time
monitoring of packet loss caused by congestion.</t>

	 <section title="Detecting microbursts in Real Time">
	 <t>Leverage real-time packet loss detection module with the built-in dedicated
hardware to read the queue overflow packet loss counter of every port at millisecond interval, and record the time and location of
loss occurrence. Once the loss counter value changes, the
telemetry data of packet loss will be reported to the collection
and analysis system, which will immediately become aware of this. Based
on the packet loss statistics collected, the operator (through some
on-line smart analytical tool) can correlate the number of
discarded packets with time of loss occurrence, and thus
determinate whether it is long-lived or short-lived congestion
that causes packet loss. For instance, if the increasing number
of packet loss lasts for a very short time (e.g., a few milliseconds
to tens of milliseconds), it might well be a microburst causing
packet loss.</t>

	 <t>At the same time, we can parse from loss files uploaded
what traffic flows are contained in discarded packets and
identify what traffic flows lead to microburst, so that we can
take action to those culprits causing microburst. Therefore,
the network operator can quickly pinpoint the congested
node, improving the efficiency of fault diagnosis and root
cause analysis. In addition, based on congestion state and
trend of packet loss statistics, timely actions will be
taken, e.g., redirecting the affected traffic flows to non-congested port, 
or making dynamic traffic adjustment to alleviate
congestion, etc.</t>
	 </section>

	 <section title="Congestion Evaluation">	 
	 <t>Congestion evaluation is of significant value for subsequent network planning, capacity expansion
and optimization. It should be noted that the PLR is a classical indicator of reflecting
network performance, but it cannot accurately reflect the
network congestion level, since we do not exactly know
the overall network packet loss caused by congestion. As
mentioned above, existing monitoring techniques are not specially designed to monitor packet loss caused by congestion.
In the proposed scheme, the PLR caused by congestion can be accurately calculated by the
total number of discarded packets divided by the total number of the received packets by the network. No probe is required.</t>

	 <t>In addition, we can obtain the average frequency and duration parameters for short-lived congestion occurrence on
entire network within a day, based on which we can evaluate the degree of traffic bursts and expand network capacity accordingly.</t>
	 </section>

	 <section title="SLO Verification of User services">
	 <t>The PLR is also a key indicator for SLA compliance and should be verified. In the proposed scheme, by configuring the packet counters
for the specified user flows received on the ingress devices
and making real-time parsing of the discarded packets for
them, we can measure tens of thousands of service traffic
flows simultaneously. Because the proposed scheme leverages
a separate entity to handle packet parsing and loss statistics,
the concurrent number of measured flows is not limited by
network resources (e.g., computing, storage or bandwidth).
Also, the data plane does not need to be modified to adapt
to different transport protocols and monitoring techniques
as existing measurement methods do (e.g., Alternate-Marking method defined in [RFC9343]for IPv6, [RFC9714]for MPLS, and [RFC9947] for SRv6).</t>
	 </section>
	</section>
	
    <section anchor="IANA" title="IANA Considerations">
	<t>This document has no IANA actions.</t>
	</section>

    <section anchor="scecurity" title="Security Considerations">
      <t>The congestion-induced loss monitoring system introduces additional traffic to the
   network. During network congestion, the monitoring system itself must not exacerbate the situation. 
   Mechanisms such as rate limiting and traffic prioritization for congestion-related monitoring data
   should be considered. Also, some appropriate defense measures against Distributed Denial of Service (DDoS) attack are necessary to protect the data plane and control plane.</t>
      <t>This document does not specify security mechanisms, but highlights
   that any solution must consider trusted boundary regarding telemetry data
   subscriptions, telemetry data reporting, and protection
   of potentially sensitive operational data.  These aspects are
   expected to be addressed by solution proposals based on deployment
   requirements and threat models.</t>
   
    </section>
  </middle>

  <back>
    <references title="Normative References">    
	 <?rfc include="reference.RFC.2119.xml"?>

      <?rfc include="reference.RFC.8126.xml"?>

      <?rfc include="reference.RFC.8174.xml"?>

    </references>

    <references title="Informative References">     

      <?rfc include="reference.RFC.1350.xml"?>

      <?rfc include="reference.RFC.9293.xml"?>
      
	 <?rfc include="reference.RFC.9343.xml"?>	     

      <?rfc include="reference.RFC.9714.xml"?>

      <?rfc include="reference.RFC.9743.xml"?>

      <?rfc include="reference.RFC.9947.xml"?>

      <?rfc include="reference.I-D.he-ippm-congestion-loss-monitoring-problem.xml"?>

	 <reference anchor="3GPP_TS_22.261" target="https://www.3gpp.org/ftp/specs/archive/22 series/22.261">
       <front>
        <title>Service requirements for the 5G system; Stage 1 (Release 18)</title>
        <author>
         <organization>3GPP</organization>
        </author>
        <date year="2024"/>
       </front>
      </reference>

	 <reference anchor="Adithya_Gangidi24" target="https://doi.org/10.1145/3651890.3672233">
       <front>
        <title>RDMA over Ethernet for Distributed AI Training at Meta Scale</title>
        <author initials="A." surname="Gangidi">
         <organization/>
        </author>
        <author initials="R." surname="Miao">
         <organization/>
        </author>
        <author initials="S." surname="Zheng">
         <organization/>
        </author>        
        <date year="2024"/>
       </front>
       <seriesInfo name="In ACM SIGCOMM 2024 Conference" value=""/>
      </reference>

	 <reference anchor="Kun_Qian24" target="https://doi.org/10.1145/3651890.3672265">
       <front>
        <title>Alibaba HPN: A Data Center Network for Large Language Model Training</title>
        <author initials="K." surname="Qian">
         <organization/>
        </author>
        <author initials="Q." surname="Xi">
         <organization/>
        </author>
        <author initials="J." surname="Cao">
         <organization/>
        </author>        
        <date year="2024"/>
       </front>
       <seriesInfo name="In ACM SIGCOMM 2024 Conference" value=""/>
      </reference>

	 <reference anchor="Microburst" target="https://support.huawei.com/ enterprise/en/doc/">
       <front>
        <title>What is a Microburst? How to Detect a Microburst,(Nov. 2020)</title>
        <author>
         <organization>Huawei Technologies Co., Ltd</organization>
        </author>        
        <date year="2020"/>
       </front>
      </reference>

      <reference anchor="Shuhei_Yoshida21" target="https://doi.org/10.1364/JOCN.422859">
       <front>
        <title>FPGA-based network microburst analysis system with efficient packet capturing</title>
        <author initials="S." surname="Yoshida">
         <organization/>
        </author>
        <author initials="Y." surname="Ukon">
         <organization/>
        </author>
        <author initials="S." surname="Ohteru">
         <organization/>
        </author> 
       </front>
 	  <seriesInfo name="Journal of Optical Communications and Networking" value="October, 2021"/>
      </reference>

      <reference anchor="Xiaoming_He25" target="https://doi.org/10.1109/TNSM.2025.3578056">
       <front>
        <title>Framework for Real-Time Monitoring of Packet Loss Caused by Network Congestion</title>
        <author initials="X." surname="He">
         <organization/>
        </author>
        <author initials="Z." surname="He">
         <organization/>
        </author>
        <author initials="W." surname="Li">
         <organization/>
        </author> 
       </front>
 	  <seriesInfo name="IEEE Transactions on Network and Service Management" value="December, 2025"/>
      </reference>  
      
    </references>
  </back>
</rfc>