Networking H. Huang, Ed. Internet-Draft Huawei Intended status: Standards Track T. He Expires: 5 September 2024 China Unicom T. Zhou Huawei 4 March 2024 Use Cases and Requirements for Implementing Lossless Techniques in Wide Area Networks draft-huang-rtgwg-wan-lossless-uc-00 Abstract This document outlines the use cases and requirements for implementing lossless data transmission techniques in Wide Area Networks (WANs), motivated by the increasing demand for high- bandwidth and reliable data transport in applications such as high- performance computing (HPC), genetic sequencing, and audio/video production. The challenges associated with existing data transport protocols in WAN environments are discussed, along with the proposal of requirements for enhancing lossless transmission capabilities to support emerging data-intensive applications. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 5 September 2024. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. Huang, et al. Expires 5 September 2024 [Page 1] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. High-Performance Computing (HPC) Services for Scientific Research . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2. Rapid Transmission Services for Genetic Sequencing for Timely Medical Services . . . . . . . . . . . . . . . . . 4 2.3. Stable Transmission Services for Large-Scale Audio/Video Data Migration . . . . . . . . . . . . . . . . . . . . . 4 3. Problem Analysis and Goal . . . . . . . . . . . . . . . . . . 4 3.1. Problem Analysis . . . . . . . . . . . . . . . . . . . . 4 3.1.1. Impact of Packet Loss . . . . . . . . . . . . . . . . 5 3.2. Goal . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4. Challenges and Requirements . . . . . . . . . . . . . . . . . 6 5. Security Considerations . . . . . . . . . . . . . . . . . . . 7 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 7. Informative References . . . . . . . . . . . . . . . . . . . 7 Appendix A. Appendix-title . . . . . . . . . . . . . . . . . . . 8 A.1. Appendix-subtitle . . . . . . . . . . . . . . . . . . . . 8 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 8 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 1. Introduction The big data is the very foundation of innovation across numerous fields. From high-performance computing (HPC) in scientific research to the latest advancements in genetic sequencing and the production of high-definition multimedia content, the need for rapid, reliable, and lossless data transmission across wide area networks (WANs) has never been more critical. Traditional network protocols, designed in an era before these immense data demands, struggle to keep up, particularly when it comes to ensuring zero data loss over long distances. This document focuses on the pressing need for lossless data transmission techniques in WANs, driven by the requirements of data- intensive applications that form the backbone of scientific, medical, Huang, et al. Expires 5 September 2024 [Page 2] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 and creative industries. For example, the Energy Sciences Network (ESnet) [ESnet] supports vast amounts of scientific data movement that underpin groundbreaking research. Similarly, in the healthcare sector, the explosion of data from genetic sequencing calls for unprecedented levels of data transmission reliability and efficiency. The media and entertainment industry also faces challenges in moving large volumes of raw content with stable network instead of manual tranportation of physical storage. These scenarios underscore a growing disconnect between the capabilities of existing WAN protocols and the evolving demands of modern applications. The challenges of ensuring zero-loss transmission in an infrastructure not originally designed for such demands highlight the need for new solutions. This document aims to shed light on the necessity for advanced lossless transmission technologies in WANs. By identifying the limitations of current network protocols and outlining the requirements for new developments, we hope to pave the way for a new generation of WANs. These networks will not only meet the current demands of data-intensive applications but will also support the next wave of digital innovation. 2. Use Cases The necessity for implementing lossless data transmission techniques in Wide Area Networks (WANs) is underscored by several critical application areas. These use cases highlight the imperative for reliable, high-speed data transfer capabilities to support the demanding requirements of modern data-intensive operations. 2.1. High-Performance Computing (HPC) Services for Scientific Research High-Performance Computing (HPC) services are fundamental to scientific advancements, where collaborative efforts across various geographical regions are commonplace. For instance, the study of PSII proteins, which are crucial for understanding how water molecules split to produce oxygen, generates between 30 to 120 high- resolution images per second during experiments. This results in 60-100 GB of data every five minutes, necessitating rapid and lossless data transfer from the National Renewable Energy Laboratory's equipment back to analysis labs such as the Lawrence Berkeley National Laboratory. The efficiency and reliability of WANs in this context are not just beneficial but essential for facilitating the seamless collaboration between scientists in different domains, enabling them to share and analyze large datasets effectively. Huang, et al. Expires 5 September 2024 [Page 3] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 2.2. Rapid Transmission Services for Genetic Sequencing for Timely Medical Services The field of genetic sequencing has seen exponential growth, driven by the decreasing costs and widespread application of sequencing technologies. This growth is matched by the burgeoning data volumes generated, which require efficient and lossless transmission to cloud or private data centers for analysis. For example, sequencing a single human genome produces 100GB to 200GB of data. With daily data production rates reaching 6TB to 12TB and annual data management needs surpassing 1.6PB, the demand for high-speed, reliable data transfer is evident. The existing network transfer efficiencies present significant bottlenecks, extending the turnaround times for sequencing services and impacting the timely delivery of precision medicine. 2.3. Stable Transmission Services for Large-Scale Audio/Video Data Migration The competitive landscape of the audio and video industry, coupled with the shift towards cloud-based post-production processes, necessitates the transfer of large volumes of raw footage across WANs. Traditional methods of data transportation, involving physical media and manual transfer, are not only time-consuming but also inefficient. For instance, film crews generating 2TB of data daily resort to physically moving storage media to processing locations, a process that significantly delays the production cycle and weakens market responsiveness. The requirement for a network infrastructure capable of handling such extensive data transfers quickly and without loss is critical for maintaining the pace of production and ensuring the quality of the final multimedia content. 3. Problem Analysis and Goal 3.1. Problem Analysis The primary objective in the realm of Wide Area Networks (WANs) is to provide long-term, stable, and high-capacity network services that can accommodate the sudden surges in data transmission demands, essential for data migration across diverse geographical locations. This goal is predicated on leveraging the inherent statistical multiplexing advantage of IP networks, which allows for cost- effective bandwidth allocation and enhanced overall network throughput. The ability to meet these data transmission requirements efficiently is crucial for supporting the backbone of today’s data- driven applications, ranging from scientific research to global financial transactions and multimedia content delivery. Huang, et al. Expires 5 September 2024 [Page 4] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 Despite the advantages of statistical multiplexing in IP networks, such as cost reduction and throughput optimization, this model introduces significant challenges in ensuring absolute resource guarantee and, consequently, zero packet loss. The practice of overprovisioning bandwidth, common among service providers, does not equate to lossless data transmission, which is a critical shortfall when compared to dedicated light networks or resources with hard isolation. 3.1.1. Impact of Packet Loss In the scenarios outlined for data migration—whether for high- performance computing services, genetic sequencing, or audio/video data migration—the reliance on traditional transmission protocols like TCP or RDMA [RoCEv2] is common. However, both protocols are adversely affected by packet loss, especially over long-haul transmissions. For TCP, algorithms such as CUBIC, a loss-based congestion control mechanism, see a dramatic throughput decline of up to 89.9% with just a 2% packet loss when the Round-Trip Time (RTT) is 30ms. BBR, another TCP congestion control that bases on bandwidth and delay, also suffers significantly when packet loss exceeds 5%, with throughput plummeting in scenarios where packet loss reaches 20%. The cost of retransmissions in these conditions is notably high, with slight packet loss (<1%) scenarios showing a retransmission rate 6-10 times higher than CUBIC, and in severe packet loss scenarios, the rate can increase exponentially. RDMA, often used within data centers for inter-node data access over UDP, relies on a goBackN retransmission mechanism. Its throughput dramatically decreases with packet loss rates greater than 0.1%, and a 2% packet loss rate effectively reduces throughput to zero. To maintain unaffected throughput, the packet loss rate must be kept below one in a hundred thousand. These challenges underscore a critical gap in the current capabilities of IP networks to support the demanding requirements of modern, data-intensive applications. The inability to ensure zero packet loss across WANs not only impacts application performance but also limits the potential for innovation and collaboration across key sectors reliant on rapid and reliable data transmission. Huang, et al. Expires 5 September 2024 [Page 5] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 3.2. Goal The overarching goal in the evolution of Wide Area Networks (WANs) to serve the afore-mentioned use cases is to enable lossless, zero- packet-loss transmission services tailored for the seamless migration of data across different geographical areas. In an age where digital data's volume, velocity, and variety are expanding exponentially, ensuring the lossless transmission of this data during inter-regional migration activities becomes indispensable. This is critically important for applications and operations that rely on the integrity and timeliness of data, such as AI/HPC computing and data backup and recovery. 4. Challenges and Requirements The quest for lossless data transmission in Wide Area Networks (WANs) is confronted with significant challenges, notably the phenomenon of elephant flows—large, bursty data transfers that can cause instantaneous congestion and packet loss within network device queues. This not only increases application latency but also diminishes throughput, adversely affecting application performance. In data centers, certain lossless technologies are deployed to enhance the performance of such applications: * *Priority-based Flow Control (PFC)*: Widely adopted for its ability to manage traffic flow, PFC [PFC] works by halting the transmission of specific queues when downstream congestion is detected, thereby achieving zero packet loss. The foundational flow control mechanism, defined by IEEE 802, involves sending a pause frame from a receiving device to a sending device to temporarily halt traffic, allowing time for congestion to clear before resuming transmission. * *Explicit Congestion Notification (ECN) with Data Center Quantized Congestion Notification (DCQCN)*: DCQCN [DCQCN], the most extensively used congestion control algorithm in RDMA networks, requires network devices to support ECN functionality [RFC3168], with other protocol functionalities implemented on the network card of the host machine. DCQCN ensures high throughput in RDMA networks needing zero packet loss by signaling congestion through ECN markers sent from congested nodes to the sender, prompting a reduction in sending rate. However, the application of these data center-oriented lossless techniques to WANs encounters obstacles due to the larger scale and longer RTTs inherent in WAN environments. Challenges and corresponding requirements arise such as: Huang, et al. Expires 5 September 2024 [Page 6] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 * *Backpressure from PFC*: The widespread application of PFC in large-scale networks can lead to head-of-line blocking, deadlocks, and congestion spreading, which degrade network throughput. Such challenges make the traditional PFC backpressure mechanisms poorly suited for the high stability demands of WANs, necessitating innovation in protocol design to alleviate issues like deadlocks and PFC storms. *Requirement 1*: Innovate and improve upon the PFC backpressure mechanism for WANs, addressing and mitigating the risk of deadlocks and congestion spreading to ensure stable and lossless data transmission. * *ECN-Based Congestion Control Limitations*: While ECN facilitates sender rate control through network collaboration, its effectiveness diminishes over longer distances typical of WANs. The delayed congestion notifications result in prolonged control loops, making it challenging to quickly alleviate congestion. *Requirement 2*: Optimize the ECN control loop for WANs, enhancing the network's ability to manage congestion through improved routing and control strategies, thereby ensuring efficient and lossless transmission across vast geographical distances. These challenges underscore the need for tailored solutions that address the unique demands and conditions of WANs. By adapting and innovating on existing lossless transmission technologies from data center networks, the goal of achieving zero packet loss in WANs becomes attainable, paving the way for enhanced data mobility and application performance. 5. Security Considerations TBD. 6. IANA Considerations TBD. 7. Informative References [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . [RoCEv2] "Supplement to InfiniBand architecture specification volume 1 release 1.2.2 annex A17 - RoCEv2 (IP routable RoCE).", n.d.. Huang, et al. Expires 5 September 2024 [Page 7] Internet-Draft Lossless WAN Use Cases and Requirements March 2024 [DCQCN] et.al., Y. Z., "Congestion Control for Large-Scale RDMA Deployments", August 2015, . [PFC] "IEEE Standard for Local and metropolitan area networks-- Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks--Amendment 17- Priority-based Flow Control", n.d.. [ESnet] "Energy Sciences Networks", n.d.. Appendix A. Appendix-title A.1. Appendix-subtitle Acknowledgements TBD. Contributors TBD. Authors' Addresses Hongyi Huang (editor) Huawei Beijing China Email: hongyi.huang@huawei.com Tao He China Unicom Email: het21@chinaunicom.cn Tianran Zhou Huawei Email: zhoutianran@huawei.com Huang, et al. Expires 5 September 2024 [Page 8]