数据中心是大数据、云计算、人工智能及大模型训练的核心基础设施,数据中心网络与系统是近 15 年来计算机网络方向最重要的研究领域之一。本课程围绕数据中心网络和系统最新科研成果和顶尖论文展开,自底向上将数据中心网络划分为:网络底层架构、路由及负载均衡、拥塞控制、流调度、网络系统和应用等层次,每一层将对~10 篇代表作进行研读,学生将在课堂上对论文进行演讲和辩论,课程将以研究报告或论文投稿等形式进行最终考核。
Data centers are being built around the world to serve as the infrastructures for big data analytics (e.g., Map-reduce, Spark, and Dryad), machine learning and AI frameworks (e.g., TensorFlow, PyTorch, and MXNet), and cloud computing services (e.g., Amazon EC2, Microsoft Windows Azure, and Google App Engine). The goal of this course is to study the key technologies and new challenges in data center networking and systems. The course will include paper presentations, discussions, and projects. The papers will be selected from top networking and systems conferences, organized in a bottom-up manner from network infrastructure, routing and load-balancing, congestion control, flow scheduling, networked systems and applications.
Network infrastructure
- A Scalable, Commodity Data Center Network Architecture, SIGCOMM 2008
- VL2: A Scalable and Flexible Data Center Network, SIGCOMM 2009
- BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers, SIGCOMM 2009
- PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric, SIGCOMM 2009
- Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers, SIGCOMM 2010
- c-Through: Part-time Optics in Data Centers, SIGCOMM 2010
- OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility, NSDI 2012
- Integrating Microsecond Circuit Switching into the Data Center, SIGCOMM 2013
- Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network, SIGCOMM 2015
- Enabling Wide-spread Communications on Optical Fabric with MegaSwitch, NSDI 2017
- Jupiter Evolving: Transforming Google's Datacenter Network via Optical Circuit Switches and Software-defined Networking, SIGCOMM 2022
Routing and load-balancing
- Hedera: Dynamic Flow Scheduling for Data Center Networks, NSDI 2010
- Improving Datacenter Performance and Robustness with Multipath TCP, SIGCOMM 2011
- Distributed Congestion-Aware Load Balancing for Datacenters, SIGCOMM 2014
- Explicit Path Control in Commodity Data Centers: Design and Applications, NSDI 2015
- Resilient Datacenter Load Balancing in the Wild, SIGCOMM 2017
- Multi-Path Transport for RDMA in Datacenters, NSDI 2018
- Network Load Balancing with In-network Reordering Support for RDMA, SIGCOMM 2023
- Congestion control Data Center TCP (DCTCP), SIGCOMM 2010
- Deadline-Aware Datacenter TCP (D2TCP), SIGCOMM 2012
- Congestion Control for Large-Scale RDMA deployments, SIGCOMM 2015
- TIMELY: RTT-based Congestion Control for the Datacenter, SIGCOMM 2015
- RDMA over Commodity Ethernet at Scale, SIGCOMM 2016
- Credit-Scheduled Delay-Bounded Congestion Control for Datacenters, SIGCOMM 2017
- Re-architecting datacenter networks and stacks for low latency and high performance, SIGCOMM 2017
- Revisiting Network Support for RDMA, SIGCOMM 2018
- Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities, SIGCOMM 2018
- HPCC: High Precision Congestion Control, SIGCOMM 2019
- Swift: Delay is Simple and Effective for Congestion Control in the Datacenter, SIGCOMM 2020
- 1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters, SIGCOMM 2020
- Host Congestion Control, SIGCOMM 2023
Flow scheduling
- Better Never than Late: Meeting Deadlines in Datacenter Networks, SIGCOMM 2011
- Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM 2012
- pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM 2013
- Decentralized Task-aware Scheduling for Data Center Networks, SIGCOMM 2014
- Efficient Coflow Scheduling with Varys, SIGCOMM 2014
- Information-Agnostic Flow Scheduling for Commodity Data Centers, NSDI 2015
- Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM 2015
- CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM 2016
- Scheduling Mix-flows in Commodity Datacenters with Karuna, SIGCOMM 2016
- AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization, SIGCOMM 2018
- Programmable Packet Scheduling with a Single Queue, SIGCOMM 2021
Networked systems, applications, and beyond
- Using RDMA Efficiently for Key-Value Services, SIGCOMM 2014
- FaRM: Fast Remote Memory, NSDI 2014
- KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC, SOSP 2017
- Efficient Memory Disaggregation with Infiniswap, NSDI 2017
- Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache, OSDI 2020
- FileMR: Rethinking RDMA Networking for Scalable Persistent Memory, NSDI 2020
- ATP: In-network Aggregation for Multi-tenant Learning, NSDI 2021
- Scaling Distributed Machine Learning with In-Network Aggregation, NSDI 2021
- TopoOpt: Optimizing the Network Topology for Distributed DNN Training, NSDI 2023
- Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems, SIGCOMM 2023