Loading…
USENIX ATC '24 and OSDI '24
Attending this event?
Wednesday, July 10
 

7:30am PDT

Badge Pickup
Wednesday July 10, 2024 7:30am - 5:00pm PDT
Wednesday July 10, 2024 7:30am - 5:00pm PDT
Lobby West

8:00am PDT

Continental Breakfast
Wednesday July 10, 2024 8:00am - 9:00am PDT
Wednesday July 10, 2024 8:00am - 9:00am PDT
Grand Ballroom Foyer

9:00am PDT

Joint Keynote Address: Scaling AI Sustainably: An Uncharted Territory
Wednesday July 10, 2024 9:00am - 10:00am PDT
Carole-Jean Wu, Meta

The past 50 years has seen a dramatic increase in the amount of compute per person, in particular, those enabled by AI. Despite the positive societal benefits, AI technologies come with significant environmental implications. I will talk about the scaling trend and the operational carbon footprint of AI computing by examining the model development cycle, spanning data, algorithms, and system hardware. At the same time, we will consider the life cycle of system hardware from the perspective of hardware architectures and manufacturing technologies. I will highlight key efficiency optimization opportunities for cutting-edge AI technologies, from deep learning recommendation models to multi-modal generative AI tasks. To scale AI sustainably, we need to make AI and computing more broadly efficient and flexible. We must also go beyond efficiency and optimize across the life cycle of computing infrastructures, from hardware manufacturing to datacenter operation and end-of-life processing for the hardware. Based on the industry experience and lessons learned, my talk will conclude with important development and research directions to advance the field of computing in an environmentally responsible and sustainable manner.

https://www.usenix.org/conference/atc24/presentation/wu-joint-keynote
Wednesday July 10, 2024 9:00am - 10:00am PDT
Grand Ballroom ABGH

10:00am PDT

Break with Refreshments
Wednesday July 10, 2024 10:00am - 10:30am PDT
Wednesday July 10, 2024 10:00am - 10:30am PDT
Grand Ballroom Foyer

10:30am PDT

OSDI ’24 Opening Remarks and Awards
Wednesday July 10, 2024 10:30am - 10:45am PDT
Program Co-Chairs: Ada Gavrilovska, Georgia Institute of Technology; Douglas B. Terry, Amazon Web Services
Wednesday July 10, 2024 10:30am - 10:45am PDT
Grand Ballroom ABGH

10:30am PDT

USENIX ATC ’24 Opening Remarks, Awards, and Presentation of the 2024 USENIX Lifetime Achievement (Flame) Award
Wednesday July 10, 2024 10:30am - 10:45am PDT
Program Co-Chairs: Saurabh Bagchi, Purdue University; Yiying Zhang, University of California, San Diego
Wednesday July 10, 2024 10:30am - 10:45am PDT
Grand Ballroom CD

10:45am PDT

Sabre: Hardware-Accelerated Snapshot Compression for Serverless MicroVMs
Wednesday July 10, 2024 10:45am - 11:05am PDT
Nikita Lazarev and Varun Gohil, MIT, CSAIL; James Tsai, Andy Anderson, and Bhushan Chitlur, Intel Labs; Zhiru Zhang, Cornell University; Christina Delimitrou, MIT, CSAIL

MicroVM snapshotting significantly reduces the cold start overheads in serverless applications. Snapshotting enables storing part of the physical memory of a microVM guest into a file, and later restoring from it to avoid long cold start-up times. Prefetching memory pages from snapshots can further improve the effectiveness of snapshotting. However, the efficacy of prefetching depends on the size of the memory that needs to be restored. Lossless page compression is therefore a great way to improve the coverage of the memory footprint that snapshotting with prefetching achieves. Unfortunately, the high overhead and high CPU cost of software-based (de)compression makes this impractical. We introduce Sabre, a novel approach to snapshot page prefetching based on hardware-accelerated (de)compression. Sabre leverages an increasingly pervasive near-memory analytics accelerator available in modern datacenter processors. We show that by appropriately leveraging such accelerators, microVM snapshots of serverless applications can be compressed up to a factor of 4.5×, with nearly negligible decompression costs. We use this insight to build an efficient page prefetching library capable of speeding up memory restoration from snapshots by up to 55%. We integrate the library with the production-grade Firecracker microVMs and evaluate its end-to-end performance on a wide set of serverless applications.

https://www.usenix.org/conference/osdi24/presentation/lazarev
Wednesday July 10, 2024 10:45am - 11:05am PDT
Grand Ballroom ABGH

10:45am PDT

Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu
Wednesday July 10, 2024 10:45am - 11:10am PDT
Qingyuan Liu, Yanning Yang, Dong Du, and Yubin Xia, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Ping Zhang and Jia Feng, Huawei Cloud; James R. Larus, EPFL; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education; Key Laboratory of System Software (Chinese Academy of Science)

Current serverless platforms struggle to optimize resource utilization due to their dynamic and fine-grained nature. Conventional techniques like overcommitment and autoscaling fall short, often sacrificing utilization for practicability or incurring performance trade-offs. Overcommitment requires predicting performance to prevent QoS violation, introducing trade-off between prediction accuracy and overheads. Autoscaling requires scaling instances in response to load fluctuations quickly to reduce resource wastage, but more frequent scaling also leads to more cold start overheads. This paper introduces Jiagu to harmonize efficiency with practicability through two novel techniques. First, pre-decision scheduling achieves accurate prediction while eliminating overheads by decoupling prediction and scheduling. Second, \emph{dual-staged scaling} achieves frequent adjustment of instances with minimum overhead. We have implemented a prototype and evaluated it using real-world applications and traces from the public cloud platform. Our evaluation shows a 54.8% improvement in deployment density over commercial clouds (with Kubernetes) while maintaining QoS, and 81.0%–93.7% lower scheduling costs and a 57.4%–69.3% reduction in cold start latency compared to existing QoS-aware schedulers.

https://www.usenix.org/conference/atc24/presentation/liu-qingyuan
Wednesday July 10, 2024 10:45am - 11:10am PDT
Grand Ballroom CD

10:45am PDT

Power-aware Deep Learning Model Serving with μ-Serve
Wednesday July 10, 2024 10:45am - 11:10am PDT
Haoran Qiu, Weichao Mao, Archit Patke, and Shengkun Cui, University of Illinois Urbana-Champaign; Saurabh Jha, Chen Wang, and Hubertus Franke, IBM Research; Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer, University of Illinois Urbana-Champaign

With the increasing popularity of large deep learning model-serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine-grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, µ-Serve. µ-Serve is a model-serving framework that optimizes the power consumption and model serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that µ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations.

https://www.usenix.org/conference/atc24/presentation/qiu
Wednesday July 10, 2024 10:45am - 11:10am PDT
Grand Ballroom EF

11:05am PDT

Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration
Wednesday July 10, 2024 11:05am - 11:25am PDT
Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, and Jia Rao, The University of Texas at Arlington; Yifan Yuan and Ren Wang, Intel Labs

With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate data spilled from fast memory. While the existing research has demonstrated the effectiveness of various optimizations on page migration, it falls short of addressing a fundamental question: Is exclusive memory tiering, in which a page is either present in fast memory or slow memory, but not both simultaneously, the optimal strategy for tiered memory management?

We demonstrate that page migration-based exclusive memory tiering suffers significant performance degradation when fast memory is under pressure. In this paper, we propose non-exclusive memory tiering, a page management strategy that retains a copy of pages recently promoted from slow memory to fast memory to mitigate memory thrashing. To enable non-exclusive memory tiering, we develop NOMAD, a new page management mechanism for Linux that features transactional page migration and page shadowing. NOMAD helps remove page migration off the critical path of program execution and makes migration completely asynchronous. Evaluations with carefully crafted micro-benchmarks and real-world applications show that NOMAD is able to achieve up to 6x performance improvement over the state-of-the-art transparent page placement (TPP) approach in Linux when under memory pressure. We also compare NOMAD with a recently proposed hardware-assisted, access sampling-based page migration approach and demonstrate NOMAD's strengths and potential weaknesses in various scenarios.

https://www.usenix.org/conference/osdi24/presentation/xiang
Wednesday July 10, 2024 11:05am - 11:25am PDT
Grand Ballroom ABGH

11:10am PDT

ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions
Wednesday July 10, 2024 11:10am - 11:35am PDT
Yuqi Fu, University of Virginia; Ruizhe Shi, George Mason University; Haoliang Wang, Adobe Research; Songqing Chen, George Mason University; Yue Cheng, University of Virginia

FaaS (Function-as-a-Service) workloads feature unique patterns. Serverless functions are ephemeral, highly concurrent, and bursty, with an execution duration ranging from a few milliseconds to a few seconds. The workload behaviors pose new challenges to kernel scheduling. Linux CFS (Completely Fair Scheduler) is workload-oblivious and optimizes long-term fairness via proportional sharing. CFS neglects the short-term demands of CPU time from short-lived serverless functions, severely impacting the performance of short functions. Preemptive shortest job first—shortest remaining process time (SRPT)—prioritizes shorter functions in order to satisfy their short-term demands of CPU time and, therefore, serves as a best-case baseline for optimizing the turnaround time of short functions. A significant downside of approximating SRPT, however, is that longer functions might be starved.

In this paper, we propose a novel application-aware kernel scheduler, ALPS (Adaptive Learning, Priority Scheduler), based on two key insights. First, approximating SRPT can largely benefit short functions but may inevitably penalize long functions. Second, CFS provides necessary infrastructure support to implement user-defined priority scheduling. To this end, we design ALPS to have a novel, decoupled scheduler frontend and backend architecture, which unifies approximate SRPT and proportional-share scheduling. ALPS’ frontend sits in the user space and approximates SRPT-inspired priority scheduling by adaptively learning from an SRPT simulation on a recent past workload. ALPS’ backend uses eBPF functions hooked to CFS to carry out the continuously learned policies sent from the frontend to inform scheduling decisions in the kernel. This design adds workload intelligence to workload-oblivious OS scheduling while retaining the desirable properties of OS schedulers. We evaluate ALPS extensively using two production FaaS workloads (Huawei and Azure), and results show that ALPS achieves a reduction of 57.2% in average function execution duration compared to CFS.

https://www.usenix.org/conference/atc24/presentation/fu
Wednesday July 10, 2024 11:10am - 11:35am PDT
Grand Ballroom CD

11:10am PDT

Fast Inference for Probabilistic Graphical Models
Wednesday July 10, 2024 11:10am - 11:35am PDT
Jiantong Jiang, The University of Western Australia; Zeyi Wen, HKUST (Guangzhou) and HKUST; Atif Mansoor and Ajmal Mian, The University of Western Australia

Probabilistic graphical models (PGMs) have attracted much attention due to their firm theoretical foundation and inherent interpretability. However, existing PGM inference systems are inefficient and lack sufficient generality, due to issues with irregular memory accesses, high computational complexity, and modular design limitation. In this paper, we present Fast-PGM, a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms. Fast-PGM incorporates careful memory management techniques to reduce memory consumption and enhance data locality. It also employs computation and parallelization optimizations to reduce computational complexity and improve the overall efficiency. Furthermore, Fast-PGM offers high generality and flexibility, allowing easy integration with all the mainstream importance sampling-based algorithms. The system abstraction of Fast-PGM facilitates easy optimizations, extensions, and customization for users. Extensive experiments show that Fast-PGM achieves 3 to 20 times speedup over the state-of-the-art implementation. Fast-PGM source code is freely available at https://github.com/jjiantong/FastPGM.

https://www.usenix.org/conference/atc24/presentation/jiang
Wednesday July 10, 2024 11:10am - 11:35am PDT
Grand Ballroom EF

11:25am PDT

Managing Memory Tiers with CXL in Virtualized Environments
Wednesday July 10, 2024 11:25am - 11:45am PDT
Yuhong Zhong, Columbia University, Microsoft Azure; Daniel S. Berger, Microsoft Azure, University of Washington; Carl Waldspurger, Carl Waldspurger Consulting; Ryan Wee, Columbia University; Ishwar Agarwal, Rajat Agarwal, Frank Hady, and Karthik Kumar, Intel; Mark D. Hill, University of Wisconsin–Madison; Mosharaf Chowdhury, University of Michigan; Asaf Cidon, Columbia University

Cloud providers seek to deploy CXL-based memory to increase aggregate memory capacity, reduce costs, and lower carbon emissions. However, CXL accesses incur higher latency than local DRAM. Existing systems use software to manage data placement across memory tiers at page granularity. Cloud providers are reluctant to deploy software-based tiering due to high overheads in virtualized environments. Hardware-based memory tiering could place data at cacheline granularity, mitigating these drawbacks. However, hardware is oblivious to application-level performance.

We propose combining hardware-managed tiering with software-managed performance isolation to overcome the pitfalls of either approach. We introduce Intel® Flat Memory Mode, the first hardware-managed tiering system for CXL. Our evaluation on a full-system prototype demonstrates that it provides performance close to regular DRAM, with no more than 5% degradation for more than 82% of workloads. Despite such small slowdowns, we identify two challenges that can still degrade performance by up to 34% for "outlier" workloads: (1) memory contention across tenants, and (2) intra-tenant contention due to conflicting access patterns.

To address these challenges, we introduce Memstrata, a lightweight multi-tenant memory allocator. Memstrata employs page coloring to eliminate inter-VM contention. It improves performance for VMs with access patterns that are sensitive to hardware tiering by allocating them more local DRAM using an online slowdown estimator. In multi-VM experiments on prototype hardware, Memstrata is able to identify performance outliers and reduce their degradation from above 30% to below 6%, providing consistent performance across a wide range of workloads.

https://www.usenix.org/conference/osdi24/presentation/zhong-yuhong
Wednesday July 10, 2024 11:25am - 11:45am PDT
Grand Ballroom ABGH

11:35am PDT

Starburst: A Cost-aware Scheduler for Hybrid Cloud
Wednesday July 10, 2024 11:35am - 12:00pm PDT
Michael Luo, Siyuan Zhuang, Suryaprakash Vengadesan, and Romil Bhardwaj, UC Berkeley; Justin Chang, UC Santa Barbara; Eric Friedman, Scott Shenker, and Ion Stoica, UC Berkeley

To efficiently tackle bursts in job demand, organizations employ hybrid cloud architectures to scale their batch workloads from their private clusters to public cloud. This requires transforming cluster schedulers into cloud-enabled versions to navigate the tradeoff between cloud costs and scheduler objectives such as job completion time (JCT). However, our analysis over production-level traces show that existing cloud-enabled schedulers incur inefficient cost-JCT trade-offs due to low cluster utilization.

We present Starburst, a system that maximizes cluster utilization to streamline the cost-JCT tradeoff. Starburst's scheduler dynamically controls jobs' waiting times to improve utilization—it assigns longer waits for large jobs to increase their chances of running on the cluster, and shorter waits to small jobs to increase their chances of running on the cloud. To offer configurability, Starburst provides system administrators a simple waiting budget framework to tune their position on the cost-JCT curve. A departure from traditional cluster schedulers, Starburst operates as a higher-level resource manager over a private cluster and dynamic cloud clusters. Simulations over production-level traces and real-world experiments on a 32-GPU private cluster show that Starburst can reduce cloud costs by up to 54-91% over existing cluster managers, while increasing average JCT by at most 5.8%.

https://www.usenix.org/conference/atc24/presentation/luo
Wednesday July 10, 2024 11:35am - 12:00pm PDT
Grand Ballroom CD

11:35am PDT

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
Wednesday July 10, 2024 11:35am - 12:00pm PDT
Bin Gao, National University of Singapore; Zhuomin He, Shanghai Jiaotong University; Puru Sharma, Qingxuan Kang, and Djordje Jevdjic, National University of Singapore; Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo, Huawei Cloud

Interacting with humans through multi-turn conversations is a fundamental feature of large language models (LLMs). However, existing LLM serving engines executing multi-turn conversations are inefficient due to the need to repeatedly compute the key-value (KV) caches of historical tokens, incurring high serving costs. To address the problem, this paper proposes CachedAttention, a new attention mechanism that enables reuse of KV caches across multi-turn conversations, significantly reducing the repetitive computation overheads. CachedAttention maintains a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests. To reduce KV cache access overheads from slow mediums, CachedAttention employs layer-wise pre-loading and asynchronous saving schemes to overlap the KV cache access with the GPU computation. To ensure that the KV caches to be accessed are placed in the fastest hierarchy, CachedAttention employs scheduler-aware fetching and eviction schemes to consciously place the KV caches in different layers based on the hints from the inference job scheduler. To avoid the invalidation of the saved KV caches incurred by context window overflow, CachedAttention enables the saved KV caches to remain valid via decoupling the positional encoding and effectively truncating the KV caches. Extensive experimental results demonstrate that CachedAttention significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8× for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.

https://www.usenix.org/conference/atc24/presentation/gao-bin-cost
Wednesday July 10, 2024 11:35am - 12:00pm PDT
Grand Ballroom EF

11:45am PDT

Harvesting Memory-bound CPU Stall Cycles in Software with MSH
Wednesday July 10, 2024 11:45am - 12:05pm PDT
Zhihong Luo, Sam Son, and Sylvia Ratnasamy, UC Berkeley; Scott Shenker, UC Berkeley & ICSI

Memory-bound stalls account for a significant portion of CPU cycles in datacenter workloads, which makes harvesting them to execute other useful work highly valuable. However, mainstream implementations of the hardware harvesting mechanism, simultaneous multithreading (SMT), are unsatisfactory. They incur high latency overhead and do not offer fine-grained configurability of the trade-off between latency and harvesting throughput, which hinders wide adoption for latency-critical services; and they support only limited degrees of concurrency, which prevents full harvesting of memory stall cycles.

We present MSH, the first system that transparently and efficiently harvests memory-bound stall cycles in software. MSH makes full use of stall cycles with concurrency scaling, while incurring minimal and configurable latency overhead. MSH achieves these with a novel co-design of profiling, program analysis, binary instrumentation and runtime scheduling. Our evaluation shows that MSH achieves up to 72% harvesting throughput of SMT for latency SLOs under which SMT has to be disabled, and that strategically combining MSH with SMT leads to higher throughput than SMT due to MSH's capability to fully harvest memory-bound stall cycles.

https://www.usenix.org/conference/osdi24/presentation/luo
Wednesday July 10, 2024 11:45am - 12:05pm PDT
Grand Ballroom ABGH

12:00pm PDT

StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow
Wednesday July 10, 2024 12:00pm - 12:25pm PDT
Hao Wu, Yue Yu, and Junxiao Deng, Huazhong University of Science and Technology; Shadi Ibrahim, Inria; Song Wu and Hao Fan, Huazhong University of Science and Technology and Jinyinhu Laboratory; Ziyue Cheng, Huazhong University of Science and Technology; Hai Jin, Huazhong University of Science and Technology and Jinyinhu Laboratory

The dynamic workload and latency sensitivity of DNN inference drive a trend toward exploiting serverless computing for scalable DNN inference serving. Usually, GPUs are spatially partitioned to serve multiple co-located functions. However, existing serverless inference systems isolate functions in separate monolithic GPU runtimes (e.g., CUDA context), which is too heavy for short-lived and fine-grained functions, leading to a high startup latency, a large memory footprint, and expensive inter-function communication. In this paper, we present StreamBox, a new lightweight GPU sandbox for serverless inference workflow. StreamBox unleashes the potential of streams and efficiently realizes them for serverless inference by implementing fine-grain and auto-scaling memory management, allowing transparent and efficient intra-GPU communication across functions, and enabling PCIe bandwidth sharing among concurrent streams. Our evaluations over real-world workloads show that StreamBox reduces the GPU memory footprint by up to 82% and improves throughput by 6.7X compared to state-of-the-art serverless inference systems.

https://www.usenix.org/conference/atc24/presentation/wu-hao
Wednesday July 10, 2024 12:00pm - 12:25pm PDT
Grand Ballroom CD

12:00pm PDT

PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch
Wednesday July 10, 2024 12:00pm - 12:25pm PDT
Kinman Lei, Yuyang Jin, Mingshu Zhai, Kezhao Huang, Haoxing Ye, and Jidong Zhai, Tsinghua University

Aligning Large Language Models (LLMs) is currently the primary method to ensure AI systems operate in an ethically responsible and socially beneficial manner. Its paradigm differs significantly from standard pre-training or fine-tuning processes, involving multiple models and workloads (context), and necessitates frequently switching execution, introducing significant overhead, such as parameter updates and data transfer, which poses a critical challenge: efficiently switching between different models and workloads.

To address these challenges, we introduce PUZZLE, an efficient system for LLM alignment. We explore model orchestration as well as light-weight and smooth workload switching in aligning LLMs by considering the similarity between different workloads. Specifically, PUZZLE adopts a two-dimensional approach for efficient switching, focusing on both intra- and inter-stage switching. Within each stage, switching costs are minimized by exploring model affinities and overlapping computation via time-sharing. Furthermore, a similarity-oriented strategy is employed to find the optimal inter-stage switch plan with the minimum communication cost. We evaluate PUZZLE on various clusters with up to 32 GPUs. Results show that PUZZLE achieves up to 2.12× speedup compared with the state-of-the-art RLHF training system DeepSpeed-Chat.

https://www.usenix.org/conference/atc24/presentation/lei
Wednesday July 10, 2024 12:00pm - 12:25pm PDT
Grand Ballroom EF

12:05pm PDT

A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient Far-Memory Applications
Wednesday July 10, 2024 12:05pm - 12:25pm PDT
Lei Chen, University of Chinese Academy of Sciences; Shi Liu, UCLA; Chenxi Wang, University of Chinese Academy of Sciences; Haoran Ma and Yifan Qiao, UCLA; Zhe Wang and Chenggang Wu, University of Chinese Academy of Sciences; Youyou Lu, Tsinghua University; Xiaobing Feng and Huimin Cui, University of Chinese Academy of Sciences; Shan Lu, Microsoft Research; Harry Xu, UCLA

With rapid advances in network hardware, far memory has gained a great deal of traction due to its ability to break the memory capacity wall. Existing far memory systems fall into one of two data paths: one that uses the kernel's paging system to transparently access far memory at the page granularity, and a second that bypasses the kernel, fetching data at the object granularity. While it is generally believed that object fetching outperforms paging due to its fine-grained access, it requires significantly more compute resources to run object-level LRU and eviction.

We built Atlas, a hybrid data plane enabled by a runtime-kernel co-design that simultaneously enables accesses via these two data paths to provide high efficiency for real-world applications. Atlas uses always-on profiling to continuously measure page locality. For workloads already with good locality, paging is used to fetch data, whereas for those without, object fetching is employed. Object fetching moves objects that are accessed close in time to contiguous local space, dynamically improving locality and making the execution increasingly amenable to paging, which is much more resource-efficient. Our evaluation shows that Atlas improves the throughput (e.g., by 1.5x and 3.2x) and reduces the tail latency (e.g., by one and two orders of magnitude) when using remote memory, compared with AIFM and Fastswap, the state-of-the-art techniques respectively in the two categories.

https://www.usenix.org/conference/osdi24/presentation/chen-lei
Wednesday July 10, 2024 12:05pm - 12:25pm PDT
Grand Ballroom ABGH

12:25pm PDT

DRust: Language-Guided Distributed Shared Memory with Fine Granularity, Full Transparency, and Ultra Efficiency
Wednesday July 10, 2024 12:25pm - 12:45pm PDT
Haoran Ma, Yifan Qiao, Shi Liu, and Shan Yu, UCLA; Yuanjiang Ni, Qingda Lu, and Jiesheng Wu, Alibaba Group; Yiying Zhang, UCSD; Miryung Kim and Harry Xu, UCLA

Despite being a powerful concept, distributed shared memory (DSM) has not been made practical due to the extensive synchronization needed between servers to implement memory coherence. This paper shows a practical DSM implementation based on the insight that the ownership model embedded in programming languages such as Rust automatically constrains the order of read and write, providing opportunities for significantly simplifying the coherence implementation if the ownership semantics can be exposed to and leveraged by the runtime. This paper discusses the design and implementation of DRust, a Rust-based DSM system that outperforms the two state-of-the-art DSM systems GAM and Grappa by up to 2.64× and 29.16× in throughput, and scales much better with the number of servers.

https://www.usenix.org/conference/osdi24/presentation/ma-haoran
Wednesday July 10, 2024 12:25pm - 12:45pm PDT
Grand Ballroom ABGH

12:25pm PDT

ATC Conference Luncheon
Wednesday July 10, 2024 12:25pm - 2:00pm PDT
Sponsored by Roblox
Wednesday July 10, 2024 12:25pm - 2:00pm PDT
Santa Clara Ballroom

12:45pm PDT

OSDI Conference Luncheon
Wednesday July 10, 2024 12:45pm - 2:00pm PDT
Sponsored by Roblox
Wednesday July 10, 2024 12:45pm - 2:00pm PDT
Santa Clara Ballroom

2:00pm PDT

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Wednesday July 10, 2024 2:00pm - 2:20pm PDT
Amey Agrawal, Georgia Institute of Technology; Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, and Bhargav Gulavani, Microsoft Research India; Alexey Tumanov, Georgia Institute of Technology; Ramachandran Ramjee, Microsoft Research India

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles.

Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6× gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

https://www.usenix.org/conference/osdi24/presentation/agrawal
Wednesday July 10, 2024 2:00pm - 2:20pm PDT
Grand Ballroom ABGH

2:00pm PDT

ScalaAFA: Constructing User-Space All-Flash Array Engine with Holistic Designs
Wednesday July 10, 2024 2:00pm - 2:25pm PDT
Shushu Yi, Peking University and Zhongguancun Laboratory; Xiurui Pan, Peking University; Qiao Li, Xiamen University; Qiang Li, Alibaba; Chenxi Wang, University of Chinese Academy of Sciences; Bo Mao, Xiamen University; Myoungsoo Jung, KAIST and Panmnesia; Jie Zhang, Peking University and Zhongguancun Laboratory

All-flash array (AFA) is a popular approach to aggregate the capacity of multiple solid-state drives (SSDs) while guaranteeing fault tolerance. Unfortunately, existing AFA engines inflict substantial software overheads on the I/O path, such as the user-kernel context switches and AFA internal tasks (e.g., parity preparation), thereby failing to adopt next-generation high-performance SSDs.

Tackling this challenge, we propose ScalaAFA, a unique holistic design of AFA engine that can extend the throughput of next-generation SSD arrays in scale with low CPU costs. We incorporate ScalaAFA into user space to avoid user-kernel context switches while harnessing SSD built-in resources for handling AFA internal tasks. Specifically, in adherence to the lock-free principle of existing user-space storage framework, ScalaAFA substitutes the traditional locks with an efficient message-passing-based permission management scheme to facilitate inter-thread synchronization. Considering the CPU burden imposed by background I/O and parity computation, ScalaAFA proposes to offload these tasks to SSDs. To mitigate host-SSD communication overheads in offloading, ScalaAFA takes a novel data placement policy that enables transparent data gathering and in-situ parity computation. ScalaAFA also addresses two AFA intrinsic issues, metadata persistence and write amplification, by thoroughly exploiting SSD architectural innovations. Comprehensive evaluation results indicate that ScalaAFA can achieve 2.5× write throughput and reduce average write latency by a significant 52.7%, compared to the state-of-the-art AFA engines.

https://www.usenix.org/conference/atc24/presentation/yi-shushu
Wednesday July 10, 2024 2:00pm - 2:25pm PDT
Grand Ballroom CD

2:00pm PDT

PeRF: Preemption-enabled RDMA Framework
Wednesday July 10, 2024 2:00pm - 2:25pm PDT
Sugi Lee and Mingyu Choi, Acryl Inc.; Ikjun Yeom, Acryl Inc. and Sungkyunkwan University; Younghoon Kim, Sungkyunkwan University

Remote Direct Memory Access (RDMA) provides high throughput, low latency, and minimal CPU usage for data-intensive applications. However, RDMA was initially designed for single-tenant use, and its application in a multi-tenant cloud environment poses challenges in terms of performance isolation, security, and scalability. This paper proposes a Preemption-enabled RDMA Framework (PeRF), which offers software-based performance isolation for efficient multi-tenancy in RDMA. PeRF leverages a novel RNIC preemption mechanism to dynamically control RDMA resource utilization for each tenant, while ensuring that RNICs remain busy, thereby enabling work conservation. PeRF outperforms existing approaches by achieving flexible performance isolation without compromising RDMA's bare-metal performance.

https://www.usenix.org/conference/atc24/presentation/lee
Wednesday July 10, 2024 2:00pm - 2:25pm PDT
Grand Ballroom EF

2:20pm PDT

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
Wednesday July 10, 2024 2:20pm - 2:40pm PDT
Yao Fu, Leyang Xue, Yeqi Huang, and Andrei-Octavian Brabete, University of Edinburgh; Dmitrii Ustiugov, NTU Singapore; Yuvraj Patel and Luo Mai, University of Edinburgh

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) fast multi-tier checkpoint loading, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) efficient live migration of LLM inference, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) startup-time-optimized model scheduling, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.

https://www.usenix.org/conference/osdi24/presentation/fu
Wednesday July 10, 2024 2:20pm - 2:40pm PDT
Grand Ballroom ABGH

2:25pm PDT

FastCommit: resource-efficient, performant and cost-effective file system journaling
Wednesday July 10, 2024 2:25pm - 2:50pm PDT
Harshad Shirwadkar, Saurabh Kadekodi, and Theodore Tso, Google

JBD2, the current physical journaling mechanism in Ext4 is bulky and resource-hungry. Specifically, in case of metadata-heavy workloads, fsyncs issued by applications cause JBD2 to write copies of changed metadata blocks, incurring high byte and IO overhead. When storing data in Ext4 via NFS (a popular setup), the NFS protocol issues fsyncs for every file metadata update which further exacerbates the problem. In a simple multi-threaded mail-server workload, JBD2 consumed approximately 76% of the disk’s write bandwidth. Higher byte and IO utilization of JBD2 results in reduced application throughput, higher wear-out of flash based media and increased performance provisioning costs in cloud-based storage services.

We present FastCommit: a hybrid journaling approach for Ext4 which performs logical journaling for simple and frequent file system modifications, while relying on JBD2 for more complex and rare modifications. Key design elements of FastCommit are compact logging, selective flushing and inline journaling. The first two techniques work together to ensure that over 80% commits are contained within a single 4KB block and are written to disk without requiring an expensive cache flush operation. Inline journaling minimizes context switching delays. With faster and efficient fsyncs, FastCommit reduces throughput interference of JBD2 by over 2× along with throughput improvements of up to 120%. We implemented FastCommit in Ext4 and successfully merged our code to the upstream Linux kernel.

https://www.usenix.org/conference/atc24/presentation/shirwadkar
Wednesday July 10, 2024 2:25pm - 2:50pm PDT
Grand Ballroom CD

2:25pm PDT

CyberStar: Simple, Elastic and Cost-Effective Network Functions Management in Cloud Network at Scale
Wednesday July 10, 2024 2:25pm - 2:50pm PDT
Tingting Xu, Nanjing University; Bengbeng Xue, Yang Song, Xiaomin Wu, Xiaoxin Peng, and Yilong Lyu, Alibaba Group; Xiaoliang Wang, Chen Tian, Baoliu Ye, and Camtu Nguyen, Nanjing University; Biao Lyu and Rong Wen, Alibaba Group; Zhigang Zong, Alibaba Group and Zhejiang University; Shunmin Zhu, Alibaba Group and Tsinghua University

Network functions (NFs) facilitate network operations and have become a critical service offered by cloud providers. One of the key challenges is how to meet the elastic requirements of massive traffic and diverse NF requests of tenants. This paper identifies the opportunity by leveraging cloud elastic compute services (ECS), i.e. containers or virtual machines, to provide the cloud-scale network function services, CyberStar. CyberStar introduces two key designs: (i) resource pooling based on a newly proposed three-tier architecture for scalable network functions; and (ii) on-demand resource assignment while maintaining high resource utilization in terms of both tenant demands and operation cost. Compared to the traditional NFs constructed over bare-metal servers, CyberStar can achieve 100Gbps bandwidth (6.7×) and scale to millions of connections within one second (20×).

https://www.usenix.org/conference/atc24/presentation/xu-tingting
Wednesday July 10, 2024 2:25pm - 2:50pm PDT
Grand Ballroom EF

2:40pm PDT

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wednesday July 10, 2024 2:40pm - 3:00pm PDT
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim, Seoul National University

Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00× compared to prior KV cache management methods while offering substantially better model accuracy.

https://www.usenix.org/conference/osdi24/presentation/lee
Wednesday July 10, 2024 2:40pm - 3:00pm PDT
Grand Ballroom ABGH

2:50pm PDT

ZMS: Zone Abstraction for Mobile Flash Storage
Wednesday July 10, 2024 2:50pm - 3:15pm PDT
Joo-Young Hwang, Seokhwan Kim, Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, and Sangyeun Cho, Samsung Electronics; Youjip Won, Korea Advanced Institute of Science and Technology

We propose an I/O stack for ZNS based flash storage in mobile environment, ZMS. The zone interface is known to save the flash storage from two fundamental issues which modern flash storage suffers from: logical-to-physical mapping table size and garbage collection overhead. Through extensive study, we find that realizing the zone interface in mobile environment is more than a challenge due to the unique characteristics of mobile environment: the lack of on-device memory in mobile flash storage and the frequent fsync() calls in mobile applications. Aligned with this, we identify the root causes that need to be addressed in realizing the zone interface in mobile I/O stack: write buffer thrashing and tiny synchronous file update. We develop a filesystem, block I/O layer, and device firmware techniques to address the above mentioned two issues. The three key techniques in ZMS are (i) IOTailor, (ii) budget-based in-place update, and (iii) multi-granularity logical-to-physical mapping. Evaluation on a real production platform shows that ZMS improves write amplification by 2.9–6.4× and random write performance by 5.0–13.6×. With the three techniques, ZMS shows significant performance improvement in writing to the multiple zones concurrently, executing SQLite transactions, and launching the applications.

https://www.usenix.org/conference/atc24/presentation/hwang
Wednesday July 10, 2024 2:50pm - 3:15pm PDT
Grand Ballroom CD

2:50pm PDT

OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs
Wednesday July 10, 2024 2:50pm - 3:15pm PDT
Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, Alessandro Vezzu, Thomas Benz, Salvatore Di Girolamo, and Timo Schneider, ETH Zürich; Daniele De Sensi, ETH Zürich and Sapienza University of Rome; Luca Benini and Torsten Hoefler, ETH Zürich

Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource manager co-design. OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing of the on-path packet processing data plane. We integrate OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMOSIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead.

https://www.usenix.org/conference/atc24/presentation/khalilov
Wednesday July 10, 2024 2:50pm - 3:15pm PDT
Grand Ballroom EF

3:00pm PDT

Llumnix: Dynamic Scheduling for Large Language Model Serving
Wednesday July 10, 2024 3:00pm - 3:20pm PDT
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin, Alibaba Group

Inference serving for large language models (LLMs) is the key to unleashing their potential in people's daily lives. However, efficient LLM serving remains challenging today because the requests are inherently heterogeneous and unpredictable in terms of resource and latency requirements, as a result of the diverse applications and the dynamic execution nature of LLMs. Existing systems are fundamentally limited in handling these characteristics and cause problems such as severe queuing delays, poor tail latencies, and SLO violations.

We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling across multiple model instances. Similar to context switching across CPU cores in modern operating systems, Llumnix reschedules requests to improve load balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs. Llumnix implements the rescheduling with an efficient and scalable live migration mechanism for requests and their in-memory states, and exploits it in a dynamic scheduling policy that unifies the multiple rescheduling scenarios elegantly. Our evaluations show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5×, and delivers up to 36% cost savings while achieving similar tail latencies, compared against state-of-the-art LLM serving systems. Llumnix is publicly available at https://github.com/AlibabaPAI/llumnix.

https://www.usenix.org/conference/osdi24/presentation/sun-biao
Wednesday July 10, 2024 3:00pm - 3:20pm PDT
Grand Ballroom ABGH

3:15pm PDT

Ethane: An Asymmetric File System for Disaggregated Persistent Memory
Wednesday July 10, 2024 3:15pm - 3:40pm PDT
Miao Cai, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics; Junru Shen, College of Computer Science and Software Engineering, Hohai University; Baoliu Ye, State Key Laboratory for Novel Software Technology, Nanjing University

The ultra-fast persistent memories (PMs) promise a practical solution towards high-performance distributed file systems. This paper examines and reveals a cascade of three performance and cost issues in the current PM provision scheme, namely expensive cross-node interaction, weak single-node capability, and costly scale-out performance, which not only underutilizes fast PM devices but also magnifies its limited storage capacity and high price deficiencies. To remedy this, we introduce Ethane, a file system built on disaggregated persistent memory (DPM). Through resource separation using fast connectivity technologies, DPM achieves efficient and cost-effective PM sharing while retaining low-latency memory access. To unleash such hardware potentials, Ethane incorporates an asymmetric file system architecture inspired by the imbalanced resource provision feature of DPM. It splits a file system into a control-plane FS and a data-plane FS and designs these two planes to make the best use of the respective hardware resources. Evaluation results demonstrate that Ethane reaps the DPM hardware benefits, performs up to 68× better than modern distributed file systems, and improves data-intensive application throughputs by up to 17×.

https://www.usenix.org/conference/atc24/presentation/cai
Wednesday July 10, 2024 3:15pm - 3:40pm PDT
Grand Ballroom CD

3:15pm PDT

ETC: An Elastic Transmission Control Using End-to-End Available Bandwidth Perception
Wednesday July 10, 2024 3:15pm - 3:40pm PDT
Feixue Han, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Qing Li, Peng Cheng Laboratory; Peng Zhang, Tencent; Gareth Tyson, Hong Kong University; Yong Jiang, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Mingwei Xu, Tsinghua University; Yulong Lan and ZhiCheng Li, Tencent

Researchers and practitioners have proposed various transport protocols to keep up with advances in networks and the applications that use them. Current Wide Area Network protocols strive to identify a congestion signal to make distributed but fair judgments. However, existing congestion signals such as RTT and packet loss can only be observed after congestion occurs. We therefore propose Elastic Transmission Control (ETC). ETC exploits the instantaneous receipt rate of N consecutive packets as the congestion signal. We refer to this as the pulling rate, as we posit that the receipt rate can be used to "pull'' the sending rate towards a fair share of the capacity. Naturally, this signal can be measured prior to congestion, as senders can access it immediately after the acknowledgment of the first N packets. Exploiting the pulling rate measurements, ETC calculates the optimal rate update steps following a simple elastic principle: the further away from the pulling rate, the faster the sending rate increases. We conduct extensive experiments using both simulated and real networks. Our results show that ETC outperforms the state-of-the-art protocols in terms of both throughput (15% higher than Copa) and latency (20% lower than BBR). Besides, ETC shows superiority in convergence speed and fairness, with a 10× improvement in convergence time even compared to the protocol with the best convergence performance.

https://www.usenix.org/conference/atc24/presentation/han
Wednesday July 10, 2024 3:15pm - 3:40pm PDT
Grand Ballroom EF

3:20pm PDT

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Wednesday July 10, 2024 3:20pm - 3:40pm PDT
Yinmin Zhong and Shengyu Liu, Peking University; Junda Chen, UC San Diego; Jianbo Hu, Peking University; Yibo Zhu, StepFun; Xuanzhe Liu and Xin Jin, Peking University; Hao Zhang, UC San Diego

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both.

DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 7.4× more requests or 12.6× tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

https://www.usenix.org/conference/osdi24/presentation/zhong-yinmin
Wednesday July 10, 2024 3:20pm - 3:40pm PDT
Grand Ballroom ABGH

3:40pm PDT

Break with Refreshments
Wednesday July 10, 2024 3:40pm - 4:10pm PDT
Wednesday July 10, 2024 3:40pm - 4:10pm PDT
Grand Ballroom Foyer

4:10pm PDT

ACCL+: an FPGA-Based Collective Engine for Distributed Applications
Wednesday July 10, 2024 4:10pm - 4:30pm PDT
Zhenhao He, Dario Korolija, Yu Zhu, and Benjamin Ramhorst, Systems Group, ETH Zurich; Tristan Laan, University of Amsterdam; Lucian Petrica and Michaela Blott, AMD Research; Gustavo Alonso, Systems Group, ETH Zurich

FPGAs are increasingly prevalent in cloud deployments, serving as Smart-NICs or network-attached accelerators. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source, FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the entire design. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and its competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: as a collective offload engine to distribute CPU-based vector-matrix multiplication, and as a component in designing fully FPGA-based distributed deep-learning recommendation inference.

https://www.usenix.org/conference/osdi24/presentation/he
Wednesday July 10, 2024 4:10pm - 4:30pm PDT
Grand Ballroom ABGH

4:10pm PDT

More is Different: Prototyping and Analyzing a New Form of Edge Server with Massive Mobile SoCs
Wednesday July 10, 2024 4:10pm - 4:35pm PDT
Li Zhang, Beijing University of Posts and Telecommunications; Zhe Fu, Tsinghua University; Boqing Shi and Xiang Li, Beijing University of Posts and Telecommunications; Rujin Lai and Chenyang Yang, vclusters; Ao Zhou, Xiao Ma, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

Huge energy consumption poses a significant challenge for edge clouds. In response to this, we introduce a new type of edge server, namely SoC Cluster, that orchestrates multiple low-power mobile system-on-chips (SoCs) through an on-chip network. For the first time, we have developed a concrete SoC Cluster consisting of 60 Qualcomm Snapdragon 865 SoCs housed in a 2U rack, which has been successfully commercialized and extensively deployed in edge clouds. Cloud gaming emerges as the principal workload on these deployed SoC Clusters, owing to the compatibility between mobile SoCs and native mobile games.

In this study, we aim to demystify whether the SoC Cluster can efficiently serve more generalized, typical edge workloads. Therefore, we developed a benchmark suite that employs state-of-the-art libraries for two critical edge workloads, i.e., video transcoding and deep learning inference. This suite evaluates throughput, latency, power consumption, and other application-specific metrics like video quality. Following this, we conducted a thorough measurement study and directly compared the SoC Cluster with traditional edge servers, with regards to electricity usage and monetary cost. Our results quantitatively reveal when and for which applications mobile SoCs exhibit higher energy efficiency than traditional servers, as well as their ability to proportionally scale power consumption with fluctuating incoming loads. These outcomes provide insightful implications and offer valuable direction for further refinement of the SoC Cluster to facilitate its deployment across wider edge scenarios.

https://www.usenix.org/conference/atc24/presentation/zhang-li-prototyping
Wednesday July 10, 2024 4:10pm - 4:35pm PDT
Grand Ballroom CD

4:10pm PDT

Limitations and Opportunities of Modern Hardware Isolation Mechanisms
Wednesday July 10, 2024 4:10pm - 4:35pm PDT
Xiangdong Chen and Zhaofeng Li, University of Utah; Tirth Jain, Maya Labs; Vikram Narayanan and Anton Burtsev, University of Utah

A surge in the number, complexity, and automation of targeted security attacks has triggered a wave of interest in hardware support for isolation. Intel memory protection keys (MPK), ARM pointer authentication (PAC), ARM memory tagging extensions (MTE), and ARM Morello capabilities are just a few hardware mechanisms aimed at supporting low-overhead isolation in recent CPUs. These new mechanisms aim to bring practical isolation to a broad range of systems, e.g., browser plugins, device drivers and kernel extensions, user-defined database and network functions, serverless cloud platforms, and many more. However, as these technologies are still nascent, their advantages and limitations are yet unclear. In this work, we do an in-depth look at modern hardware isolation mechanisms with the goal of understanding their suitability for the isolation of subsystems with the tightest performance budgets. Our analysis shows that while a huge step forward, the isolation mechanisms in commodity CPUs are still lacking implementation of several design principles critical for supporting low-overhead enforcement of isolation boundaries, zero-copy exchange of data, and secure revocation of access permissions.

https://www.usenix.org/conference/atc24/presentation/chen-xiangdong
Wednesday July 10, 2024 4:10pm - 4:35pm PDT
Grand Ballroom EF

4:30pm PDT

Beaver: Practical Partial Snapshots for Distributed Cloud Services
Wednesday July 10, 2024 4:30pm - 4:50pm PDT
Liangcheng Yu, University of Pennsylvania; Xiao Zhang, Shanghai Jiao Tong University; Haoran Zhang, University of Pennsylvania; John Sonchack, Princeton University; Dan Ports, Microsoft / University of Washington; Vincent Liu, University of Pennsylvania

Distributed snapshots are a classic class of protocols used for capturing a causally consistent view of states across machines. Although effective, existing protocols presume an isolated universe of processes to snapshot and require instrumentation and coordination of all. This assumption does not match today's cloud services—it is not always practical to instrument all involved processes nor realistic to assume zero interaction of the machines of interest with the external world.

To bridge this gap, this paper presents Beaver, the first practical partial snapshot protocol that ensures causal consistency under external traffic interference. Beaver presents a unique design point that tightly couples its protocol with the regularities of the underlying data center environment. By exploiting the placement of software load balancers in public clouds and their associated communication pattern, Beaver not only requires minimal changes to today's data center operations but also eliminates any form of blocking to existing communication, thus incurring near-zero overhead to user traffic. We demonstrate the Beaver's effectiveness through extensive testbed experiments and novel use cases.

https://www.usenix.org/conference/osdi24/presentation/yu
Wednesday July 10, 2024 4:30pm - 4:50pm PDT
Grand Ballroom ABGH

4:35pm PDT

HiP4-UPF: Towards High-Performance Comprehensive 5G User Plane Function on P4 Programmable Switches
Wednesday July 10, 2024 4:35pm - 5:00pm PDT
Zhixin Wen and Guanhua Yan, Binghamton University

Due to better cost benefits, P4 programmable switches have been considered in a few recent works to implement 5G User Plane Function (UPF). To circumvent limited resources on P4 programmable switches, they either ignore some essential UPF features or resort to a hybrid deployment approach which requires extra resources. This work is aimed to improve the performance of UPFs with comprehensive features which, except packet buffering, are deployable entirely on commodity P4 programmable switches. We build a baseline UPF based on prior work and analyze its key performance bottlenecks. We propose a three-tiered approach to optimize rule storage on the switch ASICs. We also develop a novel scheme that combines pendulum table access and selective usage pulling to reduce the operational latency of the UPF. Using a commodity P4 programmable switch, the experimental results show that our UPF implementation can support twice as many mobile devices as the baseline UPF and 1.9 times more than SD-Fabric. Our work also improves the throughputs in three common types of 5G call flows by 9-619% over the UPF solutions in two open-source 5G network emulators.

https://www.usenix.org/conference/atc24/presentation/wen
Wednesday July 10, 2024 4:35pm - 5:00pm PDT
Grand Ballroom CD

4:35pm PDT

FetchBPF: Customizable Prefetching Policies in Linux with eBPF
Wednesday July 10, 2024 4:35pm - 5:00pm PDT
Xuechun Cao, Shaurya Patel, and Soo Yee Lim, University of British Columbia; Xueyuan Han, Wake Forest University; Thomas Pasquier, University of British Columbia

Monolithic operating systems are infamously complex. Linux in particular has a tendency to intermingle policy and mechanisms in a manner that hinders modularity. This is especially problematic when developers aim to finely optimize performance,since it is often the case that a default policy in Linux, while performing well on average, cannot achieve the optimal performance in all circumstances. However, developing and maintaining a bespoke kernel to satisfy the need of a specific application is usually an unrealistic endeavor due to the high software engineering cost. Therefore, we need a mechanism to easily customize kernel policies and its behavior. In this paper, we design a framework called FetchBPF that addresses this problem in the context of memory prefetching. FetchBPF extends the widely used eBPF framework to allow developers to easily express, develop, and deploy prefetching policies without modifying the kernel codebase. We implement various memory prefetching policies from the literature and demonstrate that our deployment model incurs negligible overhead as compared to the equivalent native kernel implementation.

https://www.usenix.org/conference/atc24/presentation/cao
Wednesday July 10, 2024 4:35pm - 5:00pm PDT
Grand Ballroom EF

4:50pm PDT

Fast and Scalable In-network Lock Management Using Lock Fission
Wednesday July 10, 2024 4:50pm - 5:10pm PDT
Hanze Zhang, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Shanghai AI Laboratory; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University; Ke Cheng, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Rong Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Shanghai AI Laboratory; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Key Laboratory of System Software (Chinese Academy of Sciences)

Distributed lock services are extensively utilized in distributed systems to serialize concurrent accesses to shared resources. The need for fast and scalable lock services has become more pronounced with decreasing task execution times and expanding dataset scales. However, traditional lock managers, reliant on server CPUs to handle lock requests, experience significant queuing delays in lock grant latency. Advanced network hardware (e.g. programmable switches) presents an avenue to manage locks without queuing delays due to their high packet processing power. Nevertheless, their constrained memory capacity restricts the manageable lock scale, thereby limiting their effect in large-scale workloads.

This paper presents FISSLOCK, a fast and scalable distributed lock service that exploits the programmable switch to improve (tail) latency and peak throughput for millions of locks. The key idea behind FISSLOCK is the concept of lock fission, which decouples lock management into grant decision and participant maintenance. FISSLOCK leverages the programmable switch to decide lock grants synchronously and relies on servers to maintain participants (i.e., holders and waiters) asynchronously. By using the programmable switch for routing, FISSLOCK enables on-demand fine-grained lock migration, thereby reducing the lock grant and release delays. FISSLOCK carefully designs and implements grant decision procedure on the programmable switch, supporting over one million locks. Evaluation using various benchmarks and a real-world application shows the efficiency of FISSLOCK. Compared to the state-of-the-art switch-based approach (NetLock), FISSLOCK cuts up to 79.1% (from 43.0%) of median lock grant time in the microbenchmark and improves transaction throughput for TATP and TPC-C by 1.76× and 2.28×, respectively.

https://www.usenix.org/conference/osdi24/presentation/zhang-hanze
Wednesday July 10, 2024 4:50pm - 5:10pm PDT
Grand Ballroom ABGH

5:00pm PDT

KEPC-Push: A Knowledge-Enhanced Proactive Content Push Strategy for Edge-Assisted Video Feed Streaming
Wednesday July 10, 2024 5:00pm - 5:25pm PDT
Ziwen Ye, Peng Cheng Laboratory and Tsinghua Shenzhen International Graduate School; Qing Li, Peng Cheng Laboratory; Chunyu Qiao, ByteDance; Xiaoteng Ma, Tsinghua Shenzhen International Graduate School; Yong Jiang, Peng Cheng Laboratory and Tsinghua Shenzhen International Graduate School; Qian Ma and Shengbin Meng, ByteDance; Zhenhui Yuan, University of Warwick; Zili Meng, HKUST

Video Feed Streaming (e.g., TikTok, Reels) is increasingly popular nowadays. Users will be scheduled to the distribution infrastructure, including content distribution network (CDN) and multi-access edge computing (MEC) nodes, to access the content. Our observation is that the existing proactive content push algorithms, which are primarily based on historical access information and designed for on-demand videos, no longer meet the demands of video feed streaming. The main reason is that video feed streaming applications always push recently generated videos to attract users’ interests, thus lacking historical information when pushing. In this case, push mismatches and load imbalances will be observed, resulting in degraded bandwidth cost and user experience. To this end, we propose KEPC-Push, a Knowledge-Enhanced Proactive Content Push strategy with the \textit{knowledge} of video content features. KEPC-Push employs knowledge graphs to determine the popularity correlation among similar videos (with similar authors, contents, length, etc.) and pushes content based on this guidance. Besides, KEPC-Push designs a hierarchical algorithm to optimize the resource allocation in edge nodes with heterogeneous capabilities and runs at the regional level to shorten the communication distance. Trace-driven simulations show that KEPC-Push saves the peak-period CDN bandwidth costs by 20% and improves the average download speeds by 7% against the state-of-the-art solutions.

https://www.usenix.org/conference/atc24/presentation/ye-ziwen
Wednesday July 10, 2024 5:00pm - 5:25pm PDT
Grand Ballroom CD

5:00pm PDT

Fast (Trapless) Kernel Probes Everywhere
Wednesday July 10, 2024 5:00pm - 5:25pm PDT
Jinghao Jia, University of Illinois Urbana-Champaign; Michael V. Le and Salman Ahmed, IBM T.J. Watson Research Center; Dan Williams, Virginia Tech and IBM T.J. Watson Research Center; Hani Jamjoom, IBM T.J. Watson Research Center; Tianyin Xu, University of Illinois at Urbana-Champaign

The ability to efficiently probe and instrument a running operating system (OS) kernel is critical for debugging, system security, and performance monitoring. While efforts to optimize the widely used Kprobes in Linux over the past two decades have greatly improved its performance, many fundamental gaps remain that prevent it from being completely efficient. Specifically, we find that Kprobe is only optimized for ~80% of kernel instructions, leaving the remaining probe-able kernel code to suffer the severe penalties of double traps needed by the Kprobe implementation. In this paper, we focus on the design and implementation of an efficient and general trapless kernel probing mechanism (no hardware exceptions) that can be applied to almost all code in Linux. We discover that the main limitation of current probe optimization efforts comes from not being able to assume or change certain properties/layouts of the target kernel code. Our main insight is that by introducing strategically placed nops, thus slightly changing the code layout, we can overcome this main limitation. We implement our mechanism on Linux Kprobe, which is transparent to the users. Our evaluation shows a 10x improvement of probe performance over standard Kprobe while providing this level of performance for 96% of kernel code.

https://www.usenix.org/conference/atc24/presentation/jia
Wednesday July 10, 2024 5:00pm - 5:25pm PDT
Grand Ballroom EF

5:10pm PDT

Chop Chop: Byzantine Atomic Broadcast to the Network Limit
Wednesday July 10, 2024 5:10pm - 5:30pm PDT
Martina Camaioni, Rachid Guerraoui, Matteo Monti, Pierre-Louis Roman, Manuel Vidigueira, and Gauthier Voron, EPFL

At the heart of state machine replication, the celebrated technique enabling decentralized and secure universal computation, lies Atomic Broadcast, a fundamental communication primitive that orders, authenticates, and deduplicates messages. This paper presents Chop Chop, a Byzantine Atomic Broadcast system that uses a novel authenticated memory pool to amortize the cost of ordering, authenticating and deduplicating messages, achieving "line rate" (i.e., closely matching the complexity of a protocol that does not ensure any ordering, authentication or Byzantine resilience) even when processing messages as small as 8 bytes. Chop Chop attains this performance by means of a new form of batching we call distillation. A distilled batch is a set of messages that are fast to authenticate, deduplicate, and order. Batches are distilled using a novel interactive protocol involving brokers, an untrusted layer of facilitating processes between clients and servers. In a geo-distributed deployment of 64 medium-sized servers, Chop Chop processes 43,600,000 messages per second with an average latency of 3.6 seconds. Under the same conditions, state-of-the-art alternatives offer two orders of magnitude less throughput for the same latency. We showcase three simple Chop Chop applications: a Payment system, an Auction house and a "Pixel war" game, respectively achieving 32, 2.3 and 35 million operations per second.

https://www.usenix.org/conference/osdi24/presentation/camaioni
Wednesday July 10, 2024 5:10pm - 5:30pm PDT
Grand Ballroom ABGH

5:25pm PDT

High-density Mobile Cloud Gaming on Edge SoC Clusters
Wednesday July 10, 2024 5:25pm - 5:40pm PDT
Li Zhang, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

System-on-Chip (SoC) Clusters, i.e., servers consisting of many stacked mobile SoCs, have emerged as a popular platform for serving mobile cloud gaming. Sharing the underlying hardware and OS, these SoC Clusters enable native mobile games to be executed and rendered efficiently without modification. However, the number of deployed game sessions is limited due to conservative deployment strategies and high GPU utilization in current game offloading methods. To address these challenges, we introduce SFG, the first system that enables high-density mobile cloud gaming on SoC Clusters with two novel techniques: (1) It employs a resource-efficient game partitioning and cross-SoC offloading design that maximally preserves GPU optimization intents in the standard graphics rendering pipeline; (2) It proposes an NPU-enhanced game partition coordination strategy to adjust game performance when co-locating partitioned and complete game sessions. Our evaluation of five Unity games shows that SFG achieves up to 4.5× higher game density than existing methods with trivial performance loss. Equally important, SFG extends the lifespan of SoC Clusters, enabling outdated SoC Clusters to serve new games that are unfeasible on a single SoC due to GPU resource shortages.

https://www.usenix.org/conference/atc24/presentation/zhang-li-gaming
Wednesday July 10, 2024 5:25pm - 5:40pm PDT
Grand Ballroom CD

5:25pm PDT

HydraRPC: RPC in the CXL Era
Wednesday July 10, 2024 5:25pm - 5:40pm PDT
Teng Ma, Alibaba Group; Zheng Liu, Zhejiang University and Alibaba Group; Chengkun Wei, Zhejiang University; Jialiang Huang, Alibaba Group and Tsinghua University; Youwei Zhuo, Alibaba Group and Peking University; Haoyu Li, Zhejiang University; Ning Zhang, Yijin Guan, and Dimin Niu, Alibaba Group; Mingxing Zhang, Tsinghua University; Tao Ma, Alibaba Group

In this paper, we present HydraRPC, which utilizes CXL-attached HDM for data transmission. By leveraging CXL, HydraRPC can benefit from memory sharing, memory semantics, and high scalability. As a result, expensive network rounds, memory copying, and serialization/deserialization are eliminated. Since CXL.cache protocols are not fully supported, we employ non-cachable sharing to bypass the CPU cache and design a busy-polling free notification mechanism. This ensures efficient data transmission without the need for constant polling. We conducted evaluations of HydraRPC on real CXL hardware, which showcased the potential efficiency of utilizing CXL HDM to build RPC systems.

https://www.usenix.org/conference/atc24/presentation/ma
Wednesday July 10, 2024 5:25pm - 5:40pm PDT
Grand Ballroom EF

5:40pm PDT

ExtMem: Enabling Application-Aware Virtual Memory Management for Data-Intensive Applications
Wednesday July 10, 2024 5:40pm - 5:55pm PDT
Sepehr Jalalian, Shaurya Patel, Milad Rezaei Hajidehi, Margo Seltzer, and Alexandra Fedorova, University of British Columbia

For over forty years, researchers have demonstrated that operating system memory managers often fall short in supporting memory-hungry applications. The problem is even more critical today, with disaggregated memory and new memory technologies and in the presence of tera-scale machine learning models, large-scale graph processing, and other memory-intensive applications. Past attempts to provide application-specific memory management either required significant in-kernel changes or suffered from high overhead. We present ExtMem, a flexible framework for providing application-specific memory management. It differs from prior solutions in three ways: (1) It is compatible with today’s Linux deployments, (2) it is a general-purpose substrate for addressing various memory and storage backends, and (3) it is performant in multithreaded environments. ExtMem allows for easy and rapid prototyping of new memory management algorithms, easy collection of memory patterns and statistics, and immediate deployment of isolated custom memory management.

https://www.usenix.org/conference/atc24/presentation/jalalian
Wednesday July 10, 2024 5:40pm - 5:55pm PDT
Grand Ballroom EF

6:00pm PDT

OSDI ’24 Poster Session and Reception
Wednesday July 10, 2024 6:00pm - 7:30pm PDT
Sponsored by Amazon
Wednesday July 10, 2024 6:00pm - 7:30pm PDT
Santa Clara Ballroom

7:30pm PDT

Joint Student Meet-up
Wednesday July 10, 2024 7:30pm - 8:30pm PDT
All student attendees from both USENIX ATC and OSDI are invited to an informal mixer following the OSDI '24 Poster Session and Reception. Snacks and drinks will be provided.

Refreshments courtesy of Roblox
Wednesday July 10, 2024 7:30pm - 8:30pm PDT
Bayshore Room

7:30pm PDT

Google Sponsor Event: Ask-Me-Anything Panel Session
Wednesday July 10, 2024 7:30pm - 8:30pm PDT
Please join us for this session where a diverse panel of Googlers will share their perspectives on systems research and innovation at Google. We will kick off the panel at 7:30 pm and answer all your questions (well, most of them). Beverages and desserts will be available.

Panelists: David Culler, Hank Levy, Arif Merchant, Jeff Mogul, Mangpo Phothilimthana, Amy Tai.

All attendees from both USENIX ATC and OSDI are invited to join. Snacks and drinks will be provided.
Wednesday July 10, 2024 7:30pm - 8:30pm PDT
Magnolia Room

7:30pm PDT

Birds-of-a-Feather Sessions (BoFs)
Wednesday July 10, 2024 7:30pm - 10:30pm PDT
Registered attendees may schedule Birds-of-a-Feather sessions (BoFs) and reserve meeting rooms for them in one-hour increments via the BoFs schedule grid posted outside the badge pickup area. The Attendee Guide, which will be sent to registered attendees shortly before the event, contains more details for scheduling a BoF. Each room will be set with a projector and screen.
Wednesday July 10, 2024 7:30pm - 10:30pm PDT
Central Room, Tasman Room
 
Thursday, July 11
 

8:00am PDT

Continental Breakfast
Thursday July 11, 2024 8:00am - 9:00am PDT
Thursday July 11, 2024 8:00am - 9:00am PDT
Grand Ballroom Foyer

8:00am PDT

Badge Pickup
Thursday July 11, 2024 8:00am - 5:00pm PDT
Thursday July 11, 2024 8:00am - 5:00pm PDT
Lobby West

9:00am PDT

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
Thursday July 11, 2024 9:00am - 9:20am PDT
Yi Zhai, University of Science and Technology of China; Sijia Yang, Huawei Technologies Co., Ltd.; Keyu Pan, ByteDance Ltd.; Renwei Zhang, Huawei Technologies Co., Ltd.; Shuo Liu, University of Science and Technology of China; Chao Liu and Zichun Ye, Huawei Technologies Co., Ltd.; Jianmin Ji, University of Science and Technology of China; Jie Zhao, Hunan University; Yu Zhang and Yanyong Zhang, University of Science and Technology of China

Obtaining high-performance tensor programs with high efficiency continues to be a substantial challenge. Approaches that favor efficiency typically limit their exploration space through heuristic constraints, which often lack generalizability. Conversely, approaches targeting high performance tend to create an expansive exploration space but employ ineffective exploration strategies.

We propose a tensor program generation framework for deep learning applications. Its core idea involves maintaining an expansive space to ensure high performance while performing powerful exploration with the help of language models to generate tensor programs efficiently. We thus transform the tensor program exploration task into a language model generation task. To facilitate this, we explicitly design the language model-friendly tensor language that records decision information to represent tensor programs. During the compilation of target workloads, the tensor language model (TLM) combines knowledge from offline learning and previously made decisions to probabilistically sample the best decision in the current decision space. This approach allows more informed space exploration than random sampling commonly used in previously proposed approaches.

Experimental results indicate that TLM excels in delivering both efficiency and performance. Compared to fully tuned Ansor/MetaSchedule, TLM matches their performance with a compilation speedup of 61×. Furthermore, when evaluated against Roller, with the same compilation time, TLM improves the performance by 2.25×. Code available at https://github.com/zhaiyi000/tlm.

https://www.usenix.org/conference/osdi24/presentation/zhai
Thursday July 11, 2024 9:00am - 9:20am PDT
Grand Ballroom ABGH

9:00am PDT

Telescope: Telemetry for Gargantuan Memory Footprint Applications
Thursday July 11, 2024 9:00am - 9:25am PDT
Alan Nair, Sandeep Kumar, and Aravinda Prasad, Intel Labs; Ying Huang, Intel Corporation; Andy Rudoff and Sreenivas Subramoney, Intel Labs

Data-hungry applications that require terabytes of memory have become widespread in recent years. To meet the memory needs of these applications, data centers are embracing tiered memory architectures with near and far memory tiers. Precise, efficient, and timely identification of hot and cold data and their placement in appropriate tiers is critical for performance in such systems. Unfortunately, the existing state-of-the-art telemetry techniques for hot and cold data detection are ineffective at terabyte scale.

We propose Telescope, a novel technique that profiles different levels of the application's page table tree for fast and efficient identification of hot and cold data. Telescope is based on the observation that for a memory- and TLB-intensive workload, higher levels of a page table tree are also frequently accessed during a hardware page table walk. Hence, the hotness of the higher levels of the page table tree essentially captures the hotness of its subtrees or address space sub-regions at a coarser granularity. We exploit this insight to quickly converge on even a few megabytes of hot data and efficiently identify several gigabytes of cold data in terabyte-scale applications. Importantly, such a technique can seamlessly scale to petabyte-scale applications.

Telescope's telemetry achieves 90%+ precision and recall at just 0.9% single CPU utilization for microbenchmarks with 5 TB memory footprint. Memory tiering based on Telescope results in 5.6% to 34% throughput improvement for real-world benchmarks with 1–2 TB memory footprint compared to other state-of-the-art telemetry techniques.

https://www.usenix.org/conference/atc24/presentation/nair
Thursday July 11, 2024 9:00am - 9:25am PDT
Grand Ballroom CD

9:00am PDT

WingFuzz: Implementing Continuous Fuzzing for DBMSs
Thursday July 11, 2024 9:00am - 9:25am PDT
Jie Liang, Zhiyong Wu, and Jingzhou Fu, Tsinghua University; Yiyuan Bai and Qiang Zhang, Shuimu Yulin Technology Co., Ltd.; Yu Jiang, Tsinghua University

Database management systems (DBMSs) are critical components within software ecosystems, and their security and stability are paramount. In recent years, fuzzing has emerged as a prominent automated testing technique, effectively identifying vulnerabilities in various DBMSs. Nevertheless, many of these fuzzers require specific adaptation for a DBMS with a particular version. Employing these techniques to test enterprise-level DBMSs continuously poses challenges due to the diverse specifications of DBMSs and the code changes in their rapid version evolution.

In this paper, we present the industry practice of implementing continuous DBMS fuzzing on enterprise-level DBMSs like ClickHouse. We summarize three main obstacles in implementing, namely the diverse SQL grammar in test case generation, the ongoing evolution of codebase in continuous testing, and the disturbance of noises during anomaly analysis. We propose WingFuzz, which utilizes specification-based mutator generation, corpus-driven evolving code fuzzing, and noise-resilient anomaly assessment to address them. By working with the engineers in continuous DBMS fuzzing, we have found a total of 236 previously undiscovered bugs in 12 widely-used enterprise-level DBMSs including ClickHouse, DamengDB, and TenDB. Due to its favorable test results, our efforts received recognition and cooperation invitations from some DBMS vendors. For example, ClickHouse’s CTO praised: "Which tool did you use to find this test case? We need to integrate it into our CI." and WingFuzz has been successfully integrated into its development process.

https://www.usenix.org/conference/atc24/presentation/liang
Thursday July 11, 2024 9:00am - 9:25am PDT
Grand Ballroom EF

9:20am PDT

Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
Thursday July 11, 2024 9:20am - 9:40am PDT
Lei Wang, University of Chinese Academy of Sciences & Microsoft Research; Lingxiao Ma, Shijie Cao, Quanlu Zhang, and Jilong Xue, Microsoft Research; Yining Shi, Peking University & Microsoft Research; Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, Yuqing Yang, and Mao Yang, Microsoft Research

The increasing demand for improving deep learning model performance has led to a paradigm shift in supporting low-precision computation to harness the robustness of deep learning to errors. Despite the emergence of new low-precision data types and optimization approaches, existing hardware and software have insufficient and inefficient support for those evolving data types, making it challenging to achieve real performance gains through low-precision computing.

This paper introduces Ladder, a novel compiler designed to bridge the gap between evolving custom data types and the fixed precision formats supported by current hardware. Leveraging a general type system, tType, and an extended tensor expression, Ladder transforms deep neural network (DNN) computations into optimized computing pipelines with custom data types as the first-class citizen, exposing an optimization space for efficiently handling data storage, accesses, and type conversions. Ladder employs a new set of tensor scheduling primitives and a hardware-aware optimization policy to navigate the complex transformation space, ensuring optimal performance across different memory layers and DNN operators. Our evaluation demonstrates Ladder's capability to systematically support a wide array of low-bit precision custom data types, significantly enhancing the performance of DNN computations on modern accelerators without necessitating hardware modifications. This innovation empowers model designers with the ability to explore data type optimizations and offers hardware vendors a flexible solution to expand their support for diverse precision formats.

https://www.usenix.org/conference/osdi24/presentation/wang-lei
Thursday July 11, 2024 9:20am - 9:40am PDT
Grand Ballroom ABGH

9:25am PDT

An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise
Thursday July 11, 2024 9:25am - 9:50am PDT
Hongyu Li, Beijing University of Posts and Telecommunications; Liwei Guo, University of Electronic Science and Technology of China; Yexuan Yang, Shangguang Wang, and Mengwei Xu, Beijing University of Posts and Telecommunications

Developed for over 30 years, Linux has already become the computing foundation for today's digital world; from gigantic, complex mainframes (e.g., supercomputers) to cheap, wimpy embedded devices (e.g., IoTs), countless applications are built on top of it. Yet, such an infrastructure has been plagued by numerous memory and concurrency bugs since the day it was born, due to many rogue memory operations are permitted by C language. A recent project Rust-for-Linux (RFL) has the potential to address Linux's safety concerns once and for all -- by embracing Rust's static ownership and type checkers into the kernel code, the kernel may finally be free from memory and concurrency bugs without hurting its performance. While it has been gradually matured and even merged into Linux mainline, however, RFL is rarely studied and still remains unclear whether it has indeed reconciled the safety and performance dilemma for the kernel.

To this end, we conduct the first empirical study on RFL to understand its status quo and benefits, especially on how Rust fuses with Linux and whether the fusion assures driver safety without overhead. We collect and analyze 6 key RFL drivers, which involve hundreds of issues and PRs, thousands of Github commits and mail exchanges of the Linux mailing list, as well as over 12K discussions on Zulip.We have found while Rust mitigates kernel vulnerabilities, it is beyond Rust's capability to fully eliminate them; what is more, if not handled properly, its safety assurance even costs the developers dearly in terms of both runtime overhead and development efforts.

https://www.usenix.org/conference/atc24/presentation/li-hongyu
Thursday July 11, 2024 9:25am - 9:50am PDT
Grand Ballroom CD

9:25am PDT

Balancing Analysis Time and Bug Detection: Daily Development-friendly Bug Detection in Linux
Thursday July 11, 2024 9:25am - 9:50am PDT
Keita Suzuki, Keio University; Kenta Ishiguro, Hosei University; Kenji Kono, Keio University

Linux, a battle-tested codebase, is known to suffer from many bugs despite its extensive testing mechanisms. While many of these bugs require domain-specific knowledge for detection, a significant portion matches well-known bug patterns. Even though these bugs can be found with existing tools, our simple check of Linux kernel patches suggests that these tools are not used much in the developer's daily workflow. The lack of usage is probably due to the well-known trade-off between analysis time and bug detection capabilities: tools typically employ complex analysis to effectively and comprehensively find bugs in return for a long analysis time, or focus on a short analysis time by only employing elementary analyses and thus can only find a very limited number of bugs. Ideally, developers expect the tools to incur short analysis time, while still finding many bugs to use them in daily development.

This paper explores an approach that balances this trade-off by focusing on bugs that can be found with less computationally-complex analysis methods, and limiting the scope to each source code. To achieve this, we propose a combination of computationally lightweight analyses and demonstrate our claim by designing FiTx, a framework for generating daily development-friendly bug checkers that focus on well-known patterns. Despite its simplicity, FiTx successfully identified 47 new bugs in the Linux kernel version 5.15 within 2.5 hours, outperforming Clang Static Analyzer and CppCheck in both speed and bug detection. It demonstrates that focusing on less complex bug patterns can still significantly contribute to the improvement of codebase health. FiTx can be embedded into the daily development routine, enabling early bug detection without sacrificing developers' time.

https://www.usenix.org/conference/atc24/presentation/suzuki
Thursday July 11, 2024 9:25am - 9:50am PDT
Grand Ballroom EF

9:40am PDT

Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents
Thursday July 11, 2024 9:40am - 10:00am PDT
Qizheng Zhang, Stanford University; Ali Imran, Purdue University; Enkeleda Bardhi, Sapienza University of Rome; Tushar Swamy and Nathan Zhang, Stanford University; Muhammad Shahbaz, Purdue University and University of Michigan; Kunle Olukotun, Stanford University

Recent work on in-network machine learning (ML) anticipates offline models to operate well in modern networking environments. However, upon deployment, these models struggle to cope with fluctuating traffic patterns and network conditions and, therefore, must be validated and updated frequently in an online fashion.

This paper presents CARAVAN, a practical online learning system for in-network ML models. We tackle two primary challenges in facilitating online learning for networking: (a) the automatic labeling of evolving traffic and (b) the efficient monitoring and detection of model performance degradation to trigger retraining. CARAVAN repurposes existing systems (e.g., heuristics, access control lists, and foundation models)— not directly suitable for such dynamic environments—into high-quality labeling sources for generating labeled data for online learning. CARAVAN also introduces a new metric, accuracy proxy, to track model degradation and potential drift to efficiently trigger retraining. Our evaluations show that CARAVAN's labeling strategy enables in-network ML models to closely follow the changes in the traffic dynamics with a 30.3% improvement in F1 score (on average), compared to offline models. Moreover, CARAVAN sustains comparable inference accuracy to that of a continuous-learning system while consuming 61.3% less GPU compute time (on average) via accuracy proxy and retraining triggers.

https://www.usenix.org/conference/osdi24/presentation/zhang-qizheng
Thursday July 11, 2024 9:40am - 10:00am PDT
Grand Ballroom ABGH

9:50am PDT

Scalable and Effective Page-table and TLB management on NUMA Systems
Thursday July 11, 2024 9:50am - 10:15am PDT
Bin Gao, Qingxuan Kang, and Hao-Wei Tee, National University of Singapore; Kyle Timothy Ng Chu, Horizon Quantum Computing; Alireza Sanaee, Queen Mary University of London; Djordje Jevdjic, National University of Singapore

Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent.

In this paper, we propose a novel page-table management mechanism, called Hydra, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that Hydra's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, Hydra not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement Hydra in Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that Hydra achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.

https://www.usenix.org/conference/atc24/presentation/gao-bin-scalable
Thursday July 11, 2024 9:50am - 10:15am PDT
Grand Ballroom CD

9:50am PDT

Kivi: Verification for Cluster Management
Thursday July 11, 2024 9:50am - 10:15am PDT
Bingzhe Liu and Gangmuk Lim, UIUC; Ryan Beckett, Microsoft; P. Brighten Godfrey, UIUC and Broadcom

Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users’ operational burden, the complex, dynamic, and non-deterministic nature of these systems makes them hard to reason about, potentially leading to failures ranging from performance degradation to outages.

We present Kivi, the first system for verifying controllers and their configurations in cluster management systems. Kivi focuses on the popular system Kubernetes, and models its controllers and events into processes whereby their interleavings are exhaustively checked via model checking. Central to handling autoscaling and large-scale deployments are our modeling optimizations and our design which seeks to find violations in a smaller and reduced topology. We show that Kivi is effective and accurate in finding issues in realistic and complex scenarios and showcase two new issues in Kubernetes controller source code.

https://www.usenix.org/conference/atc24/presentation/liu-bingzhe
Thursday July 11, 2024 9:50am - 10:15am PDT
Grand Ballroom EF

10:00am PDT

nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
Thursday July 11, 2024 10:00am - 10:20am PDT
Zhiqi Lin, University of Science and Technology of China; Youshan Miao, Quanlu Zhang, Fan Yang, and Yi Zhu, Microsoft Research; Cheng Li, University of Science and Technology of China; Saeed Maleki, xAI; Xu Cao, Ning Shang, Yilei Yang, Weijiang Xu, and Mao Yang, Microsoft Research; Lintao Zhang, BaseBit Technologies; Lidong Zhou, Microsoft Research

With the growing model size of deep neural networks (DNN), deep learning training is increasingly relying on handcrafted search spaces to find efficient parallelization execution plans. However, our study shows that existing search spaces exclude plans that significantly impact the training performance of well-known DNN models (e.g., AlphaFold2) under important settings, such as when handling large embedding tables in large language models.

To address this problem, we propose nnScaler, a framework that generates efficient parallelization plans for deep learning training. Instead of relying on the existing search space, nnScaler advocates a more general approach that empowers domain experts to construct their own search space through three primitives, op-trans, op-assign, and op-order, which capture model transformation and the temporal-spatial scheduling of the transformed model of any parallelization plans. To avoid space explosion, nnScaler allows the application of constraints to those primitives during space construction. With the proposed primitives and constraints, nnScaler can compose existing search spaces as well as new ones. Experiments show that nnScaler can find new parallelization plans in new search spaces that achieve up to 3.5× speedup compared to solutions such as DeepSpeed, Megatron-LM, and Alpa for popular DNN models like SwinTransformer and AlphaFold2.

https://www.usenix.org/conference/osdi24/presentation/lin-zhiqi
Thursday July 11, 2024 10:00am - 10:20am PDT
Grand Ballroom ABGH

10:15am PDT

UniMem: Redesigning Disaggregated Memory within A Unified Local-Remote Memory Hierarchy
Thursday July 11, 2024 10:15am - 10:40am PDT
Yijie Zhong, Minqiang Zhou, and Zhirong Shen, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University

Disaggregated memory (DM) has been proposed as a feasible solution towards scaling memory capacity. A variety of memory disaggregation approaches have been introduced to facilitate the practical use of DM. The cache-coherent-based DM system, which relies on cache-coherent accelerator, can offer network-attached memory as NUMA memory. However, the current cache-coherent-based DM system introduces an extra address translation for each remote memory access. Meanwhile, the local cache mechanism of existing approaches overlooks the inherent issues of cache thrashing and pollution that arise from DM system. This paper presents UniMem, a cache-coherent-based DM system that proposes a unified local-remote memory hierarchy to remove extra indirection layer on remote memory access path. To optimize local memory utilization, UniMem redesigns the local cache mechanism to prevent cache thrashing and pollution. Furthermore, UniMem puts forth a page migration mechanism that promotes frequently used pages from device-attached memory to host memory based not only on page hotness but also on hotness fragmentation. Compared to state-of-the-art systems, UniMem reduces the average memory access time by up to 76.4% and offers substantial improvement in terms of data amplification.

https://www.usenix.org/conference/atc24/presentation/zhong
Thursday July 11, 2024 10:15am - 10:40am PDT
Grand Ballroom CD

10:15am PDT

Monarch: A Fuzzing Framework for Distributed File Systems
Thursday July 11, 2024 10:15am - 10:40am PDT
Tao Lyu, EPFL; Liyi Zhang, University of Waterloo; Zhiyao Feng, Yueyang Pan, and Yujie Ren, EPFL; Meng Xu, University of Waterloo; Mathias Payer and Sanidhya Kashyap, EPFL

Distributed file systems (DFSes) are prone to bugs. Although numerous bug-finding techniques have been applied to DFSes, static analysis does not scale well with the sheer complexity of DFS codebases while dynamic methods (e.g., regression testing) are limited by the quality of test cases. Although both can be improved by pouring in manual effort, they are less practical when facing a diverse set of real-world DFSes. Fuzzing, on the other hand, has shown great success in local systems. However, several problems exist if we apply existing fuzzers to DFSes as they 1) cannot test multiple components of DFSes holistically; 2) miss the critical testing aspects of DFSes (e.g., distributed faults); 3) have not yet explored the practical state representations as fuzzing feedback; and 4) lack checkers for asserting semantic bugs unique to DFSes.

In this paper, we introduce MONARCH, a multi-node fuzzing framework to test all POSIX-compliant DFSes under one umbrella. MONARCH pioneers push-button fuzzing for DFSes with a new set of building blocks to the fuzzing toolbox: 1) A multi-node fuzzing architecture for testing diverse DFSes from a holistic perspective; 2) A two-step mutator for testing DFSes with syscalls and faults; 3) Practical execution state representations with a unified coverage collection scheme across execution contexts; 4) A new DFSes semantic checker SYMSC. We applied MONARCH to six DFSes and uncovered a total of 48 bugs, including a bug whose existence can be traced back to the initial release of the DFSes.

https://www.usenix.org/conference/atc24/presentation/lyu
Thursday July 11, 2024 10:15am - 10:40am PDT
Grand Ballroom EF

10:20am PDT

ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications
Thursday July 11, 2024 10:20am - 10:40am PDT
Yuhan Liu, University of Chicago; Chengcheng Wan, East China Normal University; Kuntai Du, Henry Hoffmann, and Junchen Jiang, University of Chicago; Shan Lu, University of Chicago and Microsoft Research; Michael Maire, University of Chicago

ML APIs have greatly relieved application developers of the burden to design and train their own neural network models—classifying objects in an image can now be as simple as one line of Python code to call an API. However, these APIs offer the same pre-trained models regardless of how their output is used by different applications. This can be suboptimal as not all ML inference errors can cause application failures, and the distinction between inference errors that can or cannot cause failures varies greatly across applications.

To tackle this problem, we first study 77 real-world applications, which collectively use six ML APIs from two providers, to reveal common patterns of how ML API output affects applications' decision processes. Inspired by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, we show that ChameleonAPI reduces incorrect application decisions by 43%.

https://www.usenix.org/conference/osdi24/presentation/liu
Thursday July 11, 2024 10:20am - 10:40am PDT
Grand Ballroom ABGH

10:40am PDT

Break with Refreshments
Thursday July 11, 2024 10:40am - 11:10am PDT
Thursday July 11, 2024 10:40am - 11:10am PDT
Grand Ballroom Foyer

11:10am PDT

SquirrelFS: using the Rust compiler to check file-system crash consistency
Thursday July 11, 2024 11:10am - 11:30am PDT
Hayley LeBlanc, Nathan Taylor, James Bornholt, and Vijay Chidambaram, University of Texas at Austin

This work introduces a new approach to building crash-safe file systems for persistent memory. We exploit the fact that Rust's typestate pattern allows compile-time enforcement of a specific order of operations. We introduce a novel crash-consistency mechanism, Synchronous Soft Updates, that boils down crash safety to enforcing ordering among updates to file-system metadata. We employ this approach to build SquirrelFS, a new file system with crash-consistency guarantees that are checked at compile time. SquirrelFS avoids the need for separate proofs, instead incorporating correctness guarantees into the typestate itself. Compiling SquirrelFS only takes tens of seconds; successful compilation indicates crash consistency, while an error provides a starting point for fixing the bug. We evaluate SquirrelFS against state of the art file systems such as NOVA and WineFS, and find that SquirrelFS achieves similar or better performance on a wide range of benchmarks and applications.

https://www.usenix.org/conference/osdi24/presentation/leblanc
Thursday July 11, 2024 11:10am - 11:30am PDT
Grand Ballroom ABGH

11:10am PDT

Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
Thursday July 11, 2024 11:10am - 11:35am PDT
Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, Kuaishou Technology

Recent advancements in training large-scale models have centered on optimizing activation strategies and exploring various parallel training options. One research avenue focuses on enhancing activation-related operations, such as offloading and recomputing. However, there is room for further refinement in these strategies to improve the balance between computation and memory utilization. Another line of work explores different training parallelisms, which often require extensive parameter tuning and achieve suboptimal combinations of parallel options.

To tackle these challenges, this paper introduces a novel method for losslessly accelerating the training of large language models. Specifically, two efficient activation rematerialization strategies are proposed: Pipeline-Parallel-Aware Offloading, which maximizes the utilization of host memory for storing activations, and Compute-Memory Balanced Checkpointing, which seeks a practical equilibrium between activation memory and computational efficiency. Additionally, the paper presents an extremely efficient searching method for optimizing parameters for hybrid parallelism, considering both offloading and checkpointing to achieve optimal performance. The efficacy of the proposed method is demonstrated through extensive experiments on public benchmarks with diverse model sizes and context window sizes. For example, the method significantly increases Model FLOPs Utilization (MFU) from 32.3% to 42.7% for a 175B Llama-like model with a context window size of 32,768 on 256 NVIDIA H800.

https://www.usenix.org/conference/atc24/presentation/yuan
Thursday July 11, 2024 11:10am - 11:35am PDT
Grand Ballroom CD

11:10am PDT

A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND
Thursday July 11, 2024 11:10am - 11:35am PDT
Jaehyun Song and Bumsuk Kim, Sungkyunkwan University; Minwoo Kwak, Yonsei University; Byoungyoung Lee, Seoul National University; Euiseong Seo, Sungkyunkwan University; Jinkyu Jeong, Yonsei University

Serverless computing often utilizes the warm container technique to improve response times. However, this method, which allows the reuse of function containers across different function requests of the same type, creates persistent vulnerabilities in memory and file systems. These vulnerabilities can lead to security breaches such as data leaks. Traditional approaches to address these issues often suffer from performance drawbacks and high memory requirements due to extensive use of user-level snapshots and complex restoration processes.

The paper introduces REWIND, an innovative and efficient serverless function execution platform designed to address these security and efficiency concerns. REWIND ensures that after each function request, the container is reset to an initial state, free from any sensitive data, including a thorough restoration of the file system to prevent data leakage. It incorporates a kernel-level memory snapshot management system, which significantly lowers memory usage and accelerates the rewind process. Additionally, REWIND optimizes runtime by reusing memory regions and leveraging the temporal locality of function executions, enhancing performance while maintaining strict data isolation between requests. The REWIND prototype is implemented on OpenWhisk and Linux and evaluated with serverless benchmark workloads. The evaluation results have demonstrated that REWIND provides substantial memory saving while providing high function execution performance. Especially, the low memory usage makes more warm containers kept alive thereby improving the throughput as well as the latency of function execution while providing isolation between function requests.

https://www.usenix.org/conference/atc24/presentation/song
Thursday July 11, 2024 11:10am - 11:35am PDT
Grand Ballroom EF

11:30am PDT

High-throughput and Flexible Host Networking for Accelerated Computing
Thursday July 11, 2024 11:30am - 11:50am PDT
Athinagoras Skiadopoulos, Zhiqiang Xie, and Mark Zhao, Stanford University; Qizhe Cai and Saksham Agarwal, Cornell University; Jacob Adelmann, David Ahern, Carlo Contavalli, Michael Goldflam, Vitaly Mayatskikh, Raghu Raja, and Daniel Walton, Enfabrica; Rachit Agarwal, Cornell University; Shrijeet Mukherjee, Enfabrica; Christos Kozyrakis, Stanford University

Modern network hardware is able to meet the stringent bandwidth demands of applications like GPU-accelerated AI. However, existing host network stacks offer a hard tradeoff between performance (in terms of sustained throughput when compared to network hardware capacity) and flexibility (in terms of the ability to select, customize, and extend different network protocols).

This paper explores a clean-slate approach to simultaneously offer high performance and flexibility. We present a co-design of the NIC hardware and the software stack to achieve this. The key idea in our design is the physical separation of the data path (payload transfer between network and application buffers) and the control path (header processing and transport-layer decisions). The NIC enables a high-performance zero-copy data path, independent of the placement of the application (CPU, GPU, FPGA, or other accelerators). The software stack provides a flexible control path by enabling the integration of any network protocol, executing in any environment (in the kernel, in user space, or in an accelerator).

We implement and evaluate ZeroNIC, a prototype that combines an FPGA-based NIC with a software stack that integrates the Linux TCP protocol. We demonstrate that ZeroNIC achieves RDMA-like throughput while maintaining the benefits of robust protocols like TCP under various network perturbations. For instance, ZeroNIC enables a single TCP flow to saturate a 100Gbps link while utilizing only 17% of a single CPU core. ZeroNIC improves NCCL and Redis throughput by 2.66X and 3.71X, respectively, over Linux TCP on a Mellanox ConnectX-6 NIC, without requiring application modifications.

https://www.usenix.org/conference/osdi24/presentation/skiadopoulos
Thursday July 11, 2024 11:30am - 11:50am PDT
Grand Ballroom ABGH

11:35am PDT

Metis: Fast Automatic Distributed Training on Heterogeneous GPUs
Thursday July 11, 2024 11:35am - 12:00pm PDT
Taegeon Um, Byungsoo Oh, Minyoung Kang, Woo-Yeon Lee, Goeun Kim, Dongseob Kim, Youngtaek Kim, and Mohd Muzzammil, Samsung Research; Myeongjae Jeon, UNIST

As deep learning model sizes expand and new GPUs are released every year, the need for distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end GPUs and reduce the cost of purchasing expensive high-end GPUs. In this paper, we introduce Metis, a system designed to automatically find efficient parallelism plans for distributed training on heterogeneous GPUs. Metis holistically optimizes several key system components, such as profiler, cost estimator, and planner, which were limited to single GPU types, to now efficiently leverage compute powers and memory capacities of diverse GPU types. This enables Metis to achieve fine-grained distribution of training workloads across heterogeneous GPUs, improving resource efficiency. However, the search space designed for automatic parallelism in this complexity would be prohibitively expensive to navigate.

To address this issue, Metis develops a new search algorithm that efficiently prunes large search spaces and balances loads with heterogeneity-awareness, while preferring data parallelism over tensor parallelism within a pipeline stage to take advantage of its superior computation and communication trade-offs. Our evaluation with three large models (GPT-3, MoE, and Wide-Resnet) on combinations of three types of GPUs demonstrates that Metis finds better parallelism plans than traditional methods with $1.05 ~ 8.43× training speed-up, while requiring less profiling searching time. Compared to the oracle planning that delivers the fastest parallel training, Metis finds near-optimal solutions while reducing profiling and search overheads by orders of magnitude.

https://www.usenix.org/conference/atc24/presentation/um
Thursday July 11, 2024 11:35am - 12:00pm PDT
Grand Ballroom CD

11:35am PDT

SimEnc: A High-Performance Similarity-Preserving Encryption Approach for Deduplication of Encrypted Docker Images
Thursday July 11, 2024 11:35am - 12:00pm PDT
Tong Sun and Bowen Jiang, Zhejiang University; Borui Li, Southeast University; Jiamei Lv, Yi Gao, and Wei Dong, Zhejiang University

Encrypted Docker images are becoming increasingly popular in Docker registries for privacy. As the Docker registry is tasked with managing an increasing number of images, it becomes essential to implement deduplication to conserve storage space. However, deduplication for encrypted images is difficult because deduplication exploits identical content, while encryption tries to make all contents look random. Existing state-of-the-art works try to decompress images and perform message-locked encryption (MLE) to deduplicate encrypted images. Unfortunately, our measurements uncover two limitations in current works: (i) even minor modifications to the image content can hinder MLE deduplication, (ii) decompressing image layers would increase the size of the storage for duplicate data, and significantly compromise user pull latency and deduplication throughput.

In this paper, we propose SimEnc, a high-performance similarity-preserving encryption approach for deduplication of encrypted Docker images. SimEnc is the first work that integrates the semantic hash technique into MLE to extract semantic information among layers for improving the deduplication ratio. SimEnc builds on a fast similarity space selection mechanism for flexibility. Unlike existing works completely decompressing the layer, we explore a new similarity space by Huffman decoding that achieves a better deduplication ratio and performance. Experiments show that SimEnc outperforms both the state-of-the-art encrypted serverless platform and plaintext Docker registry, reducing storage consumption by up to 261.7% and 54.2%, respectively. Meanwhile, SimEnc can surpass them in terms of pull latency.

https://www.usenix.org/conference/atc24/presentation/sun
Thursday July 11, 2024 11:35am - 12:00pm PDT
Grand Ballroom EF

11:50am PDT

IntOS: Persistent Embedded Operating System and Language Support for Multi-threaded Intermittent Computing
Thursday July 11, 2024 11:50am - 12:10pm PDT
Yilun Wu, Stony Brook University; Byounguk Min, Purdue University; Mohannad Ismail and Wenjie Xiong, Virginia Tech; Changhee Jung, Purdue University; Dongyoon Lee, Stony Brook University

This paper introduces INTOS, an embedded operating system and language support for multi-threaded intermittent computing on a battery-less energy-harvesting platform. INTOS simplifies programming with a traditional "thread" and a "transaction" with automatic undo-logging of persistent objects in non-volatile memory. While INTOS allows the use of volatile memory for performance and energy efficiency, conventional transactions do not ensure crash consistency of volatile register and memory states. To address this challenge, INTOS proposes a novel replay-and-bypass approach, eliminating the need for users to checkpoint volatile states. Upon power restoration, INTOS recovers non-volatile states by undoing the updates of power-interrupted transactions. To reconstruct volatile states, INTOS restarts each thread bypassing committed transactions and system calls by returning recorded results without re-execution. INTOS seeks to build a persistent, full-fledged embedded OS, supporting priority-based preemptive multithreading while ensuring crash consistency even if power failure occurs during a system call or while some threads are blocked. Experiments on a commodity platform MSP430FR5994 show that when subjected to an extreme power failure frequency of 1 ms, INTOS demonstrated 1.24x lower latency and 1.29x less energy consumption than prior work leveraging idempotent processing. This trend turns out to be more pronounced on Apollo 4 Blue Plus.

https://www.usenix.org/conference/osdi24/presentation/wu-yilun
Thursday July 11, 2024 11:50am - 12:10pm PDT
Grand Ballroom ABGH

12:00pm PDT

FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
Thursday July 11, 2024 12:00pm - 12:25pm PDT
Mengwei Xu, Dongqi Cai, Yaozong Wu, Xiang Li, and Shangguang Wang, Beijing University of Posts and Telecommunications (BUPT)

Large Language Models (LLMs) are transforming the landscape of mobile intelligence. Federated Learning (FL), a method to preserve user data privacy, is often employed in fine-tuning LLMs to downstream mobile tasks, i.e., FedLLM. A vital challenge of FedLLM is the tension between LLM complexity and resource constraint of mobile devices.

In response to this challenge, this work introduces FwdFL, an innovative FL protocol designed to enhance the FedLLM efficiency. The key idea of FwdFL is to employ backpropagation (BP)-free training methods, requiring devices only to execute ''perturbed inferences''. Consequently, FwdFL delivers way better memory efficiency and time efficiency (expedited by mobile NPUs and an expanded array of participant devices). FwdFL centers around three key designs: (1) it combines BP-free training with parameter-efficient training methods, an essential way to scale the approach to the LLM era; (2) it systematically and adaptively allocates computational loads across devices, striking a careful balance between convergence speed and accuracy; (3) it discriminatively samples perturbed predictions that are more valuable to model convergence. Comprehensive experiments illustrate FwdFL's significant advantages over conventional methods, including up to three orders of magnitude faster convergence and a 4.6× reduction in memory footprint. Uniquely, FwdFL paves the way for federated billion-parameter LLMs such as LLaMA on COTS mobile devices -- a feat previously unattained.

https://www.usenix.org/conference/atc24/presentation/xu-mengwei
Thursday July 11, 2024 12:00pm - 12:25pm PDT
Grand Ballroom CD

12:00pm PDT

mmTLS: Scaling the Performance of Encrypted Network Traffic Inspection
Thursday July 11, 2024 12:00pm - 12:25pm PDT
Junghan Yoon, Seoul National University; Seunghyun Do and Duckwoo Kim, KAIST; Taejoong Chung, Virginia Tech; KyoungSoo Park, Seoul National University

Modern network monitoring TLS middleboxes play a critical role in fighting against the abuse by encrypted network traffic. Unfortunately, operating a TLS middlebox often incurs a huge computational overhead as it must translate and relay encrypted traffic from one endpoint to the other. We observe that even a simple TLS proxy drops the throughput of end-to-end TLS sessions by 43% to 73%. What is worse is that recent security enhancement TLS middlebox works levy an even more computational tax.

In this paper, we present mmTLS, a scalable TLS middlebox development framework that significantly improves the traffic inspection performance and provides a TLS event programming library with which one can write a TLS middlebox with ease. mmTLS eliminates the traffic relaying cost as it operates on a single end-to-end TLS session by secure session key sharing. This approach is not only beneficial to performance but it naturally guarantees all end-to-end TLS properties except confidentiality. To detect illegal content modification, mmTLS supplements a TLS record with a private tag whose key is kept secret only to TLS endpoints. We find that the extra overhead for private tag generation and verification is minimal when augmented with the first tag generation. Our evaluation demonstrates that mmTLS outperforms the nginx TLS proxy in the split-connection mode by a factor 2.7 to 41.2, and achieves 179 Gbps of traffic relaying throughput.

https://www.usenix.org/conference/atc24/presentation/yoon
Thursday July 11, 2024 12:00pm - 12:25pm PDT
Grand Ballroom EF

12:10pm PDT

Data-flow Availability: Achieving Timing Assurance in Autonomous Systems
Thursday July 11, 2024 12:10pm - 12:30pm PDT
Ao Li and Ning Zhang, Washington University in St. Louis

Due to the continuous interaction with the physical world, autonomous cyber-physical systems (CPS) require both functional and temporal correctness. Despite recent advances in the theoretical foundation of real-time computing, leveraging these results efficiently in modern CPS platforms often requires domain expertise, and presents non-trivial challenges to many developers.

To understand the practical challenges in building real-time software, we conducted a survey of 189 software issues from 7 representative CPS open-source projects. Through this exercise, we found that most bugs are due to misalignment in time between cyber and physical states. This inspires us to abstract three key temporal properties: freshness, consistency, and stability. Using a newly developed concept, Data-flow Availability (DFA), which aims to capture temporal/availability expectation of data flow, we show how these essential properties can be represented as timing constraints on data flows. To realize the timing assurance from DFA, we designed and implemented Kairos, which automatically detects and mitigates timing constraint violations. To detect violations, Kairos translates the policy definition from the API-based annotations into run-time program instrumentation. To mitigate the violations, it provides an infrastructure to bridge semantic gaps between schedulers at different abstraction layers to allow for coordinated efforts. End-to-end evaluation on three real-world CPS platforms shows that Kairos improves timing predictability and safety while introducing a minimal 2.8% run-time overhead.

https://www.usenix.org/conference/osdi24/presentation/li
Thursday July 11, 2024 12:10pm - 12:30pm PDT
Grand Ballroom ABGH

12:25pm PDT

ATC Conference Luncheon
Thursday July 11, 2024 12:25pm - 2:00pm PDT
Thursday July 11, 2024 12:25pm - 2:00pm PDT
Santa Clara Ballroom

12:30pm PDT

Microkernel Goes General: Performance and Compatibility in the HongMeng Production Microkernel
Thursday July 11, 2024 12:30pm - 12:50pm PDT
Haibo Chen, Huawei Central Software Institute and Shanghai Jiao Tong University; Xie Miao, Ning Jia, Nan Wang, Yu Li, Nian Liu, Yutao Liu, Fei Wang, Qiang Huang, Kun Li, Hongyang Yang, Hui Wang, Jie Yin, Yu Peng, and Fengwei Xu, Huawei Central Software Institute

The virtues of security, reliability, and extensibility have made state-of-the-art microkernels prevalent in embedded and safety-critical scenarios. However, they face performance and compatibility issues when targeting more general scenarios, such as smartphones and smart vehicles.

This paper presents the design and implementation of HongMeng kernel (HM), a commercialized general-purpose microkernel that preserves most of the virtues of microkernels while addressing the above challenges. For the sake of commercial practicality, we design HM to be compatible with the Linux API and ABI to reuse its rich applications and driver ecosystems. To make it performant despite the constraints of compatibility and being general-purpose, we re-examine the traditional microkernel wisdom, including IPC, capability-based access control, and userspace paging, and retrofit them accordingly. Specifically, we argue that per-invocation IPC is not the only concern for performance, but IPC frequency, state double bookkeeping among OS services, and capabilities that hide kernel objects contribute to significant performance degradation. We mitigate them accordingly with a set of techniques, including differentiated isolation classes, flexible composition, policy-free kernel paging, and address-token-based access control.

HM consists of a minimal core kernel and a set of least-privileged OS services, and it can run complex frameworks like AOSP and OpenHarmony. HM has been deployed in production on tens of millions of devices in emerging scenarios, including smart routers, smart vehicles and smartphones, typically with improved performance and security over their Linux counterparts.

https://www.usenix.org/conference/osdi24/presentation/chen-haibo
Thursday July 11, 2024 12:30pm - 12:50pm PDT
Grand Ballroom ABGH

12:50pm PDT

OSDI Conference Luncheon
Thursday July 11, 2024 12:50pm - 2:00pm PDT
Thursday July 11, 2024 12:50pm - 2:00pm PDT
Santa Clara Ballroom

2:00pm PDT

When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling
Thursday July 11, 2024 2:00pm - 2:20pm PDT
Abdullah Bin Faisal, Noah Martin, Hafiz Mohsin Bashir, Swaminathan Lamelas, and Fahad R. Dogar, Tufts University

In this paper, we make a case for providing job completion time estimates to GPU cluster users, similar to providing the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing GPU schedulers optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical.

To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights) that meets specific goals for predictability. It uses a simulation-aided search strategy to efficiently discover WFQ configurations that lie around the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of scheduling ML training workloads on GPUs. Our evaluation, on a small-scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.

https://www.usenix.org/conference/osdi24/presentation/bin-faisal
Thursday July 11, 2024 2:00pm - 2:20pm PDT
Grand Ballroom ABGH

2:00pm PDT

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
Thursday July 11, 2024 2:00pm - 2:25pm PDT
Dan Graur, Oto Mraz, Muyu Li, and Sepehr Pourghannad, ETH Zurich; Chandramohan A. Thekkath, Google; Ana Klimovic, ETH Zurich

Input data preprocessing is a common bottleneck in machine learning (ML) jobs, that can significantly increase training time and cost as expensive GPUs or TPUs idle waiting for input data. Previous work has shown that offloading data preprocessing to remote CPU servers successfully alleviates data stalls and improves training time. However, remote CPU workers in disaggregated data processing systems comprise a significant fraction of total training costs. Meanwhile, current disaggregated solutions often underutilize CPU and DRAM resources available on ML accelerator nodes. We propose two approaches to alleviate ML input data stalls while minimizing costs. First, we dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth. Second, we analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput. We observe that relaxing commutativity increases throughput while maintaining high model accuracy for a variety of ML data pipelines. We build Pecan, an ML data preprocessing service that automates data preprocessing worker placement and transformation reordering decisions. Pecan reduces preprocessing costs by 87% on average and total training costs by up to 60% compared to training with state-of-the-art disaggregated data preprocessing and total training costs by 55% on average compared to collocated data preprocessing.

https://www.usenix.org/conference/atc24/presentation/mraz
Thursday July 11, 2024 2:00pm - 2:25pm PDT
Grand Ballroom CD

2:00pm PDT

QDSR: Accelerating Layer-7 Load Balancing by Direct Server Return with QUIC
Thursday July 11, 2024 2:00pm - 2:25pm PDT
Ziqi Wei, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Zhiqiang Wang, Tencent and Peng Cheng Laboratory; Qing Li, Peng Cheng Laboratory; Yuan Yang, Tsinghua University; Cheng Luo and Fuyu Wang, Tencent; Yong Jiang, Tsinghua Shenzhen International Graduate School and Peng Cheng Laboratory; Sijie Yang, Tencent; Zhenhui Yuan, Northumbria University

Layer-7(L7) load balancing is a crucial capability for cloud service providers to maintain stable and reliable services. However, high flexibility of the L7 load balancers(LBs) and increasing downlink relaying service result in a heavy workload, which significantly increases the cost of cloud service providers and reduces end-to-end service quality. We proposes QDSR, a new L7 load balancing scheme that uses QUIC and Direct Server Return(DSR) technology. QDSR divides the QUIC connection into independent streams and distributes them to multiple real servers(RSs), enabling real servers to send data directly to the client simultaneously. Due to the lack of redundant relaying, QDSR enables high performance, low latency, and nearly eliminates additional downlink relaying overhead.

To evaluate the performance of QDSR, we implemented all its components using Nginx and Apache Traffic Server, deployed them in a real environment testbed, and conducted large-scale simulation experiments using mahimahi. The experimental results show that QDSR can process an additional 4.8%-18.5% of client requests compared to traditional L7 proxy-based load balancing schemes. It can achieve a maximum throughput that is 12.2 times higher in high-load scenarios and significantly reduce end-to-end latency and first packet latency.

https://www.usenix.org/conference/atc24/presentation/wei
Thursday July 11, 2024 2:00pm - 2:25pm PDT
Grand Ballroom EF

2:20pm PDT

Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences
Thursday July 11, 2024 2:20pm - 2:40pm PDT
Neeraj Kumar, Pol Mauri Ruiz, Vijay Menon, Igor Kabiljo, Mayank Pundir, Andrew Newell, Daniel Lee, Liyuan Wang, and Chunqiang Tang, Meta Platforms

Meta's private cloud uses millions of servers to host tens of thousands of services that power multiple products for billions of users. This complex environment has various optimization problems involving resource allocation, including hardware placement, server allocation, ML training & inference placement, traffic routing, database & container migration for load balancing, grouping serverless functions for locality, etc.

The main challenges for a reusable resource-allocation framework are its usability and scalability. Usability is impeded by practitioners struggling to translate real-life policies into precise mathematical formulas required by formal optimization methods, while scalability is hampered by NP-hard problems that cannot be solved efficiently by commercial solvers.

These challenges are addressed by Rebalancer, Meta's resource-allocation framework. It has been applied to dozens of large-scale use cases over the past seven years, demonstrating its usability, scalability, and generality. At the core of Rebalancer is an expression graph that enables its optimization algorithm to run more efficiently than past algorithms. Moreover, Rebalancer offers a high-level specification language to lower the barrier for adoption by systems practitioners.

https://www.usenix.org/conference/osdi24/presentation/kumar
Thursday July 11, 2024 2:20pm - 2:40pm PDT
Grand Ballroom ABGH

2:25pm PDT

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model
Thursday July 11, 2024 2:25pm - 2:50pm PDT
Zheng Wang, University of California, San Diego; Yuke Wang, Boyuan Feng, and Guyue Huang, University of California, Santa Barbara; Dheevatsa Mudigere and Bharath Muthiah, Meta; Ang Li, Pacific Northwest National Laboratory; Yufei Ding, University of California, San Diego

The deployment of Deep Learning Recommendation Models (DLRMs) involves the parallelization of extra-large embedding tables (EMTs) on multiple GPUs. Existing works overlook the input-dependent behavior of EMTs and parallelize them in a coarse-grained manner, resulting in unbalanced workload distribution and inter-GPU communication.

To this end, we propose OPER, an algorithm-system co-design with OPtimality-guided Embedding table parallelization for large-scale Recommendation model training and inference. The core idea of OPER is to explore the connection between DLRM inputs and the efficiency of distributed EMTs, aiming to provide a near-optimal parallelization strategy for EMTs. Specifically, we conduct an in-depth analysis of various types of EMTs parallelism and propose a heuristic search algorithm to efficiently approximate an empirically near-optimal EMT parallelization. Furthermore, we implement a distributed shared memory-based system, which supports the lightweight but complex computation and communication pattern of fine-grained EMT parallelization, effectively converting theoretical improvements into real speedups. Extensive evaluation shows that OPER achieves 2.3× and 4.0× speedup on average in training and inference, respectively, over state-of-the-art DLRM frameworks.

https://www.usenix.org/conference/atc24/presentation/wang
Thursday July 11, 2024 2:25pm - 2:50pm PDT
Grand Ballroom CD

2:25pm PDT

Evaluating Chiplet-based Large-Scale Interconnection Networks via Cycle-Accurate Packet-Parallel Simulation
Thursday July 11, 2024 2:25pm - 2:50pm PDT
Yinxiao Feng and Yuchen Wei, Institute for Interdisciplinary Information Sciences, Tsinghua University; Dong Xiang, School of Software, Tsinghua University; Kaisheng Ma, Institute for Interdisciplinary Information Sciences, Tsinghua University

The Chiplet architecture has achieved great success in recent years. However, chiplet-based networks are significantly different from traditional networks, thus presenting new challenges in evaluation. On the one hand, on-chiplet and off-chiplet networks are tightly coupled; therefore, the entire heterogeneous network must be designed and evaluated jointly rather than separately. On the other hand, existing network simulators cannot efficiently evaluate large-scale chiplet-based networks with cycle-accurate accuracy.

In this paper, we present the design and implementation of the Chiplet Network Simulator (CNSim), a cycle-accurate packet-parallel simulator supporting efficient simulation for large-scale chiplet-based (shared-memory) networks. In CNSim, a packet-centric simulation architecture and an atomic-based hyper-threading mechanism are adopted, accelerating simulation speed by 11× ~ 14× compared with existing cycle-accurate simulators. Besides, we implement the heterogeneous router/link microarchitecture and many other features, including hierarchical topologies, adaptive routing, and real workload traces integration. Based on CNSim, two typical chiplet-based networks, which cannot be efficiently simulated by existing simulators, are systematically evaluated. The advantages and limitations of chiplet-based networks are revealed through systematical cycle-accurate simulations. The simulator and evaluation framework are open-sourced to the community.

https://www.usenix.org/conference/atc24/presentation/feng-yinxiao
Thursday July 11, 2024 2:25pm - 2:50pm PDT
Grand Ballroom EF

2:40pm PDT

μSlope: High Compression and Fast Search on Semi-Structured Logs
Thursday July 11, 2024 2:40pm - 3:00pm PDT
Rui Wang, YScope; Devin Gibson, YScope and University of Toronto; Kirk Rodrigues, YScope; Yu Luo, YScope, Uber, and University of Toronto; Yun Zhang, Kaibo Wang, Yupeng Fu, and Ting Chen, Uber; Ding Yuan, YScope and University of Toronto

Internet-scale services can produce a large amount of logs. Such logs are increasingly appearing in semi-structured formats such as JSON. At Uber, the amount of semi-structured log data can exceed 10PB/day. It is prohibitively expensive to store and analyze them. As a result, logs are only kept searchable for a few days.

This paper proposes μSlope, a system that losslessly compresses semi-structured log data, and allows search without full decompression. It concisely represents the schema structures, and only keeps this representation stored once per dataset instead of interspersing it with each record. It further "structurizes" the semi-structured data by grouping the records with the same schema structure into the same table, so that each table is well structured. Our evaluation shows that μSlope achieves 21.9:1 to 186.8:1 compression ratio, which is at least a few times higher than any existing semi-structured data management systems (SSDMS); The compression ratio is even 2.34x as much as Zstandard and the search speed is on 5.77x of other SSDMSes.

https://www.usenix.org/conference/osdi24/presentation/wang-rui
Thursday July 11, 2024 2:40pm - 3:00pm PDT
Grand Ballroom ABGH

2:50pm PDT

MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States
Thursday July 11, 2024 2:50pm - 3:15pm PDT
Chen Zhang, Rongchao Dong, Haojie Wang, Runxin Zhong, Jike Chen, and Jidong Zhai, Tsinghua University

Real-world deep learning programs are often developed with dynamic programming languages like Python, which usually have complex features, such as built-in functions and dynamic typing. These programs typically execute in eager mode, where tensor operators run without compilation, resulting in poor performance. Conversely, deep learning compilers rely on operator-based computation graphs to optimize program execution. However, complexities in dynamic languages often prevent the conversion of these programs into complete operator graphs, leading to sub-optimal performance.

To address this challenge, we introduce MAGPY to optimize the generation of operator graphs from deep learning programs. MAGPY generates more complete operator graphs by collecting key runtime information through monitoring program execution. MAGPY provides a reference graph to record program execution states and leverages reference relationships to identify state changes that can impact program outputs. This approach significantly reduces analysis complexity, leading to more complete operator graphs. Experimental results demonstrate that MAGPY accelerates complex deep learning programs by up to 2.88× (1.55× on average), and successfully instantiates 93.40% of 1191 real user programs into complete operator graphs.

https://www.usenix.org/conference/atc24/presentation/zhang-chen
Thursday July 11, 2024 2:50pm - 3:15pm PDT
Grand Ballroom CD

2:50pm PDT

Config-Snob: Tuning for the Best Configurations of Networking Protocol Stack
Thursday July 11, 2024 2:50pm - 3:15pm PDT
Manaf Bin-Yahya, Yifei Zhao, and Hossein Shafieirad, Huawei Technologies Canada; Anthony Ho, Huawei Technologies Canada and University of Waterloo; Shijun Yin and Fanzhao Wang, Huawei Technologies China; Geng Li, Huawei Technologies Canada

Web servers usually use predefined configurations, yet empirical studies have shown that performance can be significantly improved when the configurations of the networking protocol stack (e.g., TCP, QUIC, and congestion control parameters) are carefully tuned due to the fact that a “one-size-fits-all” strategy does not exist. However, dynamically tuning the protocol stack's configurations is challenging: first, the configuration space is ample, and parameters with complex dependencies must be tuned jointly; second, the network condition space is also large, so an adaptive solution is needed to handle clients' diversity and network dynamics; and finally, clients endure unsatisfactory performance degradation due to learning exploration. To this end, we propose Config-Snob, a protocol tuning solution that selects the best configurations based on historical data. Config-Snob exploits the configuration space by tuning several configuration knobs and provides a practical fine-grained client grouping while handling the network environment dynamics. Config-Snob uses a controlled exploration approach to minimize the performance degradation. Config-Snob utilizes causal inference (CI) algorithms to boost the tuning optimization. Config-Snob is implemented in a QUIC-based server and deployed in a large-scale production environment. Our extensive experiments show that the proposed solution improves the completion time over the default configurations by 15% to 36% (mean) and 62% to 70% (median) in the real deployment.

https://www.usenix.org/conference/atc24/presentation/bin-yahya
Thursday July 11, 2024 2:50pm - 3:15pm PDT
Grand Ballroom EF

3:00pm PDT

ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing
Thursday July 11, 2024 3:00pm - 3:20pm PDT
Mike Chow, Meta Platforms; Yang Wang, Meta Platforms and The Ohio State University; William Wang, Ayichew Hailu, Rohan Bopardikar, Bin Zhang, Jialiang Qu, David Meisner, Santosh Sonawane, Yunqi Zhang, Rodrigo Paim, Mack Ward, Ivor Huang, Matt McNally, Daniel Hodges, Zoltan Farkas, Caner Gocmen, Elvis Huang, and Chunqiang Tang, Meta Platforms

This paper presents ServiceLab, a large-scale performance testing platform developed at Meta. Currently, the diverse set of applications and ML models it tests consumes millions of machines in production, and each year it detects performance regressions that could otherwise lead to the wastage of millions of machines. A major challenge for ServiceLab is to detect small performance regressions, sometimes as tiny as 0.01%. These minor regressions matter due to our large fleet size and their potential to accumulate over time. For instance, the median regression detected by ServiceLab for our large serverless platform, running on more than half a million machines, is only 0.14%. Another challenge is running performance tests in our private cloud, which, like the public cloud, is a noisy environment that exhibits inherent performance variances even for machines of the same instance type. To address these challenges, we conduct a large-scale study with millions of performance experiments to identify machine factors, such as the kernel, CPU, and datacenter location, that introduce variance to test results. Moreover, we present statistical analysis methods to robustly identify small regressions. Finally, we share our seven years of operational experience in dealing with a diverse set of applications.

https://www.usenix.org/conference/osdi24/presentation/chow
Thursday July 11, 2024 3:00pm - 3:20pm PDT
Grand Ballroom ABGH

3:15pm PDT

Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
Thursday July 11, 2024 3:15pm - 3:40pm PDT
Haojun Xia, University of Sydney; Zhen Zheng and Xiaoxia Wu, Microsoft; Shiyang Chen, Rutgers University; Zhewei Yao, Stephen Youn, Arash Bakhtiari, and Michael Wyatt, Microsoft; Donglin Zhuang and Zhongzhu Zhou, University of Sydney; Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song, Microsoft

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with non-power-of-two bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (5-bit, etc.). We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called Quant-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved with 6-bit quantization. Experiments show that Quant-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.

https://www.usenix.org/conference/atc24/presentation/xia
Thursday July 11, 2024 3:15pm - 3:40pm PDT
Grand Ballroom CD

3:15pm PDT

Conspirator: SmartNIC-Aided Control Plane for Distributed ML Workloads
Thursday July 11, 2024 3:15pm - 3:40pm PDT
Yunming Xiao, Northwestern University; Diman Zad Tootaghaj, Aditya Dhakal, Lianjie Cao, and Puneet Sharma, Hewlett Packard Labs; Aleksandar Kuzmanovic, Northwestern University

Modern machine learning (ML) workloads heavily depend on distributing tasks across clusters of server CPUs and specialized accelerators, such as GPUs and TPUs, to achieve optimal performance. Nonetheless, prior research has highlighted the inefficient utilization of computing resources in distributed ML, leading to suboptimal performance. This inefficiency primarily stems from CPU bottlenecks and suboptimal accelerator scheduling. Although numerous proposals have been put forward to address these issues individually, none have effectively tackled both inefficiencies simultaneously. In this paper, we introduce Conspirator, an innovative control plane design aimed at alleviating both bottlenecks by harnessing the enhanced computing capabilities of SmartNICs. Following the evolving role of SmartNICs, which have transitioned from their initial function of standard networking task offloading to serving as programmable connectors between disaggregated computing resources, Conspirator facilitates efficient data transfer without the involvement of host CPUs and hence circumvents the potential bottlenecks there. Conspirator further integrates a novel scheduling algorithm that takes into consideration of the heterogeneity of accelerators and adapts to changing workload dynamics, enabling the flexibility to mitigate the second bottleneck. Our evaluation demonstrates that Conspirator may provide a 15% end-to-end completion time reduction compared to RDMA-based alternatives while being 17% more cost-effective and 44% more power-efficient. Our proposed scheduler also helps to save 33% GPU hours compared to naive GPU-sharing schedulers by making close-to-optimal decisions while taking much less time than the optimal NP-Hard scheduler.

https://www.usenix.org/conference/atc24/presentation/xiao
Thursday July 11, 2024 3:15pm - 3:40pm PDT
Grand Ballroom EF

3:20pm PDT

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
Thursday July 11, 2024 3:20pm - 3:40pm PDT
Arnab Choudhury, Meta Platforms; Yang Wang, Meta Platforms and The Ohio State University; Tuomas Pelkonen, Meta Platforms; Kutta Srinivasan, LinkedIn; Abha Jain, Shenghao Lin, Delia David, Siavash Soleimanifard, Michael Chen, Abhishek Yadav, Ritesh Tijoriwala, Denis Samoylov, and Chunqiang Tang, Meta Platforms

In public clouds, users must manually select a datacenter region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation colocation. Unfortunately, isolated decisions by individual users can lead to a mismatch between workload demand and hardware supply across regions, hurting the cloud provider's hardware utilization and profitability. To address this problem in Meta's hyperscale private cloud, we provide a global-scheduling abstraction to all ML training workloads. Users simply submit their training workloads to MAST, our global scheduler, and rely on it to intelligently place both data and training workloads to different regions. We describe three design principles that enable MAST to schedule complex ML training workloads at a global scale: temporal decoupling, scope decoupling, and exhaustive search. MAST successfully balances the load across global regions. Before MAST, the most overloaded region had a GPU demand-to-supply ratio of 2.63 for high-priority workloads. With MAST, this ratio has been reduced to 0.98, effectively eliminating the overload.

https://www.usenix.org/conference/osdi24/presentation/choudhury
Thursday July 11, 2024 3:20pm - 3:40pm PDT
Grand Ballroom ABGH

3:40pm PDT

Break with Refreshments
Thursday July 11, 2024 3:40pm - 4:10pm PDT
Thursday July 11, 2024 3:40pm - 4:10pm PDT
Grand Ballroom Foyer

4:10pm PDT

Automatically Reasoning About How Systems Code Uses the CPU Cache
Thursday July 11, 2024 4:10pm - 4:30pm PDT
Rishabh Iyer, Katerina Argyraki, and George Candea, EPFL

We present a technique, called CFAR, that developers can use to reason precisely about how their code, as well as third-party code, uses the CPU cache. Given a piece of systems code P, CFAR employs program analysis and binary instrumentation to automatically "distill" how P accesses memory, and uses "projectors" on top of the extracted distillates to answer specific questions about P's cache usage. CFAR comes with three example projectors that report (1) how P's cache footprint scales across unseen inputs; (2) the cache hits and misses incurred by P for each class of inputs; and (3) potential vulnerabilities in cryptographic code caused by secretdependent cache-access patterns.

We implemented CFAR in an eponymous tool with which we analyze a performance-critical subset of four TCP stacks— two versions of the Linux stack, a stack used by the IX kernel-bypass OS, and the lwIP TCP stack for embedded systems— as well as 7 algorithm implementations from the OpenSSL cryptographic library, all 51 system calls of the Hyperkernel, and 2 hash-table implementations. We show how CFAR enables developers to not only identify performance bugs and security vulnerabilities in their own code but also understand the performance impact of incorporating third-party code into their systems without doing elaborate benchmarking.

https://www.usenix.org/conference/osdi24/presentation/iyer
Thursday July 11, 2024 4:10pm - 4:30pm PDT
Grand Ballroom ABGH

4:10pm PDT

FBMM: Making Memory Management Extensible With Filesystems
Thursday July 11, 2024 4:10pm - 4:35pm PDT
Bijan Tabatabai, James Sorenson, and Michael M. Swift, University of Wisconsin—Madison

New memory technologies like CXL promise diverse memory configurations such as tiered memory, far memory, and processing in memory. Operating systems must be modified to support these new hardware configurations for applications to make use of them. While many parts of operating systems are extensible, memory management remains monolithic in most systems, making it cumbersome to add support for a diverse set of new memory policies and mechanisms.

Rather than creating a whole new extensible interface for memory managers, we propose to instead use the memory management callbacks provided by the Linux virtual file system (VFS) to write memory managers, called memory management filesystems (MFSs). Memory is allocated by creating and mapping a file in an MFS's mount directory and freed by deleting the file. Use of an MFS is transparent to applications. We call this system File Based Memory Management (FBMM).

Using FBMM, we created a diverse set of standalone memory managers for tiered memory, contiguous allocations, and memory bandwidth allocation, each comprising 500-1500 lines of code. Unlike current approaches that require custom kernels, with FBMM, an MFS can be compiled separately from the kernel and loaded dynamically when needed. We measure the overhead of using filesystems for memory management and found the overhead to be less than 8% when allocating a single page, and less than 0.1% when allocating as little as 128 pages. MFSs perform competitively with kernel implementations, and sometimes better due to simpler implementations.

https://www.usenix.org/conference/atc24/presentation/tabatabai
Thursday July 11, 2024 4:10pm - 4:35pm PDT
Grand Ballroom CD

4:10pm PDT

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Thursday July 11, 2024 4:10pm - 4:35pm PDT
Yifan Xiong, Yuting Jiang, Ziyue Yang, and Lei Qu, Microsoft Research; Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, and Joe Chau, Microsoft; Peng Cheng, Yongqiang Xiong, and Lidong Zhou, Microsoft Research

Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions.

We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61×. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.

https://www.usenix.org/conference/atc24/presentation/xiong
Thursday July 11, 2024 4:10pm - 4:35pm PDT
Grand Ballroom EF

4:30pm PDT

VeriSMo: A Verified Security Module for Confidential VMs
Thursday July 11, 2024 4:30pm - 4:50pm PDT
Ziqiao Zhou, Microsoft Research; Anjali, University of Wisconsin-Madison; Weiteng Chen, Microsoft Research; Sishuai Gong, Purdue University; Chris Hawblitzel and Weidong Cui, Microsoft Research

Hardware vendors have introduced confidential VM architectures (e.g., AMD SEV-SNP, Intel TDX and Arm CCA) in recent years. They eliminate the trust in the hypervisor and lead to the need for security modules such as AMD Secure VMService Module (SVSM). These security modules aim to provide a guest with security features that previously were offered by the hypervisor. Since the security of such modules is critical, Rust is used to implement them for its known memory safety features. However, using Rust for implementation does not guarantee correctness, and the use of unsafe Rust compromises the memory safety guarantee.

In this paper, we introduce VERISMO, the first verified security module for confidential VMs on AMD SEV-SNP. VERISMO is fully functional and provides security features such as code integrity, runtime measurement, and secret management. More importantly, as a Rust-based implementation, VERISMO is fully verified for functional correctness, secure information flow, and VM confidentiality and integrity. The key challenge in verifying VERISMO is that the untrusted hypervisor can interrupt VERISMO's execution and modify the hardware state at any time. We address this challenge by dividing verification into two layers. The upper layer handles the concurrent hypervisor execution, while the lower layer handles VERISMO's own concurrent execution. When compared with a C-based implementation, VERISMO achieves similar performance. When verifying VERISMO, we identified a subtle requirement for VM confidentiality and found that it was overlooked by AMD SVSM. This demonstrates the necessity for formal verification.

https://www.usenix.org/conference/osdi24/presentation/zhou
Thursday July 11, 2024 4:30pm - 4:50pm PDT
Grand Ballroom ABGH

4:35pm PDT

Mangosteen: Fast Transparent Durability for Linearizable Applications using NVM
Thursday July 11, 2024 4:35pm - 5:00pm PDT
Sergey Egorov, Gregory Chockler, and Brijesh Dongol, University of Surrey, UK; Dan O'Keeffe, Royal Holloway, University of London, UK; Sadegh Keshavarzi, University of Surrey, UK

The advent of byte-addressable non-volatile memory (NVM) technologies has enabled the development of low-latency high-throughput durable applications, i.e., applications that are capable of recovering from full-system crashes. However, programming such applications is error-prone as efficiency gains often require fine-grained (programmer-controlled) management of low-level persistence instructions.

We propose Mangosteen, a high-level programming framework that allows developers to transform an existing linearizable in-memory application to a corresponding durably linearizable version using NVM. Our framework’s API consists of a set of callback hooks that interpose on an application’s request processing flow with minimal developer effort. Mangosteen executes client operations on DRAM and persists their effects using binary instrumentation and redo logging. Mangosteen’s concurrency control facilitates batching of read-write requests to minimize the cost of persistence, while allowing read-only requests to execute concurrently. A novel intra-batch deduplication mechanism further reduces persistence overheads for common OLTP workloads. Our empirical evaluation results show that Mangosteen-enabled applications outperform state-of-the-art solutions across the entire spectrum of read-write ratios. In particular, the Mangosteen-based version of Redis demonstrates throughput gains of between 2×–5× in comparison to prior work.

https://www.usenix.org/conference/atc24/presentation/egorov
Thursday July 11, 2024 4:35pm - 5:00pm PDT
Grand Ballroom CD

4:35pm PDT

Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field
Thursday July 11, 2024 4:35pm - 5:00pm PDT
Ronglong Wu, Shuyue Zhou, Jiahao Lu, Zhirong Shen, and Zikang Xu, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University; Kunlin Yang and Feilong Lin, Huawei Technologies Co., Ltd; Yiming Zhang, Xiamen University

High-bandwidth memory (HBM) is regarded as a promising technology for fundamentally overcoming the memory wall. It stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth. However, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes.

In this paper, we conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services. Through error analyses and methodology validations, we confirm that the HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM. We design and implement Calchas, a hierarchical failure prediction framework for HBM based on our findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcoming failures. The results demonstrate the feasibility of failure prediction across hierarchical levels.

https://www.usenix.org/conference/atc24/presentation/wu-ronglong
Thursday July 11, 2024 4:35pm - 5:00pm PDT
Grand Ballroom EF

4:50pm PDT

Validating the eBPF Verifier via State Embedding
Thursday July 11, 2024 4:50pm - 5:10pm PDT
Hao Sun and Zhendong Su, ETH Zurich

This paper introduces state embedding, a novel and highly effective technique for validating the correctness of the eBPF verifier, a critical component for Linux kernel security. To check whether a program is safe to execute, the verifier must track over-approximated program states along each potential control-flow path; any concrete state not contained in the tracked approximation may invalidate the verifier's conclusion. Our key insight is that one can effectively detect logic bugs in the verifier by embedding a program with certain approximation-correctness checks expected to be validated by the verifier. Indeed, for a program deemed safe by the verifier, our approach embeds concrete states via eBPF program constructs as correctness checks. By construction, the resulting state-embedded program allows the verifier to validate whether the embedded concrete states are correctly approximated by itself; any validation failure therefore reveals a logic bug in the verifier. We realize state embedding as a practical tool and apply it to test the eBPF verifier. Our evaluation results highlight its effectiveness. Despite the extensive scrutiny and testing undertaken on the eBPF verifier, our approach, within one month, uncovered 15 previously unknown logic bugs, 10 of which have already been fixed. Many of the detected bugs are severe, e.g., two are exploitable and can lead to local privilege escalation.

https://www.usenix.org/conference/osdi24/presentation/sun-hao
Thursday July 11, 2024 4:50pm - 5:10pm PDT
Grand Ballroom ABGH

5:00pm PDT

FlexMem: Adaptive Page Profiling and Migration for Tiered Memory
Thursday July 11, 2024 5:00pm - 5:25pm PDT
Dong Xu, University of California, Merced; Junhee Ryu, Jinho Baek, and Kwangsik Shin, SK hynix; Pengfei Su and Dong Li, University of California, Merced

Tiered memory, combining multiple memory components with different performance and capacity, provides a cost-effective solution to increase memory capacity and improve memory utilization. The existing system software to manage the tiered memory often has limitations: (1) rigid memory profiling methods that cannot timely capture emerging memory access patterns or lose profiling quality, (2) rigid page demotion (i.e., the number of pages for demotion is driven by an invariant requirement on free memory space), and (3) rigid warm page range (i.e., emerging hot pages) that leads to unnecessary page demotion from fast to slow memory. To address the above limitations, we introduce FlexMem, a page profiling and migration system for tiered memory. FlexMem combines the performance counter-based and page hinting fault-based profiling methods to improve profiling quality, dynamically decides the number of pages for demotion based on the needs of accommodating hot pages (i.e., frequently accessed pages), and dynamically decides the warm page range based on how often the pages in the range is promoted to hot pages. We evaluate FlexMem with common memory-intensive benchmarks. Compared to the state-of-the-art (Tiering-0.8, TPP, and MEMTIS), FlexMem improves performance by 32%, 23%, and 27% on average respectively.

https://www.usenix.org/conference/atc24/presentation/xu-dong
Thursday July 11, 2024 5:00pm - 5:25pm PDT
Grand Ballroom CD

5:00pm PDT

MSFRD: Mutation Similarity based SSD Failure Rating and Diagnosis for Complex and Volatile Production Environments
Thursday July 11, 2024 5:00pm - 5:25pm PDT
Yuqi Zhang, Tianyi Zhang, Wenwen Hao, Shuyang Wang, Na Liu, and Xing He, Samsung R&D Institute China Xi'an, Samsung Electronics; Yang Zhang, Weixin Wang, Yongguang Cheng, Huan Wang, Jie Xu, Feng Wang, and Bo Jiang, ByteDance Inc.; Yongwong Gwon, Jongsung Na, Zoe Kim, and Geunrok Oh, Samsung Electronics

SSD failures have an increasing impact on storage reliability and performance in data centers. Some manufacturers have customized fine-grained Telemetry attributes to analyze and identify SSD failures. Based on Telemetry data, this paper proposes the mutation similarity based failure rating and diagnosis (MSFRD) scheme to predict failures in dynamic environment of data centers and improve failure handling efficiency. MSFRD dynamically detects the internal mutations of SSDs in real time and measures their similarity to the mutations of historical failed SSDs and healthy SSDs for failure prediction and early rating. Based on the rating, unavailable SSDs with serious failures are handled immediately, while available SSDs with less serious failures will be continuously tracked and diagnosed. The MSFRD is evaluated on real Telemetry datasets collected from large-scale SSDs in data centers. Compared with the existing schemes, MSFRD improves precision by 23.8% and recall by 38.9% on average for failure prediction. The results also show the effectiveness of MSFRD on failure rating and progressive diagnosis.

https://www.usenix.org/conference/atc24/presentation/zhang-yuqi
Thursday July 11, 2024 5:00pm - 5:25pm PDT
Grand Ballroom EF

5:10pm PDT

Using Dynamically Layered Definite Releases for Verifying the RefFS File System
Thursday July 11, 2024 5:10pm - 5:30pm PDT
Mo Zou, Dong Du, and Mingkai Dong, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Huawei Technologies Co. Ltd

RefFS is the first concurrent file system that guarantees both liveness and safety, backed by a machine-checkable proof. Unlike earlier concurrent file systems, RefFS provably avoids termination bugs such as livelocks and deadlocks, through the dynamically layered definite releases specification. This specification enables handling of general blocking scenarios (including ad-hoc synchronization), facilitates modular reasoning for nested blocking, and eliminates the possibility of circular blocking.

The methodology underlying the aforementioned specification is integrated into a framework called MoLi (Modular Liveness Verification). This framework helps developers verify concurrent file systems. We further validate the correctness of the locking scheme for the Linux Virtual File System (VFS). Remarkably, even without conducting code proofs, we uncovered a critical flaw in a recent version of the locking scheme, which may lead to deadlocks of the entire OS (confirmed by Linux maintainers). RefFS achieves better overall performance than AtomFS, a state-of-the-art, verified concurrent file system without the liveness guarantee.

https://www.usenix.org/conference/osdi24/presentation/zou
Thursday July 11, 2024 5:10pm - 5:30pm PDT
Grand Ballroom ABGH

5:30pm PDT

Anvil: Verifying Liveness of Cluster Management Controllers
Thursday July 11, 2024 5:30pm - 5:50pm PDT
Xudong Sun, Wenjie Ma, Jiawei Tyler Gu, and Zicheng Ma, University of Illinois Urbana-Champaign; Tej Chajed, University of Wisconsin-Madison; Jon Howell, Andrea Lattuada, and Oded Padon, VMware Research; Lalith Suresh, Feldera; Adriana Szekeres, VMware Research; Tianyin Xu, University of Illinois Urbana-Champaign

Modern clouds depend crucially on an extensible ecosystem of thousands of controllers, each managing critical systems (e.g., a ZooKeeper cluster). A controller continuously reconciles the current state of the system to a desired state according to a declarative description. However, controllers have bugs that make them never achieve the desired state, due to concurrency, asynchrony, and failures; there are cases where after an inopportune failure, a controller can make no further progress. Formal verification is promising for avoiding bugs in distributed systems, but most work so far focused on safety, whereas reconciliation is fundamentally not a safety property.

This paper develops the first tool to apply formal verification to the problem of controller correctness, with a general specification we call eventually stable reconciliation, written as a concise temporal logic liveness property. We present Anvil, a framework for developing controller implementations in Rust and verifying that the controllers correctly implement eventually stable reconciliation. We use Anvil to verify three Kubernetes controllers for managing ZooKeeper, RabbitMQ, and FluentBit, which can readily be deployed in Kubernetes platforms and are comparable in terms of features and performance to widely used unverified controllers.

https://www.usenix.org/conference/osdi24/presentation/sun-xudong
Thursday July 11, 2024 5:30pm - 5:50pm PDT
Grand Ballroom ABGH

6:00pm PDT

USENIX ATC ’24 Poster Session and Reception
Thursday July 11, 2024 6:00pm - 7:30pm PDT
The USENIX ATC ’24 poster session and reception will feature posters by authors presenting their work in person at the conference. All USENIX ATC and OSDI conference attendees are invited to attend.
Thursday July 11, 2024 6:00pm - 7:30pm PDT
Santa Clara Ballroom

7:30pm PDT

Databricks Sponsor Event
Thursday July 11, 2024 7:30pm - 8:30pm PDT
All attendees from both USENIX ATC and OSDI are invited to join. Snacks and drinks will be provided.
Thursday July 11, 2024 7:30pm - 8:30pm PDT
Bayshore Room

7:30pm PDT

Birds-of-a-Feather Sessions (BoFs)
Thursday July 11, 2024 7:30pm - 10:30pm PDT
Registered attendees may schedule Birds-of-a-Feather sessions (BoFs) and reserve meeting rooms for them in one-hour increments via the BoFs schedule grid posted outside the badge pickup area. The Attendee Guide, which will be sent to registered attendees shortly before the event, contains more details for scheduling a BoF. Each room will be set with a projector and screen.
Thursday July 11, 2024 7:30pm - 10:30pm PDT
Central Room, Tasman Room

8:30pm PDT

Futurewei Sponsor Event
Thursday July 11, 2024 8:30pm - 9:30pm PDT
All attendees from both USENIX ATC and OSDI are invited to join. Snacks and drinks will be provided.
Thursday July 11, 2024 8:30pm - 9:30pm PDT
Bayshore Room
 
Friday, July 12
 

8:00am PDT

Continental Breakfast
Friday July 12, 2024 8:00am - 9:00am PDT
Friday July 12, 2024 8:00am - 9:00am PDT
Grand Ballroom Foyer

8:00am PDT

Badge Pickup
Friday July 12, 2024 8:00am - 12:00pm PDT
Friday July 12, 2024 8:00am - 12:00pm PDT
Lobby West

9:00am PDT

DSig: Breaking the Barrier of Signatures in Data Centers
Friday July 12, 2024 9:00am - 9:20am PDT
Marcos K. Aguilera, VMware Research Group; Clément Burgelin, Rachid Guerraoui, and Antoine Murat, École Polytechnique Fédérale de Lausanne (EPFL); Athanasios Xygkis, Oracle Labs; Igor Zablotchi, Mysten Labs

Data centers increasingly host mutually distrustful users on shared infrastructure. A powerful tool to safeguard such users are digital signatures. Digital signatures have revolutionized Internet-scale applications, but current signatures are too slow for the growing genre of microsecond-scale systems in modern data centers. We propose DSig, the first digital signature system to achieve single-digit microsecond latency to sign, transmit, and verify signatures in data center systems. DSig is based on the observation that, in many data center applications, the signer of a message knows most of the time who will verify its signature. We introduce a new hybrid signature scheme that combines cheap single-use hash-based signatures verified in the foreground with traditional signatures pre-verified in the background. Compared to prior state-of-the-art signatures, DSig reduces signing time from 18.9 to 0.7 μs and verification time from 35.6 to 5.1 μs, while keeping signature transmission time below 2.5 μs. Moreover, DSig achieves 2.5× higher signing throughput and 6.9× higher verification throughput than the state of the art. We use DSig to (a) bring auditability to two key-value stores (HERD and Redis) and a financial trading system (based on Liquibook) for 86% lower added latency than the state of the art, and (b) replace signatures in BFT broadcast and BFT replication, reducing their latency by 73% and 69%, respectively.

https://www.usenix.org/conference/osdi24/presentation/aguilera
Friday July 12, 2024 9:00am - 9:20am PDT
Grand Ballroom ABGH

9:00am PDT

Diagnosing Application-network Anomalies for Millions of IPs in Production Clouds
Friday July 12, 2024 9:00am - 9:25am PDT
Zhe Wang, Shanghai Jiao Tong University; Huanwu Hu, Alibaba Cloud; Linghe Kong, Shanghai Jiao Tong University; Xinlei Kang and Teng Ma, Alibaba Cloud; Qiao Xiang, Xiamen University; Jingxuan Li and Yang Lu, Alibaba Cloud; Zhuo Song, Shanghai Jiao Tong University and Alibaba Cloud; Peihao Yang, Alibaba Cloud; Jiejian Wu, Shanghai Jiao Tong University; Yong Yang and Tao Ma, Alibaba Cloud; Zheng Liu, Alibaba Cloud and Zhejiang University; Xianlong Zeng and Dennis Cai, Alibaba Cloud; Guihai Chen, Shanghai Jiao Tong University

Timely detection and diagnosis of application-network anomalies is a key challenge of operating large-scale production clouds. We reveal three practical issues in a cloud-native era. First, impact assessment of anomalies at a (micro)service level is absent in currently deployed monitoring systems. Ping systems are oblivious to the "actual weights'' of application traffic, e.g., traffic volume and the number of connections/instances. Failures of critical (micro)services with large weights can be easily overlooked by probing systems under prevalent network jitters. Second, the efficiency of anomaly routing (to a blamed application/network team) is still low with multiple attribution teams involved. Third, collecting fine-grained metrics at a (micro)service level incurs considerable computational/storage overheads, however, is indispensable for accurate impact assessment and anomaly routing.

We introduce the application-network diagnosing (AND) system in Alibaba cloud. AND exploits the single metric of TCP retransmission (retxs) to capture anomalies at (micro)service levels and correlates applications with networks end-to-end. To resolve deployment challenges, AND further proposes three core designs: (1) a collecting tool to perform filtering/statistics on massive retxs at the (micro)service level, (2) a real-time detection procedure to extract anomalies from ‘noisy’ retxs with millions of time series, (3) an anomaly routing model to delimit anomalies among multiple target teams/scenarios. AND has been deployed in Alibaba cloud for over three years and enables minute-level anomaly detection/routing and fast failure recovery.

https://www.usenix.org/conference/atc24/presentation/wang-zhe
Friday July 12, 2024 9:00am - 9:25am PDT
Grand Ballroom CD

9:00am PDT

Panorama: Optimizing Internet-scale Users’ Routes from End to End
Friday July 12, 2024 9:00am - 9:25am PDT
Geng Li, Shuihai Hu, and Kun Tan, Huawei Technologies

Network performance is critical to the user experience of many real-time interactive applications, such as video conferencing and live streaming. Empirical studies show that transport latency over 300ms would become unacceptable, leading to significant user satisfaction declining. Unfortunately, due to the best-effort nature of Internet, such strict performance requirement can hardly be fully met. Despite continuous efforts have been made to improve the performance of Internet (e.g., overlay routing optimization, traffic engineering and content delivery network), we are still far from delivering satisfying network performance for these applications. The stringent network requirements, the world-wide cross-continental network transfers, and the large-scale Internet-wide users, together make it a complex challenge to deliver ideal user experience for emerging real-time interactive applications.

In this paper, we present Panorama, a scalable system for delivering desired user experience to real-time interactive applications over a globally distributed overlay network. To achieve ideal user experience, Panorama takes a centralized approach to do global end-to-end traffic engineering optimization, and overcomes the scalability issue by intelligent measurement-based user grouping and scalable, parallelizable route computation. Panorama has been deployed in a large global real-time overlay network since 2021. We evaluate Panorama based on 81 million selected real-world traces in deployment environment with clients across 66 countries. The extensive evaluation demonstrates that Panorama can support a routing service for millions of users, while providing latency lower than 200ms for 96.34% of the communication sessions, and improving SLA satisfaction by up to 88.0%.

https://www.usenix.org/conference/atc24/presentation/li-geng
Friday July 12, 2024 9:00am - 9:25am PDT
Grand Ballroom EF

9:20am PDT

Ransom Access Memories: Achieving Practical Ransomware Protection in Cloud with DeftPunk
Friday July 12, 2024 9:20am - 9:40am PDT
Zhongyu Wang, Yaheng Song, Erci Xu, Haonan Wu, Guangxun Tong, Shizhuo Sun, Haoran Li, Jincheng Liu, Lijun Ding, Rong Liu, Jiaji Zhu, and Jiesheng Wu, Alibaba Group

In this paper, we focus on building a ransomware detection and recovery system for cloud block stores. We start by discussing the possibility of directly using existing methods or porting one to our scenario with modifications. These attempts, though failed, led us to identify the unique IO characteristics of ransomware, and further drove us to build DeftPunk, a block-level ransomware detection and recovery system. DeftPunk uses a two-layer classifier for fast and accurate detection, creates pre-/post-attack snapshots to avoid data loss, and leverages log-structured support for low overhead recovery. Our large-scale benchmark shows that DeftPunk can achieve nearly 100% recall across 13 types of ransomware and low runtime overhead.

https://www.usenix.org/conference/osdi24/presentation/wang-zhongyu
Friday July 12, 2024 9:20am - 9:40am PDT
Grand Ballroom ABGH

9:25am PDT

Data Caching for Enterprise-Grade Petabyte-Scale OLAP
Friday July 12, 2024 9:25am - 9:50am PDT
Chunxu Tang and Bin Fan, Alluxio; Jing Zhao and Chen Liang, Uber, Inc; Yi Wang and Beinan Wang, Alluxio; Ziyue Qiu, Carnegie Mellon University and Uber, Inc.; Lu Qiu, Bowen Ding, Shouzhuo Sun, Saiguang Che, Jiaming Mai, Shouwei Chen, Yu Zhu, and Jianjian Xie, Alluxio; Yutian (James) Sun, Meta, Inc.; Yao Li and Yangjun Zhang, Uber, Inc.; Ke Wang, Meta, Inc.; Mingmin Chen, Uber, Inc.

With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these challenges, this paper introduces the Alluxio local (edge) cache, a highly effective architectural optimization tailored for such environments. This embeddable cache, optimized for petabyte-scale data analytics, leverages local SSD resources to alleviate network I/O and API call pressures, significantly improving data transfer efficiency. Integrated with OLAP systems like Presto and storage services like HDFS, the Alluxio local cache has demonstrated its effectiveness in handling large-scale, enterprise-grade workloads over three years of deployment at Uber and Meta. We share insights and operational experiences in implementing these optimizations, providing valuable perspectives on managing modern, massive-scale OLAP workloads.

https://www.usenix.org/conference/atc24/presentation/tang
Friday July 12, 2024 9:25am - 9:50am PDT
Grand Ballroom CD

9:25am PDT

Enhancing Resource Management of the World's Largest PCDN System for On-Demand Video Streaming
Friday July 12, 2024 9:25am - 9:50am PDT
Rui-Xiao Zhang, UIUC; Haiping Wang, Shu Shi, Xiaofei Pang, Yajie Peng, and Zhichen Xue, ByteDance; Jiangchuan Liu, Simon Fraser University

The rapid growth of video services has led to the significant requirement for efficient content delivery. Traditional approaches mainly rely on Content Delivery Networks (CDNs), which unfortunately incur significant bandwidth cost for video providers. To resolve this problem, the cost-efficient edge resources have emerged as a new solution to replace CDNs. However, their heterogeneous hardware and poor performance still present challenges in their effective utilization. In this paper, we present how ByteDance explores the use of these cost-efficient but less performant resources. Specifically, we first present an extensive overview of PCDN, ByteDance's alternative delivery network for CDNs. Second, as PCDN encounters significant resource imbalances after years of deployment, we further introduce PCDN+, the enhanced iteration of PCDN. Specifically, by integrating a well-designed centralized/decentralized framework, we evolve previous "static'' and "uncontrolled'' PCDN into a "dynamic'' and "controlled'' system. The extensive A/B test and real-world deployment have demonstrated that PCDN+ 1) effectively alleviates overloading issues, 2) significantly improves the utilization of low-cost resources, and 3) provides higher service speed.

https://www.usenix.org/conference/atc24/presentation/zhang-rui-xiao
Friday July 12, 2024 9:25am - 9:50am PDT
Grand Ballroom EF

9:40am PDT

Secret Key Recovery in a Global-Scale End-to-End Encryption System
Friday July 12, 2024 9:40am - 10:00am PDT
Graeme Connell, Signal Messenger; Vivian Fang, UC Berkeley; Rolfe Schmidt, Signal Messenger; Emma Dauterman and Raluca Ada Popa, UC Berkeley

End-to-end encrypted messaging applications ensure that an attacker cannot read a user's message history without their decryption keys. While this provides strong privacy, it creates a usability problem: if a user loses their devices and cannot access their decryption keys, they can no longer access their message history. To solve this usability problem, users should be able to back up their decryption keys with the messaging provider. For privacy, the provider should not have access to users' decryption keys. To solve this problem, we present Secure Value Recovery 3 (SVR3), a secret key recovery system that distributes trust across different types of hardware enclaves run by different cloud providers in order to protect users' decryption keys. SVR3 is the first deployed secret key recovery system to split trust across heterogeneous enclaves managed by different cloud providers: this design ensures that a single type of enclave does not become a central point of attack. SVR3 protects decryption keys via rollback protection and fault tolerance techniques tailored to the enclaves' security guarantees. SVR3 costs $0.0025/user/year and takes 365ms for a user to recover their key, which is a rare operation. A part of SVR3 has been rolled out to millions of real users in a deployment with capacity for over 500 million users, demonstrating the ability to operate at scale.

https://www.usenix.org/conference/osdi24/presentation/connell
Friday July 12, 2024 9:40am - 10:00am PDT
Grand Ballroom ABGH

9:50am PDT

Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?
Friday July 12, 2024 9:50am - 10:15am PDT
Bin Yang, Tsinghua University and National Supercomputer Center in Wuxi; Hao Wei, Tsinghua University; Wenhao Zhu, Shandong University and National Supercomputer Center in Wuxi; Yuhao Zhang, Tsinghua University; Weiguo Liu, Shandong University; Wei Xue, Tsinghua University, Qinghai University and Intelligent Computing and Application Laboratory of Qinghai Province, and National Supercomputer Center in Wuxi

The system architecture of contemporary supercomputers is growing increasingly intricate with the ongoing evolution of system-wide network and storage technologies, making it challenging for application developers and system administrators to manage and utilize the escalating complexity of supercomputers effectively. Moreover, the limited experience of application developers and system administrators in conducting insightful analyses of diverse High-Performance Computing (HPC) workloads and the resulting array of resource utilization characteristics exacerbate the challenge. To address this issue, we undertake a comprehensive analysis of six years' worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes and currently ranked as the world's 11th-fastest supercomputer. Our study provides valuable insights into operational management strategies for HPC systems (i.e., job hanging caused by heavy-load benchmark testing, job starvation caused by aggressive scheduling policies) and I/O workload characteristics (i.e., getattr operations spiking caused by massive access to grid files, a large number of files accessed by many applications in a short period), shedding light on both challenges and opportunities for improvements in the HPC environment. This paper delineates our methodology, findings, and the significance of this study. Additionally, we discuss the potential of our research for future studies and practice within this domain.

https://www.usenix.org/conference/atc24/presentation/yang
Friday July 12, 2024 9:50am - 10:15am PDT
Grand Ballroom CD

9:50am PDT

TileClipper: Lightweight Selection of Regions of Interest from Videos for Traffic Surveillance
Friday July 12, 2024 9:50am - 10:15am PDT
Shubham Chaudhary and Aryan Taneja, IIIT Delhi; Anjali Singh, Indira Gandhi Delhi Technology University for Women; Purbasha Roy, Sohum Sikdar, Mukulika Maity, and Arani Bhattacharya, IIIT Delhi

With traffic surveillance increasingly used, thousands of cameras on roads send video feeds to cloud servers to run computer vision algorithms, requiring high bandwidth. State-of-the-art techniques reduce the bandwidth requirement by either sending a limited number of frames/pixels/regions or relying on re-encoding the important parts of the video. This imposes significant overhead on both the camera side and server side compute as re-encoding is expensive. In this work, we propose TILECLIPPER, a system that utilizes tile sampling, where a limited number of rectangular areas within the frames, known as tiles, are sent to the server. TILECLIPPER selects the tiles adaptively by utilizing its correlation with the tile bitrates. We evaluate TILECLIPPER on different datasets having 55 videos in total to show that, on average, our technique reduces ≈ 22% of data sent to the cloud while providing a detection accuracy of 92% with minimal calibration and compute compared to prior works. We show real-time tile filtering of TILECLIPPER even on cheap edge devices like Raspberry Pi 4 and nVidia Jetson Nano. We further create a live deployment of TILECLIPPER to show that it provides over 87% detection accuracy and over 55% bandwidth savings.

https://www.usenix.org/conference/atc24/presentation/chaudhary
Friday July 12, 2024 9:50am - 10:15am PDT
Grand Ballroom EF

10:00am PDT

Flock: A Framework for Deploying On-Demand Distributed Trust
Friday July 12, 2024 10:00am - 10:20am PDT
Darya Kaviani and Sijun Tan, UC Berkeley; Pravein Govindan Kannan, IBM Research; Raluca Ada Popa, UC Berkeley

Recent years have exhibited an increase in applications that distribute trust across n servers to protect user data from a central point of attack. However, these deployments remain limited due to a core obstacle: establishing n distinct trust domains. An application provider, a single trust domain, cannot directly deploy multiple trust domains. As a result, application providers forge business relationships to enlist third-parties as trust domains, which is a manual, lengthy, and expensive process, inaccessible to many application developers.

We introduce the on-demand distributed-trust architecture that enables an application provider to deploy distributed trust automatically and immediately without controlling the other trust domains. The insight lies in reversing the deployment method such that each user's client drives deployment instead of the application provider. While at a first glance, this approach appears infeasible due to cost, performance, and resource abuse concerns, our system Flock resolves these challenges. We implement and evaluate Flock on 3 major cloud providers and 8 distributed-trust applications. On average, Flock achieves 1.05x the latency and 0.68-2.27x the cloud cost of a traditional distributed-trust deployment, without reliance on third-party relationships.

https://www.usenix.org/conference/osdi24/presentation/kaviani
Friday July 12, 2024 10:00am - 10:20am PDT
Grand Ballroom ABGH

10:15am PDT

ATC Break with Refreshments
Friday July 12, 2024 10:15am - 10:50am PDT
Friday July 12, 2024 10:15am - 10:50am PDT
Grand Ballroom Foyer

10:20am PDT

OSDI Break with Refreshments
Friday July 12, 2024 10:20am - 10:50am PDT
Friday July 12, 2024 10:20am - 10:50am PDT
Grand Ballroom Foyer

10:50am PDT

FairyWREN: A Sustainable Cache for Emerging Write-Read-Erase Flash Interfaces
Friday July 12, 2024 10:50am - 11:10am PDT
Sara McAllister and Yucong "Sherry" Wang, Carnegie Mellon University; Benjamin Berg, UNC Chapel Hill; Daniel S. Berger, Microsoft Azure and University of Washington; George Amvrosiadis, Nathan Beckmann, and Gregory R. Ganger, Carnegie Mellon University

Datacenters need to reduce embodied carbon emissions, particularly for flash, which accounts for 40% of embodied carbon in servers. However, decreasing flash's embodied emissions is challenging due to flash's limited write endurance, which more than halves with each generation of denser flash. Reducing embodied emissions requires extending flash lifetime, stressing its limited write endurance even further. The legacy Logical Block-Addressable Device (LBAD) interface exacerbates the problem by forcing devices to perform garbage collection, leading to even more writes.

Flash-based caches in particular write frequently, limiting the lifetimes and densities of the devices they use. These flash caches illustrate the need to break away from LBAD and switch to the new Write-Read-Erase iNterfaces (WREN) now coming to market. WREN affords applications control over data placement and garbage collection. We present FairyWREN, a flash cache designed for WREN. FairyWREN reduces writes by co-designing caching policies and flash garbage collection. FairyWREN provides a 12.5× write reduction over state-of-the-art LBAD caches. This decrease in writes allows flash devices to last longer, decreasing flash cost by 35% and flash carbon emissions by 33%.

https://www.usenix.org/conference/osdi24/presentation/mcallister
Friday July 12, 2024 10:50am - 11:10am PDT
Grand Ballroom ABGH

10:50am PDT

Expeditious High-Concurrency MicroVM SnapStart in Persistent Memory with an Augmented Hypervisor
Friday July 12, 2024 10:50am - 11:15am PDT
Xingguo Pang, Yanze Zhang, and Liu Liu, University of Macau; Dazhao Cheng, WuHan University; Chengzhong Xu and Xiaobo Zhou, University of Macau

The industry has embraced snapshotting to tackle the cold starts and efficiently manage numerous short-lived functions for microservice-native architectures, serverless computing, and machine learning inference. A cutting-edge research approach FaaSnap, while innovative in reducing page faults during on-demand paging through prefetching the profiled working set pages into DRAM, incurs high caching overheads and I/O demands, potentially degrading system efficiency and limiting concurrent MicroVM executions.

This paper introduces PASS, a system leveraging byte-addressable persistent memory (PMEM) for cost-effective and highly concurrent MicroVM SnapStart execution. PASS, functioning as a PMEM-aware augmented hypervisor in the user space, revolutionizes MicroVM memory restoration. It constructs complete address indexing of the guest memory mapped to single-tier PMEM space, enabling zero-copy on-demand paging by exploiting PMEM's direct access feature. This approach bypasses the cache layer and maintains guest OS transparency, avoiding invasive modifications. Experimental results, derived from real-world applications, reveal that PASS substantially decreases SnapStart execution time, achieving up to 72% reduction compared to the Firecracker hypervisor on the PMEM filesystem, and 47% reduction compared to FaaSnap. Moreover, PASS achieves double the maximum concurrency compared to both Firecracker and FaaSnap. It improves the cost-effectiveness by 2.2x and 1.6x over the Firecracker and FaaSnap, respectively.

https://www.usenix.org/conference/atc24/presentation/pang
Friday July 12, 2024 10:50am - 11:15am PDT
Grand Ballroom CD

10:50am PDT

Efficient Decentralized Federated Singular Vector Decomposition
Friday July 12, 2024 10:50am - 11:15am PDT
Di Chai, Junxue Zhang, Liu Yang, and Yilun Jin, Hong Kong University of Science and Technology; Leye Wang, Peking University; Kai Chen, Hong Kong University of Science and Technology; Qiang Yang, Hong Kong University of Science and Technology and Webank

Federated singular value decomposition (SVD) is a foundation for many real-world distributed applications. Existing federated SVD studies either require external servers which downgrade privacy protection or leverage homomorphic encryption (HE) to get rid of external servers (e.g., being decentralized) but suffer from significant inefficiencies caused by extensive computational and communication overhead.

This paper presents Excalibur, an efficient decentralized federated SVD system. At its core, Excalibur proposes a lightweight matrix protection method to reduce the computational degradation caused by cryptographic operations, improving computation performance. Furthermore, it designs a communication-efficient decentralized SVD workflow based on the quantitative analysis of the design space, optimizing communication performance. To validate the efficiency of Excalibur, we implement a fully functional Excalibur system and evaluate it with real-world applications. Our results show that Excalibur not only removes the external servers but also achieves 3.1× ~ 6.0× faster performance than state-of-the-art (SOTA) server-aided method on different shapes of billion-scale data. In addition, Excalibur exhibits > 23000× larger throughput than the SOTA HE-based system.

https://www.usenix.org/conference/atc24/presentation/chai
Friday July 12, 2024 10:50am - 11:15am PDT
Grand Ballroom EF

11:10am PDT

Massively Parallel Multi-Versioned Transaction Processing
Friday July 12, 2024 11:10am - 11:30am PDT
Shujian Qian and Ashvin Goel, University of Toronto

Multi-version concurrency control can avoid most read-write conflicts in OLTP workloads. However, multi-versioned systems often have higher complexity and overheads compared to single-versioned systems due to the need for allocating, searching and garbage collecting versions. Consequently, single-versioned systems can often dramatically outperform multi-versioned systems.

We introduce Epic, the first multi-versioned GPU-based deterministic OLTP database. Epic utilizes a batched execution scheme, performing concurrency control initialization for a batch of transactions before executing the transactions deterministically. By leveraging the predetermined ordering of transactions, Epic eliminates version search entirely and significantly reduces version allocation and garbage collection overheads. Our approach utilizes the computational power of the GPU architecture to accelerate Epic's concurrency control initialization and efficiently parallelize batched transaction execution, while ensuring low latency. Our evaluation demonstrates that Epic achieves comparable performance under low contention and consistently higher performance under medium to high contention versus state-of-the-art single and multi-versioned systems.

https://www.usenix.org/conference/osdi24/presentation/qian
Friday July 12, 2024 11:10am - 11:30am PDT
Grand Ballroom ABGH

11:15am PDT

Taming Hot Bloat Under Virtualization with HUGESCOPE
Friday July 12, 2024 11:15am - 11:40am PDT
Chuandong Li, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and Zhongguancun Laboratory; Sai Sha, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and Beijing Huawei Digital Technologies; Yangqing Zeng and Xiran Yang, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University; Yingwei Luo and Xiaolin Wang, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and Zhongguancun Laboratory; Zhenlin Wang, Michigan Tech; Diyu Zhou, National Key Laboratory for Multimedia Information Processing, School of CS, Peking University and EPFL

Huge pages are effective in reducing the address translation overhead under virtualization. However, huge pages suffer from the hot bloat problem, where accesses to a huge page are skewed towards a few base pages (i.e., 4KB page), making the hypervisor (mistakenly) classify the whole huge page as hot. Hot bloat renders several critical techniques used in virtualization ineffective, including tiered memory and page sharing. Prior work addressing hot bloat either requires hardware modification or targets a specific scenario and is not applicable to a hypervisor.

This paper presents HugeScope, a lightweight, effective and generic system that addresses the hot bloat problem under virtualization based on commodity hardware. HugeScope includes an efficient and precise page tracking mechanism, leveraging the other level of indirect memory translation in the hypervisor. HugeScope provides a generic framework to support page splitting and coalescing policies, considering the memory pressure, as well as the recency, frequency, and skewness of page access. Moreover, HugeScope is general and modular, thereby can be easily applied to various scenarios concerning hot bloat, including tiered memory management (HS-TMM) and page sharing (HS-Share). Evaluation shows that HugeScope incurs less than 4% overhead, and by addressing hot bloat, HS-TMM improves performance by up to 61% over vTMM while HS-Share saves 41% more memory than Ingens while offering comparable performance.

https://www.usenix.org/conference/atc24/presentation/li-chuandong
Friday July 12, 2024 11:15am - 11:40am PDT
Grand Ballroom CD

11:15am PDT

Models on the Move: Towards Feasible Embedded AI for Intrusion Detection on Vehicular CAN Bus
Friday July 12, 2024 11:15am - 11:40am PDT
He Xu, Di Wu, Yufeng Lu, and Jiwu Lu, Hunan University and ExponentiAI Innovation; Haibo Zeng, Virginia Tech

Controller Area Network (CAN) protocol is widely used in vehicles as an efficient standard enabling communication among Electronic Control Units (ECUs). However, the CAN bus is vulnerable to malicious attacks because of a lack of defense features. To achieve efficient and effective intrusion detection system (IDS) design for hardware and embedded system security in vehicles, we have specifically tackled the challenge that existing IDS techniques rarely consider attacks with small-batch. We propose a model with hardware implementation to function in the vehicular CAN bus, namely MULSAM which employing multi-dimensional long short-term memory with the self-attention mechanism. The self-attention mechanism can enhance the characteristics of CAN bus-oriented attack behavior and the multi-dimensional long short-term memory can effectively extract the in-depth features of time series data. The MULSAM model has been compared with other baselines on five attacks generated by extracting benign CAN data from the actual vehicle. Our experimental results demonstrate that MULSAM has the best training stability and detection accuracy (98.98%) to identify small-batch injection attacks. Furthermore, to speed up the inference of MULSAM as an embedded unit in vehicles, hardware accelerator has been implemented on FPGA to achieve a better energy efficiency than other embedded platform. Even with a certain degree of quantification, the acceleration model for MULSAM still presents a high detection accuracy of 98.81% and a low latency of 1.88 ms, leading to a new cyber-physical system security solution towards feasible embedded AI for intrusion detection on vehicular CAN bus.

https://www.usenix.org/conference/atc24/presentation/xu-he
Friday July 12, 2024 11:15am - 11:40am PDT
Grand Ballroom EF

11:30am PDT

Burstable Cloud Block Storage with Data Processing Units
Friday July 12, 2024 11:30am - 11:50am PDT
Junyi Shu, School of Computer Science, Peking University and Alibaba Cloud; Kun Qian and Ennan Zhai, Alibaba Cloud; Xuanzhe Liu and Xin Jin, School of Computer Science, Peking University

Cloud block storage (CBS) is a key pillar of public clouds. Today's CBS distinguishes itself from physical counterparts (e.g., SSDs) by offering unique burst capability as well as enhanced throughput, capacity, and availability. We conduct an initial characterization of our CBS product, a globally deployed cloud block storage service at public cloud provider Alibaba Cloud. A key observation is that the storage agent (SA) running on a data processing unit (DPU) which connects user VMs to the backend storage is the major source of performance fluctuation with burst capability provided. In this paper, we propose a hardware-software co-designed I/O scheduling system BurstCBS to address load imbalance and tenant interference at SA. BurstCBS exploits high-performance queue scaling to achieve near-perfect load balancing at line rate. To mitigate tenant interference, we design a novel burstable I/O scheduler that prioritizes resource allocation for base-level usage while supporting bursts. We employ a vectorized I/O cost estimator for comprehensive measurements of the consumed resources of different types of I/Os. Our evaluation shows that BurstCBS reduces average latency by up to 85% and provides up to 5× throughput for base-level tenants under congestion with minimal overhead. We verify the benefits brought by BurstCBS with a database service that internally relies on CBS, and show that up to 83% latency reduction is observed on customer workloads.

https://www.usenix.org/conference/osdi24/presentation/shu
Friday July 12, 2024 11:30am - 11:50am PDT
Grand Ballroom ABGH

11:40am PDT

CrossMapping: Harmonizing Memory Consistency in Cross-ISA Binary Translation
Friday July 12, 2024 11:40am - 12:05pm PDT
Chen Gao and Xiangwei Meng, Lanzhou University; Wei Li, Tsinghua University; Jinhui Lai, Lanzhou University; Yiran Zhang, Beijing University of Posts and Telecommunications; Fengyuan Ren, Lanzhou University and Tsinghua University

The increasing prevalence of new Instruction Set Architectures (ISAs) necessitates the migration of closed-source binary programs across ISAs. Dynamic Binary Translation (DBT) stands out as a crucial technology for the cross-ISA emulation of binary programs. However, due to the mismatch in memory consistency between guest ISA and host ISA, DBT systems face substantial challenges in guaranteeing correctness and translation performance for concurrent programs. Despite several attempts to bridge the memory inconsistency between guest and host ISA, prior work is either not universal for cross-ISA DBT systems or inefficient and even error-prone in translation.

This work presents CrossMapping, a general primitive mapping framework to enhance existing DBT systems for cross-ISA translation. By harmonizing memory consistency across diverse ISAs, CrossMapping enables smooth cross-ISA translation and accomplishes correct emulation. CrossMapping introduces specification tables to describe memory models in a unified and precise format, which facilitates the derivation of concurrent primitive mapping schemes based on a convenient comparison and analysis of memory models. The correctness of cross-ISA emulation is guaranteed by harmoniously integrating the derived mapping schemes with existing DBT systems. We evaluate CrossMapping for x86, ARMv8, and RISC-V on top of QEMU using the PARSEC benchmark suite. The results show that the average performance improvement can reach 8.5% when emulating x86 on ARMv8 and 7.3% when emulating x86 on RISC-V.

https://www.usenix.org/conference/atc24/presentation/gao-chen
Friday July 12, 2024 11:40am - 12:05pm PDT
Grand Ballroom CD

11:40am PDT

CPC: Flexible, Secure, and Efficient CVM Maintenance with Confidential Procedure Calls
Friday July 12, 2024 11:40am - 12:05pm PDT
Jiahao Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University; Zeyu Mi and Yubin Xia, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Haibing Guan, Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China; Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University

Confidential virtual machines (CVMs), while providing strong data privacy for cloud tenants, pose significant challenges to VM maintenance like live migration and snapshotting. Traditional host-based maintenance, while applicable to conventional VMs, is infeasible for CVMs due to the lack of trust in the host and the prevention of mandated intrusive access from the host. State-of-the-art approaches depend on non-trivial modifications to hardware and firmware and thus lead to notable compromises in security and/or performance. Furthermore, such approaches lack flexibility for upgrades and cross-platform compatibility, hindering the popularity of CVMs on the cloud.

In this paper, we introduce Confidential Procedure Calls (CPCs), a flexible approach to the efficient and secure execution of CVM maintenance modules from within the guest. We have implemented prototypes on two leading CVM platforms. Our prototype on AMD SEV showcases the high performance of CPCs, with 3× (resource reclamation) or even 138× (live migration) faster than existing approaches. Our prototype on ARM CCA further confirms CPCs' outstanding security and flexibility.

https://www.usenix.org/conference/atc24/presentation/chen-jiahao
Friday July 12, 2024 11:40am - 12:05pm PDT
Grand Ballroom EF

11:50am PDT

Motor: Enabling Multi-Versioning for Distributed Transactions on Disaggregated Memory
Friday July 12, 2024 11:50am - 12:10pm PDT
Ming Zhang, Yu Hua, and Zhijun Yang, Wuhan National Laboratory for Optoelectronics, School of Computer, Huazhong University of Science and Technology

In modern datacenters, memory disaggregation unpacks monolithic servers to build network-connected distributed compute and memory pools to improve resource utilization and deliver high performance. The compute pool leverages distributed transactions to access remote data in the memory pool to provide atomicity and strong consistency. Existing single-versioning designs have been constrained due to limited system concurrency and high logging overheads. Although the multi-versioning design in the conventional monolithic servers is promising to offer high concurrency and reduce logging overheads, which however fails to work in the disaggregated memory. In order to bridge the gap between the multi-versioning design and the disaggregated memory, we propose Motor that holistically redesigns the version structure and transaction protocol to enable multi-versioning for fast distributed transaction processing on the disaggregated memory. To efficiently organize different versions of data in the memory pool, Motor leverages a new consecutive version tuple (CVT) structure to store the versions together in a continuous manner, which allows the compute pool to obtain the target version in a single network round trip. On top of CVT, Motor leverages a fully one-sided RDMA-based MVCC protocol to support fast distributed transactions with flexible isolation levels. Experimental results demonstrate that Motor improves the throughput by up to 98.1% and reduces the latency by up to 55.8% compared with state-of-the-art systems.

https://www.usenix.org/conference/osdi24/presentation/zhang-ming
Friday July 12, 2024 11:50am - 12:10pm PDT
Grand Ballroom ABGH

12:05pm PDT

ATC Lunch (on your own)
Friday July 12, 2024 12:05pm - 1:40pm PDT
N/A
Friday July 12, 2024 12:05pm - 1:40pm PDT
N/A

12:10pm PDT

OSDI Lunch (on your own)
Friday July 12, 2024 12:10pm - 1:40pm PDT
N/A
Friday July 12, 2024 12:10pm - 1:40pm PDT
N/A

1:40pm PDT

Detecting Logic Bugs in Database Engines via Equivalent Expression Transformation
Friday July 12, 2024 1:40pm - 2:00pm PDT
Zu-Ming Jiang and Zhendong Su, ETH Zurich

Database management systems (DBMSs) are crucial for storing and fetching data. To improve the reliability of such systems, approaches have been proposed to detect logic bugs that cause DBMSs to process data incorrectly. These approaches manipulate queries and check whether the query results produced by DBMSs follow the expectations. However, such query-level manipulation cannot handle complex query semantics and thus needs to limit the patterns of generated queries, degrading testing effectiveness.

In this paper, we tackle the problem using a fine-grained methodology—expression-level manipulation—which empowers the proposed approach to be applicable to arbitrary queries. To find logic bugs in DBMSs, we design a novel and general approach, equivalent expression transformation (EET). Our core idea is that manipulating expressions of a query in a semantic-preserving manner also preserves the semantics of the entire query and is independent of query patterns. EET validates DBMSs by checking whether the transformed queries still produce the same results as the corresponding original queries. We realize our approach and evaluate it on 5 widely used and extensively tested DBMSs: MySQL, PostgreSQL, SQLite, ClickHouse, and TiDB. In total, EET found 66 unique bugs, 35 of which are logic bugs. We expect the generality and effectiveness of EET to inspire follow-up research and benefit the reliability of many DBMSs.

https://www.usenix.org/conference/osdi24/presentation/jiang
Friday July 12, 2024 1:40pm - 2:00pm PDT
Grand Ballroom ABGH

1:40pm PDT

RL-Watchdog: A Fast and Predictable SSD Liveness Watchdog on Storage Systems
Friday July 12, 2024 1:40pm - 2:05pm PDT
Jin Yong Ha, Seoul National University; Sangjin Lee, Chung-Ang University; Heon Young Yeom, Seoul National University; Yongseok Son, Chung-Ang University

This paper proposes a reinforcement learning-based watchdog (RLW) that examines solid-state drive (SSD) liveness or failures by faults (e.g., controller/power faults and high temperature) quickly, precisely, and online to minimize application data loss. To do this, we first provide a lightweight watchdog (LWW) to actively and lightly examine SSD liveness by issuing a liveness-dedicated command to the SSD. Second, we introduce a reinforcement learning-based timeout predictor (RLTP) which predicts the timeout of the dedicated command, enabling the detection of a failure point regardless of the SSD model. Finally, we propose fast failure notification (FFN) to immediately notify the applications of the failure to minimize their potential data loss. We implement RLW with three techniques in a Linux kernel 6.0.0 and evaluate it in a single SSD and RAID using realistic power fault injection. The experimental results reveal that RLW reduces the data loss by up to 96.7% compared with the existing scheme, and its accuracy in predicting failure points reaches up to 99.8%.

https://www.usenix.org/conference/atc24/presentation/ha
Friday July 12, 2024 1:40pm - 2:05pm PDT
Grand Ballroom CD

1:40pm PDT

gVulkan: Scalable GPU Pooling for Pixel-Grained Rendering in Ray Tracing
Friday July 12, 2024 1:40pm - 2:05pm PDT
Yicheng Gu, Yun Wang, Yunfan Sun, Yuxin Xiang, Yufan Jiang, Xuyan Hu, Zhengwei Qi, and Haibing Guan, Shanghai Jiao Tong University

Ray tracing rendering technology enhances scene realism and offers immersive experiences. However, it demands significant computational resources to trace and compute light-object interactions. As a result, traditional local GPU rendering might not meet the demands for high image quality and low latency. Moreover, many applications are tailored to utilize the resources of a single GPU, limiting their capacity to increase computational power through additional GPUs.

This paper presents gVulkan, the first transparent multi-GPU acceleration rendering solution for Vulkan-based ray tracing. To address the bottleneck caused by limited local GPU resources, gVulkan can offload ray tracing rendering to the cloud via API-forwarding. In the cloud, gVulkan employs Split Frame Rendering (SFR) to enable an arbitrary number of GPUs to accelerate rendering in parallel, while dynamically self-rebalancing the workload at a pixel-grained level across GPUs. Experiments demonstrate that gVulkan can accelerate Vulkan-based ray tracing programs in an application-unaware manner. By dynamically rebalancing each GPU's workload, gVulkan achieves good linearity with 3.81× speedup across 4 GPUs on average.

https://www.usenix.org/conference/atc24/presentation/gu-yicheng
Friday July 12, 2024 1:40pm - 2:05pm PDT
Grand Ballroom EF

2:00pm PDT

Inductive Invariants That Spark Joy: Using Invariant Taxonomies to Streamline Distributed Protocol Proofs
Friday July 12, 2024 2:00pm - 2:20pm PDT
Tony Nuda Zhang, University of Michigan; Travis Hance, Carnegie Mellon University; Manos Kapritsos, University of Michigan; Tej Chajed, University of Wisconsin–Madison; Bryan Parno, Carnegie Mellon University

Proving the correctness of a distributed protocol is a challenging endeavor. Central to this task is finding an inductive invariant for the protocol. Currently, automated invariant inference algorithms require developers to describe protocols using a restricted logic. If the developer wants to prove a protocol expressed without these restrictions, they must devise an inductive invariant manually.

We propose an approach that simplifies and partially automates finding the inductive invariant of a distributed protocol, as well as proving that it really is an invariant. The key insight is to identify an invariant taxonomy that divides invariants into Regular Invariants, which have one of a few simple low-level structures, and Protocol Invariants, which capture the higher-level host relationships that make the protocol work.

Building on the insight of this taxonomy, we describe the Kondo methodology for proving the correctness of a distributed protocol modeled as a state machine. The developer first manually devises the Protocol Invariants by proving a synchronous version of the protocol correct. In this simpler version, sends and receives are replaced with atomic variable assignments. The Kondo tool then automatically generates the asynchronous protocol description, Regular Invariants, and proofs that the Regular Invariants are inductive on their own. Finally, Kondo combines these with the synchronous proof into a draft proof of the asynchronous protocol, which may then require a small amount of user effort to complete. Our evaluation shows that Kondo reduces developer effort for a wide variety of distributed protocols.

https://www.usenix.org/conference/osdi24/presentation/zhang-nuda
Friday July 12, 2024 2:00pm - 2:20pm PDT
Grand Ballroom ABGH

2:05pm PDT

Exploit both SMART Attributes and NAND Flash Wear Characteristics to Effectively Forecast SSD-based Storage Failures in Clusters
Friday July 12, 2024 2:05pm - 2:30pm PDT
Yunfei Gu and Chentao Wu, Shanghai Jiao Tong University; Xubin He, Temple University

Solid State Drives (SSDs) based on flash technology are extensively employed as high-performance storage solutions in supercomputing data centers. However, SSD failures are frequent in these environments, resulting in significant performance issues. To ensure the reliability and accessibility of HPC storage systems, it is crucial to predict failures in advance, enabling timely preventive measures. Although many failure prediction methods focus on improving SMART attributes and system telemetry logs, their predictive efficacy is constrained due to the limited capacity of these logs to directly elucidate the root causes of SSD failures at the device level. In this paper, we revisit the underlying causes of SSD failures and first utilize the device-level flash wear characteristics of SSDs as a critical indicator instead of solely relying on SMRAT data. We propose a novel Aging-Aware Pseudo Twin Network (APTN) based SSD failure prediction approach, exploiting both SMART and device-level NAND flash wear characteristics, to effectively forecast SSD failures. In practice, we also adapt APTN to the online learning framework. Our evaluation results demonstrate that APTN improves the F1-score by 51.2% and TPR by 40.1% on average compared to the existing schemes. This highlights the potential of leveraging device-level wear characteristics in conjunction with SMART attributes for more accurate and reliable SSD failure prediction.

https://www.usenix.org/conference/atc24/presentation/gu-yunfei
Friday July 12, 2024 2:05pm - 2:30pm PDT
Grand Ballroom CD

2:05pm PDT

vFPIO: A Virtual I/O Abstraction for FPGA-accelerated I/O Devices
Friday July 12, 2024 2:05pm - 2:30pm PDT
Jiyang Chen, Harshavardhan Unnibhavi, Atsushi Koshiba, and Pramod Bhatotia, Technical University of Munich

Modern cloud systems have adopted a variety of FPGA-accelerated I/O devices, such as SmartNICs and computational storage, while they face programmability and portability challenges. Existing FPGA frameworks either directly expose device-specific I/O interfaces to user logic or offer virtualized I/Os limited to a single device type. The lack of I/O abstraction imposes high engineering costs, less design portability, and even unexpected throughput degradation.

We introduce vFPIO, an FPGA-based I/O acceleration framework that brings better programmability and design portability. vFPIO extends modern FPGA OSes to expose virtual I/O ports to user logic, which abstracts device-dependent I/O specifications and makes the user logic design platform-agnostic. The connectivity between virtual and physical I/O ports can be easily configured by host applications using POSIX-like file APIs. vFPIO also offers a preemptive I/O transaction scheduler that alleviates the I/O throughput degradation caused by concurrent I/O requests from multiple accelerators in a multi-tenant environment.

We implement a prototype of the vFPIO framework on x86 servers equipped with AMD Xilinx Alveo U280 cards. Our prototype supports four different I/O interfaces: PCIe, DRAM, HBM, and network. Our evaluation highlights that vFPIO incurs negligible performance overheads compared to Coyote, one of the latest FPGA OSes, while preserving the maximum I/O throughput for high-priority tasks even under resource contention.

https://www.usenix.org/conference/atc24/presentation/chen-jiyang
Friday July 12, 2024 2:05pm - 2:30pm PDT
Grand Ballroom EF

2:20pm PDT

Performance Interfaces for Hardware Accelerators
Friday July 12, 2024 2:20pm - 2:40pm PDT
Jiacheng Ma, Rishabh Iyer, Sahand Kashani, Mahyar Emami, Thomas Bourgeat, and George Candea, EPFL

Designing and building a system that reaps the performance benefits of hardware accelerators is challenging, because they provide little concrete visibility into their expected performance. Developers must invest many person-months into benchmarking, to determine if their system would indeed benefit from using a particular accelerator. This must be done carefully, because accelerators can actually hurt performance for some classes of inputs, even if they help for others.

We demonstrate that it is possible for hardware accelerators to ship with performance interfaces that provide actionable visibility into their performance, just like semantic interfaces do for functionality. We propose an intermediate representation (IR) for accelerator performance that precisely captures all performance-relevant details of the accelerator while abstracting away all other information, including functionality. We develop a toolchain (ltc) that, based on the proposed IR, automatically produces human-readable performance interfaces that help developers make informed design decisions. ltc can also automatically produce formal proofs of performance properties of the accelerator, and can act as a fast performance simulator for concrete workloads.

We evaluate our approach on accelerators used for deep learning, serialization of RPC messages, JPEG image decoding, genome sequence alignment, and on an RMT pipeline used in programmable network switches. We demonstrate that the performance IR provides an accurate and complete representation of performance behavior, and we describe a variety of use cases for ltc and the resulting performance interfaces. ltc is open-source and freely available at https:// dslab.epfl.ch/research/perf.

https://www.usenix.org/conference/osdi24/presentation/ma-jiacheng
Friday July 12, 2024 2:20pm - 2:40pm PDT
Grand Ballroom ABGH

2:30pm PDT

StreamCache: Revisiting Page Cache for File Scanning on Fast Storage Devices
Friday July 12, 2024 2:30pm - 2:55pm PDT
Zhiyue Li and Guangyan Zhang, Tsinghua University

Buffered I/O via page cache is used for file scanning in many cases as page cache can provide buffering, data aggregation, I/O alignment and prefetching transparently. However, our study indicates that employing page cache for file scanning on fast storage devices presents two performance issues: it offers limited I/O bandwidth that does not align with the performance of fast storage devices, and the intensive background writeback onto fast storage devices can significantly interfere with foreground I/O requests.

In this paper, we propose StreamCache, a new page cache management system for file scanning on fast storage devices. StreamCache exploits three techniques to achieve high I/O performance. First, it uses a lightweight stream tracking method to record the states of cached pages at the granularity of sequential streams. Second, it uses a stream-based page reclaiming method to lower the interference to foreground I/O requests. Third, it uses a two-layer memory management method to accelerate page allocation by leveraging CPU cache locality.

We implement StreamCache in XFS. Experimental results show that compared with existing methods, StreamCache can increase the I/O bandwidth of scientific applications by 44%, and reduce the checkpoint/restart time of large language models by 15.7% on average.

https://www.usenix.org/conference/atc24/presentation/li-zhiyue
Friday July 12, 2024 2:30pm - 2:55pm PDT
Grand Ballroom CD

2:30pm PDT

ScalaCache: Scalable User-Space Page Cache Management with Software-Hardware Coordination
Friday July 12, 2024 2:30pm - 2:55pm PDT
Li Peng and Yuda An, Peking University; You Zhou, Huazhong University of Science and Technology; Chenxi Wang, University of Chinese Academy of Sciences; Qiao Li, Xiamen University; Chuanning Cheng, Huawei; Jie Zhang, Peking University and Zhongguancun Laboratory

Due to the host-centric design principle, the existing page cache management suffers from CPU consumption, communication costs, and garbage collection (GC) interference. To address these challenges, we propose ScalaCache, a scalable user-space page cache with software-hardware coordination. Specifically, to reduce the host CPU overhead, we offload the cache management into computational storage drives (CSDs) and further merge the indirection layers in both the cache and flash firmware, which facilitates lightweight cache management. To further boost scalability, we build a lockless resource management framework that allows multiple CSD internal cores to manage the cache space concurrently. ScalaCache also aggregates the computing power of multiple CSDs to deliver scalable I/O performance. Moreover, ScalaCache reduces communication costs by trimming the I/O control path while mitigating GC interference via a GC-aware replacement policy, thereby enhancing its efficiency and performance stability. Our evaluation results reveal that ScalaCache exhibits 5.12× and 1.70× bandwidth improvements, respectively, compared to kernel page cache and the state-of-the-art user-space one. ScalaCache is open source and available at https://github.com/ChaseLab-PKU/ScalaCache.

https://www.usenix.org/conference/atc24/presentation/peng
Friday July 12, 2024 2:30pm - 2:55pm PDT
Grand Ballroom EF

2:40pm PDT

IronSpec: Increasing the Reliability of Formal Specifications
Friday July 12, 2024 2:40pm - 3:00pm PDT
Eli Goldweber, Weixin Yu, Seyed Armin Vakil Ghahani, and Manos Kapritsos, University of Michigan

The guarantees of formally verified systems are only as strong as their trusted specifications (specs). As observed by previous studies, bugs in formal specs invalidate the assurances that proofs provide. Unfortunately, specs—by their very nature—cannot be proven correct. Currently, the only way to identify spec bugs is by careful, manual inspection.

In this paper we introduce IronSpec, a framework of automatic and manual techniques to increase the reliability of formal specifications. IronSpec draws inspiration from classical software testing practices, which we adapt to the realm of formal specs. IronSpec facilitates spec testing with automated sanity checking, a methodology for writing SpecTesting Proofs (STPs), and automated spec mutation testing.

We evaluate IronSpec on 14 specs, including six specs of real-world verified codebases. Our results show that IronSpec is effective at flagging discrepancies between the spec and the developer's intent, and has led to the discovery of ten specification bugs across all six real-world verified systems.

https://www.usenix.org/conference/osdi24/presentation/goldweber
Friday July 12, 2024 2:40pm - 3:00pm PDT
Grand Ballroom ABGH

2:55pm PDT

Scalable Billion-point Approximate Nearest Neighbor Search Using SmartSSDs
Friday July 12, 2024 2:55pm - 3:20pm PDT
Bing Tian, Haikun Liu, Zhuohui Duan, Xiaofei Liao, Hai Jin, and Yu Zhang, Huazhong University of Science and Technology

Approximate nearest neighbor search (ANNS) in high-dimensional vector spaces has become increasingly crucial in database and machine learning applications. Most previous ANNS algorithms require TB-scale memory to store indices of billion-scale datasets, making their deployment extremely expensive for high-performance search. The emerging SmartSSD technology offers an opportunity to achieve scalable ANNS via near data processing (NDP). However, there remain challenges to directly adopt existing ANNS algorithms on multiple SmartSSDs.

In this paper, we present SmartANNS, a SmartSSD-empowered billion-scale ANNS solution based on a hierarchical indexing methodology. We propose several novel designs to achieve near-linear scaling with multiple SmartSSDs. First, we propose a "host CPUs + SmartSSDs'' cooperative architecture incorporated with hierarchical indices to significantly reduce data accesses and computations on SmartSSDs. Second, we propose dynamic task scheduling based on optimized data layout to achieve both load balancing and data reusing for multiple SmartSSDs. Third, we further propose a learning-based shard pruning algorithm to eliminate unnecessary computations on SmartSSDs. We implement SmartANNS using Samsung’s commercial SmartSSDs. Experimental results show that SmartANNS can improve query per second (QPS) by up to 10.7× compared with the state-of-the-art SmartSSD-based ANNS solution—CSDANNS. Moreover, SmartANNS can achieve near-linear performance scalability for large-scale datasets using multiple SmartSSDs.

https://www.usenix.org/conference/atc24/presentation/tian
Friday July 12, 2024 2:55pm - 3:20pm PDT
Grand Ballroom CD

2:55pm PDT

Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor
Friday July 12, 2024 2:55pm - 3:20pm PDT
Zhen Xie, Binghamton University; Murali Emani, Argonne National Laboratory; Xiaodong Yu, Stevens Institute of Technology; Dingwen Tao, Indiana University; Xin He, Xidian University; Pengfei Su, University of California, Merced; Keren Zhou, George Mason University; Venkatram Vishwanath, Argonne National Laboratory

For an extended period, graphics processing units (GPUs) have stood as the exclusive choice for training deep neural network (DNN) models. Over time, to serve the growing demands in a more targeted manner, various artificial intelligence-specific hardware, referred to as AI accelerators, have been vigorously developed, aiming to provide more efficient DNN acceleration solutions. However, sufficient solutions are also heterogeneous and thus introduce complexities in accelerator selection. Given a DNN model and a training objective, such as throughput or price-performance ratio, it remains challenging to arrive at the optimal decision among many options due to high reimplementation costs and unexpected performance.

To tackle this challenge, we propose Centimani, a performance predictor that accurately and rapidly predicts DNN training throughput on various AI accelerators, thereby facilitating the accelerator selection process. To achieve this goal, we first analyze typical AI accelerators and draw observations that abstract AI accelerator designs and guide our performance modeling approach. In particular, we construct a memory estimation model and decoupled performance models to select the most appropriate batch size and predict the execution time of DNN training. We validate our approach by applying Centimani to six common DNN models on four typical AI accelerators. Results show that Centimani predicts the throughput with an average accuracy of 93.1% on single-device training and 90.4% on multiple-device training, thus the optimal accelerator corresponding to the user's training objective can be obtained.

https://www.usenix.org/conference/atc24/presentation/xie
Friday July 12, 2024 2:55pm - 3:20pm PDT
Grand Ballroom EF

3:00pm PDT

Identifying On-/Off-CPU Bottlenecks Together with Blocked Samples
Friday July 12, 2024 3:00pm - 3:20pm PDT
Minwoo Ahn and Jeongmin Han, Sungkyunkwan University; Youngjin Kwon, Korea Advanced Institute of Science and Technology (KAIST); Jinkyu Jeong, Yonsei University

The rapid advancement of computer system components has necessitated a comprehensive profiling approach for both on-CPU and off-CPU events simultaneously. However, the conventional approach lacks profiling both on- and off-CPU events, so they fall short of accurately assessing the overhead of each bottleneck in modern applications.

In this paper, we propose a sampling-based profiling technique called blocked samples that is designed to capture all types of off-CPU events, such as I/O waiting, blocking synchronization, and waiting in CPU runqueue. Using the blocked samples technique, this paper proposes two profilers, bperf and BCOZ. Leveraging blocked samples, bperf profiles applications by providing symbol-level profile information when a thread is either on the CPU or off the CPU, awaiting scheduling or I/O requests. Using the information, BCOZ performs causality analysis of collected on- and off-CPU events to precisely identify performance bottlenecks and the potential impact of optimizations. The profiling capability of BCOZ is verified using real applications. From our profiling results followed by actual optimization, BCOZ identifies bottlenecks with off-CPU events precisely, and their optimization results are aligned with the predicted performance improvement by BCOZ's causality analysis.

https://www.usenix.org/conference/osdi24/presentation/ahn
Friday July 12, 2024 3:00pm - 3:20pm PDT
Grand Ballroom ABGH

3:20pm PDT

Break with Refreshments
Friday July 12, 2024 3:20pm - 3:40pm PDT
Friday July 12, 2024 3:20pm - 3:40pm PDT
Grand Ballroom Foyer

3:40pm PDT

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
Friday July 12, 2024 3:40pm - 4:00pm PDT
Bingyang Wu, Ruidong Zhu, and Zili Zhang, School of Computer Science, Peking University; Peng Sun, Shanghai AI Lab; Xuanzhe Liu and Xin Jin, School of Computer Science, Peking University

Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving system for LoRA models. dLoRA achieves high serving efficiency by dynamically orchestrating requests and LoRA adapters in terms of two aspects: (i) dynamically merge and unmerge adapters with the base model; and (ii) dynamically migrate requests and adapters between different worker replicas. These capabilities are designed based on two insights. First, despite the allure of batching without merging a LoRA adapter into the base model, it is not always beneficial to unmerge, especially when the types of requests are skewed. Second, the autoregressive nature of LLM requests introduces load imbalance between worker replicas due to varying input and output lengths, even if the input requests are distributed uniformly to the replicas. We design a credit-based batching algorithm to decide when to merge and unmerge, and a request-adapter co-migration algorithm to decide when to migrate. The experimental results show that dLoRA improves the throughput by up to 57.9× and 26.0×, compared to vLLM and HugginFace PEFT, respectively. Compared to the concurrent work S-LoRA, dLoRA achieves up to 1.8× lower average latency.

https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
Friday July 12, 2024 3:40pm - 4:00pm PDT
Grand Ballroom ABGH

3:40pm PDT

A Difference World: High-performance, NVM-invariant, Software-only Intermittent Computation
Friday July 12, 2024 3:40pm - 4:05pm PDT
Harrison Williams, Virginia Tech; Saim Ahmad, Amazon; Matthew Hicks, Virginia Tech

Supporting long life, high performance, intermittent computation is an essential challenge in allowing energy harvesting devices to fulfill the vision of smart dust. Intermittent computation is the extension of long-running computation across the frequent, unexpected, power cycles that result from replacing batteries with harvested energy. The most promising intermittent computation support strategies combine programmer direction and compiler analysis to minimize run-time overhead and provide programmer control—without specialized hardware support. While such strategies succeed in reducing the size of non-volatile memory writes due to checkpointing, they must checkpoint continuously. Unfortunately, for Flash-based devices (by far the most ubiquitous), writing checkpoints is slow and gradually kills the device. Without intervention, Flash devices and software-only intermittent computation are fundamentally incompatible.

To enable ubiquitous programmer-guided intermittent computation we design and implement Camel. The key idea behind Camel is the systematic bifurcation of program state into two "worlds'' of differing volatility. Programmers compose intermittent programs by stitching together atomic units of computation called tasks. The Camel compiler ensures that all within-task data is placed in the volatile world and all between-task data is placed in the non-volatile world. Between tasks, Camel swaps the worlds, atomically locking-in the forward progress of the preceding task. In preparation for the next task, Camel resolves differences in world view by copying only differences due to the preceding task's updates. This systematic decomposition into a mixed-volatility memory allows—for the first time—long-life and high performance programmer-guided intermittent computation on Flash devices: Camel outperforms the state-of-the-art checkpointing system for Flash-based devices by up to 5x while eliminating the need for hardware support. Beyond Flash, Camel's differential buffer system improves performance by a factor of 2x compared to existing task-based approaches on FRAM platforms.

https://www.usenix.org/conference/atc24/presentation/williams
Friday July 12, 2024 3:40pm - 4:05pm PDT
Grand Ballroom CD

4:00pm PDT

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Friday July 12, 2024 4:00pm - 4:20pm PDT
Chaofan Lin, Shanghai Jiao Tong University; Zhenhua Han, Chengruidong Zhang, Yuqing Yang, and Fan Yang, Microsoft Research; Chen Chen, Shanghai Jiao Tong University; Lili Qiu, Microsoft Research

The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications.

This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications.

https://www.usenix.org/conference/osdi24/presentation/lin-chaofan
Friday July 12, 2024 4:00pm - 4:20pm PDT
Grand Ballroom ABGH

4:05pm PDT

Efficient Large Graph Processing with Chunk-Based Graph Representation Model
Friday July 12, 2024 4:05pm - 4:30pm PDT
Rui Wang, Zhejiang University and Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security; Weixu Zong, Shuibing He, Xinyu Chen, Zhenxin Li, and Zheng Dang, Zhejiang University

Existing external graph processing systems face challenges in terms of low I/O efficiency, expensive computation overhead, and high graph algorithm development costs when running on emerging NVMe SSDs, due to their reliance on complex loading and computing models that aim to convert numerous random I/Os into a few sequential I/Os. While in-memory graph systems working with memory-storage cache systems like OS page cache or TriCache, offer a promising solution for large graph processing with fine-grained I/Os and easy algorithm programming, they often overlook the specific characteristics of graph applications, resulting in inefficient graph processing. To address these challenges, we introduce ChunkGraph, an I/O-efficient graph system designed for processing large-scale graphs on NVMe SSDs. ChunkGraph introduces a novel chunk-based graph representation model, featuring classified and hierarchical vertex storage, and efficient chunk layout optimization. Evaluations show that ChunkGraph can outperform existing external graph systems, as well as in-memory graph systems relying on general cache systems, running several times faster.

https://www.usenix.org/conference/atc24/presentation/wang-rui
Friday July 12, 2024 4:05pm - 4:30pm PDT
Grand Ballroom CD

4:20pm PDT

USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
Friday July 12, 2024 4:20pm - 4:40pm PDT
Sudipta Saha Shubha and Haiying Shen, University of Virginia; Anand Iyer, Georgia Institute of Technology

Minimizing monetary cost and maximizing the goodput of inference serving systems are increasingly important with the ever-increasing popularity of deep learning models. While it is desirable to spatially multiplex GPU resources to improve utilization, existing techniques suffer from inter-model interference, which prevents them from achieving both high computation and memory utilizations. We present USHER, a system that maximizes resource utilization in a holistic fashion while being interference-aware. USHER consists of three key components: 1) a cost-efficient and fast GPU kernel-based model resource requirement estimator, 2) a lightweight heuristic-based interference-aware resource utilization-maximizing scheduler that decides the batch size, model replication degree, and model placement to minimize monetary cost while satisfying latency SLOs or maximize the goodput, and 3) a novel operator graph merger to merge multiple models to minimize interference in GPU cache. Large-scale experiments using production workloads show that USHER achieves up to 2.6× higher goodput and 3.5× better cost-efficiency compared to existing methods, while scaling to thousands of GPUs.

https://www.usenix.org/conference/osdi24/presentation/shubha
Friday July 12, 2024 4:20pm - 4:40pm PDT
Grand Ballroom ABGH

4:30pm PDT

SlimArchive: A Lightweight Architecture for Ethereum Archive Nodes
Friday July 12, 2024 4:30pm - 4:55pm PDT
Hang Feng, Yufeng Hu, and Yinghan Kou, Zhejiang University; Runhuai Li and Jianfeng Zhu, BlockSec; Lei Wu and Yajin Zhou, Zhejiang University

With the rapid development of Ethereum, archive nodes that record all historical states have become a critical component of the infrastructure. However, current archive nodes suffer enormous storage requirements and poor performance due to the inefficient authenticated Merkle Patricia Trie and coarse-grained state granularity.

This paper presents a lightweight and high-performance architecture for Ethereum archive nodes to address the two limitations mentioned earlier. The core idea of our approach is to maintain compacted, flattened, and fine-grained (i.e., transaction-level) historical states by flattening the minimum state changes of each transaction required for the world state. Our method maintains an archive node with minimum storage requirements while providing high-performance state access. We have implemented a prototype system named SLIMARCHIVE for Ethereum. The evaluation results demonstrate that our approach reduces storage requirements by 98.1%, improves state access throughput by 19.0×, and speeds up transaction execution by an average of 1112.5×, compared to vanilla Geth.

https://www.usenix.org/conference/atc24/presentation/feng-hang
Friday July 12, 2024 4:30pm - 4:55pm PDT
Grand Ballroom CD

4:40pm PDT

Fairness in Serving Large Language Models
Friday July 12, 2024 4:40pm - 5:00pm PDT
Ying Sheng, UC Berkeley and Stanford University; Shiyi Cao, Dacheng Li, Banghua Zhu, and Zhuohan Li, UC Berkeley; Danyang Zhuo, Duke University; Joseph E. Gonzalez and Ion Stoica, UC Berkeley

High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rate limits, to ensure that no client can dominate the request queue. However, this rudimentary notion of fairness also results in under-utilization of the resources and poor client experience when there is spare capacity. While there is a rich literature on fair scheduling, serving LLMs presents new challenges due to their unpredictable request lengths and their unique batching characteristics on parallel accelerators. This paper introduces the definition of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed. To achieve fairness in serving, we propose a novel scheduling algorithm, the Virtual Token Counter (VTC), a fair scheduler based on the continuous batching mechanism. We prove a 2× tight upper bound on the service difference between two backlogged clients, adhering to the requirement of work-conserving. Through extensive experiments, we demonstrate the superior performance of VTC in ensuring fairness, especially in contrast to other baseline methods, which exhibit shortcomings under various conditions. The reproducible code is available at https://github.com/Ying1123/VTC-artifact.

https://www.usenix.org/conference/osdi24/presentation/sheng
Friday July 12, 2024 4:40pm - 5:00pm PDT
Grand Ballroom ABGH

4:55pm PDT

Every Mapping Counts in Large Amounts: Folio Accounting
Friday July 12, 2024 4:55pm - 5:10pm PDT
David Hildenbrand, Technical University of Munich and Red Hat GmbH; Martin Schulz, Technical University of Munich; Nadav Amit, Technion, Israel Institute of Technology

Operating systems can significantly enhance performance by utilizing large contiguous memory regions, even when the memory is not mapped using huge pages, by streamlining memory management. To harness these advantages, Linux has introduced "folios," representing multiple contiguous pages. Unlike traditional huge pages, folios can be partially mapped, which complicates folio accounting and hinders both performance and memory savings.

Accurate and efficient folio accounting is crucial for optimizing memory management operations, enforcing various memory management policies, and performing Unique Set Size accounting in the operating system. In particular, determining whether a folio is exclusively mapped in a single address space is essential for avoiding unnecessary Copy-On-Write operations when memory is no longer shared.

We introduce a novel tracking scheme to determine, with negligible overhead, whether a folio is exclusively mapped in a single address space. Our solution achieves a memory overhead that grows sublinearly with the number of pages per folio. By implementing our method in Linux, we demonstrate a notable improvement in fork and unmap operations by 1.9x and 4.2x respectively, and in the performance of fork-intensive workloads, such as Redis, achieving up to a 2.2x speedup.

https://www.usenix.org/conference/atc24/presentation/hildebrand
Friday July 12, 2024 4:55pm - 5:10pm PDT
Grand Ballroom CD

5:00pm PDT

MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
Friday July 12, 2024 5:00pm - 5:20pm PDT
Donglin Zhuang, The University of Sydney; Zhen Zheng, Alibaba Group; Haojun Xia, The University of Sydney; Xiafei Qiu, Junjie Bai, and Wei Lin, Alibaba Group; Shuaiwen Leon Song, The University of Sydney

In this work, we reveal that the kernel-by-kernel execution scheme in the existing machine learning optimizing compilers is no longer effective in fully utilizing hardware resources provided by the advances of modern GPU architectures. Specifically, such scheme suffers from severe non-computation overhead and off-chip memory traffic, making the optimization efforts from the state-of-the-art compiler techniques greatly attenuated on the newer generations of GPUs. To address this emerging challenge, we propose MonoNN, the first machine learning optimizing compiler that enables a new monolithic design and optimization space for common static neural network (NN) inference tasks on a single GPU. MonoNN can accommodate an entire neural network into a single GPU kernel, drastically reducing non-computation overhead and providing further fine-grained optimization opportunities from the newly formed monolithic optimization space. Most importantly, MonoNN identifies the resource incompatibility issue between various NN operators as the key design bottleneck for creating such a monolithic optimization space. Then MonoNN effectively tackles it by systematically exploring and exploiting the parallelism compensation strategy and resource trade-offs across different types of NN computations, and by proposing a novel schedule-independent group tuning technique to significantly shrink the extremely large tuning space. Finally, MonoNN provides a compiler implementation that incorporates our proposed optimizations and automatically generates highly efficient kernel code. Extensive evaluation on a set of popular production inference tasks demonstrates that MonoNN achieves an average speedup of 2.01× over the state-of-the-art frameworks and compilers. Specifically, MonoNN outperforms TVM, TensorRT, XLA, and AStitch by up to 7.3×, 5.9×, 1.7× and 2.9× in terms of end-to-end inference performance, respectively. MonoNN source code is publicly available at https://github.com/AlibabaResearch/mononn.

https://www.usenix.org/conference/osdi24/presentation/zhuang
Friday July 12, 2024 5:00pm - 5:20pm PDT
Grand Ballroom ABGH

5:10pm PDT

USENIX ATC ’24 Closing Remarks
Friday July 12, 2024 5:10pm - 5:20pm PDT
Program Co-Chairs: Saurabh Bagchi, Purdue University; Yiying Zhang, University of California, San Diego
Friday July 12, 2024 5:10pm - 5:20pm PDT
Grand Ballroom CD

5:20pm PDT

OSDI ’24 Closing Remarks
Friday July 12, 2024 5:20pm - 5:30pm PDT
Program Co-Chairs: Ada Gavrilovska, Georgia Institute of Technology; Douglas B. Terry, Amazon Web Services
Friday July 12, 2024 5:20pm - 5:30pm PDT
Grand Ballroom ABGH
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.