Name: MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
Start: 2024-07-12T17:00:00-0700
End: 2024-07-12T17:20:00-0700

Friday July 12, 2024 5:00pm - 5:20pm PDT

Grand Ballroom ABGH

Donglin Zhuang, The University of Sydney; Zhen Zheng, Alibaba Group; Haojun Xia, The University of Sydney; Xiafei Qiu, Junjie Bai, and Wei Lin, Alibaba Group; Shuaiwen Leon Song, The University of Sydney

In this work, we reveal that the kernel-by-kernel execution scheme in the existing machine learning optimizing compilers is no longer effective in fully utilizing hardware resources provided by the advances of modern GPU architectures. Specifically, such scheme suffers from severe non-computation overhead and off-chip memory traffic, making the optimization efforts from the state-of-the-art compiler techniques greatly attenuated on the newer generations of GPUs. To address this emerging challenge, we propose MonoNN, the first machine learning optimizing compiler that enables a new monolithic design and optimization space for common static neural network (NN) inference tasks on a single GPU. MonoNN can accommodate an entire neural network into a single GPU kernel, drastically reducing non-computation overhead and providing further fine-grained optimization opportunities from the newly formed monolithic optimization space. Most importantly, MonoNN identifies the resource incompatibility issue between various NN operators as the key design bottleneck for creating such a monolithic optimization space. Then MonoNN effectively tackles it by systematically exploring and exploiting the parallelism compensation strategy and resource trade-offs across different types of NN computations, and by proposing a novel schedule-independent group tuning technique to significantly shrink the extremely large tuning space. Finally, MonoNN provides a compiler implementation that incorporates our proposed optimizations and automatically generates highly efficient kernel code. Extensive evaluation on a set of popular production inference tasks demonstrates that MonoNN achieves an average speedup of 2.01× over the state-of-the-art frameworks and compilers. Specifically, MonoNN outperforms TVM, TensorRT, XLA, and AStitch by up to 7.3×, 5.9×, 1.7× and 2.9× in terms of end-to-end inference performance, respectively. MonoNN source code is publicly available at https://github.com/AlibabaResearch/mononn.

https://www.usenix.org/conference/osdi24/presentation/zhuang

Friday July 12, 2024 5:00pm - 5:20pm PDT
Grand Ballroom ABGH

OSDI

USENIX ATC '24 and OSDI '24

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!