Name: Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
Start: 2024-07-11T15:15:00-0700
End: 2024-07-11T15:40:00-0700

Thursday July 11, 2024 3:15pm - 3:40pm PDT

Grand Ballroom CD

Haojun Xia, University of Sydney; Zhen Zheng and Xiaoxia Wu, Microsoft; Shiyang Chen, Rutgers University; Zhewei Yao, Stephen Youn, Arash Bakhtiari, and Michael Wyatt, Microsoft; Donglin Zhuang and Zhongzhu Zhou, University of Sydney; Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song, Microsoft

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with non-power-of-two bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (5-bit, etc.). We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called Quant-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved with 6-bit quantization. Experiments show that Quant-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at https://github.com/usyd-fsalab/fp6_llm.

https://www.usenix.org/conference/atc24/presentation/xia

Thursday July 11, 2024 3:15pm - 3:40pm PDT
Grand Ballroom CD

USENIX ATC Track 1

USENIX ATC '24 and OSDI '24

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!