Haojun Xia,
University of Sydney; Zhen Zheng and Xiaoxia Wu,
Microsoft; Shiyang Chen,
Rutgers University; Zhewei Yao, Stephen Youn, Arash Bakhtiari, and Michael Wyatt,
Microsoft; Donglin Zhuang and Zhongzhu Zhou,
University of Sydney; Olatunji Ruwase, Yuxiong He, and Shuaiwen Leon Song,
Microsoft Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with non-power-of-two bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (5-bit, etc.). We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called Quant-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved with 6-bit quantization. Experiments show that Quant-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69×-2.65× higher normalized inference throughput than the FP16 baseline. The source code is publicly available at
https://github.com/usyd-fsalab/fp6_llm.
https://www.usenix.org/conference/atc24/presentation/xia