Name: dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
Start: 2024-07-12T15:40:00-0700
End: 2024-07-12T16:00:00-0700

Friday July 12, 2024 3:40pm - 4:00pm PDT

Grand Ballroom ABGH

Bingyang Wu, Ruidong Zhu, and Zili Zhang, School of Computer Science, Peking University; Peng Sun, Shanghai AI Lab; Xuanzhe Liu and Xin Jin, School of Computer Science, Peking University

Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving system for LoRA models. dLoRA achieves high serving efficiency by dynamically orchestrating requests and LoRA adapters in terms of two aspects: (i) dynamically merge and unmerge adapters with the base model; and (ii) dynamically migrate requests and adapters between different worker replicas. These capabilities are designed based on two insights. First, despite the allure of batching without merging a LoRA adapter into the base model, it is not always beneficial to unmerge, especially when the types of requests are skewed. Second, the autoregressive nature of LLM requests introduces load imbalance between worker replicas due to varying input and output lengths, even if the input requests are distributed uniformly to the replicas. We design a credit-based batching algorithm to decide when to merge and unmerge, and a request-adapter co-migration algorithm to decide when to migrate. The experimental results show that dLoRA improves the throughput by up to 57.9× and 26.0×, compared to vLLM and HugginFace PEFT, respectively. Compared to the concurrent work S-LoRA, dLoRA achieves up to 1.8× lower average latency.

https://www.usenix.org/conference/osdi24/presentation/wu-bingyang

Friday July 12, 2024 3:40pm - 4:00pm PDT
Grand Ballroom ABGH

OSDI

USENIX ATC '24 and OSDI '24

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!