Loading…
USENIX ATC '24 and OSDI '24
Attending this event?
Thursday July 11, 2024 11:10am - 11:35am PDT
Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang, Kuaishou Technology

Recent advancements in training large-scale models have centered on optimizing activation strategies and exploring various parallel training options. One research avenue focuses on enhancing activation-related operations, such as offloading and recomputing. However, there is room for further refinement in these strategies to improve the balance between computation and memory utilization. Another line of work explores different training parallelisms, which often require extensive parameter tuning and achieve suboptimal combinations of parallel options.

To tackle these challenges, this paper introduces a novel method for losslessly accelerating the training of large language models. Specifically, two efficient activation rematerialization strategies are proposed: Pipeline-Parallel-Aware Offloading, which maximizes the utilization of host memory for storing activations, and Compute-Memory Balanced Checkpointing, which seeks a practical equilibrium between activation memory and computational efficiency. Additionally, the paper presents an extremely efficient searching method for optimizing parameters for hybrid parallelism, considering both offloading and checkpointing to achieve optimal performance. The efficacy of the proposed method is demonstrated through extensive experiments on public benchmarks with diverse model sizes and context window sizes. For example, the method significantly increases Model FLOPs Utilization (MFU) from 32.3% to 42.7% for a 175B Llama-like model with a context window size of 32,768 on 256 NVIDIA H800.

https://www.usenix.org/conference/atc24/presentation/yuan
Thursday July 11, 2024 11:10am - 11:35am PDT
Grand Ballroom CD

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link