Loading…
USENIX ATC '24 and OSDI '24
Attending this event?
Wednesday July 10, 2024 11:35am - 12:00pm PDT
Michael Luo, Siyuan Zhuang, Suryaprakash Vengadesan, and Romil Bhardwaj, UC Berkeley; Justin Chang, UC Santa Barbara; Eric Friedman, Scott Shenker, and Ion Stoica, UC Berkeley

To efficiently tackle bursts in job demand, organizations employ hybrid cloud architectures to scale their batch workloads from their private clusters to public cloud. This requires transforming cluster schedulers into cloud-enabled versions to navigate the tradeoff between cloud costs and scheduler objectives such as job completion time (JCT). However, our analysis over production-level traces show that existing cloud-enabled schedulers incur inefficient cost-JCT trade-offs due to low cluster utilization.

We present Starburst, a system that maximizes cluster utilization to streamline the cost-JCT tradeoff. Starburst's scheduler dynamically controls jobs' waiting times to improve utilization—it assigns longer waits for large jobs to increase their chances of running on the cluster, and shorter waits to small jobs to increase their chances of running on the cloud. To offer configurability, Starburst provides system administrators a simple waiting budget framework to tune their position on the cost-JCT curve. A departure from traditional cluster schedulers, Starburst operates as a higher-level resource manager over a private cluster and dynamic cloud clusters. Simulations over production-level traces and real-world experiments on a 32-GPU private cluster show that Starburst can reduce cloud costs by up to 54-91% over existing cluster managers, while increasing average JCT by at most 5.8%.

https://www.usenix.org/conference/atc24/presentation/luo
Wednesday July 10, 2024 11:35am - 12:00pm PDT
Grand Ballroom CD

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link