Loading…
USENIX ATC '24 and OSDI '24
Attending this event?
Thursday July 11, 2024 4:35pm - 5:00pm PDT
Ronglong Wu, Shuyue Zhou, Jiahao Lu, Zhirong Shen, and Zikang Xu, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University; Kunlin Yang and Feilong Lin, Huawei Technologies Co., Ltd; Yiming Zhang, Xiamen University

High-bandwidth memory (HBM) is regarded as a promising technology for fundamentally overcoming the memory wall. It stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth. However, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes.

In this paper, we conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services. Through error analyses and methodology validations, we confirm that the HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM. We design and implement Calchas, a hierarchical failure prediction framework for HBM based on our findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcoming failures. The results demonstrate the feasibility of failure prediction across hierarchical levels.

https://www.usenix.org/conference/atc24/presentation/wu-ronglong
Thursday July 11, 2024 4:35pm - 5:00pm PDT
Grand Ballroom EF

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link