Name: Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field
Start: 2024-07-11T16:35:00-0700
End: 2024-07-11T17:00:00-0700

Thursday July 11, 2024 4:35pm - 5:00pm PDT

Grand Ballroom EF

Ronglong Wu, Shuyue Zhou, Jiahao Lu, Zhirong Shen, and Zikang Xu, Xiamen University; Jiwu Shu, Xiamen University and Minjiang University; Kunlin Yang and Feilong Lin, Huawei Technologies Co., Ltd; Yiming Zhang, Xiamen University

High-bandwidth memory (HBM) is regarded as a promising technology for fundamentally overcoming the memory wall. It stacks up multiple DRAM dies vertically to dramatically improve the memory access bandwidth. However, this architecture also comes with more severe reliability issues, since HBM not only inherits error patterns of the conventional DRAM, but also introduces new error causes.

In this paper, we conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services. Through error analyses and methodology validations, we confirm that the HBM exhibits different error patterns from conventional DRAM, in terms of spatial locality, temporal correlation, and sensor metrics which make empirical prediction models for DRAM error prediction ineffective for HBM. We design and implement Calchas, a hierarchical failure prediction framework for HBM based on our findings, which integrate spatial, temporal, and sensor information from various device levels to predict upcoming failures. The results demonstrate the feasibility of failure prediction across hierarchical levels.

https://www.usenix.org/conference/atc24/presentation/wu-ronglong

Thursday July 11, 2024 4:35pm - 5:00pm PDT
Grand Ballroom EF

USENIX ATC Track 2

USENIX ATC '24 and OSDI '24

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!