SmartMem

Competition Overview

This competition focuses on predicting Uncorrectable Errors (UEs) in Dynamic random-access memory(DRAM), a critical issue impacting data center reliability and cloud service availability. As data centers expand to meet the growing demands of cloud computing and web-based applications, hardware failures, particularly memory failures, have become a significant threat to server reliability and uptime. Participants will tackle challenges such as data noise, extreme class imbalance, heterogeneous data sources, and hardware aging, using a real-world dataset containing memory system configurations, error logs, and failure tags. The goal is to develop efficient and generalized solutions for UE prediction, fostering advancements in machine learning applications for web-centric and cloud environments, ultimately enhancing the reliability and resilience of modern data centers.

Background

Uncorrectable Errors (UEs) of Dynamic random-access memory (DRAM) have been identified as a major failure cause in data centers. The multi-bit UE failure of High Bandwidth Memory (HBM) poses a significant threat to the availability and reliability of servers and entire computing clusters. Forecasting UEs before enacting preemptive maintenance measures has emerged as a viable strategy for diminishing server outages. Several machine-learning based solutions have also been proposed.

However, predicting UEs presents several challenges: data noise and extreme class imbalance, as UEs are exceptionally rare in memory events; heterogeneous data sources, since DRAMs in operational environments originate from diverse manufacturing or architectural platforms; distribution shifts caused by hardware aging; and latent factors introduced by the dynamic access mechanisms of memory.

We cured a real-world memory error dataset that contains both micro-level and bit-level information, and prepared a two-stage challenge for more efficient and generalizable event prediction solutions. We believe the competition will serve as a fertile ground for fostering discussions and advancing research on several important research topics towards real-world ML applications.

Goal

Our goal is to predict whether each DRAM module will experience an Uncorrectable Error (UE) failure within the next k day, framing the problem as a binary classification task. To support this, we provide participants with a comprehensive dataset that includes memory system configurations, memory error logs, and failure tags. This dataset serves as the foundation for developing solutions to predict potential failures of individual DRAM modules during a specified observation period.

The competition is designed to combine practical relevance with an accessible entry point in the initial stage, while also introducing fresh challenges across both stages to engage participants and drive innovative solutions.

Memory Architecture, Access, and Mitigation

DRAM components and errors: The figure below illustrates the DRAM organization within a server. The basic unit of installation is a Dual In-Line Memory Module (DIMM). At a fundamental level, a DIMM consists of multiple DRAM chips grouped into ranks, enabling simultaneous access during DRAM read/write operations within the same rank. Each chip contains multiple banks that operate in parallel. These banks are further divided into rows and columns, with the intersection of a row and column constituting a cell capable of storing a single data bit. Cells can store multiple bits, and the data width of a chip, denoted as x4, x8, or x16, signifies the number of data bits stored in a cell.

A DRAM error occurs when the DRAM exhibits abnormal behavior, resulting in one or more bits being read differently from their written values. Modern DRAM implementations utilize Error correction code (ECC) to safeguard against DRAM errors.

Memory Access and RAS: The figure below depicts the transmission process of x4 DRAM Double Data Rate 4 (DDR4) chips via DQs. Upon initiating a data request, 8 beats—each consisting of 72 bits (64 data bits and 8 ECC bits)—are transferred to the memory controller via DQ wires. By implementing contemporary ECC, the 72-bit data are distributed across 18 DRAM chips, enabling the memory controller to detect and correct errors, as shown in the figure below. Note that ECC checking bits' addresses are decoded to locate specific errors in DQs and beats. Subsequently, all logs, including error detection and correction, events, and memory specifications, are archived in the Baseboard Management Controller (BMC), as illustrated in the figure below. Utilizing memory failure prediction allows for the anticipation of failures and the activation of corresponding mitigation techniques based on specific use cases.

dataset

The dataset includes log files collected via mcelog, with each file named by the serial number of the DIMM. train.zip and test.zip contain logs with 23 columns, such as CPU ID, channel ID, DIMM ID, rank ID, device ID, bank group ID, bank ID, row ID, column ID, and more, providing detailed information about DRAM errors and system configurations. Additionally, failure_ticket.csv records the failures of each DIMM, including serial number, failure time, and server type, while submission.csv requires predictions for DIMM failures, with columns for serial number, server type, and prediction timestamp (multiple timestamps are allowed for each DIMM). Note that the DIMM manufacturer, part number, and other sensitive information have been anonymized to ensure privacy.

For more details, please refer to dataset details .

Evaluation

Our goal is to predict whether each DRAM will experience an Uncorrectable Error (UE) failure within the next \( k \) day. Thus, the problem is formulated as a binary classification problem.

The evaluation protocol, as illustrated in the figure below, is designed based on production requirements. Specifically, at the current time \( t \), an algorithm observes historical data from an observation window \( \Delta t_d \) to predict failures within the prediction interval \([t + \Delta t_l, t + \Delta t_l + \Delta t_p]\). Here, \( \Delta t_l \) is the minimum time interval between the prediction (i.e., lead time) and the failure, and \( \Delta t_p \) denotes the prediction interval. For this competition, \( \Delta t_l \) is set to 15 minutes, and \( \Delta t_p \) is set to 7 days.

For evaluation, the lead prediction window is fixed to 15 minutes (\( \Delta t_l \)), and the prediction window is set to 7 days (\( \Delta t_p \)). However, participants are free to explore the observation window and labeling methods for training.

To evaluate the performance of participant solutions, we use the \( F1 \) score, which balances both precision and recall. The \( F1 \) score is calculated as: \[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

Precision and Recall are defined as: \[ \text{Precision} = \frac{TP}{TP + FP} \] \[ \text{Recall} = \frac{TP}{TP + FN} \]

The terms used in the formulas are defined as follows:

True Positive (TP): A correctly predicted failure within the prediction window \([t + \Delta t_l, t + \Delta t_l + \Delta t_p]\).
False Positive (FP): An incorrect prediction of a failure.
False Negative (FN): A failure without a prior alarm.
True Negative (TN): No failures are predicted or occur.

Timeline

- Call for Participation - Jan 1st

- Release of Datasets and Starting Kit - Jan 10th

- Stage 1 - From Jan 15th to Mar 1st

- Stage 2 - From Mar 1st to Mar 15th

- Competition ends and Winners release - Apr 1st

- Award Ceremony - Apr 28th

Note: All submission deadlines are start-of-day in the Anywhere on Earth (AoE) time zone.

Award

Top-6 winning teams of the competition will be offer monetary prizes and certificates:

- 🥇 1st Place: 6,000 USD

- 🥈 2nd Place: 4,000 USD

- 🥉 3rd Place: 2,000 USD

- Honorable Prizes (4th-6th place): 1,000 USD

Additionally, we will select 2 student awards, granted to the highest-ranking groups outside the top 3 positions, that are composed entirely of students. Each student group will receive travel award up to 1,000 USD to attend the workshop in person. This remuneration is to be paid after the in-person attendance.

Competition Overview

Background

Goal

Memory Architecture, Access, and Mitigation

dataset

Evaluation

Timeline

- Call for Participation - Jan 1st

- Release of Datasets and Starting Kit - Jan 10th

- Stage 1 - From Jan 15th to Mar 1st

- Stage 2 - From Mar 1st to Mar 15th

- Competition ends and Winners release - Apr 1st

- Award Ceremony - Apr 28th

Award

- 🥇 1st Place: 6,000 USD

- 🥈 2nd Place: 4,000 USD

- 🥉 3rd Place: 2,000 USD

- Honorable Prizes (4th-6th place): 1,000 USD

Organizing Committee

- Min Zhou (Huawei Technologies Co., Ltd, China)

- Qiao Yu (Technical University of Berlin, Germany)

- Hongyi Xie (University of Science and Technology of China)

- Jialiang Yu (Huawei Technologies Co., Ltd, China)

- Wengui Zhang (Huawei Technologies Co., Ltd, China)

- Stefano Mauceri (Huawei Technologies Co., Ltd, Ireland)

- Ping Liu (Huawei Technologies Co., Ltd, China)

- Zhenli Sheng (Huawei Technologies Co., Ltd, China)

Steering Committee

- Hui Xiong (Hong Kong University of Science and Technology(Guangzhou))

- Menglin Yang (Hong Kong University of Science and Technology(Guangzhou))

- Zhirong Shen (Xiamen University)

- Jie Wang (University of Science and Technology of China)

- Hong Xie (University of Science and Technology of China)

- Defu Lian (University of Science and Technology of China)

- Xiao Chen (Huawei Technologies Co., Ltd, China)