This competition focuses on predicting Uncorrectable Errors (UEs) in Dynamic random-access memory(DRAM), a critical issue impacting data center reliability and cloud service availability. As data centers expand to meet the growing demands of cloud computing and web-based applications, hardware failures, particularly memory failures, have become a significant threat to server reliability and uptime. Participants will tackle challenges such as data noise, extreme class imbalance, heterogeneous data sources, and hardware aging, using a real-world dataset containing memory system configurations, error logs, and failure tags. The goal is to develop efficient and generalized solutions for UE prediction, fostering advancements in machine learning applications for web-centric and cloud environments, ultimately enhancing the reliability and resilience of modern data centers.
Uncorrectable Errors (UEs) of Dynamic random-access memory (DRAM) have been identified as a major failure cause in data centers. The multi-bit UE failure of High Bandwidth Memory (HBM) poses a significant threat to the availability and reliability of servers and entire computing clusters. Forecasting UEs before enacting preemptive maintenance measures has emerged as a viable strategy for diminishing server outages. Several machine-learning based solutions have also been proposed.
However, predicting UEs presents several challenges: data noise and extreme class imbalance, as UEs are exceptionally rare in memory events; heterogeneous data sources, since DRAMs in operational environments originate from diverse manufacturing or architectural platforms; distribution shifts caused by hardware aging; and latent factors introduced by the dynamic access mechanisms of memory.
We cured a real-world memory error dataset that contains both micro-level and bit-level information, and prepared a two-stage challenge for more efficient and generalizable event prediction solutions. We believe the competition will serve as a fertile ground for fostering discussions and advancing research on several important research topics towards real-world ML applications.
Our goal is to predict whether each DRAM module will experience an Uncorrectable Error (UE) failure within the next k day, framing the problem as a binary classification task. To support this, we provide participants with a comprehensive dataset that includes memory system configurations, memory error logs, and failure tags. This dataset serves as the foundation for developing solutions to predict potential failures of individual DRAM modules during a specified observation period.
The competition is designed to combine practical relevance with an accessible entry point in the initial stage, while also introducing fresh challenges across both stages to engage participants and drive innovative solutions.
DRAM components and errors: The figure below illustrates the DRAM organization within a server. The basic unit of installation is a Dual In-Line Memory Module (DIMM). At a fundamental level, a DIMM consists of multiple DRAM chips grouped into ranks, enabling simultaneous access during DRAM read/write operations within the same rank. Each chip contains multiple banks that operate in parallel. These banks are further divided into rows and columns, with the intersection of a row and column constituting a cell capable of storing a single data bit. Cells can store multiple bits, and the data width of a chip, denoted as x4, x8, or x16, signifies the number of data bits stored in a cell.
A DRAM error occurs when the DRAM exhibits abnormal behavior, resulting in one or more bits being read differently from their written values. Modern DRAM implementations utilize Error correction code (ECC) to safeguard against DRAM errors.
Memory Access and RAS: The figure below depicts the transmission process of x4 DRAM Double Data Rate 4 (DDR4) chips via DQs. Upon initiating a data request, 8 beats—each consisting of 72 bits (64 data bits and 8 ECC bits)—are transferred to the memory controller via DQ wires. By implementing contemporary ECC, the 72-bit data are distributed across 18 DRAM chips, enabling the memory controller to detect and correct errors, as shown in the figure below. Note that ECC checking bits' addresses are decoded to locate specific errors in DQs and beats. Subsequently, all logs, including error detection and correction, events, and memory specifications, are archived in the Baseboard Management Controller (BMC), as illustrated in the figure below. Utilizing memory failure prediction allows for the anticipation of failures and the activation of corresponding mitigation techniques based on specific use cases.
The dataset includes log files collected via mcelog, with each file named by the serial number of the DIMM. train.zip and test.zip contain logs with 23 columns, such as CPU ID, channel ID, DIMM ID, rank ID, device ID, bank group ID, bank ID, row ID, column ID, and more, providing detailed information about DRAM errors and system configurations. Additionally, failure_ticket.csv records the failures of each DIMM, including serial number, failure time, and server type, while submission.csv requires predictions for DIMM failures, with columns for serial number, server type, and prediction timestamp (multiple timestamps are allowed for each DIMM). Note that the DIMM manufacturer, part number, and other sensitive information have been anonymized to ensure privacy.
Our goal is to predict whether each DRAM will experience an Uncorrectable Error (UE) failure within the next \( k \) day. Thus, the problem is formulated as a binary classification problem.
The evaluation protocol, as illustrated in the figure below, is designed based on production requirements. Specifically, at the current time \( t \), an algorithm observes historical data from an observation window \( \Delta t_d \) to predict failures within the prediction interval \([t + \Delta t_l, t + \Delta t_l + \Delta t_p]\). Here, \( \Delta t_l \) is the minimum time interval between the prediction (i.e., lead time) and the failure, and \( \Delta t_p \) denotes the prediction interval. For this competition, \( \Delta t_l \) is set to 15 minutes, and \( \Delta t_p \) is set to 7 days.
For evaluation, the lead prediction window is fixed to 15 minutes (\( \Delta t_l \)), and the prediction window is set to 7 days (\( \Delta t_p \)). However, participants are free to explore the observation window and labeling methods for training.
To evaluate the performance of participant solutions, we use the \( F1 \) score, which balances both precision and recall. The \( F1 \) score is calculated as: \[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
Precision and Recall are defined as: \[ \text{Precision} = \frac{TP}{TP + FP} \] \[ \text{Recall} = \frac{TP}{TP + FN} \]
The terms used in the formulas are defined as follows: