This project quantifies the impact of silent data corruption on deep learning training.
Supercomputers have shown an unparalleled capacity to speed up deep learning (DL) training. In the coming era of exascale computing, a high error rate is expected to be problematic for most HPC applications. However, the impact on emerging DL applications remains unclear given their stochastic nature. In this project, we focus on understanding the training phase of such applications in the presence of silent data corruption. We design and perform a quantification study with three representative applications by manually injecting silent data corruption errors (SDCs) across the design space and compare training results with the error-free baseline. The results show only 0.61–1.76% of SDCs cause training failures, and taking into account the SDC rate in modern hardware, the actual chance of an error is one in thousands to millions. Over 75% of the SDCs that cause catastrophic errors have a training loss in the next iteration that can be easily detected. With our method and results, supercomputer designers can make rational selection between error correction code (ECC) enabled hardware and ECC-free hardware with or without error-aware DL frameworks based on their acceptable training failure expectation.
Zhao Zhang Research Associate
Lei Huang Research Associate
Ruizhu Huang Research Associate
Weijia Xu Manager, Scalable Computational Intelligence
Base funding