Error Vulnerabilities and Fault Recovery in Deep-Learning Frameworks for Hardware Accelerators - Robotics Institute Carnegie Mellon University

Error Vulnerabilities and Fault Recovery in Deep-Learning Frameworks for Hardware Accelerators

Iljoo Baek, Zhihao Zhu, Sourav Panda, Nandha Kishore Srinivasan, Soheil Samii, and Ragunathan Raj Rajkumar
Conference Paper, Proceedings of IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA '20), August, 2020

Abstract

Hardware accelerators such as GP-GPUs, Tensor Cores, and Deep-Learning Accelerators (DLA) are increasingly being used in real-time settings such as autonomous vehicles (AVs). In such deployments, any software errors and process failures in hardware systems can lead to critical faults in AVs. Therefore, assessing and mitigating hardware accelerator faults are critical requirements for safety-critical systems. Past work on this subject focused on simulated and injected software and hardware faults to understand and analyze the behavior of the software stack and the entire system. However, programming errors and process failures caused when using software frameworks must also be considered. In this paper, we present experiments which show that widely used deep-learning frameworks are vulnerable to programming mistakes and errors. We first focus on memory-related programming errors caused by applications using deep-learning frameworks that facilitate high-performance inferencing. We next find that a reset to recover from any fault imposes significant time penalties in reloading a pre-trained deep neural network model. To reduce these fault recovery times, we propose fault recovery mechanisms that checkpoint and resume the network based on the inference stage when an error is detected. Finally, we substantiate the practical feasibility of our approach and evaluate the improvement in recovery times 1 1 A demo video clip demonstrating our recovery algorithm has been uploaded to Youtube: https://www.youtube.com/watch?v=xwUYdJdA5oM.. We use a case-study with real-world applications on an Nvidia GeForce GTX 1070 GPU and an Nvidia Xavier embedded platform, which is commonly used by multiple automotive OEMs.

BibTeX

@conference{Baek-2020-126178,
author = {Iljoo Baek and Zhihao Zhu and Sourav Panda and Nandha Kishore Srinivasan and Soheil Samii and Ragunathan Raj Rajkumar},
title = {Error Vulnerabilities and Fault Recovery in Deep-Learning Frameworks for Hardware Accelerators},
booktitle = {Proceedings of IEEE 26th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA '20)},
year = {2020},
month = {August},
}