Improving redundant multithreading performance for soft-error detection in HPC applications

Loading...
Thumbnail Image

Date

Authors

Pérez-Arroyo, Diego Simón

Journal Title

Journal ISSN

Volume Title

Publisher

Instituto Tecnológico de Costa Rica

Abstract

As HPC systems move towards extreme scale, soft errors leading to silent data corruptions become a major concern. In this thesis, we propose a set of three optimizations to the classical Redundant Multithreading (RMT) approach to allow faster soft error detection. First, we leverage the use of Simultaneous Multithreading (SMT) to collocate sibling replicated threads on the same physical core to efficiently exchange data to expose errors. Some HPC applications cannot fully exploit SMT for performance improvement and instead, we propose to use these additional resources for fault tolerance. Second, we present variable aggregation to group several values together and use this merged value to speed up detection of soft errors. Third, we introduce selective checking to decrease the number of checked values to a minimum. The last two techniques reduce the overall performance overhead by relaxing the soft error detection scope. Our experimental evaluation, executed on recent multicore processors with representative HPC benchmarks, proves that the use of SMT for fault tolerance can enhance RMT performance. It also shows that, at constant computing power budget, with optimizations applied, the overhead of the technique can be significantly lower than the classical RMT replicated execution. Furthermore, these results show that RMT can be a viable solution for soft-error detection at extreme scale.

Description

Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2018.

Citation

Endorsement

Review

Supplemented By

Referenced By