Improving redundant multithreading performance for soft-error detection in HPC applications
Abstract
As HPC systems move towards extreme scale, soft errors leading to silent data corruptions become
a major concern. In this thesis, we propose a set of three optimizations to the classical Redundant
Multithreading (RMT) approach to allow faster soft error detection. First, we leverage the use of
Simultaneous Multithreading (SMT) to collocate sibling replicated threads on the same physical
core to efficiently exchange data to expose errors. Some HPC applications cannot fully exploit
SMT for performance improvement and instead, we propose to use these additional resources
for fault tolerance. Second, we present variable aggregation to group several values together
and use this merged value to speed up detection of soft errors. Third, we introduce selective
checking to decrease the number of checked values to a minimum. The last two techniques reduce
the overall performance overhead by relaxing the soft error detection scope. Our experimental
evaluation, executed on recent multicore processors with representative HPC benchmarks, proves
that the use of SMT for fault tolerance can enhance RMT performance. It also shows that, at
constant computing power budget, with optimizations applied, the overhead of the technique can
be significantly lower than the classical RMT replicated execution. Furthermore, these results
show that RMT can be a viable solution for soft-error detection at extreme scale.
Description
Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2018.
Share
Metrics
Collections
- Maestría en Computación [107]