Advisory Committee Chair
Purushotham V Bangalore
Advisory Committee Members
Anthony Skjellum
Peter M Pirkelbauer
George Bosilca
Jeffrey M Squyres
Steven J Bethard
Document Type
Dissertation
Date of Award
2016
Degree Name by School
Doctor of Philosophy (PhD) College of Arts and Sciences
Abstract
Increases in the scale of computing machines directly correlate with the rate of failures. High Performance Computing (HPC) applications provide fault-tolerance through redundancy in “resources.” However, dominant underlying parallel programming models, such as Message Passing Interface (MPI), are sensitive to the slightest errors such that a node failure aborts the entire application stack. Recent advances in application level fault-tolerance require a fault-tolerant MPI implementation. Studies (including this one) investigate all previous approaches thoroughly and determined their limitations. The purpose of this dissertation is to design a fault-tolerant MPI for data-parallel and Bulk Synchronous Parallel (BSP) applications by increasing the probability of completion with these goals: 1) relax MPI’s sensitivity to faults, 2) allow applications to exploit different recovery schemes, 3) support multiple independent and correlated fault models, and 4) provide trade-off flexibility between fault-free overhead, scalability, performance, and resilience. To achieve these goals, we first restrict MPI to nonblocking operations. This restriction is made to achieve maximum scalability. Second, we discuss middle-out, fault-tolerant design requirements between MPI and its data transfer engines, and, finally we offer a design for Fault-Aware MPI (FA-MPI) by adding minimal nonblocking routines and semantics to create a transactional fault-tolerant MPI. Additionally, we provide example building block algorithms such as reliable broadcast and consensus. This study shows that example applications encounter two to three percent fault-free overhead as well as acceptable performance under faults. In addition, our consensus and reliable broadcast achieve logarithmic scaling performance.
Recommended Citation
Hassani, Amin, "Toward a Scalable, Transactional, Fault-Tolerant Message Passing Interface for Petascale and Exascale Machines" (2016). All ETDs from UAB. 1887.
https://digitalcommons.library.uab.edu/etd-collection/1887