Toward a Scalable, Transactional, Fault-Tolerant Message Passing Interface for Petascale and Exascale Machines

Amin Hassani, University of Alabama at Birmingham

Advisory Committee Chair

Purushotham V Bangalore

Advisory Committee Members

Anthony Skjellum

Peter M Pirkelbauer

George Bosilca

Jeffrey M Squyres

Steven J Bethard

Document Type

Dissertation

Date of Award

2016

Degree Name by School

Doctor of Philosophy (PhD) College of Arts and Sciences

Abstract

Increases in the scale of computing machines directly correlate with the rate of failures. High Performance Computing (HPC) applications provide fault-tolerance through redundancy in “resources.” However, dominant underlying parallel programming models, such as Message Passing Interface (MPI), are sensitive to the slightest errors such that a node failure aborts the entire application stack. Recent advances in application level fault-tolerance require a fault-tolerant MPI implementation. Studies (including this one) investigate all previous approaches thoroughly and determined their limitations. The purpose of this dissertation is to design a fault-tolerant MPI for data-parallel and Bulk Synchronous Parallel (BSP) applications by increasing the probability of completion with these goals: 1) relax MPI’s sensitivity to faults, 2) allow applications to exploit different recovery schemes, 3) support multiple independent and correlated fault models, and 4) provide trade-off flexibility between fault-free overhead, scalability, performance, and resilience. To achieve these goals, we first restrict MPI to nonblocking operations. This restriction is made to achieve maximum scalability. Second, we discuss middle-out, fault-tolerant design requirements between MPI and its data transfer engines, and, finally we offer a design for Fault-Aware MPI (FA-MPI) by adding minimal nonblocking routines and semantics to create a transactional fault-tolerant MPI. Additionally, we provide example building block algorithms such as reliable broadcast and consensus. This study shows that example applications encounter two to three percent fault-free overhead as well as acceptable performance under faults. In addition, our consensus and reliable broadcast achieve logarithmic scaling performance.

Recommended Citation

Hassani, Amin, "Toward a Scalable, Transactional, Fault-Tolerant Message Passing Interface for Petascale and Exascale Machines" (2016). All ETDs from UAB. 1887.
https://digitalcommons.library.uab.edu/etd-collection/1887

Download

Included in

Arts and Humanities Commons

COinS

Toward a Scalable, Transactional, Fault-Tolerant Message Passing Interface for Petascale and Exascale Machines

Advisory Committee Chair

Advisory Committee Members

Document Type

Date of Award

Degree Name by School

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Toward a Scalable, Transactional, Fault-Tolerant Message Passing Interface for Petascale and Exascale Machines

Authors

Advisory Committee Chair

Advisory Committee Members

Document Type

Date of Award

Degree Name by School

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner