All ETDs from UAB

Advisory Committee Chair

Purushotham V Bangalore

Advisory Committee Members

Anthony Skjellum

Peter M Pirkelbauer

George Bosilca

Jeffrey M Squyres

Steven J Bethard

Document Type

Dissertation

Date of Award

2016

Degree Name by School

Doctor of Philosophy (PhD) College of Arts and Sciences

Abstract

Increases in the scale of computing machines directly correlate with the rate of failures. High Performance Computing (HPC) applications provide fault-tolerance through redundancy in “resources.” However, dominant underlying parallel programming models, such as Message Passing Interface (MPI), are sensitive to the slightest errors such that a node failure aborts the entire application stack. Recent advances in application level fault-tolerance require a fault-tolerant MPI implementation. Studies (including this one) investigate all previous approaches thoroughly and determined their limitations. The purpose of this dissertation is to design a fault-tolerant MPI for data-parallel and Bulk Synchronous Parallel (BSP) applications by increasing the probability of completion with these goals: 1) relax MPI’s sensitivity to faults, 2) allow applications to exploit different recovery schemes, 3) support multiple independent and correlated fault models, and 4) provide trade-off flexibility between fault-free overhead, scalability, performance, and resilience. To achieve these goals, we first restrict MPI to nonblocking operations. This restriction is made to achieve maximum scalability. Second, we discuss middle-out, fault-tolerant design requirements between MPI and its data transfer engines, and, finally we offer a design for Fault-Aware MPI (FA-MPI) by adding minimal nonblocking routines and semantics to create a transactional fault-tolerant MPI. Additionally, we provide example building block algorithms such as reliable broadcast and consensus. This study shows that example applications encounter two to three percent fault-free overhead as well as acceptable performance under faults. In addition, our consensus and reliable broadcast achieve logarithmic scaling performance.

Share

COinS