A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-03232009-133814


Type of Document Dissertation
Author Dubey, Abhishek
URN etd-03232009-133814
Title Using Model-based techniques for improving performance and reliability in high performance scientific computing
Degree PhD
Department Electrical Engineering
Advisory Committee
Advisor Name Title
Gabor Karsai Committee Chair
Paul Sheldon Committee Member
Sandeep neema Committee Member
Sherif Abdelwahed Committee Member
Theodore Bapty Committee Member
Keywords
  • scientific computing
  • software health management
  • autonomic computing
  • Electronic data processing -- Distributed processing -- Reliability
  • Computer software -- Reliability
  • Computer system failures -- Prevention
  • Fault-tolerant computing
Date of Defense 2009-03-13
Availability unrestricted
Abstract
Data processing in scientific and workflow-oriented computing is carried out as analysis campaigns, which consist of an input dataset and a set of interdependent jobs. Traditionally, these massively parallel computations required the services of supercomputers. However, recent trends show that the share of scientific computing carried out on clusters of commodity computers is on the rise. Commodity computers yield the highest performance per dollar but exhibit intermittent faults, which can result in systemic failures when operated over long continuous periods for executing analysis campaigns. Diagnosing job problems and failures in this complex environment is difficult, especially when the success of a campaign can be affected by even a single job failure. Manual administration, though essential, is slow to respond to the intermittent faults. Therefore, an autonomic approach is required that can ensure that the resources of the cluster are used to the best possible extent and improve the reliability of jobs, even in the presence of hardware/software failures.

Model-based design is a formal system design methodology that has gained momentum in recent years as a sound methodology of applying computer-based modeling and synthesis methods to a variety of problem domains, including distributed systems. A benefit of using formal models is that they can be queried or transformed to produce a variety of domain specific artifacts, which are critical to deployment and execution of the system, but are tedious and error-prone to produce manually.

This dissertation presents the design and discusses applicability of a model-based cluster management framework called Scientific Computing Autonomic Reliability Framework (SCARF). Basic components of this framework are distributed monitoring units, fault-mitigation units and a workflow-management system for dealing with workflow-specific concerns in case of failures.

Model-based techniques are used to capture workflow specifications, along with pre, post conditions and invariants for checking the validity of system state during execution. Formal data models are used to provide provenance and execution tracking of workflow jobs. Health monitoring is provided by synchronized, light-weight, distributed sensors that are augmented with a real-time fault-mitigation framework. This framework consists of hierarchical fault management entities called reflex engines, which use a timed automaton based abstraction for capturing failure management strategies. These engines track the state of components under their management zone and initiate reflexive mitigation actions upon occurrence of certain events or timeouts. This mitigation framework is verified against properties written in timed computation tree logic (TCTL).

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Abhishek_DissertationFinal.pdf 19.77 Mb 01:31:31 00:47:04 00:41:11 00:20:35 00:01:45

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.