A joint project of the Graduate School, Peabody College, and the Jean & Alexander Heard Library

Title page for ETD etd-07172015-094917

Type of Document Dissertation
Author Yan, Wei
Author's Email Address ywskycn@gmail.com
URN etd-07172015-094917
Title A Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms
Degree PhD
Department Computer Science
Advisory Committee
Advisor Name Title
Yuan Xue Committee Chair
Amr A. Awadallah Committee Member
Aniruddha S. Gokhale Committee Member
Bradley A. Malin Committee Member
Douglas C. Schmidt Committee Member
  • Resource management
  • large-scale
  • data processing
  • data profiling
  • MapReduce/Hadoop
Date of Defense 2015-05-19
Availability unrestricted
In the era of “Big Data”, a variety of data processing and analysis frameworks (such as MapReduce/Hadoop, Dremel/Impala, and Storm) have emerged as a solution to support large-scale data processing and analysis tasks. The computing tasks from these frameworks are usually deployed and executed over shared computing infrastructures. Resource management of this shared infrastructure plays a central role in consolidating the different resource needs of these jobs, satisfying the individual performance requirements while ensuring the fairness among jobs. Yet designing and implementing a scalable solution for resource management for large-scale data processing platforms remains an open challenge. First, the workload of data processing jobs greatly depends on the input data – not only the data size, but also more importantly, the internal data structure and semantics, which is usually unknown a priori. Second, different data processing jobs are highly diverse in terms of their performance requirements.

To address these challenges, this dissertation proposes a data-driven optimal resource management mechanism for large-scale data processing platforms. The proposed approach integrates efficient data profiling with resource management. Based on the knowledge of the job workload through data profiling, the proposed resource management mechanism makes informed scheduling and resource allocation decisions through an optimization framework.

This dissertation makes the following contributions: First, it presents an optimizationbased resource management approach for the prevalent MapReduce/Hadoop data processing framework. The performance objective of a MapReduce job is captured by its job completion time, which is determined by the longest reducer task. To capture the data distribution statistics, a scalable data profiling structure is designed and integrated with MapReduce framework. Based on the data profiles, a novel key assignment mechanism assigns appropriate workloads to minimize the load skew and thus optimize the performance of a MapReduce job.

Second, it presents an optimal resource allocation solution for large-scale interactive data query systems (e.g., Dremel/Impala) using a utility-based optimization framework. The objective is to optimize the cluster resource utilization, while maximizing the aggregate utility. By profiling the resource consumption for each query, a price-based algorithm allocates resources across multiple concurrent queries efficiently. The utility-based framework allows different fairness criteria to be defined through the definition of utility function (e.g., weighted proportional fairness and max-min fairness).

  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Yan.pdf 1.63 Mb 00:07:33 00:03:53 00:03:24 00:01:42 00:00:08

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LITS.