Type of Document Dissertation Author Yan, Wei Author's Email Address email@example.com URN etd-07172015-094917 Title A Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms Degree PhD Department Computer Science Advisory Committee
Advisor Name Title Yuan Xue Committee Chair Amr A. Awadallah Committee Member Aniruddha S. Gokhale Committee Member Bradley A. Malin Committee Member Douglas C. Schmidt Committee Member Keywords
- Resource management
- data processing
- data profiling
Date of Defense 2015-05-19 Availability unrestricted AbstractIn the era of “Big Data”, a variety of data processing and analysis frameworks (such as MapReduce/Hadoop, Dremel/Impala, and Storm) have emerged as a solution to support large-scale data processing and analysis tasks. The computing tasks from these frameworks are usually deployed and executed over shared computing infrastructures. Resource management of this shared infrastructure plays a central role in consolidating the different resource needs of these jobs, satisfying the individual performance requirements while ensuring the fairness among jobs. Yet designing and implementing a scalable solution for resource management for large-scale data processing platforms remains an open challenge. First, the workload of data processing jobs greatly depends on the input data – not only the data size, but also more importantly, the internal data structure and semantics, which is usually unknown a priori. Second, different data processing jobs are highly diverse in terms of their performance requirements.
To address these challenges, this dissertation proposes a data-driven optimal resource management mechanism for large-scale data processing platforms. The proposed approach integrates efficient data profiling with resource management. Based on the knowledge of the job workload through data profiling, the proposed resource management mechanism makes informed scheduling and resource allocation decisions through an optimization framework.
This dissertation makes the following contributions: First, it presents an optimizationbased resource management approach for the prevalent MapReduce/Hadoop data processing framework. The performance objective of a MapReduce job is captured by its job completion time, which is determined by the longest reducer task. To capture the data distribution statistics, a scalable data profiling structure is designed and integrated with MapReduce framework. Based on the data profiles, a novel key assignment mechanism assigns appropriate workloads to minimize the load skew and thus optimize the performance of a MapReduce job.
Second, it presents an optimal resource allocation solution for large-scale interactive data query systems (e.g., Dremel/Impala) using a utility-based optimization framework. The objective is to optimize the cluster resource utilization, while maximizing the aggregate utility. By profiling the resource consumption for each query, a price-based algorithm allocates resources across multiple concurrent queries efficiently. The utility-based framework allows different fairness criteria to be defined through the definition of utility function (e.g., weighted proportional fairness and max-min fairness).
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Yan.pdf 1.63 Mb 00:07:33 00:03:53 00:03:24 00:01:42 00:00:08