Optimizing Data Locality in Analytic Workloads over Distributed Computing Environments
With the explosion of data that are generated every second, there is an emerging need for big data analytics using scalable systems and platforms for exploration, mining and decision making purposes. To gain better business insights, the business users are interested to integrate different kinds of analytics to achieve their goals. These analytics may involve accessing the same data for different purposes. Modern data intensive systems co-locate the computation as close as possible to the data to achieve greater e ciency. This placement of computation close to the data is called data locality. Data locality has a significant impact on the performance of jobs in a large cluster since higher data locality means there is less data transfer over the network. In this work, we examine data locality in parallel processing frameworks and propose approaches to optimize it. First, we conduct a literature review of the existing systems that maximize data locality while processing big data analytics workflows. Second, we provide YARN Locality Simulator (YLocSim), a simulator tool that simulates the interactions between YARN components in a real cluster to report the data locality percentages. This tool gives the users better insights about the expected performance of the computing cluster. Third, we develop YARN Dynamic Replication Manager (YDRM), which is a new component in YARN that interacts with the existing YARN's Resource Manager to improve the data locality.
URI for this recordhttp://hdl.handle.net/1974/15890
Request an alternative formatIf you require this document in an alternate, accessible format, please contact the Queen's Adaptive Technology Centre
The following license files are associated with this item: