Achieving Consumable Big Data Analytics by Distributing Data Mining Algorithms

Loading...
Thumbnail Image

Authors

Khalifa, Shady

Date

Type

thesis

Language

eng

Keyword

Big Data , Analytics , Data Mininig , Distributed , Drill , Machine Learning , Classifier Ensembles , Consumable Analytics , Query Language , Weka

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Businesses look at Big Data as an opportunity to gain insights for improving their services. The derivation of such insights requires using different data mining techniques. Mature data mining tools like WEKA or R have been in development for years. They implement a large number of data mining algorithms and can support sophisticated Analytics. However, these mature tools are designed to run on a single machine making them unsuitable to handle Big Data. Using these tools requires data mining and statistics knowledge, and some of them, like R, are hard to learn. Businesses do not always have the technical skills required to carry on such Analytics. Even if they do, it is challenging to find a tool with the needed algorithms that supports distributed processing to handle the Big Data high arrival velocity and large volumes. The Businesses’ analytical requirements can be addressed by Consumable Big Data Analytics, that is, solutions that allow businesses to do Big Data Analytics themselves using their in-house expertise. In this work, we provide a Consumable Analytics solution to meet the businesses’ analytical needs. First, we conduct a survey of existing Analytics solutions to identify possible areas of improvement to provide Consumable Analytics. Second, instead of developing distributed data mining algorithms to handle Big Data, we develop the Data Mining Distribution (DMD) algorithm and the Label-Aware Disjoint Partitioning (LADP) algorithm to distribute the execution of all existing single-machine data mining algorithms without rewriting a single line of their code. This gives users the flexibility to use any available data mining library, have algorithms like Hoeffding Tree run 70% to 95% faster and achieve up to 18% increase in prediction accuracy. Third, we develop the free and open source QDrill solution to implement our DMD and LADP algorithms for distributed Analytics. QDrill implements our proposed Distributed Analytics Query Language (DAQL) interface that adds Analytics capabilities to the regular SQL syntax and allows integration with Business Intelligence (BI) tools. This allows businesses to use their in-house expertise to do Big Data Analytics using the spreadsheets and visualizations of their BI tools.

Description

Citation

Publisher

License

Attribution-ShareAlike 3.0 United States
Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN