Achieving Consumable Big Data Analytics by Distributing Data Mining Algorithms
Abstract
Businesses look at Big Data as an opportunity to gain insights for improving their services. The derivation of such insights requires using different data mining techniques. Mature data mining tools like WEKA or R have been in development for years. They implement a large number of data mining algorithms and can support sophisticated Analytics. However, these mature tools are designed to run on a single machine making them unsuitable to handle Big Data. Using these tools requires data mining and statistics knowledge, and some of them, like R, are hard to learn.
Businesses do not always have the technical skills required to carry on such Analytics. Even if they do, it is challenging to find a tool with the needed algorithms that supports distributed processing to handle the Big Data high arrival velocity and large volumes. The Businesses’ analytical requirements can be addressed by Consumable Big Data Analytics, that is, solutions that allow businesses to do Big Data Analytics themselves using their in-house expertise.
In this work, we provide a Consumable Analytics solution to meet the businesses’ analytical needs. First, we conduct a survey of existing Analytics solutions to identify possible areas of improvement to provide Consumable Analytics. Second, instead of developing distributed data mining algorithms to handle Big Data, we develop the Data Mining Distribution (DMD) algorithm and the Label-Aware Disjoint Partitioning (LADP) algorithm to distribute the execution of all existing single-machine data mining algorithms without rewriting a single line of their code. This gives users the flexibility to use any available data mining library, have algorithms like Hoeffding Tree run 70% to 95% faster and achieve up to 18% increase in prediction accuracy. Third, we develop the free and open source QDrill solution to implement our DMD and LADP algorithms for distributed Analytics. QDrill implements our proposed Distributed Analytics Query Language (DAQL) interface that adds Analytics capabilities to the regular SQL syntax and allows integration with Business Intelligence (BI) tools. This allows businesses to use their in-house expertise to do Big Data Analytics using the spreadsheets and visualizations of their BI tools.
URI for this record
http://hdl.handle.net/1974/15460Request an alternative format
If you require this document in an alternate, accessible format, please contact the Queen's Adaptive Technology CentreThe following license files are associated with this item: