Binary: a Framework For Big Data Integration For Ad-Hoc Querying
MetadataShow full item record
Traditional relational database systems are not practical for big data workloads that require scalable architectures for efficient data storage, manipulation, and analysis. Apache Hadoop, one of these big data frameworks, provides distributed storage and processing as well as a central repository for different types of data from different sources. Data integration from various sources is often required before performing analytics. Apache Hive on Hadoop is widely used for this purpose, as well as for data summarization and analysis. It has features such as a SQL-like query language, a Metastore to hold metadata and file formats to support access to various frameworks on Hadoop and beyond. For comprehensive analysis and decision-making, however, a hybrid system is required to integrate Hadoop with traditional relational database management systems in order to access the valuable data stored in relational databases. Current hybrid systems are either expensive proprietary products or require a system to be developed by the user, which requires programming knowledge. In addition these approaches are not sufficiently flexible to be applied to other frameworks. In this thesis, we propose a framework called BINARY (A framework for Big data INtegration for Ad-hoc queRYing). BINARY is a hybrid Software as a Service that provides a web interface supported by a back-end infrastructure for ad-hoc querying, accessing, visualizing and joining data from different data sources, including Relational Database Management Systems and Apache Hive. Our framework uses scalable Hive and HDFS big data storage systems and supports different data sources via back-end resource adapters. There is also a front-end web interface that enables the use of HiveQL to query the data sources. The framework is extendable and allows adding other storage engines (e.g. HBase) and analytics engines (e.g. R) as needed. We used REST software architecture to enable loose connections between the engines and the User Interface programs to facilitate independent updates without affecting the data infrastructure. Our approach is validated with a proof-of-concept prototype implemented on the OpenStack cloud system.