Dynamic Large-Scale Graph Processing over Data Streams with Community Detection as a Case Study

Loading...
Thumbnail Image

Authors

Abughofa, Mhd

Date

Type

thesis

Language

eng

Keyword

Big Data , Analytics , Graph Processing , Community Detection , Spark , Stream Processing , Real-time Processing

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Processing large graphs provides invaluable insights for the industry and research alike. The applications range from e-commerce, web, and social networking to analyzing gene expressions and cellular signaling. While numerous graph processing solutions have been developed with the capability to process graphs at the scale of a trillion edges, the ability to maintain and process a real-time graph still far from being handled. Processing data streams in real-time requires the graph to change over time which introduces several new challenges. First, the graph needs to be updated from the data stream efficiently. At the same time, applying these changes should not add an unacceptable overhead to graph queries. In addition, these changes need to be reflected in the new analytical insights, otherwise the value of the insights degrades with time. In this work, we investigated the problem of dynamic graph processing over data streams. We started by studying the feasibility of maintaining a dynamic graph on top of Apache Spark, a data processing engine. The chosen solutions included RDDs, IndexedRDDs, and Redis. Results from our experimental indicated that Redis performed the best, and thus we concluded that storing the graph in an external big data store besides Spark is the best approach in terms of performance and practicality. After that, we designed and developed Sprouter, a streaming data analytics framework which utilizes Apache Spark for dynamic graph processing. The framework enables storing enormous graph data, allows real-time graph updates, and supports efficient complex analytics and OLTP queries. Experiments showed that the system is able to support the above mentioned properties and update graphs with up to 100 million edges in under 50 seconds in a moderate underlying cluster. Finally, we selected community detection as a case study of incremental graph analytics with Sprouter. We proposed the Incremental Distributed Weighted Community Clustering (IDWCC), a novel algorithm that optimizes the Weighted Community Clustering metric to detect communities in unweighted undirected node-grained dynamic graphs. We validated the algorithm against the static centralized and distributed versions of WCC optimization. The experiments showed that the performance of IDWCC surpasses the static distributed version by up to three times while maintaining the same accuracy or better.

Description

Citation

Publisher

License

Attribution-NonCommercial-ShareAlike 3.0 United States
Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN