Dynamic Large-Scale Graph Processing over Data Streams with Community Detection as a Case Study

Thumbnail Image
Abughofa, Mhd
Big Data , Analytics , Graph Processing , Community Detection , Spark , Stream Processing , Real-time Processing
Processing large graphs provides invaluable insights for the industry and research alike. The applications range from e-commerce, web, and social networking to analyzing gene expressions and cellular signaling. While numerous graph processing solutions have been developed with the capability to process graphs at the scale of a trillion edges, the ability to maintain and process a real-time graph still far from being handled. Processing data streams in real-time requires the graph to change over time which introduces several new challenges. First, the graph needs to be updated from the data stream efficiently. At the same time, applying these changes should not add an unacceptable overhead to graph queries. In addition, these changes need to be reflected in the new analytical insights, otherwise the value of the insights degrades with time. In this work, we investigated the problem of dynamic graph processing over data streams. We started by studying the feasibility of maintaining a dynamic graph on top of Apache Spark, a data processing engine. The chosen solutions included RDDs, IndexedRDDs, and Redis. Results from our experimental indicated that Redis performed the best, and thus we concluded that storing the graph in an external big data store besides Spark is the best approach in terms of performance and practicality. After that, we designed and developed Sprouter, a streaming data analytics framework which utilizes Apache Spark for dynamic graph processing. The framework enables storing enormous graph data, allows real-time graph updates, and supports efficient complex analytics and OLTP queries. Experiments showed that the system is able to support the above mentioned properties and update graphs with up to 100 million edges in under 50 seconds in a moderate underlying cluster. Finally, we selected community detection as a case study of incremental graph analytics with Sprouter. We proposed the Incremental Distributed Weighted Community Clustering (IDWCC), a novel algorithm that optimizes the Weighted Community Clustering metric to detect communities in unweighted undirected node-grained dynamic graphs. We validated the algorithm against the static centralized and distributed versions of WCC optimization. The experiments showed that the performance of IDWCC surpasses the static distributed version by up to three times while maintaining the same accuracy or better.
External DOI