Scalable Subgraph Mining in a Big Graph

Guimu Guo, University Of Alabama At Birmingham

Advisory Committee Chair

Da Yan

Advisory Committee Members

Purushotham Bangalore

Sidharth Kumar

Carmeliza Navasca

Chengcui Zhang

Yang Zhou

Document Type

Dissertation

Date of Award

2022

Degree Name by School

Doctor of Philosophy (PhD) College of Arts and Sciences

Abstract

Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a userfriendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time and help G-thinker achieve orders of magnitude speedup compared with the existing subgraph-centric system. However, the old G-thinker design does not balance the workloads of different subgraphmining tasks sufficiently, leading to the straggler problem when mining expensive pseudoclique structures such as quasi-cliques and k-plexes. Recently, we proposed a system-algorithm codesign solution which will address this challenge by redesigning G-thinker’s execution engine to prioritize long-running tasks for mining, and by utilizing a novel time-delayed divide-andconquer strategy to effectively decompose the workloads of long-running tasks to improve load balancing. Moreover, since cliques are defined over undirected graphs, existing pseudo-clique definitions also only work on undirected graphs, limiting their application in many real networks that are directed. We generalized the concept of quasi-cliques to directed and proposed an efficient recursive algorithm that integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top-k large quasi-cliques directly by bootstrapping the search from more compact quasi-cliques, to scale the mining to larger networks. Inspired by this parallel paradigm, I also propose a novel programming framework, called T-thinkerQ, for answering online subgraph queries in parallel following the TLAT paradigm. T-thinkerQ utilizes a novel active task-queue list to ensure the fairness that queries are answered in the received order. To track query progress so that users are timely notified when a query iii completes, T-thinkerQ also adopts a novel lineage-based design that keeps track of how subtasks are generated by straggler tasks for divide-and-conquer processing. We use four kinds of subgraph queries to demonstrate the programming friendliness of T-thinkerQ as well as its excellent CPUscalability.

Recommended Citation

Guo, Guimu, "Scalable Subgraph Mining in a Big Graph" (2022). All ETDs from UAB. 492.
https://digitalcommons.library.uab.edu/etd-collection/492

Download

Included in

Arts and Humanities Commons

COinS

Scalable Subgraph Mining in a Big Graph

Advisory Committee Chair

Advisory Committee Members

Document Type

Date of Award

Degree Name by School

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Scalable Subgraph Mining in a Big Graph

Authors

Advisory Committee Chair

Advisory Committee Members

Document Type

Date of Award

Degree Name by School

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner