Advisory Committee Chair
Da Yan
Advisory Committee Members
Purushotham Bangalore
Sidharth Kumar
Carmeliza Navasca
Chengcui Zhang
Yang Zhou
Document Type
Dissertation
Date of Award
2022
Degree Name by School
Doctor of Philosophy (PhD) College of Arts and Sciences
Abstract
Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a userfriendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time and help G-thinker achieve orders of magnitude speedup compared with the existing subgraph-centric system. However, the old G-thinker design does not balance the workloads of different subgraphmining tasks sufficiently, leading to the straggler problem when mining expensive pseudoclique structures such as quasi-cliques and k-plexes. Recently, we proposed a system-algorithm codesign solution which will address this challenge by redesigning G-thinker’s execution engine to prioritize long-running tasks for mining, and by utilizing a novel time-delayed divide-andconquer strategy to effectively decompose the workloads of long-running tasks to improve load balancing. Moreover, since cliques are defined over undirected graphs, existing pseudo-clique definitions also only work on undirected graphs, limiting their application in many real networks that are directed. We generalized the concept of quasi-cliques to directed and proposed an efficient recursive algorithm that integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top-k large quasi-cliques directly by bootstrapping the search from more compact quasi-cliques, to scale the mining to larger networks. Inspired by this parallel paradigm, I also propose a novel programming framework, called T-thinkerQ, for answering online subgraph queries in parallel following the TLAT paradigm. T-thinkerQ utilizes a novel active task-queue list to ensure the fairness that queries are answered in the received order. To track query progress so that users are timely notified when a query iii completes, T-thinkerQ also adopts a novel lineage-based design that keeps track of how subtasks are generated by straggler tasks for divide-and-conquer processing. We use four kinds of subgraph queries to demonstrate the programming friendliness of T-thinkerQ as well as its excellent CPUscalability.
Recommended Citation
Guo, Guimu, "Scalable Subgraph Mining in a Big Graph" (2022). All ETDs from UAB. 492.
https://digitalcommons.library.uab.edu/etd-collection/492