Advisory Committee Chair
Advisory Committee Members
Date of Award
Degree Name by School
Doctor of Philosophy (PhD) College of Arts and Sciences
Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a userfriendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time and help G-thinker achieve orders of magnitude speedup compared with the existing subgraph-centric system. However, the old G-thinker design does not balance the workloads of different subgraphmining tasks sufficiently, leading to the straggler problem when mining expensive pseudoclique structures such as quasi-cliques and k-plexes. Recently, we proposed a system-algorithm codesign solution which will address this challenge by redesigning G-thinker’s execution engine to prioritize long-running tasks for mining, and by utilizing a novel time-delayed divide-andconquer strategy to effectively decompose the workloads of long-running tasks to improve load balancing. Moreover, since cliques are defined over undirected graphs, existing pseudo-clique definitions also only work on undirected graphs, limiting their application in many real networks that are directed. We generalized the concept of quasi-cliques to directed and proposed an efficient recursive algorithm that integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top-k large quasi-cliques directly by bootstrapping the search from more compact quasi-cliques, to scale the mining to larger networks. Inspired by this parallel paradigm, I also propose a novel programming framework, called T-thinkerQ, for answering online subgraph queries in parallel following the TLAT paradigm. T-thinkerQ utilizes a novel active task-queue list to ensure the fairness that queries are answered in the received order. To track query progress so that users are timely notified when a query iii completes, T-thinkerQ also adopts a novel lineage-based design that keeps track of how subtasks are generated by straggler tasks for divide-and-conquer processing. We use four kinds of subgraph queries to demonstrate the programming friendliness of T-thinkerQ as well as its excellent CPUscalability.
Guo, Guimu, "Scalable Subgraph Mining in a Big Graph" (2022). All ETDs from UAB. 492.