Advisory Committee Chair
Da Yan
Advisory Committee Members
Chengcui Zhang
Tianyang Wang
Sidharth Kumar
Kai Zhao
Zhe Jiang
Document Type
Dissertation
Date of Award
2023
Degree Name by School
Doctor of Philosophy (PhD) College of Arts and Sciences
Abstract
Scene graphs are explicit data structures that express the semantics of an image. It is semantically richer than simple image encodings from Convolutional Neural Networks (CNNs) because objects (big or small) are represented as distinct nodes and the relationship between them are expressed as labeled edges in a graph structure. Thus, scene graphs can express not only the visual context of an image but also the relationship context between pairs of objects. This added context information makes scene graphs ideal for several downstream visual language (VL) tasks such as Visual Question Answering (VQA), image captioning, text to image retrieval, and more. Even though there has been a surge in research toward the development of scene graph generation (SGG) models in recent years, they are some severe drawbacks that prevent scene graphs from being widely adopted for VL tasks. These drawbacks are namely (1) the semantic gap between Vision and Language modalities that limits the compositional performance. (2) Most scene graph generation models are limited to a closed set of object and relationship labels that limits their real-life application (3) The quality of pairwise object relationship/edge predictions that suffers from a biased long-tailed distribution. This work describes a novel Open-vocabulary Scene Graph Generation model with Compositional Prompt Tuning (OCoPro) that enforces compositionality through a Cross-modal Composition Refinement (CCR) module and uses compositional prompts to efficiently predict open vocabulary object and relationship labels, not limited to closed-set training data. Thus, significantly improving the quality and robustness of scene-graphs in real-world applications. Empirical results prove that the novel (OCoPro) model out-performs state-of-the-art scene graph generation models in different SGG evaluation protocols.
Recommended Citation
Sami, Mirza Tanzim, "Open-Vocabulary Scene Graph Generation With Compositional Prompt-Tuning" (2023). All ETDs from UAB. 3523.
https://digitalcommons.library.uab.edu/etd-collection/3523