Developing Informative and Faithful Image Captioning Systems with Compositional Generalizability

Thumbnail Image
Shi, Zhan
Compositional Generalization , Cross Modal Alignment , Image Captioning , Informative Image Captioning
Image captioning systems aim to generate visually grounded descriptions for given images. It has been an active multimodal research problem in artificial intelligence---humans are able to understand the content of images and describe them with natural language, and we expect computers to possess such an ability. Specifically, they should resemble human-written captions with desired properties: \textit{Fidelity}, \textit{Informativeness}, and \textit{Compositional Generalization}. In this thesis, we attempt to realize the above goals from several different aspects: (1) Designing caption-guided visual relation graphs that can better bridge modalities to generate better captions with fidelity; (2) Connecting the generation of informative captions with natural language inference; (3) Leveraging cross-modal retrieval to generate salient information in images; (4) Improving compositional generalization ability with retrieving and editing mechanisms. This thesis offers a set of contributions that address the challenges of generating informative and faithful captions with compositional generalizability. The first contribution is developing caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. Based on that a multi-task learning module is constructed to regularize the network by taking into account the explicit constraints of objects and predicates in the process of generation. The second contribution includes an estimation of the informativeness score of each caption by constructing the designated directed inference graphs based on natural language inference and the regularized learning objectives with the informativeness scores. The third contribution is investigating salient n-grams in captions and studying n-grams for regularizing the learning objectives in order to generate informative captions. The fourth contribution is a novel framework that can effectively retrieve similar image caption training instances and perform analogical reasoning over retrieved instances, which enhances the compositional generalization of image captioning.
External DOI