Improving image-text grounding through self-consistent explanations

7 December 2023

Computer Vision and Pattern Recognition

Ruozhen He,

Paola Cascante-Bonilla,

Ziyan Yang,

Alexander C. Berg,

Vicente Ordonez


Key Details

Proposes self-consistency tuning (SelfEQ) to improve visual grounding without box supervision

Uses large language model to generate paraphrases of text descriptions

Shows improved localization over baseline on Flickr30K, ReferIt and RefCOCO+

Achieves 4.69% better on Flickr30K, 7.68% on ReferIt, 3.74% on RefCOCO+ vs other unsupervised methods

AI generated summary

Improving image-text grounding through self-consistent explanations

This paper proposes a method to improve the ability of vision-and-language models to locate objects in images based on text descriptions. It involves generating paraphrases of text using a language model, and fine-tuning the vision model so visual explanations are consistent between original text and paraphrases referring to the same object.

