Paper Image

Improving image-text grounding through self-consistent explanations

Published on:

7 December 2023

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Ruozhen He,

Paola Cascante-Bonilla,

Ziyan Yang,

Alexander C. Berg,

Vicente Ordonez


Key Details

Proposes self-consistency tuning (SelfEQ) to improve visual grounding without box supervision

Uses large language model to generate paraphrases of text descriptions

Shows improved localization over baseline on Flickr30K, ReferIt and RefCOCO+

Achieves 4.69% better on Flickr30K, 7.68% on ReferIt, 3.74% on RefCOCO+ vs other unsupervised methods

AI generated summary

Improving image-text grounding through self-consistent explanations

This paper proposes a method to improve the ability of vision-and-language models to locate objects in images based on text descriptions. It involves generating paraphrases of text using a language model, and fine-tuning the vision model so visual explanations are consistent between original text and paraphrases referring to the same object.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up