Paper Image

Enhancing visual-language learning with MLLMs

Published on:

30 November 2023

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Yanqing Liu,

Kai Wang,

Wenqi Shao,

Ping Luo,

Yu Qiao,

Mike Zheng Shou,

Kaipeng Zhang,

Yang You


Key Details

Uses multiple MLLMs to extend image captions

Proposes 'text shearing' to control for MLLM bias

Obtains big gains in image retrieval, zero-shot vs fine-tuned

Shows MLLMs can significantly enhance representation learning

AI generated summary

Enhancing visual-language learning with MLLMs

This paper proposes using multiple multi-modal large language models (MLLMs) to extend image captions, improving visual-language dataset quality and representation learning. To prevent bias from MLLMs, they employ 'text shearing' to make extended captions the same length. In image retrieval tasks, they achieve major performance boosts from MLLMs, including comparable zero-shot vs fine-tuned results.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up