Paper Image

Vision language models with small size, fast speed, and strong performance

Published on:

13 October 2023

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Xi Chen,

Xiao Wang,

Lucas Beyer,

Alexander Kolesnikov,

Jialin Wu,

Paul Voigtlaender,

Basil Mustafa,

Sebastian Goodman,

Ibrahim Alabdulmohsin,

Piotr Padlewski,

Daniel Salz,

Xi Xiong,

Daniel Vlasic,

Filip Pavetic,

Keran Rong,

Tianli Yu,

Daniel Keysers,

Xiaohua Zhai,

Radu Soricut

Bullets

Key Details

Compares classification pretrained vs contrastively pretrained image encoders, finding the latter vastly superior for localization and text understanding tasks

Achieves state-of-the-art on 10+ vision-language benchmarks including visually-situated text and referring expression segmentation

Strong video QA results without any video pretraining, showing powerful generalization

Introduces 2B parameter multilingual SigLIP model with new state-of-the-art on multilingual retrieval

AI generated summary

Vision language models with small size, fast speed, and strong performance

This paper introduces PaLI-3, a vision language model that achieves strong performance across diverse tasks while being 10x smaller than current state-of-the-art models. The key techniques are contrastive pretraining of the image encoder, improved dataset mixing, and training at higher resolutions.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up