Vision language models with small size, fast speed, and strong performance

Paper Title:

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Published on:

13 October 2023

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Xi Chen,

Xiao Wang,

Lucas Beyer,

Alexander Kolesnikov,

Jialin Wu,

Paul Voigtlaender,

Basil Mustafa,

Sebastian Goodman,

Ibrahim Alabdulmohsin,

Piotr Padlewski,

Daniel Salz,

Xi Xiong,

Daniel Vlasic,

Filip Pavetic,

Keran Rong,

Tianli Yu,

Daniel Keysers,

Xiaohua Zhai,

Radu Soricut

Bullets

Key Details

•

Compares classification pretrained vs contrastively pretrained image encoders, finding the latter vastly superior for localization and text understanding tasks

•

Achieves state-of-the-art on 10+ vision-language benchmarks including visually-situated text and referring expression segmentation

•

Strong video QA results without any video pretraining, showing powerful generalization

•

Introduces 2B parameter multilingual SigLIP model with new state-of-the-art on multilingual retrieval

Explore the topics in this paper

contrastive pretraining

dataset mixing

high resolution training

model compactness

vision language models

AI generated summary

Vision language models with small size, fast speed, and strong performance

This paper introduces PaLI-3, a vision language model that achieves strong performance across diverse tasks while being 10x smaller than current state-of-the-art models. The key techniques are contrastive pretraining of the image encoder, improved dataset mixing, and training at higher resolutions.