Published on:
13 October 2023
Primary Category:
Computer Vision and Pattern Recognition
Paper Authors:
Xi Chen,
Xiao Wang,
Lucas Beyer,
Alexander Kolesnikov,
Jialin Wu,
Paul Voigtlaender,
Basil Mustafa,
Sebastian Goodman,
Ibrahim Alabdulmohsin,
Piotr Padlewski,
Daniel Salz,
Xi Xiong,
Daniel Vlasic,
Filip Pavetic,
Keran Rong,
Tianli Yu,
Daniel Keysers,
Xiaohua Zhai,
Radu Soricut
Compares classification pretrained vs contrastively pretrained image encoders, finding the latter vastly superior for localization and text understanding tasks
Achieves state-of-the-art on 10+ vision-language benchmarks including visually-situated text and referring expression segmentation
Strong video QA results without any video pretraining, showing powerful generalization
Introduces 2B parameter multilingual SigLIP model with new state-of-the-art on multilingual retrieval
Vision language models with small size, fast speed, and strong performance
This paper introduces PaLI-3, a vision language model that achieves strong performance across diverse tasks while being 10x smaller than current state-of-the-art models. The key techniques are contrastive pretraining of the image encoder, improved dataset mixing, and training at higher resolutions.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper