Paper Image

Efficiently training vision-language models with less data

Published on:

15 August 2023

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Xindi Wu,

Byron Zhang,

Zhiwei Deng,

Olga Russakovsky

Bullets

Key Details

Proposes first vision-language dataset distillation method

Expands trajectory matching to handle lack of discrete classes

Jointly distills images and text in a contrastive formulation

Doubles image-text retrieval performance compared to coresets

Uses 10-100x fewer training pairs than coreset methods

AI generated summary

Efficiently training vision-language models with less data

This paper proposes a new method to distill large vision-language datasets down to much smaller sets of synthetic image-text pairs. The distilled data preserves sufficient information to train new models from scratch. The method expands trajectory matching techniques to handle vision-language data lacking discrete classes. It jointly distills images and text in a contrastive formulation and significantly improves performance on image-text retrieval tasks compared to coreset selection methods.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up