Distilling video-language models from images

Paper Title:

Distilling Vision-Language Models on Millions of Videos

Published on:

11 January 2024

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Yue Zhao,

Long Zhao,

Xingyi Zhou,

Jialin Wu,

Chun-Te Chu,

Hui Miao,

Florian Schroff,

Hartwig Adam,

Ting Liu,

Boqing Gong,

Philipp Krähenbühl,

Liangzhe Yuan

Bullets

Key Details

•

Adapts image VLM to video via visual then language tuning

•

Generates detailed video descriptions for millions of clips

•

Descriptions provide better supervision than alternatives

•

Dual encoder model trained on them gets state-of-art results

Explore the topics in this paper

captioning

text-video dual encoders

video

vision-language model

web videos

AI generated summary

Distilling video-language models from images

The paper adapts an image-based vision-language model to video by fine-tuning first the visual encoder on video captioning data, then the language model on instruction-following data. This video-language model generates detailed video descriptions for millions of web videos. When used to train a video-text dual encoder model, it improves performance significantly over baselines.