Paper Image

Distilling video-language models from images

Published on:

11 January 2024

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Yue Zhao,

Long Zhao,

Xingyi Zhou,

Jialin Wu,

Chun-Te Chu,

Hui Miao,

Florian Schroff,

Hartwig Adam,

Ting Liu,

Boqing Gong,

Philipp Krähenbühl,

Liangzhe Yuan

Bullets

Key Details

Adapts image VLM to video via visual then language tuning

Generates detailed video descriptions for millions of clips

Descriptions provide better supervision than alternatives

Dual encoder model trained on them gets state-of-art results

AI generated summary

Distilling video-language models from images

The paper adapts an image-based vision-language model to video by fine-tuning first the visual encoder on video captioning data, then the language model on instruction-following data. This video-language model generates detailed video descriptions for millions of web videos. When used to train a video-text dual encoder model, it improves performance significantly over baselines.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up