Published on:
11 January 2024
Primary Category:
Computer Vision and Pattern Recognition
Paper Authors:
Yue Zhao,
Long Zhao,
Xingyi Zhou,
Jialin Wu,
Chun-Te Chu,
Hui Miao,
Florian Schroff,
Hartwig Adam,
Ting Liu,
Boqing Gong,
Philipp Krähenbühl,
Liangzhe Yuan
Adapts image VLM to video via visual then language tuning
Generates detailed video descriptions for millions of clips
Descriptions provide better supervision than alternatives
Dual encoder model trained on them gets state-of-art results
Distilling video-language models from images
The paper adapts an image-based vision-language model to video by fine-tuning first the visual encoder on video captioning data, then the language model on instruction-following data. This video-language model generates detailed video descriptions for millions of web videos. When used to train a video-text dual encoder model, it improves performance significantly over baselines.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper