Paper Image

Efficient vision-language learning with cluster masking

Published on:

14 May 2024

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Zihao Wei,

Zixuan Pan,

Andrew Owens


Key Details

Proposes cluster masking strategy that drops visually similar image patches

Forces prediction of missing structures from context, aiding representation learning

Faster training than baseline methods like FLIP and CLIP

Outperforms other strategies on various downstream benchmarks

Achieves improved feature learning without extra model complexity

AI generated summary

Efficient vision-language learning with cluster masking

This paper proposes a simple yet effective strategy for masking image patches during visual-language contrastive learning. By randomly masking clusters of visually similar patches in each training iteration, the model is forced to predict words for missing visual structures using context. This provides an extra learning signal and speeds up training. When evaluated on several benchmarks, the proposed approach outperforms strategies like random patching (FLIP) in representation quality and downstream performance.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up