Published on:
28 March 2024
Primary Category:
Computer Vision and Pattern Recognition
Paper Authors:
Kai Zhang,
Yi Luan,
Hexiang Hu,
Kenton Lee,
Siyuan Qiao,
Wenhu Chen,
Yu Su,
Ming-Wei Chang
Image pairs on the same web pages have diverse relations
Text instructions make implicit relations explicit
Trained on 36.7M triplets mined from the web
Outperforms prior work with 50x fewer parameters
Succeeds on complex search intents
Self-supervised image retrieval with open instructions
This paper introduces MagicLens, a series of self-supervised image retrieval models that can follow open-ended text instructions to find relevant images. The key insight is that image pairs naturally co-occurring on web pages contain diverse implicit relations beyond visual similarity. By using large language models to make those relations explicit as text instructions, rich training data is created. Experiments show MagicLens matches or exceeds prior state-of-the-art on multiple benchmarks, with 50x fewer parameters on some. Additional analysis finds it succeeds on complex search intents missed by prior methods.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper