Paper Title:
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Published on:
25 April 2024
Primary Category:
Computer Vision and Pattern Recognition
Paper Authors:
Zhe Chen,
Weiyun Wang,
Hao Tian,
Shenglong Ye,
Zhangwei Gao,
Erfei Cui,
Wenwen Tong,
Kongzhi Hu,
Jiapeng Luo,
Zheng Ma,
Ji Ma,
Jiaqi Wang,
Xiaoyi Dong,
Hang Yan,
Hewei Guo,
Conghui He,
Zhenjiang Jin,
Chao Xu,
Bin Wang,
Xingjian Wei,
Wei Li,
Wenjian Zhang,
Lewei Lu,
Xizhou Zhu,
Tong Lu,
Dahua Lin,
Yu Qiao
Introduces InternVL 1.5 open-source multimodal model
Uses continuous learning to boost reusable vision encoder
Supports dynamic high resolution up to 4K
Includes high-quality bilingual dataset
Achieves state-of-the-art results on 8 of 18 benchmarks
Bridging open-source and commercial multimodal models
This paper introduces InternVL 1.5, an open-source multimodal model that aims to match proprietary counterparts in capabilities. It does so through 3 key improvements: a reusable vision encoder, dynamic high resolution, and a bilingual dataset. When evaluated on 18 benchmarks, it achieved state-of-the-art results on 8, showing it has narrowed the gap.
Evaluating Multimodal Language Models
Vision language models with small size, fast speed, and strong performance
Distilling video-language models from images
Visual language models with deep vision-text fusion
Benchmarking large language models on long, multi-image tasks
Understanding and Improving Models with Vision-Language Surrogates
No comments yet, be the first to start the conversation...
Sign up to comment on this paper