Bridging open-source and commercial multimodal models

Paper Title:

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Published on:

25 April 2024

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Zhe Chen,

Weiyun Wang,

Hao Tian,

Shenglong Ye,

Zhangwei Gao,

Erfei Cui,

Wenwen Tong,

Kongzhi Hu,

Jiapeng Luo,

Zheng Ma,

Ji Ma,

Jiaqi Wang,

Xiaoyi Dong,

Hang Yan,

Hewei Guo,

Conghui He,

Zhenjiang Jin,

Chao Xu,

Bin Wang,

Xingjian Wei,

Wei Li,

Wenjian Zhang,

Lewei Lu,

Xizhou Zhu,

Tong Lu,

Dahua Lin,

Yu Qiao

Bullets

Key Details

•

Introduces InternVL 1.5 open-source multimodal model

•

Uses continuous learning to boost reusable vision encoder

•

Supports dynamic high resolution up to 4K

•

Includes high-quality bilingual dataset

•

Achieves state-of-the-art results on 8 of 18 benchmarks

Explore the topics in this paper

bilingual datasets

computer vision

learning

multimodal models

open source

AI generated summary

Bridging open-source and commercial multimodal models

This paper introduces InternVL 1.5, an open-source multimodal model that aims to match proprietary counterparts in capabilities. It does so through 3 key improvements: a reusable vision encoder, dynamic high resolution, and a bilingual dataset. When evaluated on 18 benchmarks, it achieved state-of-the-art results on 8, showing it has narrowed the gap.