Paper Image

Bridging open-source and commercial multimodal models

Published on:

25 April 2024

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Zhe Chen,

Weiyun Wang,

Hao Tian,

Shenglong Ye,

Zhangwei Gao,

Erfei Cui,

Wenwen Tong,

Kongzhi Hu,

Jiapeng Luo,

Zheng Ma,

Ji Ma,

Jiaqi Wang,

Xiaoyi Dong,

Hang Yan,

Hewei Guo,

Conghui He,

Zhenjiang Jin,

Chao Xu,

Bin Wang,

Xingjian Wei,

Wei Li,

Wenjian Zhang,

Lewei Lu,

Xizhou Zhu,

Tong Lu,

Dahua Lin,

Yu Qiao

Bullets

Key Details

Introduces InternVL 1.5 open-source multimodal model

Uses continuous learning to boost reusable vision encoder

Supports dynamic high resolution up to 4K

Includes high-quality bilingual dataset

Achieves state-of-the-art results on 8 of 18 benchmarks

AI generated summary

Bridging open-source and commercial multimodal models

This paper introduces InternVL 1.5, an open-source multimodal model that aims to match proprietary counterparts in capabilities. It does so through 3 key improvements: a reusable vision encoder, dynamic high resolution, and a bilingual dataset. When evaluated on 18 benchmarks, it achieved state-of-the-art results on 8, showing it has narrowed the gap.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up