Paper Image

Benchmarking large language models on long, multi-image tasks

Published on:

29 April 2024

Primary Category:

Computation and Language

Paper Authors:

Dingjie Song,

Shunian Chen,

Guiming Hardy Chen,

Fei Yu,

Xiang Wan,

Benyou Wang

Bullets

Key Details

MileBench tests models on long texts and multi-image tasks

Includes diagnostic and realistic evaluations

Closed-source models excel, open-source models struggle

Performance declines as number of images increases

Research needed to enhance long, multi-image abilities

AI generated summary

Benchmarking large language models on long, multi-image tasks

This paper introduces MileBench, a new benchmark to test multimodal large language models on their ability to process long contexts with multiple images. It includes diagnostic and realistic evaluations across comprehension and generation tasks. Results show closed-source models like GPT-4V and Gemini 1.5 perform well, but most open-source models struggle on long, multi-image contexts. As image count rises, the performance gap between closed and open source models widens, highlighting needs to improve multimodal capabilities.

Answers from this paper

Comments

No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up