Benchmarking large language models on long, multi-image tasks

Paper Title:

MileBench: Benchmarking MLLMs in Long Context

Published on:

29 April 2024

Primary Category:

Computation and Language

Paper Authors:

Dingjie Song,

Shunian Chen,

Guiming Hardy Chen,

Fei Yu,

Xiang Wan,

Benyou Wang

Bullets

Key Details

•

MileBench tests models on long texts and multi-image tasks

•

Includes diagnostic and realistic evaluations

•

Closed-source models excel, open-source models struggle

•

Performance declines as number of images increases

•

Research needed to enhance long, multi-image abilities

Explore the topics in this paper

benchmarking

long contexts

model capabilities

multimodal models

multiple images

AI generated summary

Benchmarking large language models on long, multi-image tasks

This paper introduces MileBench, a new benchmark to test multimodal large language models on their ability to process long contexts with multiple images. It includes diagnostic and realistic evaluations across comprehension and generation tasks. Results show closed-source models like GPT-4V and Gemini 1.5 perform well, but most open-source models struggle on long, multi-image contexts. As image count rises, the performance gap between closed and open source models widens, highlighting needs to improve multimodal capabilities.