Paper Title:
MileBench: Benchmarking MLLMs in Long Context
Published on:
29 April 2024
Primary Category:
Computation and Language
Paper Authors:
Dingjie Song,
Shunian Chen,
Guiming Hardy Chen,
Fei Yu,
Xiang Wan,
Benyou Wang
MileBench tests models on long texts and multi-image tasks
Includes diagnostic and realistic evaluations
Closed-source models excel, open-source models struggle
Performance declines as number of images increases
Research needed to enhance long, multi-image abilities
Benchmarking large language models on long, multi-image tasks
This paper introduces MileBench, a new benchmark to test multimodal large language models on their ability to process long contexts with multiple images. It includes diagnostic and realistic evaluations across comprehension and generation tasks. Results show closed-source models like GPT-4V and Gemini 1.5 perform well, but most open-source models struggle on long, multi-image contexts. As image count rises, the performance gap between closed and open source models widens, highlighting needs to improve multimodal capabilities.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper