Paper Image

Long video comprehension benchmark

Published on:

14 May 2024

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Ruchit Rawal,

Khalid Saifullah,

Ronen Basri,

David Jacobs,

Gowthami Somepalli,

Tom Goldstein


Key Details

305,000 multiple choice questions testing long-form video understanding

Covers visual, temporal, multimodal reasoning abilities

Questions require interpreting video and dialogue content

State-of-the-art models underperform humans by over 25%

Highlights remaining challenges in long-form video comprehension

AI generated summary

Long video comprehension benchmark

This paper introduces CinePile, a novel benchmark for evaluating long-form video understanding models. It contains over 300,000 multiple choice questions covering various aspects of visual, temporal, and multimodal comprehension. Models are tested on their ability to reason about events, human-object interactions, and plot progressions in long video scenes based on both visual and dialogue information. Even state-of-the-art models lag significantly behind human performance, highlighting remaining challenges in long-form video understanding.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up