Paper Image

Multimodal reasoning for video question answering

Published on:

29 February 2024

Primary Category:

Computation and Language

Paper Authors:

Kate Sanders,

Nathaniel Weir,

Benjamin Van Durme


Key Details

Proposes TV-TREES, an interpretable video QA system producing multimodal reasoning trees

Introduces new task of multimodal entailment tree generation

Achieves SOTA zero-shot results on challenging TVQA benchmark

Provides transparent and reliable reasoning unlike black-box models

AI generated summary

Multimodal reasoning for video question answering

The authors propose TV-TREES, the first system to generate interpretable trees showing chains of reasoning across both language and visual content from videos. They evaluate it on a video QA dataset, where it achieves state-of-the-art zero-shot performance using full length clips as input while also providing transparent reasoning.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up