Paper Image

Grounding Large Language Models

Published on:

6 November 2023

Primary Category:

Computer Vision and Pattern Recognition

Paper Authors:

Hanoona Rasheed,

Muhammad Maaz,

Sahal Shaji,

Abdelrahman Shaker,

Salman Khan,

Hisham Cholakkal,

Rao M. Anwer,

Erix Xing,

Ming-Hsuan Yang,

Fahad S. Khan


Key Details

GLaMM generates grounded text responses with segmentation masks

It accepts text and visual prompts for flexible interaction

Paper introduces new Grounded Conversation Generation task

GranD dataset has 7.5M concepts in 810M grounded regions

GLaMM is shown to be effective on various vision-language tasks

AI generated summary

Grounding Large Language Models

This paper introduces GLaMM, a new large language model that can generate natural language responses integrated with object segmentation masks. This allows it to ground its responses visually. GLaMM can accept both text and visual prompts, enabling interaction at different levels of detail. The authors propose a new Grounded Conversation Generation task, requiring models to produce dense object groundings in natural images. They also introduce a large dataset called GranD to train models like GLaMM, comprising over 7 million concepts grounded in 810 million image regions.

Answers from this paper


No comments yet, be the first to start the conversation...

Sign up to comment on this paper

Sign Up