8 February 2024
Computer Vision and Pattern Recognition
InstaGen integrates grounding into diffusion models to synthesize labeled images
Supervised pre-training aligns text and visual features on base categories
Self-training extends alignment to novel categories
As a data synthesizer, InstaGen boosts open-vocab (+4.5 AP) and data-sparse (+1.2-5.2 AP) detection
It outperforms state-of-the-art CLIP-based methods
Enhancing detection with synthetic images
This paper introduces InstaGen, a framework to generate synthetic images with object bounding boxes for arbitrary categories. An instance-level grounding module is integrated into a diffusion model to align text embeddings of category names with visual features and infer bounding box coordinates. Through supervised pre-training on base categories and self-training on novel categories, InstaGen serves as a data synthesizer to enhance object detectors. Experiments demonstrate superior performance over state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2-5.2 AP) detection.
No comments yet, be the first to start the conversation...
Sign up to comment on this paper