GRiT: A Generative Region-to-Text Transformer for Object Understanding


연구 분야: Software Development



학회: European Conference on Computer Vision


초록

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate natural language for objects. With the same model architecture, GRiT describes objects via not only simple nouns, but also rich descriptive sentences. We define GRiT as open-set object understanding, as it has no limit on object description output from the model architecture perspective. Experimentally, we apply GRiT to dense captioning and object detection tasks. GRiT achieves superior dense captioning performance (15.5 mAP on Visual Genome) and competitive detection accuracy (60.4 AP on COCO test-dev). Code is available at https://github.com/JialianW/GRiT.


Author Profile
Jialian Wu

State University of New York at Buffalo Buffalo USA

Austria
Author Profile
Jianfeng Wang

Advanced Micro Devices Santa Clara USA

United States
Author Profile
Zhengyuan Yang

Microsoft Redmond USA

United States

📄 논문 정보

발행 연도 2024년
인용수 0
출판 국가 United States, Austria
사이트 Springer
좋아요 수 0

연관 논문 목록 (0건)