ITRT(IT Research Trends)

Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

연구 분야: Software Development

논문 키워드: #trained #competitive #tools #recipes #puzzle

학회: Pacific Rim International Conference on Artificial Intelligence

초록

Visual-Question-Answering (VQA) requires answering questions corresponding to visual information. Although pre-trained Vision-language models (VLMs) have obtained potential results on various VQA benchmarks, they show limitations adapted to VQA in special domains, which require specific vision and reasoning skills. While Large language models (LLMs) possess outstanding knowledge and reasoning skills, they cannot be applied in VQA due to the lack of vision support. We introduce a framework to enhance the performance of VLMs and enable the use of LLMs in special domain VQA. The framework leverages computer vision (CV) tools and pre-defined tool recipes to provide the models with the necessary information to solve the task. Along with the framework, we introduce three tool recipes for special VQA domains: (i) Visual Puzzle, (ii) Visual Arithmetic Reasoning, and (iii) Multilingual Scene-text. Experiments show that the proposed framework and tool recipes significantly outperform competitive VLMs on various tasks in both fine-tuning and few-shot approaches, establishing new state-of-the-art results.

📄 논문 정보

발행 연도	2024년
인용수	0
출판 국가	Andorra
사이트	Springer
좋아요 수	0

Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

Nguyen-Khang Le

Dieu-Hien Nguyen

Le Minh Nguyen

📄 논문 정보

연관 논문 목록 (72건)

Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

📄 논문 정보

연관 논문 목록 (72건) 내 서재 담기

연관 논문 목록 (72건)