Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains


연구 분야: Software Development



학회: Pacific Rim International Conference on Artificial Intelligence


초록

Visual-Question-Answering (VQA) requires answering questions corresponding to visual information. Although pre-trained Vision-language models (VLMs) have obtained potential results on various VQA benchmarks, they show limitations adapted to VQA in special domains, which require specific vision and reasoning skills. While Large language models (LLMs) possess outstanding knowledge and reasoning skills, they cannot be applied in VQA due to the lack of vision support. We introduce a framework to enhance the performance of VLMs and enable the use of LLMs in special domain VQA. The framework leverages computer vision (CV) tools and pre-defined tool recipes to provide the models with the necessary information to solve the task. Along with the framework, we introduce three tool recipes for special VQA domains: (i) Visual Puzzle, (ii) Visual Arithmetic Reasoning, and (iii) Multilingual Scene-text. Experiments show that the proposed framework and tool recipes significantly outperform competitive VLMs on various tasks in both fine-tuning and few-shot approaches, establishing new state-of-the-art results.


Author Profile
Nguyen-Khang Le

Japan Advanced Institute of Science and Technology Ishikawa Japan

Andorra
Author Profile
Dieu-Hien Nguyen

Japan Advanced Institute of Science and Technology Ishikawa Japan

Andorra
Author Profile
Le Minh Nguyen

Japan Advanced Institute of Science and Technology Ishikawa Japan

Andorra

📄 논문 정보

발행 연도 2024년
인용수 0
출판 국가 Andorra
사이트 Springer
좋아요 수 0

연관 논문 목록 (72건)