Vietnamese Automatic Speech Recognition Utilizing Audio and Visual Data


연구 분야: Artificial Intelligence



학회: 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)


초록

Inspired by humans comprehending speech in a multi-modal manner, a growing number of audio-visual speech recognition datasets have been constructed. However, most of these datasets focus on English and provide only a limited amount of multiview video data. To address these limitations, this study first constructs a comprehensive dataset for Vietnamese Audio-visual Speech Recognition (VASR). A ViAVSP-LLM speech recognition system consisting of an AV-HuBERT encoder and VinaLLaMA decoder is then proposed. When applied to the VASR dataset (1,045 hours), ViAVSP-LLM achieves a Word Error Rate of 12.03% on the test set. Comparative experiments conducted using current audio-only speech recognition models show that the addition of visual data significantly improves the speech recognition accuracy. In addition, the use of a large language model further improves the performance and contextual understanding of the model. This research is the first in Vietnam to explore automatic speech recognition using both audio and visual data and is expected to facilitate multimodal research in broader areas.


Author Profile
Tan-Thinh Duong

Department of Information Technology FPT University Ho Chi Minh City Viet Nam

Namibia
Author Profile
Van-Minh Nguyen

Department of Information Technology FPT University Ho Chi Minh City Viet Nam

Namibia
Author Profile
Hong-Duyen-Khanh Pham

Department of Information Technology FPT University Ho Chi Minh City Viet Nam

Namibia

📄 논문 정보

발행 연도 2025년
인용수 3
출판 국가 Namibia, Andorra
사이트 IEEE
좋아요 수 0

연관 논문 목록 (43건)