ITRT(IT Research Trends)

Vietnamese Automatic Speech Recognition Utilizing Audio and Visual Data

연구 분야: Artificial Intelligence

논문 키워드: #speech #audio #english #vietnam #vietnamese

학회: 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)

초록

Inspired by humans comprehending speech in a multi-modal manner, a growing number of audio-visual speech recognition datasets have been constructed. However, most of these datasets focus on English and provide only a limited amount of multiview video data. To address these limitations, this study first constructs a comprehensive dataset for Vietnamese Audio-visual Speech Recognition (VASR). A ViAVSP-LLM speech recognition system consisting of an AV-HuBERT encoder and VinaLLaMA decoder is then proposed. When applied to the VASR dataset (1,045 hours), ViAVSP-LLM achieves a Word Error Rate of 12.03% on the test set. Comparative experiments conducted using current audio-only speech recognition models show that the addition of visual data significantly improves the speech recognition accuracy. In addition, the use of a large language model further improves the performance and contextual understanding of the model. This research is the first in Vietnam to explore automatic speech recognition using both audio and visual data and is expected to facilitate multimodal research in broader areas.

📄 논문 정보

발행 연도	2025년
인용수	3
출판 국가	Namibia, Andorra
사이트	IEEE
좋아요 수	0

Vietnamese Automatic Speech Recognition Utilizing Audio and Visual Data

Vietnamese Automatic Speech Recognition Utilizing Audio and Visual Data

Tan-Thinh Duong

Van-Minh Nguyen

Hong-Duyen-Khanh Pham

Thanh-Hai Le

📄 논문 정보

연관 논문 목록 (43건)

Vietnamese Automatic Speech Recognition Utilizing Audio and Visual Data

Vietnamese Automatic Speech Recognition Utilizing Audio and Visual Data

📄 논문 정보

연관 논문 목록 (43건) 내 서재 담기

연관 논문 목록 (43건)