연구 분야: Artificial Intelligence
학회: 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)
Inspired by humans comprehending speech in a multi-modal manner, a growing number of audio-visual speech recognition datasets have been constructed. However, most of these datasets focus on English and provide only a limited amount of multiview video data. To address these limitations, this study first constructs a comprehensive dataset for Vietnamese Audio-visual Speech Recognition (VASR). A ViAVSP-LLM speech recognition system consisting of an AV-HuBERT encoder and VinaLLaMA decoder is then proposed. When applied to the VASR dataset (1,045 hours), ViAVSP-LLM achieves a Word Error Rate of 12.03% on the test set. Comparative experiments conducted using current audio-only speech recognition models show that the addition of visual data significantly improves the speech recognition accuracy. In addition, the use of a large language model further improves the performance and contextual understanding of the model. This research is the first in Vietnam to explore automatic speech recognition using both audio and visual data and is expected to facilitate multimodal research in broader areas.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 3 |
| 출판 국가 | Namibia, Andorra |
| 사이트 | IEEE |
| 좋아요 수 | 0 |