연구 분야: Artificial Intelligence
학회: Signal, Image and Video Processing
Automatic speech recognition systems face significant performance degradation in noisy environments, whereas audio-visual speech recognition mitigates this issue through visual cue fusion. However, existing datasets primarily originate from constrained environments, restricting model generalization to real-world scenarios. To address this gap, the Chinese Daily Scene Audio-Visual (CDSAV) dataset is introduced. Recorded on consumer-grade smartphones, this multimodal dataset reflects real-world conditions by capturing natural daily interactions that include diverse indoor environments and a mixture of frontal and profile head poses. Although existing methods have designed tailored encoders to extract features and achieved performance improvements, the cross-modal architecture remains challenging, particularly in sentence-level tasks requiring contextual dependencies. To resolve this limitation, a dual-branch context encoder is proposed to hierarchically model global contextual patterns and local temporal dynamics across visual and audio modalities. Furthermore, the framework integrates hybrid connectionist temporal classification (CTC) and recurrent neural network transducer (RNN-T) loss functions, synergizing frame-level regularization with sequence-aware optimization. Extensive experiments demonstrate state-of-the-art performance, achieving character error rates (CER) of 13.45% on VispeR, 2.46% on CMLR, and 8.23% on CDSAV for audio-visual speech recognition. For visual speech recognition, CER values of 45.23%, 20.78%, and 33.98% are attained on the VispeR, CMLR, and CDSAV benchmarks, respectively. The experimental results validate the effectiveness of the established dataset while demonstrating the state-of-the-art performance of the proposed architecture.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Andorra |
| 사이트 | Springer |
| 좋아요 수 | 0 |