ITRT(IT Research Trends)

Audio visual chinese speech recognition in daily scene based on cross modal context encoder

연구 분야: Artificial Intelligence

논문 키워드: #improvements #audio #noisy #smartphones #synergizing

학회: Signal, Image and Video Processing

초록

Automatic speech recognition systems face significant performance degradation in noisy environments, whereas audio-visual speech recognition mitigates this issue through visual cue fusion. However, existing datasets primarily originate from constrained environments, restricting model generalization to real-world scenarios. To address this gap, the Chinese Daily Scene Audio-Visual (CDSAV) dataset is introduced. Recorded on consumer-grade smartphones, this multimodal dataset reflects real-world conditions by capturing natural daily interactions that include diverse indoor environments and a mixture of frontal and profile head poses. Although existing methods have designed tailored encoders to extract features and achieved performance improvements, the cross-modal architecture remains challenging, particularly in sentence-level tasks requiring contextual dependencies. To resolve this limitation, a dual-branch context encoder is proposed to hierarchically model global contextual patterns and local temporal dynamics across visual and audio modalities. Furthermore, the framework integrates hybrid connectionist temporal classification (CTC) and recurrent neural network transducer (RNN-T) loss functions, synergizing frame-level regularization with sequence-aware optimization. Extensive experiments demonstrate state-of-the-art performance, achieving character error rates (CER) of 13.45% on VispeR, 2.46% on CMLR, and 8.23% on CDSAV for audio-visual speech recognition. For visual speech recognition, CER values of 45.23%, 20.78%, and 33.98% are attained on the VispeR, CMLR, and CDSAV benchmarks, respectively. The experimental results validate the effectiveness of the established dataset while demonstrating the state-of-the-art performance of the proposed architecture.

📄 논문 정보

발행 연도	2025년
인용수	0
출판 국가	Andorra
사이트	Springer
좋아요 수	0

Audio visual chinese speech recognition in daily scene based on cross modal context encoder

Audio visual chinese speech recognition in daily scene based on cross modal context encoder

Yijun Liu

Zhihua Qu

Jie Li

Qian Zhang

Bowen Liu

Wujun Chen

📄 논문 정보

연관 논문 목록 (41건)

Audio visual chinese speech recognition in daily scene based on cross modal context encoder

Audio visual chinese speech recognition in daily scene based on cross modal context encoder

📄 논문 정보

연관 논문 목록 (41건) 내 서재 담기

연관 논문 목록 (41건)