연구 분야: Artificial Intelligence
학회: Signal, Image and Video Processing
Speech Emotion Recognition (SER) is a crucial task in human-computer interaction, enabling intelligent systems to interpret and respond to human emotions effectively. While deep learning techniques have shown promise in SER, they often require high computational resources and prolonged training times. Moreover, there remains room for improvement in training efficiency. To address these issues, this study introduces novel curriculum learning (CL) approaches, including Dual-Space Curriculum Learning (DSCL), Task Space Curriculum Learning (TSCL), and Input Space Curriculum Learning (ISCL), aimed at improving accuracy while reducing training time. Experiments were conducted on three benchmark datasets: Toronto Emotional Speech Set (TESS), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE), demonstrating notable improvements over both traditional deep learning approaches and state-of-the-art methods. The DSCL approach achieved accuracies of 99.71%, 86.11%, and 75.01% on TESS, RAVDESS, and SAVEE, respectively, surpassing existing state-of-the-art techniques. Additionally, DSCL improved accuracy by 0.87%, 5.44%, and 3.79% compared to the traditional deep learning approach. Furthermore, DSCL significantly reduced training time by 52.66%, 49.03%, and 3.60% for TESS, RAVDESS, and SAVEE, respectively. The ISCL and TSCL approaches also demonstrated competitive accuracy, further validating the effectiveness of curriculum-based training.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Andorra |
| 사이트 | Springer |
| 좋아요 수 | 0 |