연구 분야: Artificial Intelligence
학회: Multimedia Tools and Applications
Human speech contains both linguistic information and the emotion of the speaker. Speech emotion is a channel of expression of one’s mental state to another. Traditional methods use an informative representation vector of the whole sentence for modeling the SER which is not capable of handling dynamic temporal changes. The main objective of the system is to improve temporal changes, perform process based on content, provide better segmentation analysis, handle noise more effectively and perform context-sensitive processing. So, we suggest using Chunk Level Speech Emotion Recognition(CLeSER), a dynamic chunking approach where we separate each audio into a fixed number of chunks that have the same time duration by adjusting their overlaps. Feature analysis plays an important role in the performance of the system. We used both Mel spectrograms and Gammatone-like spectrograms as feature components for processing as they are suggested to improve the efficiency of the SER. And finally, we used CNN for extracting high-level features from raw spectrograms and LSTM for aggregating long-term dependencies. We tested our model in two different datasets. CLeSER achieved accuracy of 70% for the model using Mel spectrogram and 74% for the model using Gammatone spectrogram which is greater than mel spectrogram in RAVDESS dataset with augmentation. It also achieved accuracy score of 50% for Mel spectrogram and 52% for Gammatone spectrogram in MSP podcast dataset after splitting into chunks which is greater than accuracy score before splitting into chunks.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | India |
| 사이트 | Springer |
| 좋아요 수 | 0 |