연구 분야: Artificial Intelligence
학회: Neural Computing and Applications
The speech signal is one of the most effective data sources used in human–computer interaction and is widely used in many applications such as speech/speaker recognition, emotion recognition, language recognition, and age and gender recognition. In this study, two convolutional neural networks, 1D and 2D, are designed to recognize the age and gender class of the speaker. These models are created by stacking four feature learning blocks (FLBs) and one classification block. Two different feature vectors are used in their inputs, which are formed with mel-frequency cepstrum coefficients. Each FLB consists of a convolution layer, a batch normalization layer, a ReLU layer, a max pooling layer, and a dropout layer, while the classification block consists of a flatten layer, two fully connected layers, and a softmax layer. In the study, besides the parameter optimization made by manual search method, model optimization is also carried out by trying different combinations of the basic components that make up the FLBs. In the experiments with the Common Voice Turkish dataset, the highest validation accuracy is obtained as 66.26% for the 1D model and 94.40% for the 2D model. These results reveal the effectiveness of the proposed 2D model in age and gender recognition.
| 발행 연도 | 2023년 |
|---|---|
| 인용수 | 9 |
| 출판 국가 | Turkey |
| 사이트 | Springer |
| 좋아요 수 | 0 |