연구 분야: Safety
학회: SN Computer Science
The rapid explosion of Android-based devices has led to a disturbing surge in the volume and sophistication of Android malware. Effective classification of these malicious applications is essential for safeguarding user security and privacy. However, the imbalanced nature of the Android malware data, where some of the families outnumber the other ones, poses a significant challenge to traditional machine learning algorithms. The issue of imbalanced data can be handled by oversampling and undersampling techniques. Oversampling raises the number of instances in the minority class and undersampling decreases the number of instances in the majority class in order to balance the class distribution. Oversampling and undersampling can lead to overfitting and loss of information. To combine the strengths of both techniques, hybrid sampling is used. This paper proposes hybrid sampling techniques that combine oversampling and undersampling methods. Two oversampling techniques, i.e., synthetic minority oversampling technique (SMOTE), and random oversampling (ROS), and three undersampling techniques, i.e., tomek's links (TOMEK), edited nearest neighbours (ENN), and neighborhood cleaning rule (NCR), are used. The proposed hybrid sampling methods are obtained after combining each oversampling technique with an undersampling one. All the hybrid sampling techniques are evaluated on a self-created malware dataset that consists of multiple imbalanced classes of malware families. Four machine learning models, i.e., support vector machine (SVM), decision tree (DT), random forest (RF), and k-nearest neighbours (K-NN), are trained after applying the hybrid sampling techniques. From the experimental results, it is concluded that ROS-NCR outperforms other methods in handling imbalanced Android malware families. It achieves 95.81% accuracy and an F-measure of 0.95 using RF.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Andorra |
| 사이트 | Springer |
| 좋아요 수 | 0 |