Providing and evaluating a model for big data anonymization streams by using in-memory processing


연구 분야: Databases



학회: Knowledge and Information Systems


초록

Extracting valuable information from vast sources of social networks while protecting confidentiality and preventing data disclosure is a significant challenge in big data environments. Traditional anonymization methods often fall short in handling the volume, variety, and velocity of big data, leading to high data loss and inefficiency. This article addresses these challenges by proposing a novel anonymization method based on K-means clustering within the Spark framework, leveraging its in-memory processing capabilities. Our model uses K-means clustering to determine optimal cluster heads, significantly reducing data loss and identity disclosure risks. By utilizing Spark's RDD abilities and the MLlib component, our method achieves faster processing times compared to traditional methods that rely on non-in-memory big data tools. Performance evaluation demonstrates that at k = 9, the cost factor is minimized to 0.20, indicating the efficiency and effectiveness of our approach. The proposed method not only enhances processing speed but also ensures minimal data loss, making it suitable for real-time anonymization of big data streams. This work provides a balanced solution that addresses the critical need for high-speed data anonymization while maintaining data privacy and utility.


Author Profile
Elham Shamsinejad

Computer Engineering (Software) Specializing in Data Science Tehran Iran

India
Author Profile
Hamid Banirostam

Computer Engineering (Software) Specializing in Data Science Tehran Iran

India
Author Profile
Touraj BaniRostam

Master of Data Analytics University of Niagara Falls Niagara Falls ON Canada

Canada

📄 논문 정보

발행 연도 2025년
인용수 0
출판 국가 Andorra, India, Canada
사이트 Springer
좋아요 수 0

연관 논문 목록 (53건)