ITRT(IT Research Trends)

Big-MDC: A Multi-Dimensions Cleaning Framework to Improve Big Data Quality

연구 분야: Databases

논문 키워드: #algorithm #algorithms #outperforms #erroneous #changing

학회: International Journal of Data Science and Analytics

초록

Over the years, data cleaning approaches have focused on traditional data quality issues, without considering the scalability of their solutions. The emergence of Big Data has created new data quality challenges, driving the need for efficient and scalable solutions. In this paper, we present a new data cleaning solution to address the following challenges: (1) Volume: Most existing approaches struggle when dealing with large datasets. We propose a linear time algorithm which repairs millions of records in seconds. (2) Generality: Existing solutions generally focus on a single aspect of data quality problems which prevents the improvement of other aspects. We take into account six quality dimensions. (3) Automaticity: The emerging general solutions lack automaticity and efficiency. Our solution is fully automatic. (4) Lack of context: In the big data area, quality rules (QRs) are usually not provided. We automatically discover QRs from dirty data. (5) Uncertainty: QRs may also be erroneous, which makes it difficult to determine the source of the anomaly. We consider both QRs and data repair. (6) Repair accuracy and consistency: Repair algorithms usually apply updates minimizing data changes, which may introduce new errors and violations. We operate only trusted modifications by changing dirty patterns, to be identical to the most similar trusted patterns and mark cells participating in ambiguous cases for later repair, using a constraint satisfaction problem formulation, allowing all repairs to be consistent. Our experiments show that our solution is efficient, provides high-quality repairs and outperforms state-of-the-art systems in terms of effectiveness and efficiency.

📄 논문 정보

발행 연도	2025년
인용수	0
출판 국가	Algeria
사이트	Springer
좋아요 수	0

Big-MDC: A Multi-Dimensions Cleaning Framework to Improve Big Data Quality

Big-MDC: A Multi-Dimensions Cleaning Framework to Improve Big Data Quality

Nibel Nadjeh

Sabrina Abdellaoui

Fahima Nader

📄 논문 정보

연관 논문 목록 (229건)

Big-MDC: A Multi-Dimensions Cleaning Framework to Improve Big Data Quality

Big-MDC: A Multi-Dimensions Cleaning Framework to Improve Big Data Quality

📄 논문 정보

연관 논문 목록 (229건) 내 서재 담기

연관 논문 목록 (229건)