연구 분야: Databases
학회: International Journal of Data Science and Analytics
Over the years, data cleaning approaches have focused on traditional data quality issues, without considering the scalability of their solutions. The emergence of Big Data has created new data quality challenges, driving the need for efficient and scalable solutions. In this paper, we present a new data cleaning solution to address the following challenges: (1) Volume: Most existing approaches struggle when dealing with large datasets. We propose a linear time algorithm which repairs millions of records in seconds. (2) Generality: Existing solutions generally focus on a single aspect of data quality problems which prevents the improvement of other aspects. We take into account six quality dimensions. (3) Automaticity: The emerging general solutions lack automaticity and efficiency. Our solution is fully automatic. (4) Lack of context: In the big data area, quality rules (QRs) are usually not provided. We automatically discover QRs from dirty data. (5) Uncertainty: QRs may also be erroneous, which makes it difficult to determine the source of the anomaly. We consider both QRs and data repair. (6) Repair accuracy and consistency: Repair algorithms usually apply updates minimizing data changes, which may introduce new errors and violations. We operate only trusted modifications by changing dirty patterns, to be identical to the most similar trusted patterns and mark cells participating in ambiguous cases for later repair, using a constraint satisfaction problem formulation, allowing all repairs to be consistent. Our experiments show that our solution is efficient, provides high-quality repairs and outperforms state-of-the-art systems in terms of effectiveness and efficiency.
| 발행 연도 | 2025년 |
|---|---|
| 인용수 | 0 |
| 출판 국가 | Algeria |
| 사이트 | Springer |
| 좋아요 수 | 0 |