GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks


연구 분야: Artificial Intelligence



학회: The VLDB Journal


초록

Data cleaning has always been a challenging issue in data research. As data volumes grow exponentially, manual cleaning has become increasingly impractical. Despite substantial efforts in automated data cleaning, significant human effort remains essential, either for providing prior knowledge to generate rules or labeling data to train models. In this paper, we study the problem of self-supervised and interpretable data cleaning, which automatically extracts interpretable data repair rules from dirty data. We propose a novel framework, namely GARF+, based on sequence generative adversarial networks (SeqGAN). A key objective of GARF+ is to capture data repair rules (e.g., the city “Dothan” can uniquely determine that the county is “Houston”). GARF+ employs a SeqGAN consisting of a generator G and a discriminator D that trains G to learn the dependency relationships (e.g., given the city “Dothan” as input, G infers that the county should be “Houston”). After training, the generator G can be used to generate data repair rules, but such generated rules may contain incorrect rules, especially when learned from dirty data. To mitigate this problem, GARF+ further updates the learned relationships with another discriminator to iteratively improve the quality of both rules and data. By taking advantage of both logical and learning-based methods, GARF+ achieves interpretable data cleaning without requiring prior knowledge or labeled training data. Furthermore, GARF+ explores the potential of open-source large language models (LLMs) in data cleaning. Through fine-tuning, LLMs can effectively assimilate both general knowledge and domain-specific information. GARF+ integrates LLMs as a knowledge enhancement module to support rule generation and data repair processes. Extensive experiments on real-world and synthetic datasets demonstrate the effectiveness of GARF+, including its original approach (GARF) and two variants designed to tackle various scenarios. GARF+ outperforms state-of-the-art methods with high precision and recall across different datasets, through learning from dirty datasets autonomously without human supervision.


Author Profile
Jinfeng Peng

Northeastern University Shenyang China and College of Electronic Engineering National University of Defense Technology Hefei China

Andorra
Author Profile
Hanghai Cui

National University of Defense Technology Changsha China

China
Author Profile
Derong Shen

Northeastern University Shenyang China and College of Electronic Engineering National University of Defense Technology Hefei China

Andorra

📄 논문 정보

발행 연도 2025년
인용수 0
출판 국가 Andorra, China
사이트 Springer
좋아요 수 0

연관 논문 목록 (43건)