ITRT(IT Research Trends)

GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks

연구 분야: Artificial Intelligence

논문 키워드: #challenging #dirty #cleaning #impractical #iteratively

학회: The VLDB Journal

초록

Data cleaning has always been a challenging issue in data research. As data volumes grow exponentially, manual cleaning has become increasingly impractical. Despite substantial efforts in automated data cleaning, significant human effort remains essential, either for providing prior knowledge to generate rules or labeling data to train models. In this paper, we study the problem of self-supervised and interpretable data cleaning, which automatically extracts interpretable data repair rules from dirty data. We propose a novel framework, namely GARF+, based on sequence generative adversarial networks (SeqGAN). A key objective of GARF+ is to capture data repair rules (e.g., the city “Dothan” can uniquely determine that the county is “Houston”). GARF+ employs a SeqGAN consisting of a generator G and a discriminator D that trains G to learn the dependency relationships (e.g., given the city “Dothan” as input, G infers that the county should be “Houston”). After training, the generator G can be used to generate data repair rules, but such generated rules may contain incorrect rules, especially when learned from dirty data. To mitigate this problem, GARF+ further updates the learned relationships with another discriminator to iteratively improve the quality of both rules and data. By taking advantage of both logical and learning-based methods, GARF+ achieves interpretable data cleaning without requiring prior knowledge or labeled training data. Furthermore, GARF+ explores the potential of open-source large language models (LLMs) in data cleaning. Through fine-tuning, LLMs can effectively assimilate both general knowledge and domain-specific information. GARF+ integrates LLMs as a knowledge enhancement module to support rule generation and data repair processes. Extensive experiments on real-world and synthetic datasets demonstrate the effectiveness of GARF+, including its original approach (GARF) and two variants designed to tackle various scenarios. GARF+ outperforms state-of-the-art methods with high precision and recall across different datasets, through learning from dirty datasets autonomously without human supervision.

📄 논문 정보

발행 연도	2025년
인용수	0
출판 국가	Andorra, China
사이트	Springer
좋아요 수	0

GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks

GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks

Jinfeng Peng

Hanghai Cui

Derong Shen

Nan Tang

Yue Kou

Tiezheng Nie

Hang Cui

Ge Yu

📄 논문 정보

연관 논문 목록 (43건)

GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks

GARF+: self-supervised and interpretable data cleaning with sequence generative adversarial networks

📄 논문 정보

연관 논문 목록 (43건) 내 서재 담기

연관 논문 목록 (43건)