Abstract
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied.

In this work, we observe that there is an important class of data-quality constraints that we call
Semantic-Domain Constraints,
which can be reliably inferred and automatically applied to
any tables,
without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints.

Our code and data are available at https://github.com/qixuchen/AutoTest for future research.
Topics

No keywords indexed for this article. Browse by subject →

References
82
[1]
[n. d.]. Benchmark data and code used in the paper. https://tinyurl.com/AutoTestSIGMOD25 or https://github.com/qixuchen/AutoTest.
[2]
[n. d.]. Dataprep.clean: Curated validation functions for common semantic types. https://docs.dataprep.ai/user_guide/clean/introduction.html#userguide-clean.
[3]
[n. d.]. Example UPC validation function (Python). https://en.wikipedia.org/wiki/Luhn_algorithm.
[4]
[n. d.]. Excel clean data: example demo. https://tinyurl.com/Excel-Clean-Data-demo or https://drive.google.com/file/d/1kIVLVOZQfZn2Dqd2M-fblo7EwpP_O4tw/view?usp=drive_link.
[5]
[n. d.]. Excel: clean data with copilot. https://support.microsoft.com/en-us/office/clean-data-in-excel-7fe20d89--3f57--46d3-b659-e8f3ee853bda?ns=XLWAENDUSER&version=16.
[6]
[n. d.]. Full technical report of Auto-Test. https://arxiv.org/abs/2504.10762.
[7]
[n. d.]. Google Sheet: Smart Cleanup feature. https://workspace.google.com/blog/product-announcements/connectedsheets-is-generally-available.
[8]
[n. d.]. Google sheets. https://workspace.google.com/products/sheets/.
[9]
[n. d.]. Microsoft Excel. https://www.microsoft.com/en-us/microsoft-365/excel.
[10]
[n. d.]. TabLib: 627M tables and 867B tokens of context for training Large Data Models. https://www.approximatelabs.com/blog/tablib.
[11]
2024. validators - Python Data Validation for Humans. https://github.com/python-validators/validators/.
[12]
Ricardo Baeza-Yates Berthier Ribeiro-Neto et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.
[15]
Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4. Springer.
[17]
Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[18]
Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He, and W Redmond. 2016. Data services leveraging Bing's data assets. IEEE Data Eng. Bull. 39, 3 (2016), 15--28.
[19]
Anomaly detection

Varun Chandola, Arindam Banerjee, Vipin Kumar

ACM Computing Surveys 10.1145/1541880.1541882
[24]
Jacob Cohen. 2016. A power primer. (2016).
[25]
Rémi Domingues, Maurizio Filippone, Pietro Michiardi, and Jihane Zouaoui. 2018. A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern recognition 74 (2018), 406--421.
[26]
Lisa Ehrlinger and Wolfram Wöß. 2022. A survey of data quality measurement and monitoring tools. Frontiers in big data 5 (2022), 850611.
[27]
Brian S Everitt. 1992. The analysis of contingency tables. CRC Press.
[29]
Benjamin Feuer, Yurong Liu, Chinmay Hegde, and Juliana Freire. 2023. ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models. arXiv preprint arXiv:2310.18208 (2023).
[31]
Yeye He, Jie Song, Yue Wang, Surajit Chaudhuri, Vishal Anil, Blake Lassiter, Yaron Goland, and Gaurav Malhotra. 2021. Auto-Tag: Tagging-Data-By-Example in Data Lakes. arXiv preprint arXiv:2112.06049 (2021).
[33]
Joseph M Hellerstein. 2013. Quantitative data cleaning for large databases. (2013).
[35]
Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.
[36]
Madelon Hulsebos, Paul Groth, and Çagatay Demiralp. 2023. AdaTyper: Adaptive Semantic Column Type Detection. arXiv preprint arXiv:2311.13806 (2023).
[40]
Ken Kelley and Kristopher J Preacher. 2012. On effect size. Psychological methods 17, 2 (2012), 137.
[43]
Edwin M Knox and Raymond T Ng. 1998. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the international conference on very large data bases. Citeseer, 392--403.
[44]
Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2023. Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).
[46]
Yiming Lin, Yeye He, and Surajit Chaudhuri. 2023. Auto-bi: Automatically build bi-models leveraging local join prediction and global schema graph. arXiv preprint arXiv:2306.12515 (2023).
[47]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413--422.
[50]
Markos Markou and Sameer Singh. 2003. Novelty detection: a review-part 1: statistical approaches. Signal processing 83, 12 (2003), 2481--2497.

Showing 50 of 82 references

Metrics
1
Citations
82
References
Details
Published
Jun 17, 2025
Vol/Issue
3(3)
Pages
1-27
Cite This Article
Qixu Chen, Yeye He, Raymond Chi-Wing Wong, et al. (2025). Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables. Proceedings of the ACM on Management of Data, 3(3), 1-27. https://doi.org/10.1145/3725396