Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Qixu Chen; Yeye He; Raymond Chi-Wing Wong; WeiWei Cui; Song Ge; Haidong Zhang; Dongmei Zhang; Surajit Chaudhuri

doi:10.1145/3725396

journal article Jun 17, 2025

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Qixu Chen

Yeye He

Raymond Chi-Wing Wong

Proceedings of the ACM on Management of Data Vol. 3 No. 3 pp. 1-27 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/3725396

Abstract

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied.

In this work, we observe that there is an important class of data-quality constraints that we call
Semantic-Domain Constraints,
which can be reliably inferred and automatically applied to
any tables,
without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints.

Our code and data are available at https://github.com/qixuchen/AutoTest for future research.

Topics

No keywords indexed for this article. Browse by subject →

References

82

[1]

[n. d.]. Benchmark data and code used in the paper. https://tinyurl.com/AutoTestSIGMOD25 or https://github.com/qixuchen/AutoTest.

[2]

[n. d.]. Dataprep.clean: Curated validation functions for common semantic types. https://docs.dataprep.ai/user_guide/clean/introduction.html#userguide-clean.

[3]

[n. d.]. Example UPC validation function (Python). https://en.wikipedia.org/wiki/Luhn_algorithm.

[4]

[n. d.]. Excel clean data: example demo. https://tinyurl.com/Excel-Clean-Data-demo or https://drive.google.com/file/d/1kIVLVOZQfZn2Dqd2M-fblo7EwpP_O4tw/view?usp=drive_link.

[5]

[n. d.]. Excel: clean data with copilot. https://support.microsoft.com/en-us/office/clean-data-in-excel-7fe20d89--3f57--46d3-b659-e8f3ee853bda?ns=XLWAENDUSER&version=16.

[6]

[n. d.]. Full technical report of Auto-Test. https://arxiv.org/abs/2504.10762.

[7]

[n. d.]. Google Sheet: Smart Cleanup feature. https://workspace.google.com/blog/product-announcements/connectedsheets-is-generally-available.

[8]

[n. d.]. Google sheets. https://workspace.google.com/products/sheets/.

[9]

[n. d.]. Microsoft Excel. https://www.microsoft.com/en-us/microsoft-365/excel.

[10]

[n. d.]. TabLib: 627M tables and 867B tokens of context for training Large Data Models. https://www.approximatelabs.com/blog/tablib.

[11]

2024. validators - Python Data Validation for Humans. https://github.com/python-validators/validators/.

[12]

Ricardo Baeza-Yates Berthier Ribeiro-Neto et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.

[13]

10.14778/3204028.3204032

[14]

10.1109/icde.2013.6544854

[15]

Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4. Springer.

[16]

10.1145/342009.335388

[17]

Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[18]

Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He, and W Redmond. 2016. Data services leveraging Bing's data assets. IEEE Data Eng. Bull. 39, 3 (2016), 15--28.

[19]

Anomaly detection

Varun Chandola, Arindam Banerjee, Vipin Kumar

ACM Computing Surveys 10.1145/1541880.1541882

[20]

10.1109/icde.2013.6544847

[21]

10.1109/icde.2014.6816746

[22]

10.1145/2723372.2749431

[23]

10.4324/9780203771587

[24]

Jacob Cohen. 2016. A power primer. (2016).

[25]

Rémi Domingues, Maurizio Filippone, Pietro Michiardi, and Jihane Zouaoui. 2018. A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern recognition 74 (2018), 406--421.

[26]

Lisa Ehrlinger and Wolfram Wöß. 2022. A survey of data quality measurement and monitoring tools. Frontiers in big data 5 (2022), 850611.

[27]

Brian S Everitt. 1992. The analysis of contingency tables. CRC Press.

[28]

10.1109/tkde.2010.154

[29]

Benjamin Feuer, Yurong Liu, Chinmay Hegde, and Juliana Freire. 2023. ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models. arXiv preprint arXiv:2310.18208 (2023).

[30]

10.1109/tkde.2020.3012472

[31]

Yeye He, Jie Song, Yue Wang, Surajit Chaudhuri, Vishal Anil, Blake Lassiter, Yaron Goland, and Gaurav Malhotra. 2021. Auto-Tag: Tagging-Data-By-Example in Data Lakes. arXiv preprint arXiv:2112.06049 (2021).

[32]

10.1145/3299869.3319888

[33]

Joseph M Hellerstein. 2013. Quantitative data cleaning for large databases. (2013).

[34]

10.1145/3183713.3196889

[35]

Yka Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. The computer journal 42, 2 (1999), 100--111.

[36]

Madelon Hulsebos, Paul Groth, and Çagatay Demiralp. 2023. AdaTyper: Adaptive Semantic Column Type Detection. arXiv preprint arXiv:2311.13806 (2023).

[37]

10.1145/3292500.3330993

[38]

10.1109/icde.2018.00014

[39]

10.1007/978-0-8176-4811-4

[40]

Ken Kelley and Kristopher J Preacher. 2012. On effect size. Psychological methods 17, 2 (2012), 137.

[41]

10.1145/2723372.2747646

[42]

10.5555/2503308.2503323

[43]

Edwin M Knox and Raymond T Ng. 1998. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the international conference on very large data bases. Citeseer, 392--403.

[44]

Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. 2023. Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).

[45]

10.14778/1920841.1921005

[46]

Yiming Lin, Yeye He, and Surajit Chaudhuri. 2023. Auto-bi: Automatically build bi-models leveraging local join prediction and global schema graph. arXiv preprint arXiv:2306.12515 (2023).

[47]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413--422.

[48]

10.14778/3407790.3407801

[49]

10.1145/3299869.3324956

[50]

Markos Markou and Sameer Singh. 2003. Novelty detection: a review-part 1: statistical approaches. Signal processing 83, 12 (2003), 2481--2497.

Showing 50 of 82 references

Metrics

1

Citations

82

References

Details

Published: Jun 17, 2025
Vol/Issue: 3(3)
Pages: 1-27

Authors

Q

Qixu Chen

Hong Kong University of Science and Technology, Hong Kong SAR, China

Y

Yeye He

Microsoft Research, Redmond, USA

R

Raymond Chi-Wing Wong

Hong Kong University of Science and Technology, Hong Kong SAR, China

W

WeiWei Cui

Microsoft Research, Beijing, China

S

Song Ge

Microsoft Research, Beijing, China

H

Haidong Zhang

Microsoft Research, Beijing, China

D

Dongmei Zhang

Microsoft Research, Beijing, China

S

Surajit Chaudhuri

Microsoft Research, Redmond, USA

Cite This Article

Qixu Chen, Yeye He, Raymond Chi-Wing Wong, et al. (2025). Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables. Proceedings of the ACM on Management of Data, 3(3), 1-27. https://doi.org/10.1145/3725396

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

You May Also Like