A survey on dataset quality in machine learning

Youdi Gong; Guangzhen Liu; Yunzhi Xue; Rui Li; Lingzhong Meng

doi:10.1016/j.infsof.2023.107268

journal article Open Access Oct 01, 2023

A survey on dataset quality in machine learning

Youdi Gong Guangzhen Liu Yunzhi Xue Rui Li

Lingzhong Meng

Information and Software Technology Vol. 162 pp. 107268 · Elsevier BV

View at Publisher Save 10.1016/j.infsof.2023.107268

Topics

No keywords indexed for this article. Browse by subject →

References

71

[1]

I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality Assessment Model for Unstructured Data, in: IIT 2018 : 13th International Conference on Innovations in Information Technology, 2018. 10.1109/innovations.2018.8605945

[2]

Lang "NewsWeeder: Learning to filter netnews" Mach. Learn. Proc. (1995)

[3]

G.D. Corso, A. Gullí, F. Romani, Ranking a stream of news, in: International Conference on World Wide Web, DBLP, 2005, p. 97. 10.1145/1060745.1060764

[4]

J. Ni, J. Li, J. Mcauley, Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, 2019. 10.18653/v1/d19-1018

[5]

Li "Semi-supervised text categorization by considering sufficiency and diversity" (2013)

[6]

Socher "Recursive deep models for semantic compositionality over a sentiment treebank" (2013)

[7]

https://www.yelp.com/dataset.

[8]

http://labelme.csail.mit.edu/Release3.0/.

[9]

The Pascal Visual Object Classes (VOC) Challenge

Mark Everingham, Luc Van Gool, Christopher K. I. Williams et al.

International Journal of Computer Vision 2010 10.1007/s11263-009-0275-4

[10]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher et al.

2009 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2009.5206848

[11]

http://vision.stanford.edu/aditya86/ImageNetDogs/main.html.

[12]

Zhou "Places: A 10 million image database for scene recognition" IEEE Trans. Pattern Anal. Mach. Intell. (2018) 10.1109/tpami.2017.2723009

[13]

Krizhevsky "Learning multiple layers of features from tiny images" (2009)

[14]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie et al.

Lecture Notes in Computer Science 10.1007/978-3-319-10602-1_48

[15]

Nene (1996)

[16]

Liu (2014)

[17]

Panayotov "Librispeech: An ASR corpus based on public domain audio books" (2015)

[18]

(2002)

[19]

http://www.voxforge.org/.

[20]

Garofolo (1993)

[21]

https://chimechallenge.github.io/chime6/download.html.

[22]

Snyder "MUSAN: A music, speech, and noise corpus" Comput. Sci. (2015)

[23]

https://zenodo.org/record/6337421#.ZGQ6dk9ByuJ.

[24]

Takahashi "Deep convolutional neural networks and data augmentation for acoustic event recognition" Interspeech (2016) 10.21437/interspeech.2016-805

[25]

Maas (2011)

[26]

M. Abdallah, Big Data Quality Challenges, in: 2019 International Conference on Big Data and Computational Intelligence, ICBDCI, 2019. 10.1109/icbdci.2019.8686099

[27]

Song (2013)

[28]

Northcutt (2021)

[29]

Rosli "Evaluating the quality of datasets in software engineering" J. Comput. Theor. Nanosci. (2018)

[30]

Christian "Visual interactive creation, customization, and analysis of data quality metrics" J. Data Inf. Qual. (2018) 10.1145/3190578

[31]

Fabbrizzi (2021)

[32]

Guo "Automated cleaning of identity label noise in a large face dataset with quality control" IET Biometrics (2019)

[33]

Xie "Conceptual cognitive modeling for fine-grained annotation quality assessment of object detection datasets" Discrete Dyn. Nat. Soc. (2020)

[34]

I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality: A Survey, in: Big Data Congress 2018, 2018. 10.1109/bigdatacongress.2018.00029

[35]

Taleb "Big data quality framework: A holistic approach to continuous quality management" J. Big Data (2021) 10.1186/s40537-021-00468-0

[36]

Li (2020)

[37]

"Construction of big data quality measurement model" (2018)

[38]

Diaz "Using code ownership to improve IR-based traceability link recovery, program comprehension (ICPC)" (2013)

[39]

Gervasi "Supporting traceability through affinity mining" (2014)

[40]

Zogaan "Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality" (2017)

[41]

Mirakhorli "Detecting, tracing, and monitoring architectural tactics in code" IEEE Trans Softw Eng (2016) 10.1109/tse.2015.2479217

[42]

Zhang (2018)

[43]

E. Ruckhaus, M. Vidal, S. Castillo, et al., Analyzing linked data quality with LiQuate, in: Proc. of the European Semantic Web Conf., 2014, pp. 488–493. 10.1007/978-3-319-11955-7_72

[44]

N. Ruiz, M. Federico, Phonetically-oriented word error alignment for speech recognition error analysis in speech translation, in: Proc. of the Automatic Speech Recognition and Understanding, 2016, pp. 296–302. 10.1109/asru.2015.7404808

[45]

Escudero (2018)

[46]

C. Lin, ROUGE:A package for automatic evaluation of summaries, in: Proc. of the Meeting of the Association for Computational Linguistics, 2004, pp. 74–81.

[47]

N. Japkowicz, Concept-Learning in the Presence of Between-Class and Within-Class Imbalances, in: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 2001, pp. 67–77. 10.1007/3-540-45153-6_7

[48]

Yulin "A new method for measuring the distribution consistency of mixed-attribute datasets" J. Shenzhen Univ. (Sci. Technol. Ed.) (2021)

[49]

Cai "Survey of data annotation" J. Softw. (2020)

[50]

GB/T 36344-2018 Information technology—Evaluation indicators for data quality.

Showing 50 of 71 references

Cited By

245

RIOWA: Restricted induced ordered weighted averaging operators, applied to missing data imputation

Alfonso Indurain-Ibero, Benjamin Bedregal · 2026

Applied Soft Computing

Preserving and enhancing cultural heritage through art design using feature pyramid network optimized by modified builder optimization algorithm

Xing Fu, Navid Razmjooy · 2025

Scientific Reports

A systematic literature review on human activity recognition using smart devices: advances, challenges, and future directions

Tayyab Saeed Qureshi, Muhammad Haris Shahid · 2025

Artificial Intelligence Review

Machine learning and deep learning-based landslide susceptibility mapping using geospatial techniques in Wayanad, Kerala state, India

Lokesh P, Madhesh C · 2025

HydroResearch

Use of interpretable machine learning approaches for quantificationally understanding the performance of steel fiber-reinforced recycled aggregate concrete: From the perspective of compressive strength and splitting tensile strength

Shuyuan Zhang, Wenguang Chen · 2024

Engineering Applications of Artific...

Metrics

245

Citations

71

References

Details

Published: Oct 01, 2023
Vol/Issue: 162
Pages: 107268
License: View

Authors

Department of Anesthesia, Indiana University School of Medicine, Indianapolis, IN, USA

Cite This Article

Youdi Gong, Guangzhen Liu, Yunzhi Xue, et al. (2023). A survey on dataset quality in machine learning. Information and Software Technology, 162, 107268. https://doi.org/10.1016/j.infsof.2023.107268

A survey on dataset quality in machine learning

You May Also Like