journal article Open Access Oct 01, 2023

A survey on dataset quality in machine learning

View at Publisher Save 10.1016/j.infsof.2023.107268
Topics

No keywords indexed for this article. Browse by subject →

References
71
[1]
I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality Assessment Model for Unstructured Data, in: IIT 2018 : 13th International Conference on Innovations in Information Technology, 2018. 10.1109/innovations.2018.8605945
[2]
Lang "NewsWeeder: Learning to filter netnews" Mach. Learn. Proc. (1995)
[3]
G.D. Corso, A. Gullí, F. Romani, Ranking a stream of news, in: International Conference on World Wide Web, DBLP, 2005, p. 97. 10.1145/1060745.1060764
[4]
J. Ni, J. Li, J. Mcauley, Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, 2019. 10.18653/v1/d19-1018
[5]
Li "Semi-supervised text categorization by considering sufficiency and diversity" (2013)
[6]
Socher "Recursive deep models for semantic compositionality over a sentiment treebank" (2013)
[7]
https://www.yelp.com/dataset.
[8]
http://labelme.csail.mit.edu/Release3.0/.
[9]
The Pascal Visual Object Classes (VOC) Challenge

Mark Everingham, Luc Van Gool, Christopher K. I. Williams et al.

International Journal of Computer Vision 2010 10.1007/s11263-009-0275-4
[10]
ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher et al.

2009 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2009.5206848
[11]
http://vision.stanford.edu/aditya86/ImageNetDogs/main.html.
[12]
Zhou "Places: A 10 million image database for scene recognition" IEEE Trans. Pattern Anal. Mach. Intell. (2018) 10.1109/tpami.2017.2723009
[13]
Krizhevsky "Learning multiple layers of features from tiny images" (2009)
[14]
Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie et al.

Lecture Notes in Computer Science 10.1007/978-3-319-10602-1_48
[15]
Nene (1996)
[16]
Liu (2014)
[17]
Panayotov "Librispeech: An ASR corpus based on public domain audio books" (2015)
[18]
(2002)
[19]
http://www.voxforge.org/.
[20]
Garofolo (1993)
[21]
https://chimechallenge.github.io/chime6/download.html.
[22]
Snyder "MUSAN: A music, speech, and noise corpus" Comput. Sci. (2015)
[23]
https://zenodo.org/record/6337421#.ZGQ6dk9ByuJ.
[24]
Takahashi "Deep convolutional neural networks and data augmentation for acoustic event recognition" Interspeech (2016) 10.21437/interspeech.2016-805
[25]
Maas (2011)
[26]
M. Abdallah, Big Data Quality Challenges, in: 2019 International Conference on Big Data and Computational Intelligence, ICBDCI, 2019. 10.1109/icbdci.2019.8686099
[27]
Song (2013)
[28]
Northcutt (2021)
[29]
Rosli "Evaluating the quality of datasets in software engineering" J. Comput. Theor. Nanosci. (2018)
[30]
Christian "Visual interactive creation, customization, and analysis of data quality metrics" J. Data Inf. Qual. (2018) 10.1145/3190578
[31]
Fabbrizzi (2021)
[32]
Guo "Automated cleaning of identity label noise in a large face dataset with quality control" IET Biometrics (2019)
[33]
Xie "Conceptual cognitive modeling for fine-grained annotation quality assessment of object detection datasets" Discrete Dyn. Nat. Soc. (2020)
[34]
I. Taleb, M.A. Serhani, R. Dssouli, Big Data Quality: A Survey, in: Big Data Congress 2018, 2018. 10.1109/bigdatacongress.2018.00029
[35]
Taleb "Big data quality framework: A holistic approach to continuous quality management" J. Big Data (2021) 10.1186/s40537-021-00468-0
[36]
Li (2020)
[37]
"Construction of big data quality measurement model" (2018)
[38]
Diaz "Using code ownership to improve IR-based traceability link recovery, program comprehension (ICPC)" (2013)
[39]
Gervasi "Supporting traceability through affinity mining" (2014)
[40]
Zogaan "Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality" (2017)
[41]
Mirakhorli "Detecting, tracing, and monitoring architectural tactics in code" IEEE Trans Softw Eng (2016) 10.1109/tse.2015.2479217
[42]
Zhang (2018)
[43]
E. Ruckhaus, M. Vidal, S. Castillo, et al., Analyzing linked data quality with LiQuate, in: Proc. of the European Semantic Web Conf., 2014, pp. 488–493. 10.1007/978-3-319-11955-7_72
[44]
N. Ruiz, M. Federico, Phonetically-oriented word error alignment for speech recognition error analysis in speech translation, in: Proc. of the Automatic Speech Recognition and Understanding, 2016, pp. 296–302. 10.1109/asru.2015.7404808
[45]
Escudero (2018)
[46]
C. Lin, ROUGE:A package for automatic evaluation of summaries, in: Proc. of the Meeting of the Association for Computational Linguistics, 2004, pp. 74–81.
[47]
N. Japkowicz, Concept-Learning in the Presence of Between-Class and Within-Class Imbalances, in: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, 2001, pp. 67–77. 10.1007/3-540-45153-6_7
[48]
Yulin "A new method for measuring the distribution consistency of mixed-attribute datasets" J. Shenzhen Univ. (Sci. Technol. Ed.) (2021)
[49]
Cai "Survey of data annotation" J. Softw. (2020)
[50]
GB/T 36344-2018 Information technology—Evaluation indicators for data quality.

Showing 50 of 71 references

Metrics
245
Citations
71
References
Details
Published
Oct 01, 2023
Vol/Issue
162
Pages
107268
License
View
Cite This Article
Youdi Gong, Guangzhen Liu, Yunzhi Xue, et al. (2023). A survey on dataset quality in machine learning. Information and Software Technology, 162, 107268. https://doi.org/10.1016/j.infsof.2023.107268
Related

You May Also Like