journal article Aug 10, 2022

A new feature selection method based on frequent and associated itemsets for text classification

Abstract
SummaryFeature selection is one of the major issues in pattern recognition. The quality of selected features is important for classification as the low‐quality data can degrade the model construction performance. Due to the difficulty of dealing with the problem that selected features always contain redundant information, this article focuses on the association analysis theory in data mining to select important features. In this study, a novel feature selection method based on frequent and associated itemsets (FS‐FAI) for text classification is proposed. FS‐FAI seeks to find relevant features and also takes feature interaction into account. Moreover, it uses association as a metric to evaluate the relativity between the target concept and feature(s). To evaluate the efficacy of the proposed method, several experiments were conducted on a BBC dataset from the BBC news website and SMS spam collection dataset from the UCI machine learning repository. The obtained results were compared to well‐known feature selection methods. The reported results demonstrated the effectiveness of the proposed feature selection method in selecting high‐quality features and in handling redundant information in text classification.
Topics

No keywords indexed for this article. Browse by subject →

References
34
[4]
SangodiahA AhmadR AhmadWFW.A review in feature extraction approach in question classification using support vector machine. Proceedings of the 2014 IEEE International Conference on Control System Computing and Engineering (ICCSCE 2014);2014:536‐541. 10.1109/iccsce.2014.7072776
[6]
LangleyP(1994).Selection of relevant features. Proceedings of the AAAI Fall Symposium on Relevance;1994:171‐182. 10.21236/ada292575
[8]
AgrawalR SrikantR.Fast algorithms for mining association rules. Proceedings of the 20th International Conference Very Large Data Bases (VLDB) Vol. 1215;1994:487‐499.
[13]
Pawening RE "Feature selection methods based on mutual information for classifying heterogeneous features" J Ilmu komput Inf (2016)
[14]
Sun J "Efficient method for feature selection in text classification" Int Conf Eng Technol (2017)
[15]
KaoungkuN SuksutK ChanklanR KerdprasopK KerdprasopN.Data classification based on feature selection with association rule mining. Proceedings of the International MultiConference of Engineers and Computer Scientists;2017.
[17]
Ukhti Ikhsani Larasati IU "Improve the accuracy of support vector machine using Chi Square statistic and term frequency inverse document frequency on movie review sentiment analysis" Sci J Inform (2019)
[18]
Kononenko I (1994) 10.1007/3-540-57868-4_57
[20]
AnggraenyFT PurbasariIY SuryaningsihE.Relief feature selection and Bayesian network model for hepatitis diagnosis. Prosiding International Conference on Information Technology and Business (ICITB);2018:113‐118.
[22]
SinayobyeJO KyandaSK KiwanukaNF MusabeR.Hybrid model of correlation based filter feature selection and machine learning classifiers applied on smart meter dataset. IEEE/ACM Symposium on Software Engineering in Africa (SEiA);2019:1‐10. 10.1109/seia.2019.00009
[24]
Verma T "Tokenization and filtering process in RapidMiner" Int J Appl Inf Syst (2014)
[25]
SaifH FernándezM HeY AlaniH.On stop words filtering and data sparsity for sentiment analysis of twitter;2014.
[26]
SamirA LahbibZ.Stemming and lemmatization for information retrieval systems in amazigh language. Proceedings of the International Conference on Big Data Cloud and Applications;2018:222‐233. 10.1007/978-3-319-96292-4_18
[27]
LiuQ WangJ ZhangD YangY WangN.Text features extraction based on TF‐IDF associating semantic. Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC);2018:2338‐2343; IEEE. 10.1109/compcomm.2018.8780663
[28]
Soucy P "Beyond TFIDF weighting for text categorization in the vector space model" IJCAI (2005)
[30]
AgarwalR SrikantR.Fast algorithms for mining association rules. Proceedings of the 20th VLDB Conference;1994:487‐499.
[31]
KlemettinenM MannilaH RonkainenP ToivonenH VerkamoAI.Finding interesting rules from large sets of discovered association rules. Proceedings of the 3rd International Conference on Information and Knowledge Management;1994:401‐407. 10.1145/191246.191314
[32]
SokolovaM JapkowiczN SzpakowiczS(2006).Beyond accuracy F‐score and ROC: a family of discriminant measures for performance evaluation. Proceedings of the Australasian Joint Conference on Artificial Intelligence;2006:1015‐1021. 10.1007/11941439_114
[33]
Insight—BBC Datasets. [Online]. Available from:http://mlg.ucd.ie/datasets/bbc.html
[34]
UCI Machine Learning Repository. [Online]. Available from:http://archive.ics.uci.edu/ml
Cited By
47
Progress in Artificial Intelligence
BMC Medical Informatics and Decisio...
Metrics
47
Citations
34
References
Details
Published
Aug 10, 2022
Vol/Issue
34(25)
License
View
Funding
Minia University
Cite This Article
Heba Mamdouh Farghaly, Tarek Abd El‐Hafeez (2022). A new feature selection method based on frequent and associated itemsets for text classification. Concurrency and Computation: Practice and Experience, 34(25). https://doi.org/10.1002/cpe.7258
Related

You May Also Like

Distributed computing in practice: the Condor experience

Douglas Thain, Todd Tannenbaum · 2005

1,238 citations

The LINPACK Benchmark: past, present and future

Jack J. Dongarra, Piotr Luszczek · 2003

549 citations

Chatbots: Security, privacy, data protection, and social aspects

Martin Hasal, Jana Nowaková · 2021

209 citations

Performance evaluation of heterogeneous cloud functions

Kamil Figiela, Adam Gajek · 2018

64 citations