A new feature selection method based on frequent and associated itemsets for text classification

Heba Mamdouh Farghaly; Tarek Abd El‐Hafeez

doi:10.1002/cpe.7258

journal article Aug 10, 2022

A new feature selection method based on frequent and associated itemsets for text classification

Heba Mamdouh Farghaly Tarek Abd El‐Hafeez

Concurrency and Computation: Practice and Experience Vol. 34 No. 25 · Wiley

View at Publisher Save 10.1002/cpe.7258

Abstract

SummaryFeature selection is one of the major issues in pattern recognition. The quality of selected features is important for classification as the low‐quality data can degrade the model construction performance. Due to the difficulty of dealing with the problem that selected features always contain redundant information, this article focuses on the association analysis theory in data mining to select important features. In this study, a novel feature selection method based on frequent and associated itemsets (FS‐FAI) for text classification is proposed. FS‐FAI seeks to find relevant features and also takes feature interaction into account. Moreover, it uses association as a metric to evaluate the relativity between the target concept and feature(s). To evaluate the efficacy of the proposed method, several experiments were conducted on a BBC dataset from the BBC news website and SMS spam collection dataset from the UCI machine learning repository. The obtained results were compared to well‐known feature selection methods. The reported results demonstrated the effectiveness of the proposed feature selection method in selecting high‐quality features and in handling redundant information in text classification.

Topics

No keywords indexed for this article. Browse by subject →

References

34

[1]

10.1016/j.ins.2017.08.036

[2]

10.1016/j.knosys.2013.09.019

[3]

10.3390/app9040665

[4]

SangodiahA AhmadR AhmadWFW.A review in feature extraction approach in question classification using support vector machine. Proceedings of the 2014 IEEE International Conference on Control System Computing and Engineering (ICCSCE 2014);2014:536‐541. 10.1109/iccsce.2014.7072776

[5]

10.1145/505282.505283

[6]

LangleyP(1994).Selection of relevant features. Proceedings of the AAAI Fall Symposium on Relevance;1994:171‐182. 10.21236/ada292575

[7]

10.1109/34.574797

[8]

AgrawalR SrikantR.Fast algorithms for mining association rules. Proceedings of the 20th International Conference Very Large Data Bases (VLDB) Vol. 1215;1994:487‐499.

[9]

10.1109/69.846291

[10]

10.1145/335191.335372

[11]

10.1201/9781584888796.pt4

[12]

10.3233/ida-2009-0364

[13]

Pawening RE "Feature selection methods based on mutual information for classifying heterogeneous features" J Ilmu komput Inf (2016)

[14]

Sun J "Efficient method for feature selection in text classification" Int Conf Eng Technol (2017)

[15]

KaoungkuN SuksutK ChanklanR KerdprasopK KerdprasopN.Data classification based on feature selection with association rule mining. Proceedings of the International MultiConference of Engineers and Computer Scientists;2017.

[16]

10.1088/1742-6596/1168/5/052012

[17]

Ukhti Ikhsani Larasati IU "Improve the accuracy of support vector machine using Chi Square statistic and term frequency inverse document frequency on movie review sentiment analysis" Sci J Inform (2019)

[18]

Kononenko I (1994) 10.1007/3-540-57868-4_57

[19]

10.1016/j.neucom.2015.07.155

[20]

AnggraenyFT PurbasariIY SuryaningsihE.Relief feature selection and Bayesian network model for hepatitis diagnosis. Prosiding International Conference on Information Technology and Business (ICITB);2018:113‐118.

[21]

10.1177/0165551518770967

[22]

SinayobyeJO KyandaSK KiwanukaNF MusabeR.Hybrid model of correlation based filter feature selection and machine learning classifiers applied on smart meter dataset. IEEE/ACM Symposium on Software Engineering in Africa (SEiA);2019:1‐10. 10.1109/seia.2019.00009

[23]

10.1016/s1088-467x(97)00008-5

[24]

Verma T "Tokenization and filtering process in RapidMiner" Int J Appl Inf Syst (2014)

[25]

SaifH FernándezM HeY AlaniH.On stop words filtering and data sparsity for sentiment analysis of twitter;2014.

[26]

SamirA LahbibZ.Stemming and lemmatization for information retrieval systems in amazigh language. Proceedings of the International Conference on Big Data Cloud and Applications;2018:222‐233. 10.1007/978-3-319-96292-4_18

[27]

LiuQ WangJ ZhangD YangY WangN.Text features extraction based on TF‐IDF associating semantic. Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC);2018:2338‐2343; IEEE. 10.1109/compcomm.2018.8780663

[28]

Soucy P "Beyond TFIDF weighting for text categorization in the vector space model" IJCAI (2005)

[29]

10.1016/j.procs.2019.05.008

[30]

AgarwalR SrikantR.Fast algorithms for mining association rules. Proceedings of the 20th VLDB Conference;1994:487‐499.

[31]

KlemettinenM MannilaH RonkainenP ToivonenH VerkamoAI.Finding interesting rules from large sets of discovered association rules. Proceedings of the 3rd International Conference on Information and Knowledge Management;1994:401‐407. 10.1145/191246.191314

[32]

SokolovaM JapkowiczN SzpakowiczS(2006).Beyond accuracy F‐score and ROC: a family of discriminant measures for performance evaluation. Proceedings of the Australasian Joint Conference on Artificial Intelligence;2006:1015‐1021. 10.1007/11941439_114

[33]

Insight—BBC Datasets. [Online]. Available from:http://mlg.ucd.ie/datasets/bbc.html

[34]

UCI Machine Learning Repository. [Online]. Available from:http://archive.ics.uci.edu/ml

Cited By

47

Exploring federated learning trends: a bibliometric analysis and Word2Vec NLP approach

Krishan Kumar, Komal Sharma · 2025

Progress in Artificial Intelligence

The power of deep learning in simplifying feature selection for hepatocellular carcinoma: a review

Ghada Mostafa, Hamdi Mahmoud · 2024

BMC Medical Informatics and Decisio...

Metrics

47

Citations

34

References

Details

Published: Aug 10, 2022
Vol/Issue: 34(25)
License: View

Authors

H

Heba Mamdouh Farghaly

Department of Computer Science, Faculty of Science Minia University EL‐Minia Egypt

T

Tarek Abd El‐Hafeez

Department of Computer Science, Faculty of Science Minia University EL‐Minia Egypt; Computer Science Unit Deraya University EL‐Minia Egypt

Funding

Minia University

Cite This Article

Heba Mamdouh Farghaly, Tarek Abd El‐Hafeez (2022). A new feature selection method based on frequent and associated itemsets for text classification. Concurrency and Computation: Practice and Experience, 34(25). https://doi.org/10.1002/cpe.7258

A new feature selection method based on frequent and associated itemsets for text classification

You May Also Like