Using Genetic Algorithm for Breast Cancer Feature Selection

Kaan Eroltu *

United World College of the Adriatic, Collegio del Mondo Unito dell'Adriatico, Italy.

*Author to whom correspondence should be addressed.


Abstract

Breast cancer has been one of the most widespread cancer types in women worldwide. Breast cancer can be treated when detected early; otherwise, it has one of the highest mortality rates among cancer types. Many tools can be used for detection, but computer-based diagnosis systems have become popular as they are cheaper and quicker. This brings incorrect detections as well. Hence, feature selection is an important factor that can enhance the accuracy of computer-based programs. This study uses genetic algorithms for feature selection within a wrapper methodology for breast cancer diagnosis. The proposed model has been tested with 17 different classifiers in order to evaluate its effectiveness. There has been an increase in training accuracy after feature selection was employed with genetic algorithms. The highest training accuracy was reported in Extra Trees, MLP, Random Forest, and Logistic at 100%, and the lowest was reported in GaussianNB at 0.925. Furthermore, feature selection improved validation accuracy, sensitivity, specificity, F1-score, Matthews Correlation Coefficient, specificity, and sensitivity.

Keywords: Genetic algorithm, Feature selection, Breast cancer, Machine learning classifiers, Random forest


How to Cite

Eroltu, Kaan. 2023. “Using Genetic Algorithm for Breast Cancer Feature Selection”. International Research Journal of Oncology 6 (2):203-26. https://journalirjo.com/index.php/IRJO/article/view/139.

Downloads

Download data is not yet available.

References

Robbins KR, et al. The ant colony algorithm for feature selection in high-dimension gene expression data for disease classification. Mathematical Medicine and Biology: A Journal of the IMA. 2007;24(4):413-26.

Moradi P, Rostami M. A graph theoretic approach for unsupervised feature selection. Engineering Applications of Artificial Intelligence. 2015;44:33-45.

Rostami M, Berahmand K, Forouzandeh S. A novel community detection based genetic algorithm for feature selection. Journal of Big Data. 2021;8(2). Available:https://doi.org/10.1186/s40537-020-00398-3.

Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence. 1997;97(1-2):273-324.

Aalaei S, et al. Feature selection using genetic algorithm for breast cancer diagnosis: Experiment on three different datasets. Iranian Journal of Basic Medical Sciences. 2016;19(5):476-82.

Łukasiewicz S, et al. Breast Cancer-Epidemiology, Risk Factors, Classification, Prognostic Markers, and Current Treatment Strategies-An Updated Review. Cancers. 2021;13(17):4287. Available:https://doi.org/10.3390/cancers13174287.

Basha S, Saheb K, Satya Prasad. Automatic Detection Of Breast Cancer Mass In Mammograms Using Morphological Operators And Fuzzy C-Means Clustering. Journal of Theoretical & Applied Information Technology. 2009;5(6).

Kuhl CK, et al. Mammography, breast ultrasound, and magnetic resonance imaging for surveillance of women at high familial risk for breast cancer. Journal of Clinical Oncology. 2005;23(33):8469-76 Available:https://doi.org/10.1200/JCO.2004.00.4960.

Elmore JG, et al. Variability in radiologists' interpretations of mammograms. New England Journal of Medicine. 1994; 331(22):1493-9.

Fletcher SW, et al. Report of the international workshop on screening for breast cancer. JNCI: Journal of the National Cancer Institute. 1993; 85(20):1644-56.

Zanella L, Facco P, Bezzo F, Cimetta E. Feature Selection and Molecular Classification of Cancer Phenotypes: A Comparative Study. International journal of molecular sciences. 2022;23(16):9087. Available:https://doi.org/10.3390/ijms23169087.

Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3:1157-82.

Bermejo P, Gámez JA, Puerta JM. A GRASP algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets. Pattern Recognition Letters. 2011;32(5):701-11.

Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE; 2015: 1200-5. Available:https://doi.org/10.1109/MIPRO.2015.7160458.

Aghdam MH, Ghasem-Aghaee N, Basiri ME. Application of ant colony optimization for feature selection in text categorization. In IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). IEEE; 2008.

Unler A, Murat A. A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research. 2010; 206(3):528-39.

Karegowda AGM, Jayaram AS, Manjunath. Feature subset selection problem using a wrapper approach in supervised learning. International Journal of Computer Applications. 2010;1(7):13-7.

Youn E, Koenig L, Jeong MK, Baek SH. Support vector-based feature selection using Fisher’s linear discriminant and support vector machine. Expert Systems with Applications. 2010;37(9):6148–6156. Available:https://doi.org/10.1016/j.eswa.2010.02.113

Ouyang X, Wang J, Chen X, Zhao X, Ye H, Watson AE, Wang S. Applying a projection pursuit model for evaluation of ecological quality in Jiangxi Province, China. Ecological Indicators. 2021;133:108414. Available:https://doi.org/10.1016/j.ecolind.2021.108414.

Sridevi T, Murugan A. An intelligent classifier for breast cancer diagnosis based on K-Means clustering and rough set. International Journal of Computer Applications. 2014;85:11.

Sridevi T, Murugan A. A novel feature selection method for effective breast cancer diagnosis and prognosis. International Journal of Computer Applications. 2014;88:11.

Wolberg W, Mangasarian O, Street N, Street W. Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. 1995. https://doi.org/10.24432/C5DW2B.

Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine Learning. 2006;63:3–42. Available:https://doi.org/10.1007/s10994-006-6226-1.

Melanson D. Extremely Randomized Trees with Multiparty Computation. [Internet] 2021.

Available:https://www.tacoma.uw.edu/sites/default/files/2021-08/melanson_david_senior_thesis_2020.pdf.

Ampomah EK, Qin Z, Nyame G. Evaluation of tree-based ensemble machine learning models in predicting stock price direction of movement. Information. 2020;11(6):332. Available:https://doi.org/10.3390/info11060332.

Freund Y, Schapire RE. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning; 1996.

Breiman L. Bagging predictors. Machine Learning. 1996;24(2):123-40.

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Computation. 1997;9(7):1545-88.

Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20(8):832-44.

Ho TK. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition. IEEE; 1995:278-82.

Anand MV, KiranBala B, Srividhya SR, C K, Younus M, Rahman MH. Gaussian naïve Bayes Algorithm: A reliable technique involved in the assortment of the segregation in cancer. Mobile Information Systems. 2022;2022:1-7.

Available:https://doi.org/10.1155/2022/2436946

Gan M, Pan S, Chen Y, Cheng C, Pan H, Zhu X. Application of the machine learning lightgbm model to the prediction of the water levels of the lower Columbia River. Journal of Marine Science and Engineering. 2021;9(5):496.

Available:https://doi.org/10.3390/jmse9050496

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A highly efficient gradient boosting decision tree. Neurips.Cc. [Accessed 25 Aug 2022]. Available:ttps://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

Ghojogh B. Linear and quadratic discriminant analysis: Tutorial. Arxiv.org. 2019 [Accessed 27 Oct 2022]. Available from: http://arxiv.org/abs/1906.02590

Sifaou H, Kammoun A, Alouini M-S. High-dimensional quadratic discriminant analysis under spiked covariance model. [Accessed 25 Sep 2022]. Available from: Available:https://doi.org/10.48550/ARXIV.2006.14325

Zhang Y. Support vector machine classification algorithm and its application. In: Communications in Computer and Information Science. Springer Berlin; 2012; 179-186.

Wang H, Xiong J, Yao Z, Lin M, Ren J. Research survey on support vector machine. In: Proceedings of the 10th EAI International Conference on Mobile Multimedia Communications; 2017.

Yang X-S. Genetic Algorithms. In: Nature-Inspired Optimization Algorithms. Elsevier; 2014;77-87.

Marseguerra M, Zio E. Genetic algorithms: Theory and applications in the safety domain. Iaea.org. [Accessed 29 Aug 2022]. Available:https://inis.iaea.org/collection/NCLCollectionStore/_Public/38/027/38027911.pdf

Eroltu K. Comparing different Convolutional Neural Networks for the classification of Alzheimer’s Disease. Journal of High School Science. 2023;7(3).

Oh I-S, Lee J-S, Moon B-R. Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004;26(11):1424-1437.

Hadizadeh F, Vahdani S, Jafarpour M. Quantitative Structure-Activity Relationship Studies of 4-Imidazolyl- 1,4-dihydropyridines as Calcium Channel Blockers. Iranian Journal of Basic Medical Sciences. 2013;16(8):910-916.

Senturk ZK, Kara R. Breast cancer diagnosis via data mining: Performance analysis of seven different algorithms. Computer Science & Engineering An International Journal. 2014;4(1):35-46. Availablehttps://doi.org/10.5121/cseij.2014.4104

Noruzi A, Sahebi H. A graph-based feature selection method for improving medical diagnosis. Advances in Computer Science. 2015;4:36-40.