Feature Selection and Machine Learning Based Software Defect Prediction: A Performance Evaluation
Abstract
Software Defect Prediction (SDP) is a vital technique for enhancing software quality and reducing development and maintenance costs by identifying fault-prone components at early stages of the software life cycle. Despite significant progress in machine learning–based approaches, limited attention has been given to understanding which software features most strongly influence prediction performance and how different algorithms compare under standardized evaluation. This study investigates feature identification and evaluates the performance of several machine learning classifiers for SDP using the JM1 dataset from the NASA PROMISE repository, which contains 10,885 software modules described by McCabe and Halstead complexity metrics. Data preprocessing was performed using correlation-based feature selection and K-means clustering to reduce redundancy, mitigate multicollinearity, and improve data representation. Four classifiers Support Vector Machine, Naïve Bayes, Random Forest, and AdaBoost together with a stacking ensemble model were implemented and evaluated using accuracy, precision, recall, and F1-score, both with and without Particle Swarm Optimization. Experimental results indicate that the Random Forest classifier consistently outperformed the other models, achieving up to 99.91% accuracy and superior precision, recall, and F1-score, while Naïve Bayes produced the lowest performance. The ensemble approach, particularly the combination of Random Forest and AdaBoost, further improved robustness. Feature analysis revealed that line count of code, cyclomatic complexity, essential complexity, total operators and operands, program volume, and program length are the most influential predictors of software defects. These findings demonstrate that careful feature selection combined with robust ensemble-based learning models can significantly improve the reliability and effectiveness of software defect prediction systems.
Full Text:
PDFReferences
Aimen, Khalid, Gran Badshah, Nasir Ayub, Muhammad Shiraz, Mohamed Ghouse, 2023. "Software Defect Prediction Analysis Using Machine Learning Techniques," Sustainability, vol. 15, no. 6, pp. 15,5517.
Emmanuel, Gbenga Dada, David Opeoluwa Oyewola, , Stephen Bassi Joseph,and Ali Baba Dauda, 2021. "Ensemble Machine Learning Model for Software Defect Prediction," Advances in Machine Learning & Artificial Intelligence 2021, vol. 2, pp. 11-21.
I C., SOCIETY, 2024. "What is software quality, "IEEE COMPUTER SOCIETY, New York city,
K. Jan, (2024). "Ensemble Methods: Combining multiple models toimprove prediction accuracy and robustness," Research gate, p. 11,.
Kechao, Wang, Lin Liu, Chengjun Yuan and M. S. G. M. K. L. K. Prashanthi, 2023. "Software Defect Prediction Survey Introducing Innovations with Multiple Techniques," in Advances in Cognitive Science and Communications, CHAP, pp. (pp.783-793).
Mehmood,Iqra, Shahid Sidra, Hussain Hameed, Khan Inayat, Ahmad Shafiq Rahman, Shahid Ullah, Najeeb Huda, Shamsul, 2023. "A Novel Approach to Improve Software Defect Prediction Accuracy Using Machine Learning," IEEE Access, Vols. pp(99):1-1, p. 99,
Pandit, Mahesha and Varma, Nitin, 2019. A Deep Introduction to AI Based Software Defect Prediction (SDP) and its Current Challenges, TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON).
S.M. Rifat, Akil Uddin Bhuyain, Md. Shamim Hossain, Md. Solaiman Mia, Mahmuda Rhman, 2023.A Systematic Approach for Enhancing Software Defect Prediction Using Machine Learning," in 2023 International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM), Gazipur, Bangladesh, 16-17 June 2023.
Sharma, S, Khatter K Soni M., and Sharma R, 2020. "Software defect prediction: do different levels of data relevance effect the prediction accuracy," international journal of information technology, vol. 12(3), no. 41870-020-00418-7, pp. 1073-1084.
Suresh, P., & Kavitha, V, 2021 "Naive Bayes for Software Defect Prediction: A Study on Its Effectiveness in Handling Large Datasets," Journal of Software Engineering Research and Development, vol. 9(3), pp. 123-135.
Zhifei Wang, 2020. "Software defect prediction model based on LASSO–SVM," Springer Link, vol. 33, pp. 8249–8259.
Refbacks
- There are currently no refbacks.