Evaluating Imputation Methods to Improve Data Availability in a Software Estimation Dataset
Sreekumar P. Pillai1, T. Radha Ramanan2, S. D. Madhu Kumar3
1Sreekumar P. Pillai, Research Scholar, School of Management Studies, NIT Calicut, Kozhikode (Kerala), India.
2Dr. T. Radha Ramanan, Associate Professor, School of Management Studies, NIT Calicut, Kozhikode (Kerala), India.
3Dr. S. D. Madhu Kumar, Professor, Department of Computer Science, NIT Calicut, Kozhikode (Kerala), India.
Manuscript received on 10 October 2019 | Revised Manuscript received on 19 October 2019 | Manuscript Published on 02 November 2019 | PP: 153-159 | Volume-8 Issue-2S11 September 2019 | Retrieval Number: B10250982S1119/2019©BEIESP | DOI: 10.35940/ijrte.B1025.0982S1119
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Missing of partial data is a problem that is prevalent in most of the datasets used for statistical analysis. In this study, we analyzed the missing values in ISBSG R1 2018 dataset and addressed the problem through imputation, a machine learning technique which can increase the availability of data. Additionally, we compare the performance of three imputation methods: Classification and Regression Trees (CART), Polynomial Regression (PR), Predictive Mean Matching (PMM), and Random Forest (RF) applied to ISBSG R1 2018 dataset available from International Standards Benchmarks Group. Through imputation, we were able to increase data availability by four times. We also evaluated the performance of these methods against the original dataset without imputation using an ensemble of Linear Regression, Gradient Boosting, Random Forest, and ANN. Imputation using CART can increase the availability of the overall dataset but only at the loss of some predictive capability of the model. However, CART remains the option of choice to extend the usability of the data by retaining rows that are otherwise removed from the dataset in traditional methods. In our experiments, this approach has been able to increase the usability of the original dataset to 63%, but with 2 to 3% decrease in its overall predictive performance.
Keywords: Software Effort Estimation, Software Cost Estimation, Effort Prediction, Gradient Boosting Machines, Generalized Linear Model, Artificial Neural Networks, Random Forests, Missing Data Imputation, Ensemble Models.
Scope of the Article: Software Engineering & Its Applications