Missing Data Imputation in High Dimensional Data Set using Local Similarity
C.Nalini1, J.Sudeeptha2
1C.Nalini, Professor, Department of Information Technology, Kongu Engineering College, Erode, India.
2J.Sudeeptha, Application development associate in Accenture Chennai India.
Manuscript received on 06 August 2019. | Revised Manuscript received on 10 August 2019. | Manuscript published on 30 September 2019. | PP: 8070-8074 | Volume-8 Issue-3 September 2019 | Retrieval Number: C6435098319/2019©BEIESP | DOI: 10.35940/ijrte.C6435.098319
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Data quality is an important aspect for any data mining and statistical tasks. Presence of missing values in the dataset affects the data quality. Missing values refers to the event did not happen or the value does not exist. Data mining algorithms are not robust towards incomplete data. Imputation of missing values is necessary to improve the data quality for performing data mining and statistical analysis. The existing methods such as Expectation Maximization Imputation (EMI), A Framework for Imputing Missing values Using co appearance, correlation and Similarity analysis (FIMUS) use the whole dataset to impute missing values. In such cases, due to the influence of irrelevant record the accuracy of imputation may be affected. This can be controlled by only considering locally similar records to impute missing values. Local similarity imputation can be done through clustering algorithms such as k-means algorithm. K-means clustering efficiency depends on the number of clusters is to be defined by users. To increase the clustering efficiency, first distinctive value is imputed in place of missing ones and this imputed dataset is given to stacked autoencoder for dimensionality reduction which also improves the efficiency of clustering. Initial number of clusters to k-means algorithm is determined using fast clustering. Due to initial imputation, some irrelevant records may be partitioned to a cluster. When these records are used for imputing missing values, accuracy of imputation decreases. In the proposed algorithm, local similarity imputation algorithm uses only top knearest neighbours within the cluster to impute missing values. The performance of the proposed algorithm is evaluated based on Root-Mean-Squared-Error (RMSE) and Index of Agreement (d2). University of California Irvine datasets has been used for analyzing the performance of the proposed algorithm.
Keywords: Data quality, missing values, clustering, Root-Mean-Squared-Error, Index of Agreement
Scope of the Article: Data Mining