Outlier Detection in Imbalanced Data Classification
M. Kamaladevi1, K. R. Sekar2, V. Venkataraman3, K. Kannan4
1M. Kamaladevi, School of Computing ,SASTRA Deemed University, (Tamilnadu), India.
2K.R. Sekar, School of Computing ,SASTRA Deemed University, (Tamilnadu), India.
3V. Venkataraman, School of Humanities and Science, SASTRA Deemed University, (Tamilnadu), India.
4K. Kannan, School of Humanities and Science, SASTRA Deemed University, (Tamilnadu), India.
Manuscript received on 23 March 2019 | Revised Manuscript received on 30 March 2019 | Manuscript published on 30 March 2019 | PP: 972-975 | Volume-7 Issue-6, March 2019 | Retrieval Number: F2626037619/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: In Binary classification , the distribution of classes present in a data is not uniform such that the number of instances of a class(es) significantly out numbers the instances of another class(es) leads to class imbalance. Classification algorithm biased toward the majority class. Performance accuracy are not based on minority class instance. This lead to degrade the classifier .To improve performance characteristics of minority data instance such as borderline rare and outlier has to analyzed. An outlier or an anomaly is a point that deviates from the normal behavior exhibited by the other points in a data. Detection of outlier in class instances is still open Research. Problem. In this article, two density-based outlier detection methods are compared. The two methods in discussion are the KNN method and the Local Outlier Factor (LOF). The KNN algorithm, which is a classification algorithm, is a global densitybased method, while the LOF is a local density-based method. These two methods are applied on the imbalanced data set Breast Cancer-W Dataset, consisting of 569 instances and 33 variables, taken from the UCI (University of California, Irvine) Machine Learning repository. The accuracy of both the algorithms (based on the percentage of observations correctly identified) is found out and their performances are analyzed. It has been found out that LOF method provided a better view of outlier data compared to KNN method.
Keywords: Outlier, LOF, KNN, distance-based, density-based
Scope of the Article: Classification