A Probe on Document Clustering Methodologies and Its Performance Metrics
P.Kalpana1, P. Tamije Selvy2
1Ms. P. Kalpana, Assistant Professor, Department of CSE, Sri Krishna College of Technology, Coimbatore, (Tamil Nadu).
2Dr. P. Tamije Selvy, Professor, Department of CSE, Sri Krishna College of Technology, Coimbatore, (Tamil Nadu).
Manuscript received on 09 March 2019 | Revised Manuscript received on 17 March 2019 | Manuscript published on 30 July 2019 | PP: 29 38-2942| Volume-8 Issue-2, July 2019 | Retrieval Number: B2624078219/19©BEIESP | DOI: 10.35940/ijrteB2624.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Due to the huge growth of internet usage, large volume of information flow has also been increased, which leads to the problem of information congestion. In unsupervised learning, clustering is consider as most important problem. Big quality, high dimensionality and complicated semantics are the difficult issue of document clustering.it focus on the way of identifying a structure from an unlabeled data collection. A cluster is a method in which the data items are identified and grouped based on the resemblance between the objects from a dissimilar object set. Decision of a good cluster, can be demonstrated that there is no absolute “best” criterion independent of the final objective of the clustering. A good document clustering scheme’s primary objective is to minimize intra-cluster distance between papers while maximizing inter-cluster distance(using a suitable document distance measure).A distance measure(or, dually, measure of resemblance)is therefore at the core of document clustering. This assessment gives an implication about the different methods(Vector Space Model, Latent Sematic Indexing, Latent Dirichlet Allocation, Singular Value Decomposition, Doc2Vec Model, Graph model), distance measures (Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient) and evaluation parameters of document clustering. This work is theoretical in nature and aims to corner the overall procedure of document clustering.
Index Terms: Document Clustering, Distance Measure, Unsupervised Learning, Intra-Cluster.
Scope of the Article: Deep Learning