Data Mining K-Means Document Clustering using TFIDF and Word Frequency Count
Aranga Arivarasan1, M. Karthikeyan2
1Aranga Arivarasan, Assistant Professor, Department of Computer and Information Science, Annamalai University, India.
2Dr. M. Karthikeyan, Assistant Professor, Department of Computer and Information Science, Annamalai University, India.
Manuscript received on 03 March 2019 | Revised Manuscript received on 09 March 2019 | Manuscript published on 30 July 2019 | PP: 2542-2549 | Volume-8 Issue-2, July 2019 | Retrieval Number: B1718078219/19©BEIESP | DOI: 10.35940/ijrte.B1718.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: In the rapid development of www the amount of documents used increases in a rapid speed. This produces huge gigabyte of text document processing. For indexing as well as retrieving the required text document an efficient algorithms produce better performance by achieving good accuracy. The algorithms available in the field of data mining also provide a variety of new innovations regarding data mining. This increases the interest of the researchers to develop many essential models in the field of text data mining. In the proposed model is a two step text document clustering approach by K-Means algorithm. The first step includes Pre_Processing and second step includes clustering process. For Pre-Processing the method performs the tokenization approach. The distinct words are identified and the distinct words frequency of occurrence, TFIDF weights of the occurrences are calculated to form a document feature vector separately. In the clustering phase the feature vector is clustered by performing K-means algorithm by implementing various similarity measures.
Keywords: TFIDF, Word Frequency, Probability, Tokenization, Clustering.
Scope of the Article: Data Mining