Maximum Frequent Item Set based Clustering Algorithm for Big Text Data
K. V. Kanimozhi1, Rajakumarkrishnan2, M. Venkatesan3
1K. V. Kanimozhi, Department of Computer Sscience and Engineering, VIT University Vellore (Tamil Nadu), India.
2Dr. Rajakumarkrishnan, Department of Computer Science and Engineering, VIT University, Vellore (Tamil Nadu), India.
3Dr. M. Venkatesan, Department of Computer Science and Engineering, NIT Surathkal Vellore (Tamil Nadu), India.
Manuscript received on 20 October 2019 | Revised Manuscript received on 25 October 2019 | Manuscript Published on 02 November 2019 | PP: 3970-3975 | Volume-8 Issue-2S11 September 2019 | Retrieval Number: B15390982S1119/2019©BEIESP | DOI: 10.35940/ijrte.B1539.0982S1119
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Due to fast growth of internet and continuous expansion of World Wide Web like digital libraries, online news contributes to massive amount of electronic unstructured text documents on the web. Although lot traditional techniques are available to extract the knowledge from large collection of text documents, still to improve precision of the web search retrieval and to find most appropriate documents from huge text collections proficiently is a big challenge. Clustering techniques helps the search engine to retrieve the documents. The proposed system overcomes existing problems using bivariate n-gram frequent item clustering algorithm by concept of maximum frequent set which maintain the sequence and meaning of sentence in order to reduce huge dimension and and frequent item sets finds similarity. Then based on maximum document occurrence we cluster the documents. Thus our method obtains quality of clusters when compared with existing methodologies and improves the efficiency. The experiment is shown for sample Newsgroup dataset for existing K-Mean and FICMDO (Frequent item clustering method based on maximum document occurrence) and proved the f-measure is higher for our algorithm. Since the f-measure increases, obtains efficient clusters. Hence it is faster and efficient big data method which improves the performance when compared with vector space model like K-Means algorithm.
Keywords: Text Documents, Frequent Item Set, Similarity, Clustering, Map Reduce.
Scope of the Article: Textile Engineering