Text Mining with Hadoop: Enforcement of Document Clustering using Non-Negative Matrix Factorization KNMF
E. Laxmi Lydia1, K. Vijaya Kumar2, K. Shankar3
1E. Laxmi Lydia, Department of Computer Science Engineering, Vignan’s Institute of Information Technology (Autonomous), Visakhapatnam, INDIA.
2K. Vijaya Kumar, Department of Computer Science Engineering, Vignan’s Institute of Engineering for Women, Andhra Pradesh, India.
3K. Shankar, School of Computing, Kalasalingam Academy of Research and Education, Krishnan Koil, India.
Manuscript received on 08 April 2019 | Revised Manuscript received on 16 May 2019 | Manuscript published on 30 May 2019 | PP: 3372-3380 | Volume-8 Issue-1, May 2019 | Retrieval Number: F2962037619/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Big data is recognized as information coming from many sources with an innovative analysis of information. The data in documents are mostly unstructured data such as text processing documents, audio, webpage, log results, etc. Problem Statement: To Order these files manually in folders, it is essential to know the entire contents of the files and the name of the files in order to process files, so that certain files are aligned as a lot. Another characteristic of this information is that it is prone to continuous change, hence clustering is required. Existing approach: uses Latent Semantic Indexing (LSI), Single value decomposition for unstructured document which was quickly filtered and viewed, but it is much harder tocomprehend for computer machines. Proposed approach: A prototype is prepared by deducting redundancy structures to organize the data by similarity, NMF’s updated rules along with k-means are proposed in this paper which is used to find the top terms in a respective cluster. For the purposes of exploration, anew data set called Newsgroup20 is considered. To accomplish this, preprocessing steps like Documents indexing, removal of stop words, Stemming.In specific, the words of the text document must be identified for the extraction of key features. The actual work was distributed in parallel with all documents in this project here, Apache Hadoop Map reduce was used for parallel programming.
Keywords: Big Data, Hadoop, LSI, Newsgroup20, NMF, SVD.
Scope of the Article: Big Data