Web based Content Extraction and Retrieval in Web Engineering
C. H. Sarada Devi1, T. Kumanan2, Prameela Devi. Chillakuru3
1C. H. Sarada Devi, Research Scholar, Meenashi Academy of Higher Education and Research, Chennai (Tamil Nadu), India.
2Dr. T. Kumanan, Professor, Department of CSE, Meenashi Academy of Higher Education and Research, Chennai (Tamil Nadu), India.
3Prameela Devi. Chillakuru, Research Scholar, Meenashi Academy of Higher Education and Research, Chennai (Tamil Nadu), India.
Manuscript received on 10 October 2019 | Revised Manuscript received on 19 October 2019 | Manuscript Published on 02 November 2019 | PP: 71-80 | Volume-8 Issue-2S11 September 2019 | Retrieval Number: B10130982S1119/2019©BEIESP | DOI: 10.35940/ijrte.B1013.0982S1119
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The fast and wide-ranging pervasion of data and information over the web possess a high dispersion of an enormous capacity of normal language textual possessions. Excessive attention has been evolved in the existing scenario for determining, distribution and retrieving of an enormous source of knowledge. For this purpose, processing enormous data capacities in a sensible time frame is an important challenge and a vital necessity in numerous commercial and exploration fields. Computer clusters, distributed systems and parallel computing paradigms are being progressively applied in the current years; subsequently they presented important developments for computing presentation in data-intensive contexts, like Big Data mining and analysis. NLP is one of the significant features which can be utilized for text explanation and first feature extraction from request area with high computational supplies; therefore, these responsibilities can have advantage over similar architectures. This study shows a discrete framework for running NLP tasks in a parallel fashion and crawling web documents. The system was found on Apache Hadoop environment, and on its equivalent programming paradigm, called MapReduce. Authentication is done using the explanation for extracting keywords and critical phrase from the web documents in a multinode Hadoop cluster. The results of the proposed work shows increased storage capacity, increased speed in data processing, reduced user searching time and receives the accurate content from the large dataset stored in HBase.
Keywords: Natural Language Processing, Hadoop, Text Parsing, Web Crawling, Big Data Mining, HBase.
Scope of the Article: Web Mining