Loading

An Ameliorate Approach for Near Duplicate Page Detection Considering Synonyms of Keyword
V. A. Narayana1, Gaddamidhi Sreevani2, K. Srujan Raju3

1Dr. V. A. Narayana, Department of CSE, CMR College of Engineering & Technology, Hyderabad (Telangana), India.
2Gaddamidhi Sreevani, Department of CSE, CMR College of Engineering & Technology, Hyderabad (Telangana), India.
3Dr. K. Srujan Raju, Department of CSE, CMR Technical Campus, Hyderabad (Telangana), India.
Manuscript received on 20 June 2019 | Revised Manuscript received on 11 July 2019 | Manuscript Published on 17 July 2019 | PP: 1232-1239 | Volume-8 Issue-1C2 May 2019 | Retrieval Number: A12200581C219/2019©BEIESP
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Past due years have visible the exquisite improvement of world big internet (WWW). statistics is being open on the blade gertip each time anywhere thru the big internet preserve. The execution and unwavering great of internet motors on this manner face good sized troubles because of the nearness of wonderful measure of net facts. The voluminous degree of net statistics has delivered approximately issues for internet crawlers prompting the way that the indexed lists are a number of the time of a great deal much less significance to the patron. what is greater, the nearness of replica and near reproduction net information has made a further overhead for the net indexes basically influencing their execution. the decision for for integrating data from heterogeneous assets ends in the hassle of close to-reproduction net pages. The detection of close to reproduction documents interior a set has these days end up a place of exceptional interest. on this paper, a talented approach for the discovery of close replica net website pages in net slithering which makes use of key terms alongside side their synonyms became supplied and the bear in mind of assessment score degree some of the files being in contrast. except that, Narayana et al proposed “a very unique and efficient method for close to reproduction web web page Detection in internet slithering”. in this technique, in the starting the watchwords are eliminated from the slithered internet web page pages and the likeness score among stpages is decided relying on the separated catchphrases. however this approach doesn’t don’t forget with equal semantic content material. This decreases the accuracy and efficiency of identifying near duplicates pages. In the new system which is displayed, which is to survive the above specified problem, all the keywords are collected from the crawled page and afterward for the each frequent occurring keyword their synonyms are considered, and then the similarity score is calculated between the two pages. By this technique, the duplicate pages which were created by modifying the keywords with their synonyms are also detected and hence not added to the repository.
Keywords: Near Duplicate Documents, Similarityscore Measure, Confusion Matrix, Storage Space Complexity, Memory usage Analysis, Computation Time Analysis.
Scope of the Article: Analysis of Algorithms and Computational Complexity