An Enhanced Unsupervised Fuzzy Expectation Maximization Clustering for Deduplication of Records in Big data
P. Selvi1, D. Shanmuga Priyaa2
1P. Selvi, Research Scholar, Department of Computer Science, Karpagam Academy of Higher Education, Coimbatore (Tamil Nadu), India.
2Dr. D. Shanmuga Priyaa, Professor, Department of CS, CA & IT, Karpagam Academy of Higher Education, Coimbatore (Tamil Nadu), India.
Manuscript received on 29 November 2019 | Revised Manuscript received on 04 December 2019 | Manuscript Published on 10 December 2019 | PP: 988-993 | Volume-8 Issue-3S2 October 2019 | Retrieval Number: C12691083S219/2019©BEIESP | DOI: 10.35940/ijrte.C1269.1083S219
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The main issue while handling records in data warehouse or cloud storage is the presence of duplicate records which may unnecessarily test the storage capacity and computation complexity. This is an issue while integrating various databases. This paper focuses on discovering records, entirely and partly replicated, before storing them in cloud storage. This work converts whole content of data to numeric values for applying deduplication using radix method. Fuzzy Expectation Maximization (FEM) is used to cluster the numerals, so that the time taken for comparison between records is reduced. To discover and eliminate the duplicate records, this paper used divided-and-conquer-algorithm to match records among intra-clusters, which further enhances the performance of the model. The simulation results have proved that the performance of the proposed model achieves higher detection rate of duplicate records.
Keywords: Duplication, Data Warehouse, Cloud Storage, Fuzzy Expectation Maximization, Deduplication.
Scope of the Article: Clustering