Feature Extraction and Feature Selection Process in Authorship Identification for Tamil Language
A. Pandian1, R Ragavi2, V. V. Ramalingam3
1Dr. A. Pandian, Department of Computer Science, SRMIST, SRM University, Chennai (Tamil Nadu), India.
2R Ragavi, Department of Computer Science, SRMIST, SRM University, Chennai (Tamil Nadu), India.
3Dr. V. V. Ramalingam, Department of Computer Science, SRMIST, SRM University, Chennai (Tamil Nadu), India.
Manuscript received on 02 April 2019 | Revised Manuscript received on 18 April 2019 | Manuscript Published on 30 April 2019 | PP: 1-6 | Volume-7 Issue-6S6 April 2019 | Retrieval Number: F10010476S619/2019©BEIESP | DOI: 10.35940/ijrte.F1001.0476S619
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The concept of authorship identification and stylometry analysis have fascinating issues to be dealt with. Articles framed by various authors can be classified by measuring attributes linked to literary style along with attribute authorship related to texts that are newly explored. This forms a necessary function in different fields like psycholinguistics, cybercrime investigation, political socialization, etc… Numerous classical variations of statistical methods are presented and imbibed by the literary style. In the approach of Text processing outstanding information is fetched from the Tamil dataset combining quantifiable parameters from it. In the case of unidentified authors, it becomes tedious to classify the poems. The existing paper suggest that poem or text that is unidentified can be retrieved by categorizing potential author’s earlier work and organizing the unfamiliar text or poem in Tamil language by constructing a classifier. In the proposed approach various stages include: feature extraction and selection utilizing decision tree. Whole process is split up into two – training and testing. All the known poets are organized within training dataset – contains 5 authors each 80 poems whereas the unknown poets are categorized under testing dataset. Through imbibing this methodology different poetic authors can be identified within the Tamil vernacular, resulting in valuable contribution to the society. The work considers a persons or authors distinctive style of writing, computes the features relativity. The method proposed is highly effective as demonstrated in the outcome of experiments performed on the actual dataset. The proposed techniques of decision tree effectively yields higher functionality in comparison with other existing approaches.
Keywords: Classification, Tamil Articles, Feature Extraction, Feature Selection, Stylometry, Training dataset, Testing Dataset, Authorship.
Scope of the Article: Classification