Development of Indonesian Stemming Algorithms through Modification of Grouping, Sequencing and Removing of Affixes Based on Morphophonemic
Iyan Mulyana1, Adang Suhendra2, Ernastuti3, Bheta Agus W4
1Iyan Mulyana, Department of Computer Science, Pakuan University, Bogor, Indonesia.
2Adang Suhendra, Department of Technique Informatic, Gunadarma University, Depok, Indonesia.
3Ernastuti, Department of Industrie Technology, Gunadarma University, Depok, Indonesia.
4Bheta Agus W, Department of Technique Informatic, Gunadarma University, Depok, Indonesia.
Manuscript received on 03 August 2019 | Revised Manuscript received on 26 August 2019 | Manuscript Published on 05 September 2019 | PP: 179-184 | Volume-8 Issue-2S7 July 2019 | Retrieval Number: B10440782S719/2019©BEIESP | DOI: 10.35940/ijrte.B1044.0782S719
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Text documents stored on the system in an unstructured form, so that the information inside cannot be extracted directly. To be able to extract it, it takes text processing which is first carried out initial processing (preprocessing text) to convert text documents into more structured by selecting words that used as indexes. The smaller the index value, the more text documents are recognized on the system and the information is more easily extracted. The size of the index determined by the number of groups of words formed. To avoid forming many groups of words, then each word is changed to become a basic word first before grouping. The process of changing of affix word into a basic word using certain rules is called stemming. This research aims to produce a new Indonesian stemming algorithm named UG18 Stemmer algorithm, which can reduce or eliminate stemming errors such as over-stemming and under-stemming on existing stemming algorithms including the Enhanced Confix Stripping (ECS) Stemmer algorithm and the New Enhanced Confix Stripping (NECS) stemming algorithm. The method used is the morphophonemic process approach, which sees affixes as bound morphemes that experience phoneme changes, phoneme addition, and phoneme removal. The three processes are mapped, and Finite State Automata was made to obtain new affixed groups, sequences and new deletion methods that form the basis of the development of the UG18 Stemmer algorithm. This algorithm developed not using a list of decapitation rules used in pre-existing algorithms. Decapitation rules replaced with morphophonemic based elimination rules. Based on the evaluation results and testing of the UG18 Stemmer algorithm, it has a lower error rate compared to the results of stemming using NESC Stemmer. The result can be seen from the randomized test of 2500 word using Relevance Judgment validated by Indonesian language experts, from 1.48% over-stemming and 16.69% under-stemming using the NECS stemmer algorithm down to 0.12% overstemming and 0% understemming using the UG18 algorithm stemmer. Also, the UG18 Stemmer algorithm can improve the speed performance process in the information retrieval-based document similarity measurement application of 45.47% compared to using the ECS stemmer algorithm.
Keywords: Stemming, Affixes, Morphophonemic, UG18 Stemmer.
Scope of the Article: Web Algorithms