Multi Modal Rgb D Data Based Cnn Training with Uni Modal Rgb Data Testing for Real Time Sign Language Recognition
Sunitha Ravi1, M. Suman2, P.V.V. Kishore3, E. Kiran Kumar4, M. Teja Kiran Kumar5 and D. Anil Kumar6
1Sunitha Ravi, Department of Engineering and Communication Engineering, K.L.E.F., Green Fields, Vaddeswaram, Guntur, A.P., INDIA – 522502
2M. Suman, Department of Engineering and Communication Engineering, K.L.E.F., Green Fields, Vaddeswaram, Guntur, A.P., INDIA – 522502.
3P.V.V. Kishore, Department of Engineering and Communication Engineering, K.L.E.F., Green Fields, Vaddeswaram, Guntur, A.P., INDIA – 522502
4E. Kiran Kumar, Department of Engineering and Communication Engineering, K.L.E.F., Green Fields, Vaddeswaram, Guntur, A.P., INDIA – 522502.
5M. Teja Kiran Kumar, Department of Engineering and Communication Engineering, K.L.E.F., Green Fields, Vaddeswaram, Guntur, A.P., INDIA – 522502
6D. Anil Kumar, Department of Engineering and Communication Engineering, K.L.E.F., Green Fields, Vaddeswaram, Guntur, A.P., INDIA – 522502.
Manuscript received on 21 April 2019 | Revised Manuscript received on 26 May 2019 | Manuscript published on 30 May 2019 | PP: 2972-2982 | Volume-8 Issue-1, May 2019 | Retrieval Number: A1934058119/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: At present, the accuracy of translating video based sign language into text or voice remains indistinct and is therefore an interesting and challenging problem for computer scientists. Higher accuracies can now be achieved by applying deep learning models for sign language recognition (SLR), which was done successfully for human action recognition problem. This inspired us to investigate convolutional neural networks (CNN) for translating 2D sign videos into text. To this end, we propose a novel four steam CNN architecture with multi modal training (MT) with RGB and depth data; and unimodal testing (UT) with only RGB data on RGB – D sign language video data. The four streams cluster into two native modal streams, RGB and depth. Based on the domain characteristics, native modals were divided into two modal specific spatial and temporal streams. The major drawback of feeding raw sign video for training can be ineffective due to the small variations in sign language data compared to large background variations in the video sequence. We have observed this overfitting problem by preliminary experimentation, where the CNNs learn the noisy background rather than the foreground signer. The overfitting problem was solved by feature sharing mechanism between RGB and depth modals. Experimental results show that the proposed CNN is capable of predicting the class labels with unimodal data (RGB) only. We tested the performance of the proposed MTUTCNN architecture on our own RGB – D sign language (BVCSL3D) and three RGB -D based action datasets for scale, subject and view invariance. Results were validated against current state – of – the – art deep learning based sign language (or action) recognition models. Our study shows a recognition rate of 91.93% on BVCSL3D dataset.
Index Terms: 3D Sign Language Recognition, Feature Sharing CNNs, Convolutional Neural Nets, Multimodal Training, Unimodal Testing.
Scope of the Article: Natural Language Processing