Abstract:
Lip-reading is a technique for interpreting meanings from a series of images or videos in which faces and lips are spoken, but no sound or voice is missing. This research aims to develop techniques for the recognition of Thai lip reading by using a data set from five hundred Thai news videos on social media, including YouTube. These videos were used to develop a Thai oral reading recognition model based on visual speech recognition. The applied dataset is extracted for each frame and detects the lip localization and lip features. Three or five representative frames are then selected, replacing one syllable with the relative max and relative min functions. And then put them together with the technique of sequencing the images in both rectangular and square forms. The developed models were applied by using Convolutional Neural Networks and Bidirectional Long Short-Term Memory. It was found that the SC5-SKI + CNN + Bi-LSTM model, which is a model using a square sequence concatenated of five frames in combination with Convolutional Neural Networks and Bidirectional Long Short-Term Memory had the highest evaluation of the models performance compared to other models. This models accuracy was 94.38%, the word error rate was 6.71%, and the character error rate was 7.55% for Thai lip reading recognition, which is a continuous sentence.