Abstract:
The purpose of this research was to assist in the development of machine translation (MT) in Thailand. The primary need for performance improvements for Thai MT is the availability of the required resources especially for training translation models which require parallel texts. The objectives of this project were specifically to create a parallel text corpus that could be used for Thai-English (or English-Thai) machine translation, and to make this resource free and publicly available. The approach to achieve the corpus construction involved the extraction, segmentation and alignment of web text. To establish the needs of Thai machine translation it was also necessary to research the field and identify the historical and current needs and trends such as the use of neural networks.
Despite the importance of parallel text corpora, research has few existing resources for the Thai language due in part to the difficult nature of corpus construction. This project details the issues raised during the various construction processes. The project objectives also include creating the tools to achieve the project goal during corpus construction. These tools included extraction through a python library and regular expressions, developing existing segmentation tools, and alignment though both statistical and linguistic approaches involving sentence length and recognizable characters called cognates. These stages are required to perform the process of transforming this text into the required format. Although it is not possible to create a system that automatically extracts, segments and aligns texts suitable for the corpus, the tools may benefit researchers and are therefore also made publicly available. It is hoped further research can benefit from the findings.