Abstract:
This research aims to compare the efficiency of algorithms for detecting and correcting typos in Thai, considering accuracy and processing time, especially the combination of word cutting methods and typo detection algorithms, to find the most suitable approach for developing Thai natural language processing tools (Thai NLP). The data used in the experiment consisted of 3 Thai datasets: Thai Toxicity Tweet, Wisesight Sentiment, and ThaiSum, which are human-generated texts from both social media and news articles. The data was then prepared and word cutting was performed using the newmm, deepcut, and attacut processes. Then, typos were checked using the Levenshtein Distance, Hunspell, Peter Norvig, and Word2Vec algorithms. The experimental results showed that the combination of word cutting and typo detection algorithms between attacut and Peter Norvig gave the best results in terms of accuracy, while newmm and Hunspell gave the best results in terms of speed. Each method has its own advantages and disadvantages. Therefore, the choice of use should depend on the objectives, such as accuracy or speed. In addition, the research also presents a reusable experimental framework, which is useful for developers and researchers who want to evaluate or develop Thai typo detection systems in the future.