Please use this identifier to cite or link to this item: http://cmuir.cmu.ac.th/jspui/handle/6653943832/79707
Title: Thai spelling correction investigation framework based on OCR word extraction
Other Titles: กระบวนการสำรวจการแก้ไขคำสะกดผิดภาษาไทย กระบวนการสกัดคำตั้งแต่ต้นจนจบโดยการรู้จำอักขระด้วยแสง
Authors: Jukkrit Mengkaw
Authors: Pree Thiengburanathum
Jukkrit Mengkaw
Issue Date: Jun-2024
Publisher: Chiang Mai : Graduate School, Chiang Mai University
Abstract: Spelling correction (SC) is used to detect and correct misspelled words. SC is considered a fundamental task in various Natural Language Processing (NLP) applications, such as machine translation, chatbots, Optical Character Recognition (OCR) systems, etc. Currently, most documents are archived in PDF files, which can be easily read by humans. Therefore, text extraction is necessary to extract data from such documents that a computer can use for analysis. There are a few research studies on spelling correction in the low-resource language, particularly the performance of the Thai OCR. The issue of spelling correction in the Thai language is not only involved in terms of the availability of data, but also with the complexity of the language. In this paper, we proposed a two-step spelling correction framework that includes detection and correction steps. In the error detection, Conditional Random Field (CRF) revealed the highest performance and achieved an F1-score of 93.20%. In the error correction, Bi-LSTM with attention mechanism achieved F1-score of 86.31% and WangchanBERTa achieved F1-score of 81.36%. However, WangchanBERTa has a faster inference time than the Attention mechanism (40 times) and can reduce WER from 11.99% to 4.51%. The experiment results reveal that our proposed method effectively detects and corrects the Thai language.
URI: http://cmuir.cmu.ac.th/jspui/handle/6653943832/79707
Appears in Collections:ENG: Independent Study (IS)

Files in This Item:
File Description SizeFormat 
630632114-Jukkrit Mengkaw.pdf2.17 MBAdobe PDFView/Open    Request a copy


Items in CMUIR are protected by copyright, with all rights reserved, unless otherwise indicated.