Thai spelling correction investigation framework based on OCR word extraction

Jukkrit Mengkaw

Please use this identifier to cite or link to this item: http://cmuir.cmu.ac.th/jspui/handle/6653943832/79707

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Pree Thiengburanathum	-
dc.contributor.author	Jukkrit Mengkaw	en_US
dc.date.accessioned	2024-07-12T00:53:54Z	-
dc.date.available	2024-07-12T00:53:54Z	-
dc.date.issued	2024-06	-
dc.identifier.uri	http://cmuir.cmu.ac.th/jspui/handle/6653943832/79707	-
dc.description.abstract	Spelling correction (SC) is used to detect and correct misspelled words. SC is considered a fundamental task in various Natural Language Processing (NLP) applications, such as machine translation, chatbots, Optical Character Recognition (OCR) systems, etc. Currently, most documents are archived in PDF files, which can be easily read by humans. Therefore, text extraction is necessary to extract data from such documents that a computer can use for analysis. There are a few research studies on spelling correction in the low-resource language, particularly the performance of the Thai OCR. The issue of spelling correction in the Thai language is not only involved in terms of the availability of data, but also with the complexity of the language. In this paper, we proposed a two-step spelling correction framework that includes detection and correction steps. In the error detection, Conditional Random Field (CRF) revealed the highest performance and achieved an F1-score of 93.20%. In the error correction, Bi-LSTM with attention mechanism achieved F1-score of 86.31% and WangchanBERTa achieved F1-score of 81.36%. However, WangchanBERTa has a faster inference time than the Attention mechanism (40 times) and can reduce WER from 11.99% to 4.51%. The experiment results reveal that our proposed method effectively detects and corrects the Thai language.	en_US
dc.language.iso	en	en_US
dc.publisher	Chiang Mai : Graduate School, Chiang Mai University	en_US
dc.title	Thai spelling correction investigation framework based on OCR word extraction	en_US
dc.title.alternative	กระบวนการสำรวจการแก้ไขคำสะกดผิดภาษาไทย กระบวนการสกัดคำตั้งแต่ต้นจนจบโดยการรู้จำอักขระด้วยแสง	en_US
dc.type	Independent Study (IS)
thailis.controlvocab.lcsh	Thai language -- Orthography and spelling	-
thailis.controlvocab.lcsh	Thai language -- Errors of usage	-
thailis.controlvocab.lcsh	Thai language -- Alphabet)	-
thesis.degree	master	en_US
thesis.description.thaiAbstract	การแก้ไขคำสะกดผิดถูกนำมาใช้ในการตรวจจับและแก้ไขคำที่สะกดผิด โดยการแก้ไขคำสะกดผิดถือเป็นงานขั้นพื้นฐานสำหรับแอปพลิเคชันต่าง ๆ การประมวลผลภาษาธรรมชาติ (Natural Language Processing) เช่น การแปลภาษา (Machine Translation) แชทบอท (Chatbot) และการรู้จำอักขระด้วยแสง (Optical Character Recognition) เป็นต้น ในปัจจุบันมีการจัดเก็บเอกสารในรูปของไฟล์ PDF เพิ่มมากขึ้น ซึ่งเป็นการเพิ่มความสะดวกในการเข้าถึงและการอ่านเอกสารเป็นอย่างมาก การนำข้อมูลจากไฟล์เอกสารเหล่านี้มาใช้ในการวิเคราะห์งานประเภทต่าง ๆ จึงเป็นหนึ่งในการที่มีความท้าทายมาก แต่ในการนำข้อมูลจากเอกสารที่เป็นไฟล์อิเล็กทรอนิกส์ชนิดต่าง ๆ นั้นจำเป็นที่ต้องทำการสกัดข้อความนั้น ๆ ออกมาเพื่อทำการประมวลผลล่วงหน้า มีงานวิจัยค่อนข้างน้อยที่ศึกษาเกี่ยวกับการแก้ไขคำสะกดผิด โดยเฉพาะอย่างยิ่งการแก้ไขคำสะกดผิดในภาษาไทย เนื่องจากภาษาไทยนั้นมีข้อมูลที่สามารถนำมาใช้ในการศึกษาค่อนข้างน้อย รวมถึงโครงสร้างของภาษาไทยที่มีความซับซ้อนเช่นกัน ในการศึกษาครั้งนี้ผู้วิจัยได้นำเสนอแนวทางในการแก้ไขคำสะกดผิดในภาษาไทยโดยการใช้สองขั้นตอนได้แก่ การตรวจจับคำสะกดผิดและการแก้ไขคำสะกดผิด โดยการตรวจจับคำสะกดผิดนั้นใช้ Conditional Random Field (CRF) ซึ่งมีประสิทธิภาพในการตรวจจับคำสะกดผิดค่อนข้างสูง ผลลัพธ์จากการทดลองมีค่า F1-score ที่ร้อยละ 93.20 ส่วนในการแก้ไขคำสะกดผิดโมเดล Bi-LSTM มีค่า F1-score ที่ร้อยละ 86.31 และโมเดล WangchanBERTa มีค่า F1-score ที่ร้อยละ 81.36 อย่างไรก็ตามโมเดลการแก้ไขคำสะกดผิดที่ใช้ WangchanBERTa นั้นประมวลผลได้รวดเร็วกว่าถึง 40 เท่า และสามารถลด Word Error Rate (WER) จากร้อยละ 11.99 เหลือ ร้อยละ 4.51 ผลการทดลองพบว่าวิธีการที่ผู้วิจัยนำเสนอสามารถตรวจจับและแก้ไขคำสะกดผิดในภาษาไทยได้อย่างมีประสิทธิภาพ	en_US
Appears in Collections:	ENG: Independent Study (IS)

Files in This Item:

File	Description	Size	Format
630632114-Jukkrit Mengkaw.pdf		2.17 MB	Adobe PDF	View/Open Request a copy

Show simple item record