A Real-time bus arrival time predictive system based on spark framework and machine learning approaches: A case study in Chiang Mai

Ye Li

Please use this identifier to cite or link to this item: http://cmuir.cmu.ac.th/jspui/handle/6653943832/73660

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Pree Thiengburanathum	-
dc.contributor.author	Ye Li	en_US
dc.date.accessioned	2022-07-18T10:48:16Z	-
dc.date.available	2022-07-18T10:48:16Z	-
dc.date.issued	2021-05	-
dc.identifier.uri	http://cmuir.cmu.ac.th/jspui/handle/6653943832/73660	-
dc.description.abstract	As the quality of living standards has been improving, more people have private vehicles, which leads to traffic congestion, environmental pollution, and even traffic accidents. Due to Chiang Mai's public transportation master plan, the city's major public transportation are red taxis, grab, tuk-tuk and bus. Especially, most people choose red taxis to travel around the city. The bus system has been deployed as a new option for public transport in the city. However, the bus passengers in Chiang Mai are hesitant to take the bus because people do not have confidence in the bus schedule. There is a need for an intelligent system, such as a bus arrival time (BAT) prediction system, to help bus passengers accurately obtain BAT. Predicting the BAT at a certain bus station is challenging due to concerns with real- time data processing, numerous data inputs, and predictive accuracy. For instance, the bus station's arrival time at a particular bus station is affected by the previous bus station or the previous few bus stations. Therefore, one of the challenges is to improve BAT prediction capabilities while considering the impact of previous station data. There are previous studies that used a small number of features, such as bus location, to make predictions because of the limited-features collection. Nevertheless, the forecast of BAT is affected by many features involving bus travel time, bus travel speed, and bus travel distance. Therefore, using a small number of features to construct other related features of BAT prediction is a challenge. Another challenge is that previous studies only analyzed a small amount of data to save costs and increase the speed of forecasting, resulting in poor predictive power. Moreover, most previous studies used Pandas library to process their data. They only focus on modeling and ignore the real-time nature of data processing. This research proposed a real-time BAT prediction system. We collected real-world 78 days data of the Chiang Mai R3-Y bus route. The data consist of real-time bus location and location timestamp. Moreover, we collected the location of bus station data from Google Maps. We used these data to do feature engineering, which includes proposing algorithms to extract features involving BAT, departure time, travel time, travel speed, travel distance, and dwell time. This research's operational flow follows the cross- industry standard process for data mining (CRISP-DM), begins with business understanding, data understanding, data preparation (using Spark framework), modeling (using Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX) and Support Vector Regression (SVR) algorithm), and evaluation (using mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), and coefficient of determination (R3). We evaluated the real-time performance of bus data processing based on Spark. The experiment results reveal that when the file size is about 900 KB, Spark's running time is about 3 seconds. The running time of Pandas is about 3-folds that of Spark. When the file size increased by about 3-folds, Spark's running time is about 7 seconds. The running time of Pandas is about 60-folds that of Spark. The SVR model's accuracy has achieved 99.5%, which is 25% higher than the ARIMAX model. This research proves that the Spark framework combined with the SVR model to predict time series data can reduce the data processing time and achieve high predictive power.	en_US
dc.language.iso	en	en_US
dc.publisher	เชียงใหม่ : บัณฑิตวิทยาลัย มหาวิทยาลัยเชียงใหม่	en_US
dc.title	A Real-time bus arrival time predictive system based on spark framework and machine learning approaches: A case study in Chiang Mai	en_US
dc.title.alternative	ระบบพยากรณ์การมาถึงของรถประจำทางโดยใช้การคำนวณแบบจัดกลุ่ม ด้วยสปาร์คเฟรมเวิร์คและการเรียนรู้ของเครื่อง: กรณีศึกษาในจังหวัดเชียงใหม่	en_US
dc.type	Thesis
thailis.controlvocab.thash	Buses -- Chiangmai	-
thailis.controlvocab.thash	Local transit -- Chiangmai	-
thailis.controlvocab.thash	Transportation -- Chiangmai	-
thailis.controlvocab.thash	Data mining	-
thailis.controlvocab.thash	Data sets	-
thesis.degree	master	en_US
thesis.description.thaiAbstract	เนื่องจากมาตรฐานและคุณภาพการครองชีพของประชาชนที่มีการพัฒนาอย่างต่อเนื่อง ทำให้ทวีจำนวนของผู้คนที่นิยมการมียานพาหนะเป็นของตนเอง จึงเป็นที่มาของปัญหาการจราจรคับคั่งหรือการเกิดมลภาวะต่อสิ่งแวดล้อม แต่ยังก่อให้เกิดอุบัติเหตุจากการจราจรอีกด้วย ตามแผนแม่บทการขนส่งสาธารณะของจังหวัดเชียงใหม่ ได้มีการกำหนดการขนส่งสาธารณะที่เป็นหลักของจังหวัดประกอบด้วย แท็กซี่สีแดง รถบริการส่งสินค้าแกร็บ รถตุ๊ก-ตุ๊ก และรถโดยสารประจำทาง แต่คนส่วนใหญ่มักนิยมเลือกแท็กซี่สีแดงเพื่อใช้ในการเดินทางรอบเมือง ระบบการจัดการรถโดยสาร จึงได้นำมาใช้เป็นทางเลือกใหม่สำหรับการขนส่งสาธารณะ ในเมือง แต่ถึงกระนั้นผู้ใช้บริการรถโดยสารในจังหวัดเชียงใหม่ยังลังเลใจในการใช้บริการรถสาธารณะ เนื่องจากความไม่มั่นใจในเวลาตารางเดินรถ ดังนั้นจึงจำเป็นต้องนำระบบอัจฉริชะ เช่น ระบบการคาดการณ์เวลามาถึงของรถบัส (Bus Arrival Time Prediction System) หรือ เรียกย่อว่า BAT มาใช้ เพื่อช่วยให้ผู้ใช้บริการ โดยสารรถประจำทางสามารถคาดการณ์เวลามาถึงตามจริงของรถ โดยสารเดินทางได้อย่างแม่นยำ การคาดการณ์เวลามาถึงตามจริงของรถโดยสารเดินทาง ณ ที่สถานีขนส่งหนึ่งๆ เป็นเรื่องที่ท้าทายมาก เนื่องจากมีเรื่องที่เกี่ยวข้องกับการประมวลผลข้อมูลตามแบบเวลาที่เกิดขึ้นจริง (Real time) รวมทั้งการประมวลผลข้อมูลที่มีจำนวนมาก และความแม่นยำในการคาดการณ์ ตัวอย่างเช่น เวลามาถึงของสถานีขนส่งที่สถานีขนส่งแห่งใดแห่งหนึ่งนั้น อาจได้รับผลกระทบจากสถานีขนส่งก่อนหน้าหรือสถานีขบส่งสองสามแห่งก่อนหน้า ความท้าทายประการหนึ่ง คือการปรับปรุงความสามารถในการคาดการณ์เวลามาถึงตามจริงของรถโดยสารเดินทาง (BAT) พร้อมทั้งต้องพิจารณาผลกระทบของข้อมูลจากสถานีก่อนหน้าด้วยการศึกษาก่อนหน้านี้ได้มีการนำตัวแปรอิสระมาใช้ในการพยากรณ์เพียงส่วนน้อย เช่น ตำแหน่งรถบัสมาคาดการณ์ เนื่องจากมีข้อจำกัดในการเก็บตัวแปร อย่างไรก็ตาม การคาดการณ์ของเวลามาถึงตามจริงของรถโดยสารเดินทาง (BAT) ก็ยังได้รับผลกระทบจากคุณสมบัติหลายอย่างที่เกี่ยวข้องกับเวลาเดินทางของรถบัสเอง ความเร็วในการเดินทางของรถบัส และระยะการเดินทางของรถบัสด้วย ดังนั้นการใช้คุณสมบัติจำนวนน้อยเพื่อสร้างคุณสมบัติที่เกี่ยวข้องอื่น ๆ ของการคาดการณ์ของเวลามาถึงตามจริงของรถโดยสารเดินทาง (BAT) จึงเป็นความท้าทายอย่างยิ่งความท้าทายอีกประการหนึ่งคือ จากการศึกษาก่อนหน้นี้ได้มีการวิเคราะห์จำนวนข้อมูลเพียงเล็กน้อย ทั้งนี้ เพื่อเป็นการประหยัดคำใช้จ่ายและเพิ่มความเร็วในการคาดการณ์ ส่งผลให้ความสามารถในการคาดการณ์ได้ต่ำ นอกจากนี้ การศึกษาก่อนหน้านี้ส่วนใหญ่ใช้ห้องสมุด Pandas ในการประมวลผลข้อมูล โดยมุ่งเน้น ไปที่การสร้างแบบจำลองและไม่ได้ให้ความสมใจธรรมชาติของการประมวลผลข้อมูลตามเวลาจริง งานวิจัยนี้เสนอระบบการคาดการณ์ของเวลามาถึงตามจริงของรถโดยสารเดินทาง (BAT) โดยได้เก็บรวบรวมข้อมูลใน 78 วันจริงของเส้นทางรถเมล์ R3-Y ในจังหวัดเชียงใหม่ ซึ่งมีข้อมูลประกอบด้วย ตำแหน่งรถบัสตามเวลาจริงและการประทับเวลาของตำแหน่ง นอกจากนี้ยังได้รวบรวมข้อมูลตำแหน่งสถานีขนส่งจาก Google Maps และใช้ข้อมูลเหล่านี้เพื่อทำคุณลักษณะทางวิศวกรรมซึ่งรวมถึงการนำเสนออัลกอริทึมเพื่อแยกคุณลักษณะ ที่เกี่ยวข้องกับการคาดการณ์ของเวลามาถึงตามจริงของรถโดยสารเดินทาง (BAT) เวลาออกเดินทาง เวลาเดินทาง ความเร็วในการเดินทาง ระยะทางในการเดินทาง และเวลาพัก ขั้นตอนการดำเนินงานของงานวิจัยนี้เป็นไปตามกระบวนการมาตรฐานข้ามอุตสาหกรรมสำหรับการทำเหมืองข้อมูล (CRISP-DM) เริ่มต้นจากความเข้าใจทางธุรกิจ ความเข้าใจข้อมูล การเตรียมข้อมูล (โดยใช้กรอบงาน Spark) การสร้างแบบจำลอง (โดยใช้ Autoregressive Integrated Moving Average with Explanatory Variable (ARIMAX) และ รองรับอัลกอริทึมการถดถอยเวกเตอร์ (SVR) และการประเมิน (โดยใช้ข้อผิดพลาดแบบสัมบูรณ์เฉลี่ย (MAE) ข้อผิดพลาดกำลังสองเฉลี่ย (MSE) ความคลาดเคลื่อนกำลังสองเฉลี่ยรูต (RMSE) และสัมประสิทธิ์การกำหนด (R2) โดยประเมินประสิทธิภาพการประมวลผลข้อมูลบัสตามเวลาจริงโดยอิงจาก Spark ผลจากการทดสอบพบว่าเมื่อไฟล์มีขนาดประมาณ 900 KB เวลาทำงานของ Spark ประมาณ 3 วินาทีเวลาทำงานของ Pandas นั้นประมาณ 3 เท่าของ Spark เมื่อขนาดไฟล์เพิ่มขึ้นประมาณ 3 เท่า เวลาทำงานของ Spark จะอยู่ที่ประมาณ 7 วินาทีเวลาทำงานของ Pandas นั้นประมาณ 60 เท่าของ Spark ความแม่นยำของรุ่น SVR อยู่ที่ 99.5% ซึ่งสูงกว่ารุ่น ARIMAX ถึง 25% งานวิจัยนี้พิสูจน์ว่ากรอบงาน Spark รวมกับแบบจำลอง SVR เพื่อพยากรณ์ข้อมูลอนุกรมเวลาสามารถลดเวลาในการประมวลผลข้อมูลและ ให้กำลังการพยากรณ์สูง	en_US
Appears in Collections:	CAMT: Theses

Files in This Item:

File	Description	Size	Format
612131005 YE LI.pdf		3.36 MB	Adobe PDF	View/Open Request a copy

Show simple item record