A comparative study of open source crawlers based on robustness and scalability testing on e-commerce websites

Desheng Yang

Please use this identifier to cite or link to this item: http://cmuir.cmu.ac.th/jspui/handle/6653943832/73795

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Pree Thiengburanathum	-
dc.contributor.author	Desheng Yang	en_US
dc.date.accessioned	2022-08-07T05:08:39Z	-
dc.date.available	2022-08-07T05:08:39Z	-
dc.date.issued	2021-05	-
dc.identifier.uri	http://cmuir.cmu.ac.th/jspui/handle/6653943832/73795	-
dc.description.abstract	Web crawlers are automatic software for extracting data from the Internet. Ecommerce websites are essential areas for crawler applications. The short-performing web crawlers can lead to the waste of many resources in terms of development and maintenance. Nowadays, e-commerce websites have more complex anti-crawler mechanisms in a real network environment, and e-commerce websites generate a large volume of data every day. For e-commerce websites, large-scale crawling involves many problems, such as exception handling and task scheduling. Many open source crawlers have been developed to handle these issues. However, choosing a suitable open source crawler becomes a huge challenge, because we lack a systematic comparative study about them. In previous studies, crawling speed was one of the most important evaluation factors for comparing crawler performance. Due to the anti-crawler mechanism, some studies mentioned comparing the robustness and scalability of web crawlers, but lacking a systematic study based on robustness and scalability of web crawlers. The web crawlers must have the ability that can avoid anti-crawler traps and handle exponentially increasing data on e-commerce websites. Therefore, the robustness and scalability of crawlers become key factors for evaluating the performance of web crawlers. The robustness is an ability that can handle exceptions when web crawlers are crawling. The scalability is a feature that the system can accept the amount of data continuing to increase, and the performance does not decrease. An excellent crawler should have better robustness and scalability on e-commerce websites. This study aims to emphasize evaluating the web crawlers’ robustness and scalability on e-commerce websites. A novelty framework of open source web crawlers was proposed on e-commerce websites. Multiple testing environments were set up on ecommerce websites, while robustness testing and scalability testing were used to measure the robustness and scalability of web crawlers. Two scaling modes, scale up and scale out, were applied to scalability testing. The scalability attributes, crawling throughput, CPU usage, disk IO throughput, memory usage, and network usage, were defined to evaluate scalability. Scaling efficiency was established to quantify the scalability of the crawlers based on scalability attributes. Meanwhile, the robustness testing was designed for two experiments (non-interference test and interference test). The robustness failure rate was used to quantify the robustness. Statistical methods such as the Friedman test and the Nemenyi test were used to analyze the significant differences among crawlers. The experimental results revealed Nutch, Scrapy, and Heritrix have the best scalability on scale up mode. Furthermore, Nutch and Scrapy have the best scalability on the scale out mode. In the non-interference test, Scrapy has the best robustness. However, Webmagic, Webcolletor, and Gecco have the best robustness in the interference test based on general test and database test. This research can provide support for developers to find an appropriate web crawler and can be a reference for evaluating the performance of web crawlers.	en_US
dc.language.iso	en	en_US
dc.publisher	Chiang Mai : Graduate School, Chiang Mai University	en_US
dc.title	A comparative study of open source crawlers based on robustness and scalability testing on e-commerce websites	en_US
dc.title.alternative	การศึกษาเปรียบเทียบของโอเพนซอร์สซอฟต์แวร์สกัดข้อมูลจากเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ โดยใช้การทำการทดสอบด้านความสามารถในการปรับขนาดและความทนทาน	en_US
dc.type	Thesis
thailis.controlvocab.thash	Computer software	-
thailis.controlvocab.thash	Open source software	-
thailis.controlvocab.thash	Software engineering	-
thailis.controlvocab.thash	Web sites	-
thesis.degree	master	en_US
thesis.description.thaiAbstract	เว็บชอฟต์แวร์สกัดข้อมูล คือ ซอฟต์แวร์อัตโนมัติสำหรับรวบรวมข้อมูลจากอินเทอร์เน็ต เว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์เป็นแหล่งสำคัญสำหรับแอปพลิเคชันซอฟต์แวร์สกัดข้อมูล แต่การใช้เว็บซอฟต์แวร์สกัดข้อมูลมีสมรรถนะจำกัด อาจทำให้สิ้นเปลืองทรัพยากรเป็นจำนวนมากทางด้านการพัฒนาและการบำรุงรักษา อีกทั้งในสภาพปัจจุบันเว็บไชต์เชิงพาณิชย์อิเล็กทรอนิกส์ มักมีกลไกต่อต้านเว็บซอฟต์แวร์สกัดข้อมูลที่ซับซ้อนมากขึ้นบนเครือข่ายปฏิบัติงานจริง ตัวเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์เองนั้นได้สร้างปริมาณข้อมูลเพิ่มมากขึ้นทุกวัน สำหรับเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ มีการรวบรวมข้อมูลขนาดใหญ่ มักต้องประสบกับปัญหาต่างๆ เช่น การจัดการข้อยกเว้น และการกำหนดเวลาการทำงาน เว็บซอฟต์แวร์สกัดข้อมูล โอเพนซอร์สจำนวนมากได้รับการพัฒนาเพื่อดำเนินการแก้ปัญหาเหล่านี้ แต่อย่างไรก็ตาม การเลือกเว็บซอฟต์แวร์สกัดข้อมูล โอเพนซอร์สที่เหมาะสมก็ เป็นความท้าทายอย่างยิ่ง เนื่องจากยังขาดการศึกษาเปรียบเทียบอย่างเป็นระบบเกี่ยวกับซอฟต์แวร์อัตโนมัติเหล่านี้ จากการศึกษาก่อนหน้านี้ พบว่าความเร็วในการรวบรวมข้อมูลเป็นปัจจัยหนึ่งในการประเมินที่สำคัญที่สุด สำหรับการเปรียบเทียบประสิทธิภาพของเว็บซอฟต์แวร์สกัดข้อมูล เนื่องจากมีกลไกต่อต้านเว็บซอฟต์แวร์สกัดข้อมูลจากอินเทอร์เน็ต และมีงานการศึกษาบางชิ้นได้กล่าวถึงการเปรียบเทีบความทนทานและความสามารถในการปรับขนาดของเว็บซอฟต์แวร์สกัดข้อมูลเว็บไซต์ แต่ไม่ได้มีการศึกษาอย่างเป็นระบบโดยพิจารณาจากความทนทานและความสามารถในการปรับขนาดของเว็บซอฟต์แวร์สกัดข้อมูลเว็บ ดังนั้นเว็บซอฟต์แวร์สกัดข้อมูลเว็บจำเป็นต้องมีความสามารถในการหลีกเลี่ยงกับดักต่อต้านเว็บซอฟต์แวร์สกัดข้อมูล นี้ และต้องสามารถจัดการข้อมูลที่เพิ่มขึ้นอย่างทวีคูณบนเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ ด้วยเหตุผลดังกล่าว ความทนทานและความสามรถในการปรับขนาดของเว็บซอฟต์แวร์สกัดข้อมูล จึงเป็นปัจจัยสำคัญในการประเมินประสิทธิภาพของเว็บซอฟต์แวร์สกัดข้อมูลเว็บ ความทนทานคือความสามารถที่สามารถจัดการกับข้อยกเว้นเมื่อเว็บซอฟต์แวร์สกัดข้อมูลกำลังรวบรวมข้อมูล ความสามารถในการปรับขนาดเป็นคุณลักษณะที่ระบบสามารถยอมรับปริมาณข้อมูลที่เพิ่มขึ้นอย่างต่อเนื่องโดยที่ประสิทธิภาพไม่ลดลง เว็บซอฟต์แวร์สกัดข้อมูลที่ดีควรมีความแข็งแกร่งและความสามารถในการปรับขนาดได้ดีกว่าในการทำงาน บนเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ การศึกษานี้ มีจุดมุ่งหมายเพื่อเน้นการประเมินความทนทานและความสามารถในการปรับขนาดของเว็บซอฟต์แวร์สกัดข้อมูลบนเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ โดยเสนอกรอบการทำงานแนวใหม่ของเว็บซอฟต์แวร์สกัดข้อมูลเว็บซอฟต์แวร์สกัดข้อมูลแบบโอเพนซอร์สบนเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ โดยมีการตั้งค่าสภาพแวดล้อมการทดสอบหลายแบบบนเว็บไซต์ มีการทดสอบความทนทานและการทดสอบความสามารถในการปรับขนาดใช้เพื่อวัดความทนทานและความสามารถในการปรับขนาดของเว็บซอฟต์แวร์สกัดข้อมูลเว็บ โดยมีรูปแบบการปรับขนาดออกสองแบบคือ การเพิ่มขนาด (Scale up) และปรับขยายออก (Scale-out) ได้นำมาใช้กับการทดสอบความสามารถในการปรับขยาย โดยมีเกณฑ์ที่ใช้ในกำหนดการประเมินคุณลักยณะในการปรับขยายได้ จะพิจารณาจากปริมาณงานการรวบรวมข้อมูล การใช้ CPU ปริมาณงาน IO ของดิสก์ การใช้หน่วยความจำ และการใช้เครือข่าย ประสิทธิภาพการปรับขนาดถูกกำหนดขึ้นเพื่อวัดความสามารถในการปรับขนาดของเว็บซอฟต์แวร์สกัดข้อมูลตามคุณลักษณะ ความสามารถในการปรับขนาดได้ ในขณะเดียวกันการทดสอบความทนทานได้รับการออกแบบสำหรับการทดลองสองครั้ง คือการทดสอบแบบไม่รบกวนและการทดสอบการรบกวน อัตราความล้มเหลวของความทนทานถูกใช้เพื่อวัดค่าความทนทาน ใช้วิธีการทางสถิติ เช่น การทดสอบฟรีดแมนและการทดสอบเนเมนยี เพื่อวิเคราะห์ความแตกต่างที่มีนัยสำคัญระหว่างเว็บซอฟต์แวร์สกัดข้อมูล จากผลการทดลองพบว่าแบบวิธีของ Nutch, Scrapy และ Heritrix มีความสามารถในการปรับขนาดได้ดีที่สุดในโหมดการเพิ่มขนาด (Scale up) นอกจากนี้ แบบของ Nutch และ Scrapy ยังมีความสามารถในการปรับขนาดได้ดีที่สุดในแบบขยายออก (Scale out) ในการทดสอบแบบไม่รบกวน, วิธีการของ Scrapy แสดงให้เห็น ว่ามีความทนทานสูงสุด อย่างไร ก็ตามWebmagic, Webcolletor และ Gecco ก็มีความทนทานที่ดีที่สุดในการทดสอบการรบกวนตามการทดสอบทั่วไปและการทดสอบฐานข้อมูล งานวิจัยนี้สามารถสนับสนุนสำหรับนักพัฒนาในการค้นหาเว็บซอฟต์แวร์สกัดข้อมูลที่เหมาะสม และสามารถใช้เป็นข้อมูลอ้างอิงสำหรับการประเมินประสิทธิภาพของเว็บซอฟต์แวร์สกัดข้อมูล	en_US
Appears in Collections:	CAMT: Theses

Files in This Item:

File	Description	Size	Format
612131003 DESHENG YANG.pdf		8.88 MB	Adobe PDF	View/Open Request a copy

Show simple item record