Please use this identifier to cite or link to this item:
Title: A comparative study of open source crawlers based on robustness and scalability testing on e-commerce websites
Other Titles: การศึกษาเปรียบเทียบของโอเพนซอร์สซอฟต์แวร์สกัดข้อมูลจากเว็บไซต์เชิงพาณิชย์อิเล็กทรอนิกส์ โดยใช้การทำการทดสอบด้านความสามารถในการปรับขนาดและความทนทาน
Authors: Desheng Yang
Authors: Pree Thiengburanathum
Desheng Yang
Issue Date: May-2021
Publisher: Chiang Mai : Graduate School, Chiang Mai University
Abstract: Web crawlers are automatic software for extracting data from the Internet. Ecommerce websites are essential areas for crawler applications. The short-performing web crawlers can lead to the waste of many resources in terms of development and maintenance. Nowadays, e-commerce websites have more complex anti-crawler mechanisms in a real network environment, and e-commerce websites generate a large volume of data every day. For e-commerce websites, large-scale crawling involves many problems, such as exception handling and task scheduling. Many open source crawlers have been developed to handle these issues. However, choosing a suitable open source crawler becomes a huge challenge, because we lack a systematic comparative study about them. In previous studies, crawling speed was one of the most important evaluation factors for comparing crawler performance. Due to the anti-crawler mechanism, some studies mentioned comparing the robustness and scalability of web crawlers, but lacking a systematic study based on robustness and scalability of web crawlers. The web crawlers must have the ability that can avoid anti-crawler traps and handle exponentially increasing data on e-commerce websites. Therefore, the robustness and scalability of crawlers become key factors for evaluating the performance of web crawlers. The robustness is an ability that can handle exceptions when web crawlers are crawling. The scalability is a feature that the system can accept the amount of data continuing to increase, and the performance does not decrease. An excellent crawler should have better robustness and scalability on e-commerce websites. This study aims to emphasize evaluating the web crawlers’ robustness and scalability on e-commerce websites. A novelty framework of open source web crawlers was proposed on e-commerce websites. Multiple testing environments were set up on ecommerce websites, while robustness testing and scalability testing were used to measure the robustness and scalability of web crawlers. Two scaling modes, scale up and scale out, were applied to scalability testing. The scalability attributes, crawling throughput, CPU usage, disk IO throughput, memory usage, and network usage, were defined to evaluate scalability. Scaling efficiency was established to quantify the scalability of the crawlers based on scalability attributes. Meanwhile, the robustness testing was designed for two experiments (non-interference test and interference test). The robustness failure rate was used to quantify the robustness. Statistical methods such as the Friedman test and the Nemenyi test were used to analyze the significant differences among crawlers. The experimental results revealed Nutch, Scrapy, and Heritrix have the best scalability on scale up mode. Furthermore, Nutch and Scrapy have the best scalability on the scale out mode. In the non-interference test, Scrapy has the best robustness. However, Webmagic, Webcolletor, and Gecco have the best robustness in the interference test based on general test and database test. This research can provide support for developers to find an appropriate web crawler and can be a reference for evaluating the performance of web crawlers.
Appears in Collections:CAMT: Theses

Files in This Item:
File Description SizeFormat 
612131003 DESHENG YANG.pdf8.88 MBAdobe PDFView/Open    Request a copy

Items in CMUIR are protected by copyright, with all rights reserved, unless otherwise indicated.