MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction

Kwabena Ebo Bennin; Jacky Keung; Passakorn Phannachitta; Akito Monden; Solomon Mensah

Please use this identifier to cite or link to this item: http://cmuir.cmu.ac.th/jspui/handle/6653943832/58498

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kwabena Ebo Bennin	en_US
dc.contributor.author	Jacky Keung	en_US
dc.contributor.author	Passakorn Phannachitta	en_US
dc.contributor.author	Akito Monden	en_US
dc.contributor.author	Solomon Mensah	en_US
dc.date.accessioned	2018-09-05T04:25:36Z	-
dc.date.available	2018-09-05T04:25:36Z	-
dc.date.issued	2018-06-01	en_US
dc.identifier.issn	00985589	en_US
dc.identifier.other	2-s2.0-85028936214	en_US
dc.identifier.other	10.1109/TSE.2017.2731766	en_US
dc.identifier.uri	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85028936214&origin=inward	en_US
dc.identifier.uri	http://cmuir.cmu.ac.th/jspui/handle/6653943832/58498	-
dc.description.abstract	© 1976-2012 IEEE. Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Notwithstanding the successes achieved by these approaches, they mostly result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data). In this study, we introduce MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with SMOTE, Borderline-SMOTE, ADASYN, Random Oversampling and the No sampling approach using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner's statistical significance test and Cliff's effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets.	en_US
dc.subject	Computer Science	en_US
dc.title	MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction	en_US
dc.type	Journal	en_US
article.title.sourcetitle	IEEE Transactions on Software Engineering	en_US
article.volume	44	en_US
article.stream.affiliations	City University of Hong Kong	en_US
article.stream.affiliations	Chiang Mai University	en_US
article.stream.affiliations	Okayama University	en_US
Appears in Collections:	CMUL: Journal Articles

Files in This Item:

There are no files associated with this item.

Show simple item record