A Resampling Method for Imbalanced Datasets Considering Noise and Overlap

Taisho Sasada Zhaoyu Liu Tokiya Baba Kenji Hatano Yusuke Kimura
雑誌・プロシーディングス名: Procedia Computer Science
国名(英語): Online
言語: Japanese
Vol.: 176
ページ: 420--429
出版年: 2020
出版月: 9
出版日: 2020-09-17
DOI: 10.1016/j.procs.2020.08.043
📄 PDFを開く
       

概要

If there is a bias in the number of instances that make up the class in a dataset, the predicted results will be affected when applied to machine learning as training data. A method called resampling, which adjusts the number of majority and minority instances, is usually used to solve the imbalance in training data. Although resampling can eliminate imbalances, it may cause data complexity that deteriorates classification accuracy. Noise and overlap are well-known factors of data complexity. Noise is mixture of instances with features that can be classified into other classes at the time of training, and overlap represents the state in which classes cannot be linearly separated because they partially overlap each other. However, conventional methods could not consider these factors at a time, so that their classification accuracy would be not praiseworthy. In order to deal with both noise and overlap, we just need to integrate each of the methods that can deal with them. We know that there have already been established the methods to deal with each problem; however a simple integration of them may remove instances from the dataset that do not need to be removed, or may leave ones that should be removed. Therefore, we have to quantify these factors to take into account for data complexity, and have to consider more effective ways of their integration. In this paper, we propose a method for integrating well-known two resampling methods, which are called SMOTE-ENN and SMOTE-Tomek. In four out of ten datasets, our experimental result showed that our method is effective compared with the latest conventional methods.

引用情報

Taisho Sasada, Zhaoyu Liu, Tokiya Baba, Kenji Hatano, Yusuke Kimura, A Resampling Method for Imbalanced Datasets Considering Noise and Overlap, Procedia Computer Science, Vol.176, pp.420--429, 2020-09-17, DOI: 10.1016/j.procs.2020.08.043.

Iconic One Theme | Powered by Wordpress