SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Abstract

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages - from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.

Keywords

Computer scienceMachine learningArtificial intelligencePreprocessorRobustness (evolution)Benchmark (surveying)OversamplingSupervised learningVariety (cybernetics)Class (philosophy)Data miningArtificial neural network

Affiliated Institutions

Related Publications

Unsupervised Feature Learning via Non-parametric Instance Discrimination

Zhirong Wu , Yuanjun Xiong , Stella X. Yu +1 more

Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so. We study whether...

2018 3435 citations

Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)

Jerome H. Friedman , Trevor Hastie , Robert Tibshirani

Boosting is one of the most important recent developments in\nclassification methodology. Boosting works by sequentially applying a\nclassification algorithm to reweighted versi...

2000 The Annals of Statistics 6819 citations

A Global Geometric Framework for Nonlinear Dimensionality Reduction

Joshua B. Tenenbaum , Vin de Silva , John Langford

Scientists working with large volumes of high-dimensional data, such as global climate patterns, stellar spectra, or human gene distributions, regularly confront the problem of ...

2000 Science 13453 citations

Part-Based Statistical Models for Object Classification and Detection

Elliot Joel Bernstein , Yali Amit

We propose using simple mixture models to define a set of mid-level binary local features based on binary oriented edge input. The features capture natural local structures in t...

2005 31 citations

Stability-Based Validation of Clustering Solutions

Tilman Lange , Volker Röth , Mikio L. Braun +1 more

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated ...

2004 Neural Computation 508 citations

Publication Info

Year: 2018
Type: article
Volume: 61
Pages: 863-905
Citations: 1895
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

1895

OpenAlex

Cite This

APA Style

                            
                                    Alberto Fernández, 
                                
                                    Salvador García, 
                                
                                    Francisco Herrera
                                
                                et al.
                            
                            (2018). 
                            SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. 
                            Journal of Artificial Intelligence Research
                            , 61
                            
                            , 863-905.
                            https://doi.org/10.1613/jair.1.11192

Identifiers

DOI: 10.1613/jair.1.11192