Abstract
Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research. So that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
Keywords
Affiliated Institutions
Related Publications
Effect size, confidence interval and statistical significance: a practical guide for biologists
Abstract Null hypothesis significance testing (NHST) is the dominant statistical approach in biology, although it has many, frequently unappreciated, problems. Most importantly,...
Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corre...
Model Uncertainty, Data Mining and Statistical Inference
This paper takes a broad, pragmatic view of statistical inference to include all aspects of model formulation. The estimation of model parameters traditionally assumes that a mo...
Input feature selection by mutual information based on Parzen window
Mutual information is a good indicator of relevance between variables, and have been used as a measure in several feature selection algorithms: However, calculating the mutual i...
A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features
In recent years, the development of a feature-based general statistical framework has been pursued for automatic speech recognition via novel designs of minimal or atomic units ...
Publication Info
- Year
- 2004
- Type
- article
- Volume
- 6
- Issue
- 1
- Pages
- 7-19
- Citations
- 1364
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1145/1007730.1007734