Abstract
Bagging is one of the most effective computationally intensive procedures to improve on unstable estimators or classifiers, useful especially for high dimensional data set problems. Here we formalize the notion of instability and derive theoretical results to analyze the variance reduction effect of bagging (or variants thereof) in mainly hard decision problems, which include estimation after testing in regression and decision trees for regression functions and classifiers. Hard decisions create instability, and bagging is shown to smooth such hard decisions, yielding smaller variance and mean squared error. With theoretical explanations, we motivate subagging based on subsampling as an alternative aggregation scheme. It is computationally cheaper but still shows approximately the same accuracy as bagging. Moreover, our theory reveals improvements in first order and in line with simulation studies.\n¶ In particular, we obtain an asymptotic limiting distribution at the cube-root rate for the split point when fitting piecewise constant functions. Denoting sample size by n, it follows that in a cylindric neighborhood of diameter $n^{-1/3}$ of the theoretically optimal split point, the variance and mean squared error reduction of subagging can be characterized analytically. Because of the slow rate, our reasoning also provides an explanation on the global scale for the whole covariate space in a decision tree with finitely many splits.
Keywords
Affiliated Institutions
Related Publications
Modeled Variance in Two-Level Models
The concept of explained proportion of variance or modeled proportion of variance is reviewed in the situation of the random effects hierarchical two-level model. It is argued t...
Optimum diversity combining and equalization in digital data transmission with applications to cellular mobile radio. I. Theoretical considerations
A comprehensive theory for Nth-order space diversity reception combined with various equalization techniques in digital data transmission over frequency-selective fading channel...
performance: An R Package for Assessment, Comparison and Testing of Statistical Models
A crucial part of statistical analysis is evaluating a model's quality and fit, or performance.During analysis, especially with regression models, investigating the fit of model...
R-Squared Measures for Count Data Regression Models with Applications to Health-Care Utilization
For regression models other than the linear model, R-squared type goodness-to-fit summary statistics have been constructed for particular models using a variety of methods. The ...
The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
Regression analysis makes up a large part of supervised machine learning, and consists of the prediction of a continuous independent target from a set of other predictor variabl...
Publication Info
- Year
- 2002
- Type
- article
- Volume
- 30
- Issue
- 4
- Citations
- 564
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1214/aos/1031689014