Abstract

Summary The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number of subsampled data points), and they often require knowledge of the estimator's convergence rate, in contrast with the bootstrap. As an alternative, we introduce the ‘bag of little bootstraps’ (BLB), which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators. The BLB is well suited to modern parallel and distributed computing architectures and furthermore retains the generic applicability and statistical efficiency of the bootstrap. We demonstrate the BLB's favourable statistical performance via a theoretical analysis elucidating the procedure's properties, as well as a simulation study comparing the BLB with the bootstrap, the m out of n bootstrap and subsampling. In addition, we present results from a large-scale distributed implementation of the BLB demonstrating its computational superiority on massive data, a method for adaptively selecting the BLB's tuning parameters, an empirical study applying the BLB to several real data sets and an extension of the BLB to time series data.

Keywords

EstimatorScalabilityComputer scienceComputationConvergence (economics)AlgorithmData miningMathematicsStatistics

Affiliated Institutions

Related Publications

Publication Info

Year
2014
Type
article
Volume
76
Issue
4
Pages
795-816
Citations
385
Access
Closed

External Links

Social Impact

Altmetric

Social media, news, blog, policy document mentions

Citation Metrics

385
OpenAlex

Cite This

Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar et al. (2014). A Scalable Bootstrap for Massive Data. Journal of the Royal Statistical Society Series B (Statistical Methodology) , 76 (4) , 795-816. https://doi.org/10.1111/rssb.12050

Identifiers

DOI
10.1111/rssb.12050