The document discusses the Bag of Little Bootstrap (BLB) technique for efficiently estimating statistical properties like medians, variances, and confidence intervals through resampling. BLB addresses computational limitations of traditional bootstrap by drawing many small samples without replacement from the original dataset. This reduces storage and computation needs while maintaining theoretical guarantees like consistency. The key steps are sampling the dataset into small "bags" multiple times, resampling bags with replacement until full size, and aggregating statistics like medians across resamples. BLB scales efficiently to large datasets and is easily parallelized.
9. Problems with the asymptotic
Approach:
- Density “f” is hard to estimate
- Sample size demand is much larger than the mean for
Central Limit theorem to kick in
- True median unknown
21. Key: There are 3 distributions
Population
Approximate
distribution
Actual Sample
22. Key: There are 3 distributions
Population
Approximate
distribution
Actual Sample
Approximate
distribution
Bootstrap Samples
23. Key: There are 3 distributions
Population
Approximate
distribution
Actual Sample
Approximate
distribution
Bootstrap Samples
Approximate
the approximation
- Is there bias?
- What’s the variance?
- etc.
24. No free meals:
- Bootstrapping requires re-sampling the entire
population B times
- Each sample is size n
- Sampling m < n will violate the sample size
properties
- Original sample size cannot be too small
- “Pre-asymptopia” cases
25. Hope
-
Resample expects .632n unique samples
Sample less – m out of n bootstrap is possible with
analytical adjustments. (Bickel 1997)
26. Hope
-
Resample expects .632n unique samples
Sample less – m out of n bootstrap is possible with
analytical adjustments. (Bickel 1997)
Intuition: Need less than all n values for each bootstrap.
27. Hope
-
Resample expects .632n unique samples
Sample less – m out of n bootstrap is possible with
analytical adjustments. (Bickel 1997)
Intuition: Need less than all n values for each bootstrap.
Problem:
- Analytical adjustment is not as automatic as desirable
- m out of n bootstrap is sensitive to choices of m
28. Bag of Little Bootstrap
-
Sample without
replacement the
sample s times into
sizes of b
29. Bag of Little Bootstrap
-
Sample without
replacement the
sample s times into
sizes of b
- Resample each
until sample size is
n, r times.
30. Bag of Little Bootstrap
-
Med 1
Med r
Sample without
replacement the
sample s times into
sizes of b
- Resample each
until sample size is
n, r times.
- Compute the
median for each
31. Bag of Little Bootstrap
-
Med 1
Med r
Sample without
replacement the
sample s times into
sizes of b
- Resample each
until sample size is
n, r times.
- Compute the
median for each
- Compute the
confidence interval
for each
32. Bag of Little Bootstrap
-
Med 1
Med r
Sample without
replacement the
sample s times into
sizes of b
- Resample each
until sample size is
n, r times.
- Compute the
median for each
- Compute the
confidence interval
for each
33. Bag of Little Bootstrap
-
Med 1
Med r
-
Sample without
replacement the
sample s times into
sizes of b
- Resample each
until sample size is
n, r times.
- Compute the
median for each
- Compute the
confidence interval
for each
Take average of each
upper and lower point
for the confidence
interval
34. Bag of Little Bootstrap
Klein et al. 2012
Computational Gains:
- Each sample only has b unique values!
- Can sample a b-dimensional multinomial
with n trials.
- Scales in b instead of n
- Easily parallelizable
35. Bag of Little Bootstrap
Klein et al. 2012
Computational Gains:
- Each sample only has b unique values!
- Can sample a b-dimensional multinomial
with n trials.
- Scales in b instead of n
- Easily parallelizable
If b=n^(0.6), a dataset of size 1TB:
- Bootstrap storage demands ~ 632GB
- BLB storage demands ~ 4GB
36. Bag of Little Bootstrap
Theoretical guarantees:
- Consistency
- Higher order correctness
- Fast convergence rate (same as bootstrap)
38. Performance
b = n^(gamma), 0.5<= gamma <=1
These choices of gamma ensures bootstrap convergence rates.
Relative error of confidence interval width of logistic regression
coefficients
(Klein et al. 2012)
39. Performance
b = n^(gamma), 0.5<= gamma <=1
These choices of gamma ensures bootstrap convergence rates.
Relative error of confidence interval width of logistic regression
coefficients
(Klein et al. 2012)
Gamma residuals
t-distr residuals
41. Selecting Hyperparameters
• b, the number of unique samples for each little bootstrap
• s, the number of size b samples w/o replacement
• r, the number of multinomials to draw
42. Selecting Hyperparameters
• b, the number of unique samples for each little bootstrap
• s, the number of size b samples w/o replacement
• r, the number of multinomials to draw
b: the larger the better
s, r: adaptively increase this until a convergence
has been reached. (Median doesn’t change)
43. Bag of Little Bootstrap
Main benefits:
- Computationally friendly
- Maintains most statistical properties of bootstrap
- Flexibility
- More robust to choice of b than older methods
44. Reference
• Efron, Tibshirani (1993) An Introduction to the Bootstrap
• Kleiner et al. (2012) A Scalable Bootstrap for Massive Data
Thanks!