Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Computational Techniques for the Statistical
Analysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusu...
Workflow
Identify
Rewrite
Benchmark
Test
Case Study: rlme
Identify
Wilcoxon Tau Estimator
Pairup
Covariance Estimator
Summa...
Motivation
Case study: rlme package
Rank based regression and estimation of two- and three- level
nested effects models.
Go...
Section 1
Workflow
Workflow
Identify
Rewrite
Benchmark
Test
Identify
Know your big O!
Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages
Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages
Profiling with RP...
Rewrite
High level design
Algorithm design
Rewrite
High level design
Algorithm design
Statistical techniques: bootstrapping
Rewrite
Microbenchmarking
Know what R is good at
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by valu...
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by valu...
Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by valu...
Vectorizing
## Bad
vec = 1:100
for (i in 1:length(vec)) {
vec[i] = vec[i]^2
}
## Better
sapply(vec, function(x) x^2)
## Be...
Preallocation
## Bad
vec = c()
for (i in 1:0) {
vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {
vec[i] = i...
Pass by value
square <- function(x) {
x <- x^2
return(x)
}
x <- 1:100
square(x)
Benchmark
Write several versions of a slow function
Benchmark
Write several versions of a slow function
Test them against each other
Benchmark
Write several versions of a slow function
Test them against each other
Package: microbenchmark
Test
Regressions
Test
Regressions
Unit Testing
Test
Regressions
Unit Testing
Package: testthat
Test
Regressions
Unit Testing
Package: testthat
Section 2
Case Study: rlme
Identify
Over to R!
Rprof("profile")
fit.rlme = rlme(...)
Rprof(NULL)
summaryRprof("profile")
Wilcoxon Tau Estimator
Rank based scale estimator of residuals
Uses pairup (so already O(n2))
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wron...
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wron...
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wron...
Wilcoxon Tau Estimator
Test with 2,000 residuals: better!
Wilcoxon Tau
But what about really huge inputs?
Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled point...
Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled point...
Pairup
Pairup function: generates every possible pair from input
vector
Some rank-based estimators require pairwise operat...
Pairup
Original version: vectorized (14 LOC)
Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)
Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)
C++ version (1...
Over to R!
Covariance Estimator
n × n covariance matrix
change to preallocation
Covariance Estimator
Summary
Identify
Rewrite
Benchmark
Test
Keeping Ahead
Parallelism
Keeping Ahead
Parallelism
Cluster: RMpi, snow
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wic...
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wic...
Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wic...
Questions?
Prochain SlideShare
Chargement dans…5
×

Computational Techniques for the Statistical Analysis of Big Data in R

3 652 vues

Publié le

A talk presented at UP-Stat 2014 on techniques for optimizing R code for large data sets

  • Login to see the comments

  • Soyez le premier à aimer ceci

Computational Techniques for the Statistical Analysis of Big Data in R

  1. 1. Computational Techniques for the Statistical Analysis of Big Data in R A Case Study of the rlme Package Herb Susmann, Yusuf Bilgic April 12, 2014
  2. 2. Workflow Identify Rewrite Benchmark Test Case Study: rlme Identify Wilcoxon Tau Estimator Pairup Covariance Estimator Summary Keeping Ahead
  3. 3. Motivation Case study: rlme package Rank based regression and estimation of two- and three- level nested effects models. Goals: faster, less memory, more data Before: 5,000 rows of data After: 50,000 rows of data
  4. 4. Section 1 Workflow
  5. 5. Workflow Identify Rewrite Benchmark Test
  6. 6. Identify Know your big O!
  7. 7. Identify Know your big O! (O(n2) memory usage? probably not so good for big data)
  8. 8. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages
  9. 9. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages Profiling with RProf
  10. 10. Rewrite High level design Algorithm design
  11. 11. Rewrite High level design Algorithm design Statistical techniques: bootstrapping
  12. 12. Rewrite Microbenchmarking Know what R is good at
  13. 13. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization
  14. 14. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation
  15. 15. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference
  16. 16. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++
  17. 17. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++ Be careful!
  18. 18. Vectorizing ## Bad vec = 1:100 for (i in 1:length(vec)) { vec[i] = vec[i]^2 } ## Better sapply(vec, function(x) x^2) ## Best vec^2
  19. 19. Preallocation ## Bad vec = c() for (i in 1:0) { vec = c(vec, i) } ## Better vec = numeric(100) for (i in 1:0) { vec[i] = i }
  20. 20. Pass by value square <- function(x) { x <- x^2 return(x) } x <- 1:100 square(x)
  21. 21. Benchmark Write several versions of a slow function
  22. 22. Benchmark Write several versions of a slow function Test them against each other
  23. 23. Benchmark Write several versions of a slow function Test them against each other Package: microbenchmark
  24. 24. Test Regressions
  25. 25. Test Regressions Unit Testing
  26. 26. Test Regressions Unit Testing Package: testthat
  27. 27. Test Regressions Unit Testing Package: testthat
  28. 28. Section 2 Case Study: rlme
  29. 29. Identify Over to R! Rprof("profile") fit.rlme = rlme(...) Rprof(NULL) summaryRprof("profile")
  30. 30. Wilcoxon Tau Estimator Rank based scale estimator of residuals Uses pairup (so already O(n2))
  31. 31. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)]
  32. 32. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong?
  33. 33. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times
  34. 34. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times Updated with C++ dresd = remove.k.smallest(dresd)
  35. 35. Wilcoxon Tau Estimator Test with 2,000 residuals: better!
  36. 36. Wilcoxon Tau But what about really huge inputs?
  37. 37. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times
  38. 38. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times Not about speed, but about memory
  39. 39. Pairup Pairup function: generates every possible pair from input vector Some rank-based estimators require pairwise operations O(n2) complexity
  40. 40. Pairup Original version: vectorized (14 LOC)
  41. 41. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC)
  42. 42. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC)
  43. 43. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC) C++ version (12 LOC)
  44. 44. Over to R!
  45. 45. Covariance Estimator n × n covariance matrix change to preallocation
  46. 46. Covariance Estimator
  47. 47. Summary Identify Rewrite Benchmark Test
  48. 48. Keeping Ahead Parallelism
  49. 49. Keeping Ahead Parallelism Cluster: RMpi, snow
  50. 50. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud
  51. 51. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark?
  52. 52. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language
  53. 53. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...)
  54. 54. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”
  55. 55. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”
  56. 56. Questions?

×