Computational Techniques for the Statistical Analysis of Big Data in R

•

1 j'aime•3,819 vues

herbps10

A talk presented at UP-Stat 2014 on techniques for optimizing R code for large data sets

Technologie Divertissement et humour

Computational Techniques for the Statistical
Analysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusuf Bilgic
April 12, 2014

Workﬂow
Identify
Rewrite
Benchmark
Test
Case Study: rlme
Identify
Wilcoxon Tau Estimator
Pairup
Covariance Estimator
Summary
Keeping Ahead

Motivation
Case study: rlme package
Rank based regression and estimation of two- and three- level
nested eﬀects models.
Goals: faster, less memory, more data
Before: 5,000 rows of data
After: 50,000 rows of data

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages
Proﬁling with RProf

Rewrite
High level design
Algorithm design

Rewrite
High level design
Algorithm design
Statistical techniques: bootstrapping

Rewrite
Microbenchmarking
Know what R is good at

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Embrace C++

$Vectorizing ## Bad vec = 1:100 for (i in 1:length(vec)) { vec[i] = vec[i]^2 } ## Better sapply(vec, function(x) x^2) ## Best vec^2$

Preallocation
## Bad
vec = c()
for (i in 1:0) {
vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {
vec[i] = i
}

$Pass by value square <- function(x) { x <- x^2 return(x) } x <- 1:100 square(x)$

Benchmark
Write several versions of a slow function

Benchmark
Write several versions of a slow function
Test them against each other

Benchmark
Write several versions of a slow function
Test them against each other
Package: microbenchmark

Test
Regressions
Unit Testing
Package: testthat

Wilcoxon Tau Estimator
Rank based scale estimator of residuals
Uses pairup (so already O(n2))

Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]

Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong?

Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times

Wilcoxon Tau Estimator
Test with 2,000 residuals: better!

Wilcoxon Tau
But what about really huge inputs?

Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times

Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
Not about speed, but about memory

Pairup
Pairup function: generates every possible pair from input
vector
Some rank-based estimators require pairwise operations
O(n2) complexity

Pairup
Original version: vectorized (14 LOC)

Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)

Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)

Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)
C++ version (12 LOC)

Covariance Estimator
n × n covariance matrix
change to preallocation

Keeping Ahead
Parallelism
Cluster: RMpi, snow

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)

Recommandé

Detecting Deadlock, Double-Free and Other Abuses in a Million Lines of Linux ...Peter Breuer

Garbage CollectionEelco Visser

Python to scalakao kuo-tung

Optimizing with persistent data structures (LLVM Cauldron 2016)Igalia

Integrating R with C++: Rcpp, RInside and RProtoBufRomain Francois

Command line arguments that make you smileMartin Melin

Knit, Chisel, Hack: Building Programs in Guile Scheme (Strange Loop 2016)Igalia

A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas

Recommandé

Detecting Deadlock, Double-Free and Other Abuses in a Million Lines of Linux ...Peter Breuer

Garbage CollectionEelco Visser

Python to scalakao kuo-tung

Optimizing with persistent data structures (LLVM Cauldron 2016)Igalia

Integrating R with C++: Rcpp, RInside and RProtoBufRomain Francois

Command line arguments that make you smileMartin Melin

Knit, Chisel, Hack: Building Programs in Guile Scheme (Strange Loop 2016)Igalia

A peek on numerical programming in perl and python e christopher dyken 2005Jules Krdenas

R workshop xx -- Parallel Computing with R Vivian S. Zhang

Return Oriented ProgrammingUTD Computer Security Group

Optimizing Communicating Event-Loop Languages with TruffleStefan Marr

Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia

Parallel Computing with RAbhirup Mallik

Seattle useR Group - R + ScalaShouheng Yi

MUMS Opening Workshop - Inferring Release Characteristics from an Atmospheric...The Statistical and Applied Mathematical Sciences Institute

defenseQing Dou

Object Detection with TensorflowElifTech

[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...Jerry Chou

JavaScript for Web AnalystsLukáš Čech

Bigdata PresentationYonas Gidey

Tree-based Translation Models (『機械翻訳』§6.2-6.3)Yusuke Oda

Deep Learning in theanoMassimo Quadrana

Los Angeles R users group - July 12 2011 - Part 2rusersla

MODELS 2019: Querying and annotating model histories with time-aware patternsAntonio García-Domínguez

ocelotsean chen

jimmy hacking (at) MicrosoftJimmy Schementi

Arm tools and roadmap for SVE compiler supportLinaro

Virtual Machine for Regular ExpressionsAlexander Yakushev

GTC 2012: GPU-Accelerated Path RenderingMark Kilgard

GPUs in Big Data - StampedeCon 2014StampedeCon

Contenu connexe

Tendances

R workshop xx -- Parallel Computing with R Vivian S. Zhang

Return Oriented ProgrammingUTD Computer Security Group

Optimizing Communicating Event-Loop Languages with TruffleStefan Marr

Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia

Parallel Computing with RAbhirup Mallik

Seattle useR Group - R + ScalaShouheng Yi

MUMS Opening Workshop - Inferring Release Characteristics from an Atmospheric...The Statistical and Applied Mathematical Sciences Institute

defenseQing Dou

Object Detection with TensorflowElifTech

[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...Jerry Chou

JavaScript for Web AnalystsLukáš Čech

Bigdata PresentationYonas Gidey

Tree-based Translation Models (『機械翻訳』§6.2-6.3)Yusuke Oda

Deep Learning in theanoMassimo Quadrana

Los Angeles R users group - July 12 2011 - Part 2rusersla

MODELS 2019: Querying and annotating model histories with time-aware patternsAntonio García-Domínguez

ocelotsean chen

jimmy hacking (at) MicrosoftJimmy Schementi

Arm tools and roadmap for SVE compiler supportLinaro

Virtual Machine for Regular ExpressionsAlexander Yakushev

Tendances (20)

R workshop xx -- Parallel Computing with R

Return Oriented Programming

Optimizing Communicating Event-Loop Languages with Truffle

Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)

Parallel Computing with R

Seattle useR Group - R + Scala

MUMS Opening Workshop - Inferring Release Characteristics from an Atmospheric...

defense

Object Detection with Tensorflow

[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...

JavaScript for Web Analysts

Bigdata Presentation

Tree-based Translation Models (『機械翻訳』§6.2-6.3)

Deep Learning in theano

Los Angeles R users group - July 12 2011 - Part 2

MODELS 2019: Querying and annotating model histories with time-aware patterns

ocelot

jimmy hacking (at) Microsoft

Arm tools and roadmap for SVE compiler support

Virtual Machine for Regular Expressions

En vedette

GTC 2012: GPU-Accelerated Path RenderingMark Kilgard

GPUs in Big Data - StampedeCon 2014StampedeCon

Deep learning on sparkSatyendra Rana

PG-Strom - GPU Accelerated AsyncrKohei KaiGai

SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard

GPU EcosystemOfer Rosenberg

Accelerating Machine Learning Applications on Spark Using GPUsIBM

Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...odsc

Heterogeneous System Architecture Overviewinside-BigData.com

PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai

PyData Amsterdam - Name Matching at ScaleGoDataDriven

Deep Learning on HadoopDataWorks Summit

Hadoop + GPUVladimir Starostenkov

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit

DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc

How to Solve Real-Time Data ProblemsIBM Power Systems

Containerizing GPU Applications with Docker for Scaling to the CloudSubbu Rama

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly

The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit

Spark Summit EU talk by Tim HunterSpark Summit

En vedette (20)

GTC 2012: GPU-Accelerated Path Rendering

GPUs in Big Data - StampedeCon 2014

Deep learning on spark

PG-Strom - GPU Accelerated Asyncr

SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering

GPU Ecosystem

Accelerating Machine Learning Applications on Spark Using GPUs

Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...

Heterogeneous System Architecture Overview

PG-Strom - GPGPU meets PostgreSQL, PGcon2015

PyData Amsterdam - Name Matching at Scale

Deep Learning on Hadoop

Hadoop + GPU

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...

DeepLearning4J and Spark: Successes and Challenges - François Garillot

How to Solve Real-Time Data Problems

Containerizing GPU Applications with Docker for Scaling to the Cloud

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...

The Potential of GPU-driven High Performance Data Analytics in Spark

Spark Summit EU talk by Tim Hunter

Similaire à Computational Techniques for the Statistical Analysis of Big Data in R

Devnology Workshop Genpro 2 feb 2011Devnology

NIPS2007: structured predictionzukun

Preemptive RANSAC by David Nister.Ian Sa

Functional Programming With ScalaKnoldus Inc.

Next.ml Boston: Data Science Dev OpsEric Chiang

Advanced procedures in assembly language Full chapter pptMuhammad Sikandar Mustafa

Parallelising Dynamic ProgrammingRaphael Reitzig

H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati

Basics of JavascriptUniverse41

Scala clojure techday_2011Thadeu Russo

R Analytics in the CloudDataMine Lab

Native interfaces for RSeth Falcon

Atlanta Spark User Meetup 09 22 2016Chris Fregly

Ppt chapter12Richard Styner

Pr045 deep lab_semantic_segmentationTaeoh Kim

Skip gram and cbowhyunyoung Lee

Uncovering Performance Problems in Java Applications with Reference Propagati...Dacong (Tony) Yan

pptx - Psuedo Random Generator for Halfspacesbutest

Inferno Scalable Deep Learning on SparkDataWorks Summit/Hadoop Summit

Similaire à Computational Techniques for the Statistical Analysis of Big Data in R (20)

Devnology Workshop Genpro 2 feb 2011

NIPS2007: structured prediction

Preemptive RANSAC by David Nister.

Functional Programming With Scala

Next.ml Boston: Data Science Dev Ops

Advanced procedures in assembly language Full chapter ppt

Parallelising Dynamic Programming

H2O Open Source Deep Learning, Arno Candel 03-20-14

Basics of Javascript

Scala clojure techday_2011

R Analytics in the Cloud

Native interfaces for R

Atlanta Spark User Meetup 09 22 2016

Ppt chapter12

Pr045 deep lab_semantic_segmentation

Skip gram and cbow

Uncovering Performance Problems in Java Applications with Reference Propagati...

pptx - Psuedo Random Generator for Halfspaces

Inferno Scalable Deep Learning on Spark

Dernier

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Real Time Object Detection Using Open CVKhem

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

🐬 The future of MySQL is Postgres 🐘RTylerCroy

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Boost Fertility New Invention Ups Success Rates.pdf

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Powerful Google developer tools for immediate impact! (2023-24 C)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

presentation ICT roal in 21st century education

Real Time Object Detection Using Open CV

Scaling API-first – The story of a global engineering organization

Partners Life - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

HTML Injection Attacks: Impact and Mitigation Strategies

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

🐬 The future of MySQL is Postgres 🐘

AWS Community Day CPH - Three problems of Terraform

How to Troubleshoot Apps for the Modern Connected Worker

Finology Group – Insurtech Innovation Award 2024

Computational Techniques for the Statistical Analysis of Big Data in R

1. Computational Techniques for the Statistical Analysis of Big Data in R A Case Study of the rlme Package Herb Susmann, Yusuf Bilgic April 12, 2014

2. Workﬂow Identify Rewrite Benchmark Test Case Study: rlme Identify Wilcoxon Tau Estimator Pairup Covariance Estimator Summary Keeping Ahead

3. Motivation Case study: rlme package Rank based regression and estimation of two- and three- level nested eﬀects models. Goals: faster, less memory, more data Before: 5,000 rows of data After: 50,000 rows of data

4. Section 1 Workﬂow

5. Workﬂow Identify Rewrite Benchmark Test

6. Identify Know your big O!

7. Identify Know your big O! (O(n2) memory usage? probably not so good for big data)

8. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages

9. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages Proﬁling with RProf

10. Rewrite High level design Algorithm design

11. Rewrite High level design Algorithm design Statistical techniques: bootstrapping

12. Rewrite Microbenchmarking Know what R is good at

13. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization

14. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation

15. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference

16. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++

17. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++ Be careful!

18. Vectorizing ## Bad vec = 1:100 for (i in 1:length(vec)) { vec[i] = vec[i]^2 } ## Better sapply(vec, function(x) x^2) ## Best vec^2

19. Preallocation ## Bad vec = c() for (i in 1:0) { vec = c(vec, i) } ## Better vec = numeric(100) for (i in 1:0) { vec[i] = i }

20. Pass by value square <- function(x) { x <- x^2 return(x) } x <- 1:100 square(x)

21. Benchmark Write several versions of a slow function

22. Benchmark Write several versions of a slow function Test them against each other

23. Benchmark Write several versions of a slow function Test them against each other Package: microbenchmark

24. Test Regressions

25. Test Regressions Unit Testing

26. Test Regressions Unit Testing Package: testthat

27. Test Regressions Unit Testing Package: testthat

28. Section 2 Case Study: rlme

29. Identify Over to R! Rprof("profile") fit.rlme = rlme(...) Rprof(NULL) summaryRprof("profile")

30. Wilcoxon Tau Estimator Rank based scale estimator of residuals Uses pairup (so already O(n2))

31. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)]

32. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong?

33. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times

34. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times Updated with C++ dresd = remove.k.smallest(dresd)

35. Wilcoxon Tau Estimator Test with 2,000 residuals: better!

36. Wilcoxon Tau But what about really huge inputs?

37. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times

38. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times Not about speed, but about memory

39. Pairup Pairup function: generates every possible pair from input vector Some rank-based estimators require pairwise operations O(n2) complexity

40. Pairup Original version: vectorized (14 LOC)

41. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC)

42. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC)

43. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC) C++ version (12 LOC)

44. Over to R!

45. Covariance Estimator n × n covariance matrix change to preallocation

46. Covariance Estimator

47. Summary Identify Rewrite Benchmark Test

48. Keeping Ahead Parallelism

49. Keeping Ahead Parallelism Cluster: RMpi, snow

50. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud

51. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark?

52. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language

53. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...)

54. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”

55. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”

56. Questions?