SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
What We Learned Building an
R-Python Hybrid Analytics Pipeline
Niels Bantilan, Pegged Software
NY R Conference April 8th 2016
Help healthcare organizations recruit better
Pegged Software’s Mission:
Core Activities
● Build, evaluate, refine, and deploy predictive models
● Work with Engineering to ingest, validate, and store data
● Work with Product Management to develop data-driven feature sets
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Anchor Yourself to Problem Statements / Use Cases
1. Define Problem statement
2. Scope out solution space and trade-offs
3. Make decision, justify it, document it
4. Implement chosen solution
5. Evaluate working solution against problem statement
6. Rinse and repeat
Problem-solving Heuristic
R-Python Pipeline
Read Data Preprocess Build Model Evaluate Deploy
Data Science Stack
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
● Code quality
● Incremental Knowledge Transfer
● Sanity check
Git
Why? Because Version Control
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Dependency Management
Why Pip + Pyenv?
1. Easily sync Python package dependencies
2. Easily manage multiple Python versions
3. Create and manage virtual environments
Why Packrat? From RStudio
1. Isolated: separate system environment and repo environment
2. Portable: easily sync dependencies across data science team
3. Reproducible: easily add/remove/upgrade/downgrade as needed.
Dependency Management
Packrat Internals
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile # points R to packrat
└─ packrat
├─ init.R # initialize script
├─ packrat.lock # package deps
├─ packrat.opts # options config
├─ lib # repo private library
└─ src # repo source files
Understanding packrat
PackratFormat: 1.4
PackratVersion: 0.4.6.1
RVersion: 3.2.3
Repos:CRAN=https://cran.rstudio.com/
...
Package: ggplot2
Source: CRAN
Version: 2.0.0
Hash: 5befb1e7a9c7d0692d6c35fa02a29dbf
Requires: MASS, digest, gtable, plyr,
reshape2, scales
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile
└─ packrat
├─ init.R
├─ packrat.lock # package deps
├─ packrat.opts
├─ lib
└─ src
packrat.lock: package version and deps
Packrat Internals
auto.snapshot: TRUE
use.cache: FALSE
print.banner.on.startup: auto
vcs.ignore.lib: TRUE
vcs.ignore.src: TRUE
load.external.packages.on.startup: TRUE
quiet.package.installation: TRUE
snapshot.recommended.packages: FALSE
packrat.opts: project-specific configuration
Packrat Internals
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile
└─ packrat
├─ init.R
├─ packrat.lock
├─ packrat.opts # options config
├─ lib
└─ src
● Initialize packrat with packrat::init()
● Toggle packrat in R session with packrat::on() / off()
● Save current state of project with packrat::snapshot()
● Reconstitute your project with packrat::restore()
● Removing unused libraries with packrat::clean()
Packrat Workflow
Problem: Unable to find source packages when restoring
Happens when there is a new version of a package on an R package
repository like CRAN
Packrat Issues
> packrat::restore()
Installing knitr (1.11) ...
FAILED
Error in getSourceForPkgRecord(pkgRecord, srcDir(project),
availablePkgs, :
Couldn't find source for version 1.11 of knitr (1.10.5 is
current)
Solution 1: Use R’s Installation Procedure
Packrat Issues
> install.packages(<package_name>)
> packrat::snapshot()
Solution 2: Manually Download Source File
$ wget -P repo/packrat/src <package_source_url>
> packrat::restore()
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Call R from Python: Data Pipeline
Read Data Preprocess Build Model Evaluate Deploy
# model_builder.R
cmdargs <- commandArgs(trailingOnly = TRUE)
data_filepath <- cmdargs[1]
model_type <- cmdargs[2]
formula <- cmdargs[3]
build.model <- function(data_filepath, model_type, formula) {
df <- read.data(data_filepath)
model <- train.model(df, model_type, formula)
model
}
Call R from Python: Example
# model_pipeline.py
import subprocess
subprocess.call([‘path/to/R/executable’,
'path/to/model_builder.R’,
data_filepath, model_type, formula])
Why subprocess?
1. Python for control flow, data manipulation, IO handling
2. R for model build and evaluation computations
3. main.R script (model_builder.R) as the entry point into R layer
4. No need for tight Python-R integration
Call R from Python
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Tolerance to Change
Are we confident that a modification to the codebase will not silently
introduce new bugs?
Automated Testing
Working Effectively with Legacy Code - Michael Feathers
1. Identify change points
2. Break dependencies
3. Write tests
4. Make changes
5. Refactor
Automated Testing
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Make is a language-agnostic utility for *nix
● Enables reproducible workflow
● Serves as lightweight documentation for repo
# makefile
build-model:
python model_pipeline.py 
-i ‘model_input’ 
-m_type ‘glm’ 
-formula ‘y ~ x1 + x2’ 
# command-line
$ make build-model
Build Management: Make
$ python model_pipeline.py 
-i input_fp 
-m_type ‘glm’ 
-formula ‘y ~ x1 + x2’ 
VS
By adopting the above practices, we:
1. Can maintain the codebase more easily
2. Reduce cognitive load and context switching
3. Improve code quality and correctness
4. Facilitate knowledge transfer among team members
5. Encourage reproducible workflows
Big Wins
Necessary Time Investment
1. The learning curve
2. Breaking old habits
3. Create fixes for issues that come with chosen solutions
Costs
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Questions?
niels@peggedsoftware.com
@cosmicbboy

Contenu connexe

Tendances

Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Raffi Khatchadourian
 
From NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceFrom NASA to Startups to Big Commerce
From NASA to Startups to Big Commerce
Daniel Greenfeld
 

Tendances (20)

A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
 
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
 
ownR extended technical introduction
ownR extended technical introductionownR extended technical introduction
ownR extended technical introduction
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
From NASA to Startups to Big Commerce
From NASA to Startups to Big CommerceFrom NASA to Startups to Big Commerce
From NASA to Startups to Big Commerce
 
OwnR introduction
OwnR introductionOwnR introduction
OwnR introduction
 
ownR presentation eRum 2016
ownR presentation eRum 2016ownR presentation eRum 2016
ownR presentation eRum 2016
 
ownR platform technical introduction
ownR platform technical introductionownR platform technical introduction
ownR platform technical introduction
 
Wrapping and securing REST APIs with GraphQL
Wrapping and securing REST APIs with GraphQLWrapping and securing REST APIs with GraphQL
Wrapping and securing REST APIs with GraphQL
 
TDD For Mortals
TDD For MortalsTDD For Mortals
TDD For Mortals
 
Contributing to Upstream Open Source Projects
Contributing to Upstream Open Source ProjectsContributing to Upstream Open Source Projects
Contributing to Upstream Open Source Projects
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
 
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and BeyondGetting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
 
Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
An Empirical Study of Unspecified Dependencies in Make-Based Build Systems
An Empirical Study of Unspecified Dependencies in Make-Based Build SystemsAn Empirical Study of Unspecified Dependencies in Make-Based Build Systems
An Empirical Study of Unspecified Dependencies in Make-Based Build Systems
 
Internship final presentation
Internship final presentationInternship final presentation
Internship final presentation
 
Demo : Twitter Sentiment Analysis on Kubernetes using Kafka, MongoDB with Ope...
Demo : Twitter Sentiment Analysis on Kubernetes using Kafka, MongoDB with Ope...Demo : Twitter Sentiment Analysis on Kubernetes using Kafka, MongoDB with Ope...
Demo : Twitter Sentiment Analysis on Kubernetes using Kafka, MongoDB with Ope...
 

En vedette

En vedette (16)

Thinking Small About Big Data
Thinking Small About Big DataThinking Small About Big Data
Thinking Small About Big Data
 
Iterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament editionIterating over statistical models: NCAA tournament edition
Iterating over statistical models: NCAA tournament edition
 
R for Everything
R for EverythingR for Everything
R for Everything
 
Julia + R for Data Science
Julia + R for Data ScienceJulia + R for Data Science
Julia + R for Data Science
 
Using R at NYT Graphics
Using R at NYT GraphicsUsing R at NYT Graphics
Using R at NYT Graphics
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
Analyzing NYC Transit Data
Analyzing NYC Transit DataAnalyzing NYC Transit Data
Analyzing NYC Transit Data
 
The Feels
The FeelsThe Feels
The Feels
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data Frames
 
The Political Impact of Social Penumbras
The Political Impact of Social PenumbrasThe Political Impact of Social Penumbras
The Political Impact of Social Penumbras
 
Reflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYCReflection on the Data Science Profession in NYC
Reflection on the Data Science Profession in NYC
 
I Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for TreesI Don't Want to Be a Dummy! Encoding Predictors for Trees
I Don't Want to Be a Dummy! Encoding Predictors for Trees
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
R Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal DependenceR Packages for Time-Varying Networks and Extremal Dependence
R Packages for Time-Varying Networks and Extremal Dependence
 
Scaling Data Science at Airbnb
Scaling Data Science at AirbnbScaling Data Science at Airbnb
Scaling Data Science at Airbnb
 
Inside the R Consortium
Inside the R ConsortiumInside the R Consortium
Inside the R Consortium
 

Similaire à What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
Revolution Analytics
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
Ivo Jimenez
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and More
Matt Harrison
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 

Similaire à What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline (20)

Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Complete python toolbox for modern developers
Complete python toolbox for modern developersComplete python toolbox for modern developers
Complete python toolbox for modern developers
 
20150422 repro resr
20150422 repro resr20150422 repro resr
20150422 repro resr
 
PyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentPyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deployment
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...Reproducible, Automated and Portable Computational and Data Science Experimen...
Reproducible, Automated and Portable Computational and Data Science Experimen...
 
First python project
First python projectFirst python project
First python project
 
R development
R developmentR development
R development
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and More
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0
 
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
Christian Strappazzon - Presentazione Python Milano - Codemotion Milano 2017
 
How to implement continuous delivery with enterprise java middleware?
How to implement continuous delivery with enterprise java middleware?How to implement continuous delivery with enterprise java middleware?
How to implement continuous delivery with enterprise java middleware?
 
R sharing 101
R sharing 101R sharing 101
R sharing 101
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 

Plus de Work-Bench

Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
Work-Bench
 

Plus de Work-Bench (8)

2017 Enterprise Almanac
2017 Enterprise Almanac2017 Enterprise Almanac
2017 Enterprise Almanac
 
AI to Enable Next Generation of People Managers
AI to Enable Next Generation of People ManagersAI to Enable Next Generation of People Managers
AI to Enable Next Generation of People Managers
 
Startup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview ProcessStartup Recruiting Workbook: Sourcing and Interview Process
Startup Recruiting Workbook: Sourcing and Interview Process
 
Cloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions ComparedCloud Native Infrastructure Management Solutions Compared
Cloud Native Infrastructure Management Solutions Compared
 
Building a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDBBuilding a Demand Generation Machine at MongoDB
Building a Demand Generation Machine at MongoDB
 
How to Market Your Startup to the Enterprise
How to Market Your Startup to the EnterpriseHow to Market Your Startup to the Enterprise
How to Market Your Startup to the Enterprise
 
Marketing & Design for the Enterprise
Marketing & Design for the EnterpriseMarketing & Design for the Enterprise
Marketing & Design for the Enterprise
 
Playing the Marketing Long Game
Playing the Marketing Long GamePlaying the Marketing Long Game
Playing the Marketing Long Game
 

Dernier

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Dernier (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

  • 1. What We Learned Building an R-Python Hybrid Analytics Pipeline Niels Bantilan, Pegged Software NY R Conference April 8th 2016
  • 2. Help healthcare organizations recruit better Pegged Software’s Mission:
  • 3. Core Activities ● Build, evaluate, refine, and deploy predictive models ● Work with Engineering to ingest, validate, and store data ● Work with Product Management to develop data-driven feature sets
  • 4. How might we build a predictive analytics pipeline that is reproducible, maintainable, and statistically rigorous?
  • 5. Anchor Yourself to Problem Statements / Use Cases 1. Define Problem statement 2. Scope out solution space and trade-offs 3. Make decision, justify it, document it 4. Implement chosen solution 5. Evaluate working solution against problem statement 6. Rinse and repeat Problem-solving Heuristic
  • 6. R-Python Pipeline Read Data Preprocess Build Model Evaluate Deploy
  • 7. Data Science Stack Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 8. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 9. ● Code quality ● Incremental Knowledge Transfer ● Sanity check Git Why? Because Version Control
  • 10. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 11. Dependency Management Why Pip + Pyenv? 1. Easily sync Python package dependencies 2. Easily manage multiple Python versions 3. Create and manage virtual environments
  • 12. Why Packrat? From RStudio 1. Isolated: separate system environment and repo environment 2. Portable: easily sync dependencies across data science team 3. Reproducible: easily add/remove/upgrade/downgrade as needed. Dependency Management
  • 13. Packrat Internals datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile # points R to packrat └─ packrat ├─ init.R # initialize script ├─ packrat.lock # package deps ├─ packrat.opts # options config ├─ lib # repo private library └─ src # repo source files Understanding packrat
  • 14. PackratFormat: 1.4 PackratVersion: 0.4.6.1 RVersion: 3.2.3 Repos:CRAN=https://cran.rstudio.com/ ... Package: ggplot2 Source: CRAN Version: 2.0.0 Hash: 5befb1e7a9c7d0692d6c35fa02a29dbf Requires: MASS, digest, gtable, plyr, reshape2, scales datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock # package deps ├─ packrat.opts ├─ lib └─ src packrat.lock: package version and deps Packrat Internals
  • 15. auto.snapshot: TRUE use.cache: FALSE print.banner.on.startup: auto vcs.ignore.lib: TRUE vcs.ignore.src: TRUE load.external.packages.on.startup: TRUE quiet.package.installation: TRUE snapshot.recommended.packages: FALSE packrat.opts: project-specific configuration Packrat Internals datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock ├─ packrat.opts # options config ├─ lib └─ src
  • 16. ● Initialize packrat with packrat::init() ● Toggle packrat in R session with packrat::on() / off() ● Save current state of project with packrat::snapshot() ● Reconstitute your project with packrat::restore() ● Removing unused libraries with packrat::clean() Packrat Workflow
  • 17.
  • 18. Problem: Unable to find source packages when restoring Happens when there is a new version of a package on an R package repository like CRAN Packrat Issues > packrat::restore() Installing knitr (1.11) ... FAILED Error in getSourceForPkgRecord(pkgRecord, srcDir(project), availablePkgs, : Couldn't find source for version 1.11 of knitr (1.10.5 is current)
  • 19. Solution 1: Use R’s Installation Procedure Packrat Issues > install.packages(<package_name>) > packrat::snapshot() Solution 2: Manually Download Source File $ wget -P repo/packrat/src <package_source_url> > packrat::restore()
  • 20. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 21. Call R from Python: Data Pipeline Read Data Preprocess Build Model Evaluate Deploy
  • 22. # model_builder.R cmdargs <- commandArgs(trailingOnly = TRUE) data_filepath <- cmdargs[1] model_type <- cmdargs[2] formula <- cmdargs[3] build.model <- function(data_filepath, model_type, formula) { df <- read.data(data_filepath) model <- train.model(df, model_type, formula) model } Call R from Python: Example # model_pipeline.py import subprocess subprocess.call([‘path/to/R/executable’, 'path/to/model_builder.R’, data_filepath, model_type, formula])
  • 23. Why subprocess? 1. Python for control flow, data manipulation, IO handling 2. R for model build and evaluation computations 3. main.R script (model_builder.R) as the entry point into R layer 4. No need for tight Python-R integration Call R from Python
  • 24. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 25. Tolerance to Change Are we confident that a modification to the codebase will not silently introduce new bugs? Automated Testing
  • 26. Working Effectively with Legacy Code - Michael Feathers 1. Identify change points 2. Break dependencies 3. Write tests 4. Make changes 5. Refactor Automated Testing
  • 27. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 28. Make is a language-agnostic utility for *nix ● Enables reproducible workflow ● Serves as lightweight documentation for repo # makefile build-model: python model_pipeline.py -i ‘model_input’ -m_type ‘glm’ -formula ‘y ~ x1 + x2’ # command-line $ make build-model Build Management: Make $ python model_pipeline.py -i input_fp -m_type ‘glm’ -formula ‘y ~ x1 + x2’ VS
  • 29. By adopting the above practices, we: 1. Can maintain the codebase more easily 2. Reduce cognitive load and context switching 3. Improve code quality and correctness 4. Facilitate knowledge transfer among team members 5. Encourage reproducible workflows Big Wins
  • 30. Necessary Time Investment 1. The learning curve 2. Breaking old habits 3. Create fixes for issues that come with chosen solutions Costs
  • 31. How might we build a predictive analytics pipeline that is reproducible, maintainable, and statistically rigorous?