This document discusses building a hybrid R-Python analytics pipeline that is reproducible, maintainable, and statistically rigorous. It recommends using Git for version control, Packrat and Pip for dependency management in R and Python respectively, calling R from Python using subprocess, using Testthat and Nose for automated testing in R and Python, and Make for a reproducible workflow. Adopting these practices improves code quality, facilitates knowledge transfer, and encourages reproducible workflows, though there is an initial time investment to set them up.
3. Core Activities
● Build, evaluate, refine, and deploy predictive models
● Work with Engineering to ingest, validate, and store data
● Work with Product Management to develop data-driven feature sets
4. How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
5. Anchor Yourself to Problem Statements / Use Cases
1. Define Problem statement
2. Scope out solution space and trade-offs
3. Make decision, justify it, document it
4. Implement chosen solution
5. Evaluate working solution against problem statement
6. Rinse and repeat
Problem-solving Heuristic
12. Why Packrat? From RStudio
1. Isolated: separate system environment and repo environment
2. Portable: easily sync dependencies across data science team
3. Reproducible: easily add/remove/upgrade/downgrade as needed.
Dependency Management
16. ● Initialize packrat with packrat::init()
● Toggle packrat in R session with packrat::on() / off()
● Save current state of project with packrat::snapshot()
● Reconstitute your project with packrat::restore()
● Removing unused libraries with packrat::clean()
Packrat Workflow
17.
18. Problem: Unable to find source packages when restoring
Happens when there is a new version of a package on an R package
repository like CRAN
Packrat Issues
> packrat::restore()
Installing knitr (1.11) ...
FAILED
Error in getSourceForPkgRecord(pkgRecord, srcDir(project),
availablePkgs, :
Couldn't find source for version 1.11 of knitr (1.10.5 is
current)
20. Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
21. Call R from Python: Data Pipeline
Read Data Preprocess Build Model Evaluate Deploy
22. # model_builder.R
cmdargs <- commandArgs(trailingOnly = TRUE)
data_filepath <- cmdargs[1]
model_type <- cmdargs[2]
formula <- cmdargs[3]
build.model <- function(data_filepath, model_type, formula) {
df <- read.data(data_filepath)
model <- train.model(df, model_type, formula)
model
}
Call R from Python: Example
# model_pipeline.py
import subprocess
subprocess.call([‘path/to/R/executable’,
'path/to/model_builder.R’,
data_filepath, model_type, formula])
23. Why subprocess?
1. Python for control flow, data manipulation, IO handling
2. R for model build and evaluation computations
3. main.R script (model_builder.R) as the entry point into R layer
4. No need for tight Python-R integration
Call R from Python
24. Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
25. Tolerance to Change
Are we confident that a modification to the codebase will not silently
introduce new bugs?
Automated Testing
26. Working Effectively with Legacy Code - Michael Feathers
1. Identify change points
2. Break dependencies
3. Write tests
4. Make changes
5. Refactor
Automated Testing
27. Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
28. Make is a language-agnostic utility for *nix
● Enables reproducible workflow
● Serves as lightweight documentation for repo
# makefile
build-model:
python model_pipeline.py
-i ‘model_input’
-m_type ‘glm’
-formula ‘y ~ x1 + x2’
# command-line
$ make build-model
Build Management: Make
$ python model_pipeline.py
-i input_fp
-m_type ‘glm’
-formula ‘y ~ x1 + x2’
VS
29. By adopting the above practices, we:
1. Can maintain the codebase more easily
2. Reduce cognitive load and context switching
3. Improve code quality and correctness
4. Facilitate knowledge transfer among team members
5. Encourage reproducible workflows
Big Wins
30. Necessary Time Investment
1. The learning curve
2. Breaking old habits
3. Create fixes for issues that come with chosen solutions
Costs
31. How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?