SlideShare une entreprise Scribd logo
1  sur  25
Reproducibility with Checkpoint & RRO
New York R Conference
Joseph Rickert
April 25, 2015
2
OUR COMPANY
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
OUR PRODUCT
REVOLUTION R: The
enterprise-grade predictive
analytics application platform
based on the R language
SOME KUDOS
“This acquisition will help
customers use advanced
analytics within Microsoft data
platforms“
-- Joseph Sirosh, CVP C+E
What is Reproducibility?
“The goal of reproducible research is to tie specific
instructions to data analysis and experimental data so that
scholarship can be recreated, better understood and
verified.”
CRAN Task View on Reproducible Research (Kuhn)
 Method + Environment
-> Results
 A process for:
– Sharing the method
– Describing the environment
– Recreating the results
3
xkcd.com/242/
Reproducibility – why do we care?
Academic / Research
 Verify results
 Advance Research
Business
 Production code
 Reliability
 Reusability
 Collaboration
 Regulation
4
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
Observations
 R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
 Good solutions for literate programming
– Rstudio / knitr / Rmarkdown
 OS/Hardware not the major cause of problems
 The big problem is with packages
– CRAN is in a state of continual flux
5
Package dependency explosion
 R script file using 6 most popular packages
6
Any updated package = potential reproducibility error!
http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html
7
An R Reproducibility Problem
Adapted from http://xkcd.com/234/ CC BY-NC 2.5
8
Reproducible R Toolkit
 Static CRAN mirror in Revolution R Open
– CRAN packages fixed with each RRO update
 Daily CRAN snapshots
– Storing every package version since September 2014
– Hosted at mran.revolutionanalytics.com/snapshot
 Write and share scripts synced to a specific snapshot date
– checkpoint package installed with RRO
projects.revolutionanalytics.com/rrt/
9
Using checkpoint
 Add 2 lines to the top of your script
library(checkpoint)
checkpoint("2015-01-28")
 That’s it?
 Optionally, check the R version as well
library(checkpoint)
checkpoint("2015-01-28", R.version="3.1.3")
Or, whichever date you want
10
Using checkpoint
 Easy to use: add 2 lines to the top of each script
library(checkpoint)
checkpoint("2014-09-17")
 For the package author:
– Use package versions available on the chosen date
– Installs packages local to this project
• Allows different package versions to be used simultaneously
 For a script collaborator:
– Automatically installs required packages
• Detects required packages (no need to manually install!)
– Uses same package versions as script author to ensure reproducibility
11
R Script with packages having dependencies
 ## Adapted from https://gist.github.com/abresler/46c36c1a88c849b94b07
 ## Blog post Jan 21
 ## http://blog.revolutionanalytics.com/2015/01/a-beautiful-story-about-nyc-weather.html
 library(checkpoint)
 checkpoint("2015-02-05",R="3.1.2")
 library(dplyr)
 library(tidyr)
 library(magrittr)
 library(ggplot2)
12
First time script is run
 checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics
 http://projects.revolutionanalytics.com/rrt/
 > checkpoint("2015-02-05",R="3.1.2")
 Scanning for packages used in this project
 |==========================================================================================| 100%
 - Discovered 5 packages
 Installing packages used in this project
 - Installing ‘dplyr’
 also installing the dependencies ‘assertthat’, ‘R6’, ‘Rcpp’, ‘magrittr’, ‘lazyeval’, ‘DBI’, ‘BH’
 package ‘assertthat’ successfully unpacked and MD5 sums checked
 package ‘R6’ successfully unpacked and MD5 sums checked
 package ‘Rcpp’ successfully unpacked and MD5 sums checked
 package ‘magrittr’ successfully unpacked and MD5 sums checked
 package ‘lazyeval’ successfully unpacked and MD5 sums checked
 package ‘DBI’ successfully unpacked and MD5 sums checked
 package ‘BH’ successfully unpacked and MD5 sums checked
 package ‘dplyr’ successfully unpacked and MD5 sums checked
 - Installing ‘ggplot2’
 also installing the dependencies ‘colorspace’, ‘stringr’, ‘RColorBrewer’, ‘dichromat’, ‘munsell’, ‘labeling’, ‘plyr’, ‘digest’, ‘gtable’, ‘reshape2’,
‘scales’, ‘proto’, ‘MASS’
13
Lots of dependent packages loaded
 package ‘colorspace’ successfully unpacked and MD5 sums checked
 package ‘stringr’ successfully unpacked and MD5 sums checked
 package ‘RColorBrewer’ successfully unpacked and MD5 sums checked
 package ‘dichromat’ successfully unpacked and MD5 sums checked
 package ‘munsell’ successfully unpacked and MD5 sums checked
 package ‘labeling’ successfully unpacked and MD5 sums checked
 package ‘plyr’ successfully unpacked and MD5 sums checked
 package ‘digest’ successfully unpacked and MD5 sums checked
 package ‘gtable’ successfully unpacked and MD5 sums checked
 package ‘reshape2’ successfully unpacked and MD5 sums checked
 package ‘scales’ successfully unpacked and MD5 sums checked
 package ‘proto’ successfully unpacked and MD5 sums checked
 package ‘MASS’ successfully unpacked and MD5 sums checked
 package ‘ggplot2’ successfully unpacked and MD5 sums checked
 - Previously installed ‘magrittr’
 - Installing ‘tidyr’
 also installing the dependency ‘stringi’
 package ‘stringi’ successfully unpacked and MD5 sums checked
 package ‘tidyr’ successfully unpacked and MD5 sums checked
 checkpoint process complete
Checkpoint tips for script authors
 Work within a project
– Dedicated folder with scripts, data and output
– eg /Users/Joe/Rstudio_Projects/weather
 Create a master .R script file beginning with
library(checkpoint)
checkpoint("DATE")
– package versions used will be as of this date
 Don’t use install.packages directly
– Use library() and checkpoint does the rest
– You can have different package versions installed for different projects at the
same time!
14
Sharing projects with checkpoint
 Just share your script or project folder!
 Recipient only needs:
– compatible R version
– checkpoint package (installed with RRO)
– Internet connection to MRAN (at least first time)
 Checkpoint takes care of:
– Installing CRAN packages
• Binaries (ease of installation)
• Correct versions (reproducibility)
• Dependencies (ease of installation)
– Eliminating conflicts with other installed packages
15
The checkpoint magic
The checkpoint() call does all this:
 Scans project for required packages
 Installs required packages and dependencies
– Packages installed specific to project
– Versions specific to checkpoint date
• Installed in ~/.checkpoint/DATE
• Skips packages if already installed (2nd run through)
 Reconfigures package search path
– Points only to project-specific library
16
MRAN checkpoint server
Checkpoint uses MRAN’s downstream CRAN mirror
with daily snapshots.
17
CRAN
RRDaily
snapshots
checkpoint
package
library(checkpoint)
checkpoint("2015-01-28")
CRAN mirror
cran.revolutionanalytics.com
(Windows binaries virus-scanned)
checkpoint
server
Midnight
UTC
mran.revolutionanalytics.com/snapshot/
checkpoint server - implementation
Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots.
 rsync to mirror CRAN daily
– Only downloads changed packages
 zfs to store incremental snapshots
– Storage only required for new packages
 Organizes snapshots into a labelled hierarchy
– mran.revolutionanalytics.com/snapshot/YYYY-MM-DD
 MRAN hosted by high-performance cloud provider
– Provisioned for availability and latency
18
https://github.com/RevolutionAnalytics/checkpoint-server
19
Using non-CRAN packages Reproducibly
 Today, checkpoint only manages packages from CRAN
 GitHub: use install_github with a specific checkin hash
install_github("ramnathv/rblocks",
ref="a85e748390c17c752cc0ba961120d1e784fb1956")
 BioConductor: use packages from a specific BioConductor release
– Not as easy as it seems!
 Private packages / behind the firewall
– use miniCRAN to create a local, static repository
20
Why use checkpoint?
 Write and share code R whose results can be reproduced, even if new (and
possibly incompatible) package versions are released later.
 Share R scripts with others that will automatically install the appropriate package
versions (no need to manually install CRAN packages).
 Write R scripts that use older versions of packages, or packages that are no
longer available on CRAN.
 Install packages (or package versions) visible only to a specific project, without
affecting other R projects or R users on the same system.
 Manage multiple projects that use different package versions.
What about packrat?
 Packrat is flexible and powerful
– Supports non-CRAN packages (e.g. github)
– Allows mix-and-matching package versions
– Requires shipping all package source
– Requires recipients to build packages from source
 Checkpoint is simple
– Reproducibility from one script
– Simple for recipients to reproduce results
– Only allows use of CRAN packages versions that have been tested
together
– Requires Web access (and availability of MRAN)
21
rstudio.github.io/packrat/
22
Revolution R Open is:
 An enhanced open source distribution of R
 Compatible with all R-related software
 Multi-threaded for performance
 Focused on reproducibility
 Open source (GPLv2 license)
 Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE
 Includes checkpoint
 Designed to work with RStudio
 Side-by-side installations
 Download from mran.revolutionanalytics.com
CRAN mirrors and RRO
 Revolution R Open ships with a fixed default CRAN mirror
– Currently, 1 March 2015 snapshot (v 8.0.2)
– Soon: (v 8.0.3)
 All users of same version get same CRAN package versions by default
– regardless when “install.packages” is run
 Use checkpoint to access newer package versions
23
24
MRAN
The Managed R Archive Network
 Download Revolution R Open
 Learn about R and RRO
 Explore R Packages
 Explore Task Views
 R tips and applications
 Daily CRAN snapshots
mran.revolutionanalytics.com
Thank you.
Learn more at:
projects.revolutionanalytics.com/rrt/
Find archived webinars at:
revolutionanalytics.com/webinars
www.revolutionanalytics.com
1.855.GET.REVO
Twitter: @RevolutionR
Joseph Rickert
Data Scientist, Community Manager
Revolution Analytics
@RevoJoe
Joseph.rickert@revolutionanalytics.com

Contenu connexe

Tendances

12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with RTechsparks
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersPanagiotis Garefalakis
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easyDataWorks Summit
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkDataWorks Summit
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
Richard Potts CV 2017-01-17
Richard Potts CV 2017-01-17Richard Potts CV 2017-01-17
Richard Potts CV 2017-01-17Richard Potts
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
 

Tendances (20)

Revolution R: 100% R and more
Revolution R: 100% R and moreRevolution R: 100% R and more
Revolution R: 100% R and more
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Medea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production ClustersMedea: Scheduling of Long Running Applications in Shared Production Clusters
Medea: Scheduling of Long Running Applications in Shared Production Clusters
 
R reproducibility
R reproducibilityR reproducibility
R reproducibility
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
An elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache FlinkAn elastic batch-and stream-processing stack with Pravega and Apache Flink
An elastic batch-and stream-processing stack with Pravega and Apache Flink
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Richard Potts CV 2017-01-17
Richard Potts CV 2017-01-17Richard Potts CV 2017-01-17
Richard Potts CV 2017-01-17
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache Flink
 

En vedette

DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 

En vedette (10)

DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 

Similaire à Reproducibility with Checkpoint & RRO - NYC R Conference

Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRANRevolution Analytics
 
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...Fasten Project
 
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...Felipe Prado
 
OWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionOWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionSergey Sotnikov
 
Introduction to r
Introduction to rIntroduction to r
Introduction to rgslicraf
 
[JOI] TOTVS Developers Joinville - Java #1
[JOI] TOTVS Developers Joinville - Java #1[JOI] TOTVS Developers Joinville - Java #1
[JOI] TOTVS Developers Joinville - Java #1Rubens Dos Santos Filho
 
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Paul Richards
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSSPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSinside-BigData.com
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...Fasten Project
 
Approaching package manager
Approaching package managerApproaching package manager
Approaching package managerTimur Safin
 
Apache maven, a software project management tool
Apache maven, a software project management toolApache maven, a software project management tool
Apache maven, a software project management toolRenato Primavera
 
PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)Andrey Karpov
 
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Work-Bench
 
How to implement continuous delivery with enterprise java middleware?
How to implement continuous delivery with enterprise java middleware?How to implement continuous delivery with enterprise java middleware?
How to implement continuous delivery with enterprise java middleware?Thoughtworks
 

Similaire à Reproducibility with Checkpoint & RRO - NYC R Conference (20)

Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
FASTEN: Scaling static analyses to ecosystem, presented at FOSDEM 2020 in Bru...
 
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
DEF CON 27 - workshop - ISAAC EVANS - discover exploit and eradicate entire v...
 
Native Hadoop with prebuilt spark
Native Hadoop with prebuilt sparkNative Hadoop with prebuilt spark
Native Hadoop with prebuilt spark
 
OWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionOWASP Dependency-Track Introduction
OWASP Dependency-Track Introduction
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
[JOI] TOTVS Developers Joinville - Java #1
[JOI] TOTVS Developers Joinville - Java #1[JOI] TOTVS Developers Joinville - Java #1
[JOI] TOTVS Developers Joinville - Java #1
 
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
Preparing and submitting a package to CRAN - June Sanderson, Sheffield R User...
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOSSPACK: A Package Manager for Supercomputers, Linux, and MacOS
SPACK: A Package Manager for Supercomputers, Linux, and MacOS
 
MidSem
MidSemMidSem
MidSem
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
kishore_Nokia
kishore_Nokiakishore_Nokia
kishore_Nokia
 
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...
FASTEN presentation at OSS2021, by Michele Scarlato, Endocode, May 12, 2021, ...
 
Approaching package manager
Approaching package managerApproaching package manager
Approaching package manager
 
Apache maven, a software project management tool
Apache maven, a software project management toolApache maven, a software project management tool
Apache maven, a software project management tool
 
PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)
 
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
Nobody Knows What It’s Like To Be the Bad Man: The Development Process for th...
 
How to implement continuous delivery with enterprise java middleware?
How to implement continuous delivery with enterprise java middleware?How to implement continuous delivery with enterprise java middleware?
How to implement continuous delivery with enterprise java middleware?
 

Plus de Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 

Plus de Revolution Analytics (13)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 

Dernier

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Reproducibility with Checkpoint & RRO - NYC R Conference

  • 1. Reproducibility with Checkpoint & RRO New York R Conference Joseph Rickert April 25, 2015
  • 2. 2 OUR COMPANY The leading provider of advanced analytics software and services based on open source R, since 2007 OUR PRODUCT REVOLUTION R: The enterprise-grade predictive analytics application platform based on the R language SOME KUDOS “This acquisition will help customers use advanced analytics within Microsoft data platforms“ -- Joseph Sirosh, CVP C+E
  • 3. What is Reproducibility? “The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn)  Method + Environment -> Results  A process for: – Sharing the method – Describing the environment – Recreating the results 3 xkcd.com/242/
  • 4. Reproducibility – why do we care? Academic / Research  Verify results  Advance Research Business  Production code  Reliability  Reusability  Collaboration  Regulation 4 www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf
  • 5. Observations  R versions are pretty manageable – Major versions just once a year – Patches rarely introduce incompatible changes  Good solutions for literate programming – Rstudio / knitr / Rmarkdown  OS/Hardware not the major cause of problems  The big problem is with packages – CRAN is in a state of continual flux 5
  • 6. Package dependency explosion  R script file using 6 most popular packages 6 Any updated package = potential reproducibility error! http://blog.revolutionanalytics.com/2014/10/explore-r-package-connections-at-mran.html
  • 7. 7 An R Reproducibility Problem Adapted from http://xkcd.com/234/ CC BY-NC 2.5
  • 8. 8 Reproducible R Toolkit  Static CRAN mirror in Revolution R Open – CRAN packages fixed with each RRO update  Daily CRAN snapshots – Storing every package version since September 2014 – Hosted at mran.revolutionanalytics.com/snapshot  Write and share scripts synced to a specific snapshot date – checkpoint package installed with RRO projects.revolutionanalytics.com/rrt/
  • 9. 9 Using checkpoint  Add 2 lines to the top of your script library(checkpoint) checkpoint("2015-01-28")  That’s it?  Optionally, check the R version as well library(checkpoint) checkpoint("2015-01-28", R.version="3.1.3") Or, whichever date you want
  • 10. 10 Using checkpoint  Easy to use: add 2 lines to the top of each script library(checkpoint) checkpoint("2014-09-17")  For the package author: – Use package versions available on the chosen date – Installs packages local to this project • Allows different package versions to be used simultaneously  For a script collaborator: – Automatically installs required packages • Detects required packages (no need to manually install!) – Uses same package versions as script author to ensure reproducibility
  • 11. 11 R Script with packages having dependencies  ## Adapted from https://gist.github.com/abresler/46c36c1a88c849b94b07  ## Blog post Jan 21  ## http://blog.revolutionanalytics.com/2015/01/a-beautiful-story-about-nyc-weather.html  library(checkpoint)  checkpoint("2015-02-05",R="3.1.2")  library(dplyr)  library(tidyr)  library(magrittr)  library(ggplot2)
  • 12. 12 First time script is run  checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics  http://projects.revolutionanalytics.com/rrt/  > checkpoint("2015-02-05",R="3.1.2")  Scanning for packages used in this project  |==========================================================================================| 100%  - Discovered 5 packages  Installing packages used in this project  - Installing ‘dplyr’  also installing the dependencies ‘assertthat’, ‘R6’, ‘Rcpp’, ‘magrittr’, ‘lazyeval’, ‘DBI’, ‘BH’  package ‘assertthat’ successfully unpacked and MD5 sums checked  package ‘R6’ successfully unpacked and MD5 sums checked  package ‘Rcpp’ successfully unpacked and MD5 sums checked  package ‘magrittr’ successfully unpacked and MD5 sums checked  package ‘lazyeval’ successfully unpacked and MD5 sums checked  package ‘DBI’ successfully unpacked and MD5 sums checked  package ‘BH’ successfully unpacked and MD5 sums checked  package ‘dplyr’ successfully unpacked and MD5 sums checked  - Installing ‘ggplot2’  also installing the dependencies ‘colorspace’, ‘stringr’, ‘RColorBrewer’, ‘dichromat’, ‘munsell’, ‘labeling’, ‘plyr’, ‘digest’, ‘gtable’, ‘reshape2’, ‘scales’, ‘proto’, ‘MASS’
  • 13. 13 Lots of dependent packages loaded  package ‘colorspace’ successfully unpacked and MD5 sums checked  package ‘stringr’ successfully unpacked and MD5 sums checked  package ‘RColorBrewer’ successfully unpacked and MD5 sums checked  package ‘dichromat’ successfully unpacked and MD5 sums checked  package ‘munsell’ successfully unpacked and MD5 sums checked  package ‘labeling’ successfully unpacked and MD5 sums checked  package ‘plyr’ successfully unpacked and MD5 sums checked  package ‘digest’ successfully unpacked and MD5 sums checked  package ‘gtable’ successfully unpacked and MD5 sums checked  package ‘reshape2’ successfully unpacked and MD5 sums checked  package ‘scales’ successfully unpacked and MD5 sums checked  package ‘proto’ successfully unpacked and MD5 sums checked  package ‘MASS’ successfully unpacked and MD5 sums checked  package ‘ggplot2’ successfully unpacked and MD5 sums checked  - Previously installed ‘magrittr’  - Installing ‘tidyr’  also installing the dependency ‘stringi’  package ‘stringi’ successfully unpacked and MD5 sums checked  package ‘tidyr’ successfully unpacked and MD5 sums checked  checkpoint process complete
  • 14. Checkpoint tips for script authors  Work within a project – Dedicated folder with scripts, data and output – eg /Users/Joe/Rstudio_Projects/weather  Create a master .R script file beginning with library(checkpoint) checkpoint("DATE") – package versions used will be as of this date  Don’t use install.packages directly – Use library() and checkpoint does the rest – You can have different package versions installed for different projects at the same time! 14
  • 15. Sharing projects with checkpoint  Just share your script or project folder!  Recipient only needs: – compatible R version – checkpoint package (installed with RRO) – Internet connection to MRAN (at least first time)  Checkpoint takes care of: – Installing CRAN packages • Binaries (ease of installation) • Correct versions (reproducibility) • Dependencies (ease of installation) – Eliminating conflicts with other installed packages 15
  • 16. The checkpoint magic The checkpoint() call does all this:  Scans project for required packages  Installs required packages and dependencies – Packages installed specific to project – Versions specific to checkpoint date • Installed in ~/.checkpoint/DATE • Skips packages if already installed (2nd run through)  Reconfigures package search path – Points only to project-specific library 16
  • 17. MRAN checkpoint server Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots. 17 CRAN RRDaily snapshots checkpoint package library(checkpoint) checkpoint("2015-01-28") CRAN mirror cran.revolutionanalytics.com (Windows binaries virus-scanned) checkpoint server Midnight UTC mran.revolutionanalytics.com/snapshot/
  • 18. checkpoint server - implementation Checkpoint uses MRAN’s downstream CRAN mirror with daily snapshots.  rsync to mirror CRAN daily – Only downloads changed packages  zfs to store incremental snapshots – Storage only required for new packages  Organizes snapshots into a labelled hierarchy – mran.revolutionanalytics.com/snapshot/YYYY-MM-DD  MRAN hosted by high-performance cloud provider – Provisioned for availability and latency 18 https://github.com/RevolutionAnalytics/checkpoint-server
  • 19. 19 Using non-CRAN packages Reproducibly  Today, checkpoint only manages packages from CRAN  GitHub: use install_github with a specific checkin hash install_github("ramnathv/rblocks", ref="a85e748390c17c752cc0ba961120d1e784fb1956")  BioConductor: use packages from a specific BioConductor release – Not as easy as it seems!  Private packages / behind the firewall – use miniCRAN to create a local, static repository
  • 20. 20 Why use checkpoint?  Write and share code R whose results can be reproduced, even if new (and possibly incompatible) package versions are released later.  Share R scripts with others that will automatically install the appropriate package versions (no need to manually install CRAN packages).  Write R scripts that use older versions of packages, or packages that are no longer available on CRAN.  Install packages (or package versions) visible only to a specific project, without affecting other R projects or R users on the same system.  Manage multiple projects that use different package versions.
  • 21. What about packrat?  Packrat is flexible and powerful – Supports non-CRAN packages (e.g. github) – Allows mix-and-matching package versions – Requires shipping all package source – Requires recipients to build packages from source  Checkpoint is simple – Reproducibility from one script – Simple for recipients to reproduce results – Only allows use of CRAN packages versions that have been tested together – Requires Web access (and availability of MRAN) 21 rstudio.github.io/packrat/
  • 22. 22 Revolution R Open is:  An enhanced open source distribution of R  Compatible with all R-related software  Multi-threaded for performance  Focused on reproducibility  Open source (GPLv2 license)  Available for Windows, Mac OS X, Ubuntu, Red Hat and OpenSUSE  Includes checkpoint  Designed to work with RStudio  Side-by-side installations  Download from mran.revolutionanalytics.com
  • 23. CRAN mirrors and RRO  Revolution R Open ships with a fixed default CRAN mirror – Currently, 1 March 2015 snapshot (v 8.0.2) – Soon: (v 8.0.3)  All users of same version get same CRAN package versions by default – regardless when “install.packages” is run  Use checkpoint to access newer package versions 23
  • 24. 24 MRAN The Managed R Archive Network  Download Revolution R Open  Learn about R and RRO  Explore R Packages  Explore Task Views  R tips and applications  Daily CRAN snapshots mran.revolutionanalytics.com
  • 25. Thank you. Learn more at: projects.revolutionanalytics.com/rrt/ Find archived webinars at: revolutionanalytics.com/webinars www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR Joseph Rickert Data Scientist, Community Manager Revolution Analytics @RevoJoe Joseph.rickert@revolutionanalytics.com

Notes de l'éditeur

  1. http://xkcd.com/242/