SlideShare une entreprise Scribd logo
1  sur  20
R and Reproducibility
A Proposal
David Smith
useR! 2014
What is Reproducibility?
“The goal of reproducible research is to tie
specific instructions to data analysis and
experimental data so that scholarship can be
recreated, better understood and verified.”
CRAN Task View on Reproducible Research (Kuhn)
• Method + Environment
-> Results
• A process for:
– Sharing the method
– Describing the environment
– Recreating the results
2 xkcd.com/242/
Why care about reproducibility?
Academic / Research
• Verify results
• Advance Research
Business
• Production code
• Reliability
• Reusability
• Regulation
3
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
R and Reproducibility
4
Results
Interfaces
Platform
Packages
R Engine
• Hand-assembled
• Sweave/knitr/DeployR/Shiny
• R GUI / DevelopR / RStudio
• Batch / Web Services
• OS / Virtualization
• Hardware Architecture
• CRAN
• BioConductor / GitHub / …
• R Version
• Base + Recommended pkgs
Observations
• R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
• Good solutions for literate programming
– Interfaces help
• OS/Hardware not the major cause of
problems
• The big problem is with packages
– CRAN is in a state of continual flux
5
Package Problem #1 : The User
http://xkcd.com/234/6
I heard you need to create a
TPS Report. Here, I’ve got an
R script that does that
already.
Oh, you need to
download these 5
packages first.
I already
did, and it
still
doesn’t
work!
Well, it worked when I
wrote it 3 weeks ago.
YOUR
Grr.
Package
updates…
Package Problem #2: The Author
http://xkcd.com/970/7
Time to update
my package on
CRAN!
>> Dependent
packages that
now fail to build:
67
>> Resubmit
your package
and try again
Crap.
Package Problem #3 : The Update
http://xkcd.com/664/8
3 days later…
Woot! A new version of R
is out! I have 10 minutes
now, time to download
and install!
… package not found …
… can’t install package…
… error …
The Proposal
• Change the default way R handles packages
– Install packages local to projects
• “Snapshot” CRAN daily
– Make it easy to get & use package versions used in script
development
Not a new idea!
– Ooms, “Possible Directions for Improving Dependency
Versioning in R”, R Journal 5/1
– BioConductor Project
– Revolution R Enterprise
– Linux distros
9
Example
• R script file using 6 most popular packages
10
Sharing a script reproducibly
… and simply
# Run with R 3.1.0
require(RRT)
mran_set(snapshot="2014-06-27")
# find packages used in this project
# get package versions used by script author
# install locally to this project
require(ggplot2)
require(data.table)
require(knitr) …
11
RRT: The R Reproducibility Toolkit
• Open Source R Package (GPLv2)
• From an R project folder:
– Detect packages & dependencies used in project
– Download and install from MRAN
– Versions selected according to script date
– Find and use packages from local install
github.com/RevolutionAnalytics/RRT
12
MRAN - Implementation
A downstream CRAN mirror with daily snapshots
• Use rsync to mirror CRAN daily
– Only downloads changed packages
• Use zfs to store incremental snapshots
– Storage only required for new packages
• Organize snapshots into a labelled hierarchy
– Access package versions by date of use
• CRAN snapshot server hosted by cloud provider
– Provisioned for availability and latency
13
Future work
• Just getting started!
• Snapshot binaries and source packages
• Other repos (BioConductor, GitHub, user)
• Institution-level package duplication
– CRAN “behind the firewall”
• User-defined package versions
• Checks on R versions
• Suggestions welcome!
github.com/RevolutionAnalytics/RRT
14
Thank You!
David Smith
david@revolutionanalytics.com
blog.revolutionanalytics.com
Possible Solution
• Bundle all packages with scripts
• Packrat solves this very well
– Project + package dependencies stored in Github
• But:
– Contributes to package fragmentation
– Adds friction to the sharing process
– Doesn’t address the problem for R generally
16
CRAN vs Github
CRAN
• “Repository of Record”
– Default for R users
• Strict quality checking
• Handles dependencies
• Binaries built
– But only current versions
saved
• Manual update process
• Dependent on volunteer
support
Github
• Frictionless publishing /
updates
– RStudio integration
• Social development
– Pull requests FTW
• Ease of updates
• Fragmented – no unified
directory of packages
• Permanence – accounts
closed / repos deleted
17
A downstream CRAN solution?
“I don't see why CRAN needs to be involved in
this effort at all. A third party could take
snapshots of CRAN at R release dates, and make
those available to package users in a separate
repository. It is not hard to set a different
repository than CRAN as the default location
from which to obtain packages.”
-- R-core member, r-devel, March 2014
18
Snapshot CRAN repository :
requirements
• Availability
• Latency
• Bandwidth
• Storage
• Binary package archives
• Other enhancements?
19
Proposal
“Development Branch” “Stable Branch”
Defaults are important!!20
MRANCRAN Downstram
Reproducible

Contenu connexe

Tendances

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with RTechsparks
 
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest PresentationAlex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentationlexicron345
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
Reproducibility with Revolution R Open
Reproducibility with Revolution R OpenReproducibility with Revolution R Open
Reproducibility with Revolution R OpenRevolution Analytics
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisWork-Bench
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Josh Levy-Kramer
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsWork-Bench
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Revolution Analytics
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceRevolution Analytics
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible ResearchC. Tobin Magle
 

Tendances (20)

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest PresentationAlex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentation
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Reproducibility with Revolution R Open
Reproducibility with Revolution R OpenReproducibility with Revolution R Open
Reproducibility with Revolution R Open
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 

Similaire à R reproducibility

Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with condaTravis Oliphant
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
OpenStack Documentation in the Open
OpenStack Documentation in the OpenOpenStack Documentation in the Open
OpenStack Documentation in the OpenAnne Gentle
 
Managing Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EraManaging Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EranexB Inc.
 
Upgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleetUpgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleetDavide Cavalca
 
Docker: Containers for Data Science
Docker: Containers for Data ScienceDocker: Containers for Data Science
Docker: Containers for Data ScienceAlessandro Adamo
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROWork-Bench
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsMarcus Hanwell
 
Guidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in EvergreenGuidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in Evergreenloriayre
 
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
Package Repositories:  The Unsung Heroes of Configuration and Release Managem...Package Repositories:  The Unsung Heroes of Configuration and Release Managem...
Package Repositories: The Unsung Heroes of Configuration and Release Managem...IBM UrbanCode Products
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10IxiaRomania
 
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018Peter Schmidtke
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Bruno Capuano
 
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Martin Bergljung
 

Similaire à R reproducibility (20)

Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
R development
R developmentR development
R development
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
OpenStack Documentation in the Open
OpenStack Documentation in the OpenOpenStack Documentation in the Open
OpenStack Documentation in the Open
 
Managing Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EraManaging Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub Era
 
Upgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleetUpgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleet
 
Docker: Containers for Data Science
Docker: Containers for Data ScienceDocker: Containers for Data Science
Docker: Containers for Data Science
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RRO
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Guidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in EvergreenGuidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in Evergreen
 
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
Package Repositories:  The Unsung Heroes of Configuration and Release Managem...Package Repositories:  The Unsung Heroes of Configuration and Release Managem...
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
 
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
 
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
 
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
 

Plus de Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageRevolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 

Plus de Revolution Analytics (18)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 

Dernier

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Dernier (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

R reproducibility

  • 1. R and Reproducibility A Proposal David Smith useR! 2014
  • 2. What is Reproducibility? “The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn) • Method + Environment -> Results • A process for: – Sharing the method – Describing the environment – Recreating the results 2 xkcd.com/242/
  • 3. Why care about reproducibility? Academic / Research • Verify results • Advance Research Business • Production code • Reliability • Reusability • Regulation 3 www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf
  • 4. R and Reproducibility 4 Results Interfaces Platform Packages R Engine • Hand-assembled • Sweave/knitr/DeployR/Shiny • R GUI / DevelopR / RStudio • Batch / Web Services • OS / Virtualization • Hardware Architecture • CRAN • BioConductor / GitHub / … • R Version • Base + Recommended pkgs
  • 5. Observations • R versions are pretty manageable – Major versions just once a year – Patches rarely introduce incompatible changes • Good solutions for literate programming – Interfaces help • OS/Hardware not the major cause of problems • The big problem is with packages – CRAN is in a state of continual flux 5
  • 6. Package Problem #1 : The User http://xkcd.com/234/6 I heard you need to create a TPS Report. Here, I’ve got an R script that does that already. Oh, you need to download these 5 packages first. I already did, and it still doesn’t work! Well, it worked when I wrote it 3 weeks ago. YOUR Grr. Package updates…
  • 7. Package Problem #2: The Author http://xkcd.com/970/7 Time to update my package on CRAN! >> Dependent packages that now fail to build: 67 >> Resubmit your package and try again Crap.
  • 8. Package Problem #3 : The Update http://xkcd.com/664/8 3 days later… Woot! A new version of R is out! I have 10 minutes now, time to download and install! … package not found … … can’t install package… … error …
  • 9. The Proposal • Change the default way R handles packages – Install packages local to projects • “Snapshot” CRAN daily – Make it easy to get & use package versions used in script development Not a new idea! – Ooms, “Possible Directions for Improving Dependency Versioning in R”, R Journal 5/1 – BioConductor Project – Revolution R Enterprise – Linux distros 9
  • 10. Example • R script file using 6 most popular packages 10
  • 11. Sharing a script reproducibly … and simply # Run with R 3.1.0 require(RRT) mran_set(snapshot="2014-06-27") # find packages used in this project # get package versions used by script author # install locally to this project require(ggplot2) require(data.table) require(knitr) … 11
  • 12. RRT: The R Reproducibility Toolkit • Open Source R Package (GPLv2) • From an R project folder: – Detect packages & dependencies used in project – Download and install from MRAN – Versions selected according to script date – Find and use packages from local install github.com/RevolutionAnalytics/RRT 12
  • 13. MRAN - Implementation A downstream CRAN mirror with daily snapshots • Use rsync to mirror CRAN daily – Only downloads changed packages • Use zfs to store incremental snapshots – Storage only required for new packages • Organize snapshots into a labelled hierarchy – Access package versions by date of use • CRAN snapshot server hosted by cloud provider – Provisioned for availability and latency 13
  • 14. Future work • Just getting started! • Snapshot binaries and source packages • Other repos (BioConductor, GitHub, user) • Institution-level package duplication – CRAN “behind the firewall” • User-defined package versions • Checks on R versions • Suggestions welcome! github.com/RevolutionAnalytics/RRT 14
  • 16. Possible Solution • Bundle all packages with scripts • Packrat solves this very well – Project + package dependencies stored in Github • But: – Contributes to package fragmentation – Adds friction to the sharing process – Doesn’t address the problem for R generally 16
  • 17. CRAN vs Github CRAN • “Repository of Record” – Default for R users • Strict quality checking • Handles dependencies • Binaries built – But only current versions saved • Manual update process • Dependent on volunteer support Github • Frictionless publishing / updates – RStudio integration • Social development – Pull requests FTW • Ease of updates • Fragmented – no unified directory of packages • Permanence – accounts closed / repos deleted 17
  • 18. A downstream CRAN solution? “I don't see why CRAN needs to be involved in this effort at all. A third party could take snapshots of CRAN at R release dates, and make those available to package users in a separate repository. It is not hard to set a different repository than CRAN as the default location from which to obtain packages.” -- R-core member, r-devel, March 2014 18
  • 19. Snapshot CRAN repository : requirements • Availability • Latency • Bandwidth • Storage • Binary package archives • Other enhancements? 19
  • 20. Proposal “Development Branch” “Stable Branch” Defaults are important!!20 MRANCRAN Downstram Reproducible

Notes de l'éditeur

  1. http://xkcd.com/242/
  2. https://stat.ethz.ch/pipermail/r-devel/2014-March/068552.html