SlideShare une entreprise Scribd logo
1  sur  18
Reproducible data science
Lightning review
Into
• Data Science Lead at Outra
• 5 years in Data Science
• Main focus has been social media,
marketing and retail
Josh Levy-Kramer
Outline
1. Reproducibility crisis
2. Possible solutions
3. Comparison and roundup
Reproducibility
crisis
• Dark ages when tracking changes and building models
• ML lacking abstractions that developers have developed
• Creates problems for yourself, your team and public projects
• Partially due to cant commit large files into git
Data Scientist Manifesto
• Reproducibility – the ability to reconstruct any previous state of your
data analysis (data and execution)
• Provenance – the ability to track any result and link it the the input
and code used
• Collaboration – the ability to easily collaborate with team members
• Environment agnostic – the ability to deploy a process to in different
environments without much hindrance
Adapted from http://www.pachyderm.io/dsbor.html
Data – Code – Environment → Output
CodeData Output
Environment
Code – Environment
Git
1 numpy==1.13.3
2 pandas==0.21.1
3 scikit-learn==0.19.1
4 psutil==5.4.0
5 pyyaml==3.1
OS environment
Python eniroment
1 FROM continuumio/miniconda3:4.3.27
2 COPY requirements.txt .
3 RUN pip install -r requirements.txt
4 COPY model.py
5 ENTRYPOINT model.py
1 AWSTemplateFormatVersion: "2010-09-09"
2 Resources:
3 WebInstance:
4 Type: AWS::EC2::Instance
5 Properties:
6 InstanceType: r4.8xlarge
7 ImageId: ami-80861296
8 KeyName: my-key
9 SecurityGroupIds:
10 - sg-abc01234
11 SubnetId: subnet-acb01234
Hardware environment
Code
Dockerfile
pip requirements.txt
AWS CloudFormation template
Data?
Data Output
Emerging solutions
Data Version Control
Git LFS
• Git extension
• Allows you to commit large files into Git
• Uses custom protocol and store
• No concept of pipelining
What's similar
Data Pipelines
• Version controls data and pipelines, similar to what Git does with
code
• Two main abstractions:
alpha=0.7 Output
v1
ML Image model
Input Output
alpha=0.7 Output
v2
Change
input
alpha=0.1 Output
v3
Change
model
Version
control all
• Version controls data and pipelines, similar to what Git does with
code
Pachyderm - workflow
Data
repo
pachctl putfile
Pipeline Output
pachctl create-pipeline
Pachyderm
• I like:
• Interlinked data-pipeline-
output version control
• Automatic output generation
• Parallelisation and distribution
• Semi-mature project –
started 2014
👍 👎
• I dislike:
• Not environment agnostic
• Bloated tool:
• Not generic - highly integrated
with Kubernetes and S3
• Installation is complicated
• Not portable
• Not integrated with git
dvc
• “Git extension for data scientists – mange your code and data together”
• Same git workflow with extra commands
dvc add
Data
repo
Pipeline Output
dvc run –d input.csv
–o output.csv model.py alpha=0.1
Workflow
dvc
• I like:
• Integration with GIT
• Interlinked data-pipeline-
output version control
• Easy to install
• Environment agnostic
👍 👎
• I dislike:
• Double the actions required
compared to just GIT – easy to
get lost in workflow
• Terrible name
• Immature project – started
2017
Round up
• No solution quite there yet
• Data version control is the best
contender
🤔

Contenu connexe

Tendances

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWork-Bench
 
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...DevOpsDays Tel Aviv
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingCodeScience
 
Enterprise graph applications
Enterprise graph applicationsEnterprise graph applications
Enterprise graph applicationsDavid Colebatch
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1Claudius Hauptmann
 

Tendances (8)

GraphQL and mule4
GraphQL and mule4 GraphQL and mule4
GraphQL and mule4
 
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
 
Enterprise graph applications
Enterprise graph applicationsEnterprise graph applications
Enterprise graph applications
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1Bachelor Practical Course Semantic Web Part 1
Bachelor Practical Course Semantic Web Part 1
 

Similaire à Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Bruno Capuano
 
Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...eZ Systems
 
Managing changes to eZPublish Database
Managing changes to eZPublish DatabaseManaging changes to eZPublish Database
Managing changes to eZPublish DatabaseGaetano Giunta
 
Building a custom cms with django
Building a custom cms with djangoBuilding a custom cms with django
Building a custom cms with djangoYann Malet
 
Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2James Cowie
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practiceC. Tobin Magle
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisMarcus Hanwell
 
Database Migrations with Gradle and Liquibase
Database Migrations with Gradle and LiquibaseDatabase Migrations with Gradle and Liquibase
Database Migrations with Gradle and LiquibaseDan Stine
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Debraj GuhaThakurta
 
Git version control and trunk based approach with VSTS
Git version control and trunk based approach with VSTSGit version control and trunk based approach with VSTS
Git version control and trunk based approach with VSTSMurughan Palaniachari
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample TrackingBruce Kozuma
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsSimon Cockell
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
GR8CONF Contributing Back To Grails
GR8CONF Contributing Back To GrailsGR8CONF Contributing Back To Grails
GR8CONF Contributing Back To Grailsbobbywarner
 
Git(hub) for windows developers
Git(hub) for windows developersGit(hub) for windows developers
Git(hub) for windows developersbwullems
 

Similaire à Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools (20)

Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?
 
Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...Managing Changes to the Database Across the Project Life Cycle (presented by ...
Managing Changes to the Database Across the Project Life Cycle (presented by ...
 
Managing changes to eZPublish Database
Managing changes to eZPublish DatabaseManaging changes to eZPublish Database
Managing changes to eZPublish Database
 
Building a custom cms with django
Building a custom cms with djangoBuilding a custom cms with django
Building a custom cms with django
 
Git SVN Migrate Reasons
Git SVN Migrate ReasonsGit SVN Migrate Reasons
Git SVN Migrate Reasons
 
Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2Expert guidance on migrating from magento 1 to magento 2
Expert guidance on migrating from magento 1 to magento 2
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Reproducible research: practice
Reproducible research: practiceReproducible research: practice
Reproducible research: practice
 
tip oopt pse-summit2017
tip oopt pse-summit2017tip oopt pse-summit2017
tip oopt pse-summit2017
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Database Migrations with Gradle and Liquibase
Database Migrations with Gradle and LiquibaseDatabase Migrations with Gradle and Liquibase
Database Migrations with Gradle and Liquibase
 
Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017Team Data Science Process Presentation (TDSP), Aug 29, 2017
Team Data Science Process Presentation (TDSP), Aug 29, 2017
 
Git version control and trunk based approach with VSTS
Git version control and trunk based approach with VSTSGit version control and trunk based approach with VSTS
Git version control and trunk based approach with VSTS
 
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
2019-04-17 Bio-IT World G Suite-Jira Cloud Sample Tracking
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformatics
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
GR8CONF Contributing Back To Grails
GR8CONF Contributing Back To GrailsGR8CONF Contributing Back To Grails
GR8CONF Contributing Back To Grails
 
Git(hub) for windows developers
Git(hub) for windows developersGit(hub) for windows developers
Git(hub) for windows developers
 

Dernier

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 

Dernier (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools

  • 2. Into • Data Science Lead at Outra • 5 years in Data Science • Main focus has been social media, marketing and retail Josh Levy-Kramer
  • 3. Outline 1. Reproducibility crisis 2. Possible solutions 3. Comparison and roundup
  • 4. Reproducibility crisis • Dark ages when tracking changes and building models • ML lacking abstractions that developers have developed • Creates problems for yourself, your team and public projects • Partially due to cant commit large files into git
  • 5. Data Scientist Manifesto • Reproducibility – the ability to reconstruct any previous state of your data analysis (data and execution) • Provenance – the ability to track any result and link it the the input and code used • Collaboration – the ability to easily collaborate with team members • Environment agnostic – the ability to deploy a process to in different environments without much hindrance Adapted from http://www.pachyderm.io/dsbor.html
  • 6. Data – Code – Environment → Output CodeData Output Environment
  • 7. Code – Environment Git 1 numpy==1.13.3 2 pandas==0.21.1 3 scikit-learn==0.19.1 4 psutil==5.4.0 5 pyyaml==3.1 OS environment Python eniroment 1 FROM continuumio/miniconda3:4.3.27 2 COPY requirements.txt . 3 RUN pip install -r requirements.txt 4 COPY model.py 5 ENTRYPOINT model.py 1 AWSTemplateFormatVersion: "2010-09-09" 2 Resources: 3 WebInstance: 4 Type: AWS::EC2::Instance 5 Properties: 6 InstanceType: r4.8xlarge 7 ImageId: ami-80861296 8 KeyName: my-key 9 SecurityGroupIds: 10 - sg-abc01234 11 SubnetId: subnet-acb01234 Hardware environment Code Dockerfile pip requirements.txt AWS CloudFormation template
  • 10. Git LFS • Git extension • Allows you to commit large files into Git • Uses custom protocol and store • No concept of pipelining
  • 11. What's similar Data Pipelines • Version controls data and pipelines, similar to what Git does with code • Two main abstractions:
  • 12. alpha=0.7 Output v1 ML Image model Input Output alpha=0.7 Output v2 Change input alpha=0.1 Output v3 Change model Version control all • Version controls data and pipelines, similar to what Git does with code
  • 13. Pachyderm - workflow Data repo pachctl putfile Pipeline Output pachctl create-pipeline
  • 14. Pachyderm • I like: • Interlinked data-pipeline- output version control • Automatic output generation • Parallelisation and distribution • Semi-mature project – started 2014 👍 👎 • I dislike: • Not environment agnostic • Bloated tool: • Not generic - highly integrated with Kubernetes and S3 • Installation is complicated • Not portable • Not integrated with git
  • 15. dvc • “Git extension for data scientists – mange your code and data together” • Same git workflow with extra commands dvc add Data repo Pipeline Output dvc run –d input.csv –o output.csv model.py alpha=0.1
  • 17. dvc • I like: • Integration with GIT • Interlinked data-pipeline- output version control • Easy to install • Environment agnostic 👍 👎 • I dislike: • Double the actions required compared to just GIT – easy to get lost in workflow • Terrible name • Immature project – started 2017
  • 18. Round up • No solution quite there yet • Data version control is the best contender 🤔

Notes de l'éditeur

  1. These can be version controlled
  2. And the output
  3. Pachyderm is the most developed
  4. Track experimentation. Even the crap trys
  5. Imagine this is a model that’s doing some Ml on the images
  6. Going tobefome a margional tool but not going to be the next git
  7. Going tobefome a margional tool but not going to be the next git