SlideShare une entreprise Scribd logo
1  sur  23
Reproducible Research with R,The
Tidyverse, Notebooks, and Spark
Bob Wakefield
Principal
bob@MassStreet.net
Twitter:
@BobLovesData
Who is Mass Street?
• Boutique data consultancy
• We work on data problems big or “small”
• We have a focus on helping organizations
move from a batch to a real time paradigm.
• Free Big Data training
Mass Street Partnerships and
Capability
•Hortonworks Partner
•Confluent Partner
•ARG Back Office
Bob’s Background
• IT professional 16 years
• Currently working as a Data Engineer
• Education
• BS Business Admin (MIS) from KState
• MBA (finance concentration) from KU
• Coursework in Mathematics at Washburn
• Graduate certificate Data Science from Rockhurst
•Addicted to everything data
Follow Me!
•Personal Twitter: @BobLovesData
•Company Twitter: @MassStreet
•Blog: DataDrivenPerspectives.com
•Website: www.MassStreet.net
•Facebook: @MassStreetAnalyticsLLC
KC Learn Big Data Objectives
•Educate people about what you
can do with all the new technology
surrounding data.
•Grow the big data career field.
•Teach skills not products
ACM Kansas City
We’re looking for a speaker
willing to talk in deep detail
about data engineering
challenges their organization is
experiencing.
This Evening’s Learning
Objectives
All Material Can Be Downloaded
from GitHub
People Associated With
Johns Hopkins
People NOT Associated
With Johns Hopkins
Tidyverse
Reproducible Research
Literate Statistical
Programming
Opinionated Analysis
Data Science as a Science
Brian Caffo
Not So Standard
Deviations Podcast
Roger Peng
Hillary Parker
Hadley Wickham
It’s All Six Degrees of Johns Hopkins’ Biostatistics Department
Jeff Leek
Simply Statistics
Website
Motivations For This Evenings Discussion
Changes to the Career Field in the Past 15
Months
• Rise of Python for Data Science/Engineering
• Rise of notebooks (Jupyter, Zeppelin, R Notebook)
• Data Science SaaS (cloud, cloud, and more cloud)
• R got a nice NLP package
• Deep Learn all the things!
• Rise of Spark.
Reproducible Research
• Introduction to the topic came from the Not So
Standard Deviations Podcast.
• Researches and software engineers approach data
science wildly differently.
• Both sides can learn from the other.
Reproducible Research
• Someone should be able to run your exact analysis and get your
result.
• Goal is to reproduce NOT replicate.
• Reproduce = validate your work
• Replicate = validate the conclusions of the study
• This is a lot harder than it sounds.
• Reproducibility hasn’t been totally figured out.
• I still struggle with dependencies
• Build tools for R?
Reproducible Research
• Elements of reproducibility
1. Analytic data (the Tidy data)
2. Analytic code
3. Documentation
4. Distribution
• Of these, distribution is the trickiest
Reproducible Research
• Literate Statistical Programming
• Combine your analysis and your code into a single
document
• There are several tools for this
• Markdown
• RMarkdown/knitr
• R Studio
• Notebooks
Reproducible Research
• A proposed structure of analysis*
• Defining the question
• Defining the ideal dataset
• Determining what data you can access
• Obtaining the data
• Cleaning the data
• Exploratory data analysis
• Statistical prediction/modeling
• Interpretation/Challenging of results
• Synthesis and write up
• Creating reproducible code
*From “Report Writing for Data Science in R”
Reproducible Research
• Reproducibility Checklist*
• Start with good science
• Don’t do things by hand
• Don’t point and click
• Teach a computer
• Use version control
• Keep track of your software environment
• Don’t save output
• Set your seed
• Think about the entire pipeline
*From “Report Writing for Data Science in R”
Opinionated Analysis Development
• Read Opinionated Analysis Development
• Link in references
• Opinionated analysis = analysis that follows certain
practices
• Follows on to the principals of reproducible research
• Lays out a framework for how an analysis should be
completed
Tidy Data
• Three rules that make data tidy:
• Each variable must have its own column.
• Each observation must have its own row.
• Each value must have its own cell.
• No you’re not crazy. Yes that’s third normal form.
• I don’t have to deal with this issue often if ever.
Tidyverse
• Used to be called the Hadleyverse
• An ecosystem of packages designed with common
APIs and a shared philosophy
• Helps you get your data tidy
• Also assumes that your data is tidy
Part II: Tools
• Git
• Modern Source Control
• Code repositories: GitHub and Bitbucket
• GUI: Sourcetree
• Rmarkdown/knitr
• Appears to be strictly an R Studio thing
• Pandoc Markdown/Jupyter
• Julia, Python, R
• Rebranded Ipython
Part II: Tools
• SparklyR/Databricks
• SparklyR provides an R API for Spark with dplyr
• AWS/S3
• Helps solve the problem of accessibility to data
• Can be annoying to manage
• Tidyverse
• A set of packages that makes working with data easier

Contenu connexe

Tendances

Scaling Data Science at Airbnb
Scaling Data Science at AirbnbScaling Data Science at Airbnb
Scaling Data Science at AirbnbWork-Bench
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Onlinesfdatascience
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Landing a career in data science
Landing a career in data scienceLanding a career in data science
Landing a career in data scienceParul Pandey
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
The Data Analysis Workflow
The Data Analysis WorkflowThe Data Analysis Workflow
The Data Analysis WorkflowJonathanEarley3
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 
Data Science and Beer - Kris peeters
Data Science and Beer - Kris peetersData Science and Beer - Kris peeters
Data Science and Beer - Kris peeterssparktc
 
A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaMichael Mior
 
Secure360 May 2018 Lessons Learned from OWASP T10 Datacall
Secure360 May 2018 Lessons Learned from OWASP T10 DatacallSecure360 May 2018 Lessons Learned from OWASP T10 Datacall
Secure360 May 2018 Lessons Learned from OWASP T10 DatacallBrian Glas
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_scienceMartina Pugliese
 
Data Integration Score Card
Data Integration Score CardData Integration Score Card
Data Integration Score CardSciBite Limited
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 

Tendances (20)

Scaling Data Science at Airbnb
Scaling Data Science at AirbnbScaling Data Science at Airbnb
Scaling Data Science at Airbnb
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
 
Hadoop Meets Scrum
Hadoop Meets ScrumHadoop Meets Scrum
Hadoop Meets Scrum
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Landing a career in data science
Landing a career in data scienceLanding a career in data science
Landing a career in data science
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
The Data Analysis Workflow
The Data Analysis WorkflowThe Data Analysis Workflow
The Data Analysis Workflow
 
Data science 101
Data science 101Data science 101
Data science 101
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Data Science and Beer - Kris peeters
Data Science and Beer - Kris peetersData Science and Beer - Kris peeters
Data Science and Beer - Kris peeters
 
A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academia
 
Secure360 May 2018 Lessons Learned from OWASP T10 Datacall
Secure360 May 2018 Lessons Learned from OWASP T10 DatacallSecure360 May 2018 Lessons Learned from OWASP T10 Datacall
Secure360 May 2018 Lessons Learned from OWASP T10 Datacall
 
Data Science using Python
Data Science using PythonData Science using Python
Data Science using Python
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
 
Data Integration Score Card
Data Integration Score CardData Integration Score Card
Data Integration Score Card
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 

Similaire à Reproducible Research with R, The Tidyverse, Notebooks, and Spark

NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkVivian S. Zhang
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptxShree Shree
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and PlacementAkhilGGM
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptxShree Shree
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptxShree Shree
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
data science training and placement
data science training and placementdata science training and placement
data science training and placementSaiprasadVella
 
online data science training
online data science trainingonline data science training
online data science trainingDIGITALSAI1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabadVamsiNihal
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in HyderabadKumarNaik21
 

Similaire à Reproducible Research with R, The Tidyverse, Notebooks, and Spark (20)

NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
02-Lifecycle.pptx
02-Lifecycle.pptx02-Lifecycle.pptx
02-Lifecycle.pptx
 
Data Science Training and Placement
Data Science Training and PlacementData Science Training and Placement
Data Science Training and Placement
 
01-Introduction.pdf
01-Introduction.pdf01-Introduction.pdf
01-Introduction.pdf
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptx
 
01-Introduction.pptx
01-Introduction.pptx01-Introduction.pptx
01-Introduction.pptx
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
data science training and placement
data science training and placementdata science training and placement
data science training and placement
 
online data science training
online data science trainingonline data science training
online data science training
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
data science online training in hyderabad
data science online training in hyderabaddata science online training in hyderabad
data science online training in hyderabad
 
Best data science training in Hyderabad
Best data science training in HyderabadBest data science training in Hyderabad
Best data science training in Hyderabad
 

Plus de Adaryl "Bob" Wakefield, MBA

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataAdaryl "Bob" Wakefield, MBA
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionAdaryl "Bob" Wakefield, MBA
 

Plus de Adaryl "Bob" Wakefield, MBA (9)

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Getting Started With Sparklyr
Getting Started With SparklyrGetting Started With Sparklyr
Getting Started With Sparklyr
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Hands On: Linux survival skills
Hands On: Linux survival skillsHands On: Linux survival skills
Hands On: Linux survival skills
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 

Dernier

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Dernier (20)

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Reproducible Research with R, The Tidyverse, Notebooks, and Spark

  • 1. Reproducible Research with R,The Tidyverse, Notebooks, and Spark Bob Wakefield Principal bob@MassStreet.net Twitter: @BobLovesData
  • 2. Who is Mass Street? • Boutique data consultancy • We work on data problems big or “small” • We have a focus on helping organizations move from a batch to a real time paradigm. • Free Big Data training
  • 3. Mass Street Partnerships and Capability •Hortonworks Partner •Confluent Partner •ARG Back Office
  • 4. Bob’s Background • IT professional 16 years • Currently working as a Data Engineer • Education • BS Business Admin (MIS) from KState • MBA (finance concentration) from KU • Coursework in Mathematics at Washburn • Graduate certificate Data Science from Rockhurst •Addicted to everything data
  • 5. Follow Me! •Personal Twitter: @BobLovesData •Company Twitter: @MassStreet •Blog: DataDrivenPerspectives.com •Website: www.MassStreet.net •Facebook: @MassStreetAnalyticsLLC
  • 6. KC Learn Big Data Objectives •Educate people about what you can do with all the new technology surrounding data. •Grow the big data career field. •Teach skills not products
  • 7. ACM Kansas City We’re looking for a speaker willing to talk in deep detail about data engineering challenges their organization is experiencing.
  • 9. All Material Can Be Downloaded from GitHub
  • 10. People Associated With Johns Hopkins People NOT Associated With Johns Hopkins Tidyverse Reproducible Research Literate Statistical Programming Opinionated Analysis Data Science as a Science Brian Caffo Not So Standard Deviations Podcast Roger Peng Hillary Parker Hadley Wickham It’s All Six Degrees of Johns Hopkins’ Biostatistics Department Jeff Leek Simply Statistics Website
  • 11. Motivations For This Evenings Discussion
  • 12. Changes to the Career Field in the Past 15 Months • Rise of Python for Data Science/Engineering • Rise of notebooks (Jupyter, Zeppelin, R Notebook) • Data Science SaaS (cloud, cloud, and more cloud) • R got a nice NLP package • Deep Learn all the things! • Rise of Spark.
  • 13. Reproducible Research • Introduction to the topic came from the Not So Standard Deviations Podcast. • Researches and software engineers approach data science wildly differently. • Both sides can learn from the other.
  • 14. Reproducible Research • Someone should be able to run your exact analysis and get your result. • Goal is to reproduce NOT replicate. • Reproduce = validate your work • Replicate = validate the conclusions of the study • This is a lot harder than it sounds. • Reproducibility hasn’t been totally figured out. • I still struggle with dependencies • Build tools for R?
  • 15. Reproducible Research • Elements of reproducibility 1. Analytic data (the Tidy data) 2. Analytic code 3. Documentation 4. Distribution • Of these, distribution is the trickiest
  • 16. Reproducible Research • Literate Statistical Programming • Combine your analysis and your code into a single document • There are several tools for this • Markdown • RMarkdown/knitr • R Studio • Notebooks
  • 17. Reproducible Research • A proposed structure of analysis* • Defining the question • Defining the ideal dataset • Determining what data you can access • Obtaining the data • Cleaning the data • Exploratory data analysis • Statistical prediction/modeling • Interpretation/Challenging of results • Synthesis and write up • Creating reproducible code *From “Report Writing for Data Science in R”
  • 18. Reproducible Research • Reproducibility Checklist* • Start with good science • Don’t do things by hand • Don’t point and click • Teach a computer • Use version control • Keep track of your software environment • Don’t save output • Set your seed • Think about the entire pipeline *From “Report Writing for Data Science in R”
  • 19. Opinionated Analysis Development • Read Opinionated Analysis Development • Link in references • Opinionated analysis = analysis that follows certain practices • Follows on to the principals of reproducible research • Lays out a framework for how an analysis should be completed
  • 20. Tidy Data • Three rules that make data tidy: • Each variable must have its own column. • Each observation must have its own row. • Each value must have its own cell. • No you’re not crazy. Yes that’s third normal form. • I don’t have to deal with this issue often if ever.
  • 21. Tidyverse • Used to be called the Hadleyverse • An ecosystem of packages designed with common APIs and a shared philosophy • Helps you get your data tidy • Also assumes that your data is tidy
  • 22. Part II: Tools • Git • Modern Source Control • Code repositories: GitHub and Bitbucket • GUI: Sourcetree • Rmarkdown/knitr • Appears to be strictly an R Studio thing • Pandoc Markdown/Jupyter • Julia, Python, R • Rebranded Ipython
  • 23. Part II: Tools • SparklyR/Databricks • SparklyR provides an R API for Spark with dplyr • AWS/S3 • Helps solve the problem of accessibility to data • Can be annoying to manage • Tidyverse • A set of packages that makes working with data easier