SlideShare une entreprise Scribd logo
1  sur  7
Revolution Confidential

Data Science
Not just for big data!
David Smith
Revolution Analytics
@revodavid
October 16, 2013
Big Data: the new oil?

Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0

Revolution Confidential

2
Big Data is just raw material

Revolution Confidential

 Data Distillation
 Extract quantities of interest
 Find complete cases
 Derive missing information

 Big Data Pitfalls:
 Data cleanliness & accuracy
 Observational bias
 Do the data I have represent the population I’m
interested in?
3
Surveys & Experiments

Revolution Confidential

 Even with Big Data, the data you need isn’t
always in the building!
 … so ask (survey)!
 Survey design
 Stratified sampling

 … or experiment!
 A/B Testing
 Experimental Design
4
Data Exploration & Visualization

Revolution Confidential

 Limited by pixels
 Big data = a big black
blob

 Extract signal from
noise





Aggregations
Heat maps
Smoothing
Small multiples
5
Statistical Modeling & Forecasting

Revolution Confidential

 You don’t always need big data
 Sampling can help with observational bias

 Model selection
 Feature extraction
 Confounding?
 Interactions?

 Model validation
 Overfitting

 Prediction
 Extrapolation
 Confidence
http://xkcd.com/605/

6
Summary

Revolution Confidential

 Big Data is great, but think of it as the “raw
materials” for data science
 After refining, “big” isn’t always so “Big”

 Use statistical insight to avoid pitfalls:
 Inferences: Observational bias / Sampling bias
 Predictions: Confounding / Overfitting
 Think about variances and means (risk!)

 Some data scientists may miss these issues
 Look for statistical expertise

 Further reading:
 ComputerWorld: 12 predictive analytics screw-ups
7

Contenu connexe

Tendances

Tendances (20)

Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Life of a data scientist (pub)
Life of a data scientist (pub)Life of a data scientist (pub)
Life of a data scientist (pub)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data science
Data scienceData science
Data science
 
Data Science
Data ScienceData Science
Data Science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Data Science using Python
Data Science using PythonData Science using Python
Data Science using Python
 
Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 

En vedette

Open Access, Open Data. Open Research?
Open Access, Open Data. Open Research?Open Access, Open Data. Open Research?
Open Access, Open Data. Open Research?
Cameron Neylon
 
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 

En vedette (20)

Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Open Access, Open Data. Open Research?
Open Access, Open Data. Open Research?Open Access, Open Data. Open Research?
Open Access, Open Data. Open Research?
 
Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)Data Tactics Analytics Brown Bag (Aug 22, 2013)
Data Tactics Analytics Brown Bag (Aug 22, 2013)
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Revolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R CommunityRevolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R Community
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)
 
R2DOCX example
R2DOCX exampleR2DOCX example
R2DOCX example
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
2013 Future of Open Source - 7th Annual Survey results
2013 Future of Open Source - 7th Annual Survey results2013 Future of Open Source - 7th Annual Survey results
2013 Future of Open Source - 7th Annual Survey results
 
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
 
Be seen! Be cited! Have impact! Open Access and the Academy of Science of SA ...
Be seen! Be cited! Have impact! Open Access and the Academy of Science of SA ...Be seen! Be cited! Have impact! Open Access and the Academy of Science of SA ...
Be seen! Be cited! Have impact! Open Access and the Academy of Science of SA ...
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 

Similaire à Data Science: Not Just For Big Data

DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
Gary Rector
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
Jay Gendron
 

Similaire à Data Science: Not Just For Big Data (20)

The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on Privacy
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
The crusade for big data in the AAL domain
The crusade for big data in the AAL domainThe crusade for big data in the AAL domain
The crusade for big data in the AAL domain
 
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...NPTEL BIG DATA FULL PPT  BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Introduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia ResearchIntroduction to Big Data and its Potential for Dementia Research
Introduction to Big Data and its Potential for Dementia Research
 
What's the Value of Data Science for Organizations: Tips for Invincibility in...
What's the Value of Data Science for Organizations: Tips for Invincibility in...What's the Value of Data Science for Organizations: Tips for Invincibility in...
What's the Value of Data Science for Organizations: Tips for Invincibility in...
 
Big Data, Big Opportunities
Big Data, Big OpportunitiesBig Data, Big Opportunities
Big Data, Big Opportunities
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data mining
Data miningData mining
Data mining
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
 
Synthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacySynthetic Data for Big Data Privacy
Synthetic Data for Big Data Privacy
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 

Plus de Revolution Analytics

The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 

Plus de Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Data Science: Not Just For Big Data

  • 1. Revolution Confidential Data Science Not just for big data! David Smith Revolution Analytics @revodavid October 16, 2013
  • 2. Big Data: the new oil? Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0 Revolution Confidential 2
  • 3. Big Data is just raw material Revolution Confidential  Data Distillation  Extract quantities of interest  Find complete cases  Derive missing information  Big Data Pitfalls:  Data cleanliness & accuracy  Observational bias  Do the data I have represent the population I’m interested in? 3
  • 4. Surveys & Experiments Revolution Confidential  Even with Big Data, the data you need isn’t always in the building!  … so ask (survey)!  Survey design  Stratified sampling  … or experiment!  A/B Testing  Experimental Design 4
  • 5. Data Exploration & Visualization Revolution Confidential  Limited by pixels  Big data = a big black blob  Extract signal from noise     Aggregations Heat maps Smoothing Small multiples 5
  • 6. Statistical Modeling & Forecasting Revolution Confidential  You don’t always need big data  Sampling can help with observational bias  Model selection  Feature extraction  Confounding?  Interactions?  Model validation  Overfitting  Prediction  Extrapolation  Confidence http://xkcd.com/605/ 6
  • 7. Summary Revolution Confidential  Big Data is great, but think of it as the “raw materials” for data science  After refining, “big” isn’t always so “Big”  Use statistical insight to avoid pitfalls:  Inferences: Observational bias / Sampling bias  Predictions: Confounding / Overfitting  Think about variances and means (risk!)  Some data scientists may miss these issues  Look for statistical expertise  Further reading:  ComputerWorld: 12 predictive analytics screw-ups 7