SlideShare une entreprise Scribd logo
1  sur  27
WINNING WITH BIG  DATA Secrets of the Successful Data Scientist Making Data Work June 9, 2010 Michael Driscoll @dataspora
WHY DATA MATTERS
THE INDUSTRIAL AGE  OF  DATA
WHAT IS  BIG DATA? Data that is distributed.
WHAT IS DATA  SCIENCE?
NINE WAYS  TO WIN
1.  CHOOSE THE RIGHT TOOL You don’t need a chainsaw to cut butter.
2. COMPRESS  EVERYTHING mysqldump -u myuser -p mypasssourceDB| br />gzip| sshmike@dataspora.com "cat - | br />gunzip | mysql-u myuser -p mypasstargetDB" The world is IO-bound.
3. SPLIT UP YOUR DATA Split, apply, combine. See  Hadley Wickham’s paper at http://had.co.nz/plyr/plyr-intro-090510.pdf
4. WORK  WITH SAMPLES perl -ne "print if (rand() < 0.01)"   data.csv > sample.csv Big Data is heavy,  samples are light.
5.  USE STATISTICS
COPY FROM OTHERS git clone git://github.com/kevinweil/hadoop-lzo Use open source.
7. ESCAPE CHART TYPOLOGIES Charts are compositions, not containers.
8. USE COLOR WISELY Color can enhance  or insult.
9. TELL A STORY People are listening.
ONE  SUCCESS STORY
WHY DO TELCO CUSTOMERS LEAVE? Sign up Leave Goal:  “less churn.”
DATA: BILLIONS OF CALLS … and millions of callers.
DOES CALL  QUALITY MATTER? … a difference, but not significant.
WHAT ABOUT SOCIAL NETWORKS? Hmmm...
BUILD THE  CALL GRAPH … but is it predictive?
EVOLUTION OF A CALL GRAPH April
EVOLUTION OF A CALL GRAPH May
EVOLUTION OF A CALL GRAPH June
EVOLUTION OF A CALL GRAPH July
700% INCREASE IN CHURN when a cancellation occurs in a call network.
THANKS! QUESTIONS? Michael Driscoll twitter @dataspora http://www.dataspora.com/blog Making Data Work June 9, 2010

Contenu connexe

En vedette

Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
Revolution Analytics
 

En vedette (10)

Multi Level Modelling&amp;Weights Workshop Kiel09
Multi Level Modelling&amp;Weights Workshop Kiel09Multi Level Modelling&amp;Weights Workshop Kiel09
Multi Level Modelling&amp;Weights Workshop Kiel09
 
A Survey Of R Graphics
A Survey Of R GraphicsA Survey Of R Graphics
A Survey Of R Graphics
 
ForecastIT 6. Multi-Variable Linear Regression
ForecastIT 6. Multi-Variable Linear RegressionForecastIT 6. Multi-Variable Linear Regression
ForecastIT 6. Multi-Variable Linear Regression
 
Social Network Analysis for Telecoms
Social Network Analysis for TelecomsSocial Network Analysis for Telecoms
Social Network Analysis for Telecoms
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 

Similaire à Winning with Big Data: Secrets of the Successful Data Scientist

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Matt Stubbs
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
Mike Miller
 
Satyam open analytics nyc
Satyam open analytics nycSatyam open analytics nyc
Satyam open analytics nyc
Open Analytics
 
The New Era of Cognitive Computing
The New Era of Cognitive ComputingThe New Era of Cognitive Computing
The New Era of Cognitive Computing
IBM Research
 

Similaire à Winning with Big Data: Secrets of the Successful Data Scientist (20)

Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
 
Around Data Science
Around Data ScienceAround Data Science
Around Data Science
 
Literacy in the Age of Big Data
Literacy in the Age of Big DataLiteracy in the Age of Big Data
Literacy in the Age of Big Data
 
Version Control in AI/Machine Learning by Datmo
Version Control in AI/Machine Learning by DatmoVersion Control in AI/Machine Learning by Datmo
Version Control in AI/Machine Learning by Datmo
 
Big Data Maturity and its Evolution
Big Data Maturity and its EvolutionBig Data Maturity and its Evolution
Big Data Maturity and its Evolution
 
Gluecon miller horizon
Gluecon miller horizonGluecon miller horizon
Gluecon miller horizon
 
Around Data Science (v. 2021 ITA)
Around Data Science (v. 2021 ITA)Around Data Science (v. 2021 ITA)
Around Data Science (v. 2021 ITA)
 
Satyam open analytics nyc
Satyam open analytics nycSatyam open analytics nyc
Satyam open analytics nyc
 
Big data - Aditya Yadav
Big data - Aditya YadavBig data - Aditya Yadav
Big data - Aditya Yadav
 
Big data myths busted
Big data myths bustedBig data myths busted
Big data myths busted
 
big data
big databig data
big data
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Horizon 20110928
Horizon 20110928Horizon 20110928
Horizon 20110928
 
Introduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQueryIntroduction Data Warehouse With BigQuery
Introduction Data Warehouse With BigQuery
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
Big data
Big dataBig data
Big data
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
The New Era of Cognitive Computing
The New Era of Cognitive ComputingThe New Era of Cognitive Computing
The New Era of Cognitive Computing
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Winning with Big Data: Secrets of the Successful Data Scientist

Notes de l'éditeur

  1. If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.
  2. In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
  3. In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.
  4. Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
  5. Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
  6. This is the essence of parallelism: find some independent dimension on which to split your data.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn &amp; understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
  7. do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
  8. When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?
  9. Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead &amp; use someone else’s stuff. Go ahead. It’s there. That what’s great about R. 2000 statistical libraries written by professors.
  10. Not machines, people.
  11. Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
  12. Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
  13. Not machines, people.
  14. This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
  15. “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  16. Windowing functions in Greenplum, which is a modified Postgres distributed database.
  17. “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  18. “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  19. “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  20. “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  21. “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.