Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

iTrain Malaysia: Data Science by Tarun Sukhani

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 46 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à iTrain Malaysia: Data Science by Tarun Sukhani (20)

Publicité
Publicité

iTrain Malaysia: Data Science by Tarun Sukhani

  1. 1. Data Science APPLICATION AND OPPORTUNITY Prepared By: Tarun Sukhani
  2. 2. WHAT IS DATA SCIENCE & BIG DATA? Data Science is an interdisciplinary field that combines statistics, computer science, and operations research. It has numerous applications such as in Fintech, Genomics, and even the Social Sciences, just to name a few. Big Data is data science applied to large data sets, usually in the terabyte range and above. It has its roots in Web 2.0 which emphasized user-generated content, thus resulting in greater variety, volume, and velocity of data.
  3. 3. DATA SCIENCE CORE COMPONENTS
  4. 4. BIG DATA – THE 4 V’S
  5. 5. BIG DATA – UNPRECEDENTED GROWTH
  6. 6. WHAT IS A DATA SCIENTIST?
  7. 7. DATA SCIENCE VENN DIAGRAM Hacking Skills Having a proper mathematical background and domain expertise may not be sufficient to succeed as a data scientist. The ability to combine together Different tools and visualizations is key to becoming an effective data scientist. Math & Statistics Computer Science, Math, Statistics, and Linear Algebra provide a solid foundation from which a data scientist can draw the necessary knowledge to apply analysis to data sets. SME & Job Experience There is no substitute for solid work experience as a business analyst, programmer, and/or statistician for the domain in which you are applying your skills and knowledge. The absence of such experience can lead to biased statistical models or irrelevant conclusions.
  8. 8. WHAT DOES A GOOD DATA SCIENTIST LOOK LIKE? Inquisitive – skeptical and curious Knowledgeable – knows machine learning, statistics, and probability Scientific Method – Creates hypotheses, tests them, and updates understanding Coding – is good at coding, hacking, and general programming Product Oriented – knows how to build data products and visualizations to make data understandable to mere mortals Domain Knowledge – understands the business and how to tell the relevant story from business data. Able to find answers to known unknowns.
  9. 9. T-SHAPED SKILLSET Broad-range Generalist DeepExpertise Machine Learning, Statistics, Domain Knowledge
  10. 10. DATA SCIENTIST ROLES
  11. 11. DATA SCIENTIST ROADMAP
  12. 12. DEMAND & OPPORTUNITY Data Science has been dubbed by the Harvard Business Review (Thomas H. Davenport and D.J. Patil, October 2012) as… “The Sexiest Job of the 21st Century” https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century And by the New York Times (April 11, 2013) as a… “hot new field [that] promises to revolutionize industries from business to government, healthcare to academia” Data Science, however, is NOT NEW! It’s basically just data mining rebranded.
  13. 13. DEMAND & OPPORTUNITY Data Scientist was identified by Glassdoor as the top job for Work-Life Balance in 2015 (out of 25), with the highest salary…(in USA) 1. Data Scientist • Work-Life Balance Rating: 4.2 (out of 5) • Salary: $114,808 (highest salary) • Number of Job Openings: 1,315 (highest in the top 9) https://www.glassdoor.com/blog/25-jobs-worklife-balance-2015/ According to McKinsey, there will be a shortage of talent needed to take advantage of data science and big data. By 2018, The USA alone could face a shortage of 140-190k skilled data scientists and 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation
  14. 14. DATA SCIENCE PRINCIPLES 1. Socio-Technical Systems are complex! 2. Data is never at rest 3. Data is dirty, deal with it! 4. SVoT = LOL! (Single Version of Truth) 5. Data munging/wrangling & data wrestling > 70% time – this is the reality of the data scientist 6. Simplification. Reduction. Distillation. 7. Curiosity. Empricism. Skepticism.
  15. 15. KNOWNS AND UNKNOWNS There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know. Donald Rumsfeld
  16. 16. DIKUW
  17. 17. APPLICATIONS OF DATA SCIENCE
  18. 18. APPLICATIONS OF DATA SCIENCE Data-Driven Decision Making (DDD) refers to the practice of basing decisions on data, rather than purely on intuition. DataScienceforBusiness.O’ReillyMedia
  19. 19. APPLICATIONS OF DATA SCIENCE PROCESS FLOW DIAGRAM
  20. 20. APPLICATIONS OF DATA SCIENCE BUSINESS
  21. 21. APPLICATIONS OF DATA SCIENCE BUSINESS
  22. 22. APPLICATIONS OF DATA SCIENCE BUSINESS
  23. 23. APPLICATIONS OF DATA SCIENCE BUSINESS
  24. 24. APPLICATIONS OF DATA SCIENCE BUSINESS
  25. 25. APPLICATIONS OF DATA SCIENCE BUSINESS
  26. 26. APPLICATIONS OF DATA SCIENCE BUSINESS
  27. 27. APPLICATIONS OF DATA SCIENCE BUSINESS
  28. 28. APPLICATIONS OF DATA SCIENCE SPORTS
  29. 29. APPLICATIONS OF DATA SCIENCE HEALTHCARE
  30. 30. APPLICATIONS OF DATA SCIENCE RETAIL
  31. 31. APPLICATIONS OF DATA SCIENCE RETAIL
  32. 32. APPLICATIONS OF DATA SCIENCE RESEARCH
  33. 33. DATA-DRIVEN ORGANIZATION Organizations become data-driven by developing data products. What is a data product? • Curated and crafted from raw data • A result of exploration and iterations • A machine that learns from data • An answer to known unknowns or unknown unknowns • A mechanism that triggers immediate business value • A probabilistic window of future events or behavior
  34. 34. DEVELOPING DATA PRODUCTS OBJECTIVES What outcome am I trying to achieve? LEVERS What inputs can we control? DATA What data can we collect? MODELS How the levers influence the objectives?
  35. 35. © Tarun Sukhani DEVELOPING DATA PRODUCTS THE WORLD 1. Product Manufactured 2. Goods shipped 3. Product purchased 4. Phone Calls Made 5. Energy Consumed 6. Fraud Committed 7. Repair Requested 8. System INGEST RAW DATA 1. Transactions 2. Web-scraping 3. Web-clicks & logs 4. Sensor data 5. Mobile data 6. Docs, Email, XLS 7. Social Feeds, RSS 8. Flume & Sqoop MUNCH DATA 1. MapReduce 2. ETL/ELT 3. Data Wrangle 4. Data Cleansing 5. Dim. Reduction 6. Sample 7. Select, Join, Bind THE DATASET 1. Independency? 2. Correlation? 3. Covariance? 4. Causality? 5. Dimensionality? 6. Missing Values? 7. Relevancy? 1. Known Unknowns? 2. We’d like to know… 3. Outcomes? 4. What data? 5. Hypothesis?
  36. 36. DEVELOPING DATA PRODUCTS LEARN FROM DATA 1. Description & Inference 2. Data & Algorithm Models 3. Machine Learning 4. Networks & Graphs 5. Regression & Prediction 6. Classification & Clustering 7. Experiments & Iteration DATA PRODUCT 1. Objectives 2. Levers 3. Modeling 4. Simulation 5. Optimization 6. Visualization VISUALIZE INSIGHT 1. Actionable 2. Predictive 3. Immediate Impact 4. Business Value 5. Easy to Explain DELIVER INSIGHT EXPLORE DATATHE DATASET REPRESENT DATA DISCOVER DATA
  37. 37. DEVELOPING DATA PRODUCTS DATA MODELER SIMULATOR OPTIMIZER What Outcome Am I Trying to Achieve? Actionable Outcome The Model Assembly Line
  38. 38. DATA SCIENCE AS A CAREER
  39. 39. DATA SCIENCE AS A CAREER DJ Patil, Chief Data Scientist of the United States is the perfect prototype of the Data Scientist. He brings a deep understanding of mathematics from his Ph.D. in applied mathematics. He has created multiple data products, and collaborated with people in various data science roles. He’s headed up strategy and led teams to build out entire new extensions of Linkedin’s data, from the creation of “People You May Know”, to Talent Match, a function that automatically sources the best candidate for any job posted on Linkedin. Doug Cutting, Creator of Hadoop & Chief Architect at Cloudera is somebody who has dedicated his time to creating technical solutions to store and process data at scale. Hadoop is widely used to distribute data across several hardware servers so that huge data sets can become manageable. Doug Cutting is the prototypical example of a data engineer and he is now the chief architect at Cloudera, one of the largest data engineering organizations in the world.
  40. 40. DATA SCIENCE EDUCATION FRAMEWORK LEARN TO CODE PYTHON R JULIA HIGH-LEVEL LOWER-LEVEL JAVA SCALA/CLOJURE C++/GO
  41. 41. DATA SCIENCE EDUCATION FRAMEWORK LEARN MATHEMATICS & STATISTICS MATHEMATICS STATISTICAL ANALYSIS LINEAR ALEGBRA (MATRIX FACTORIZATION) CALCULUS (INTEGRALS, DERIVATIVES, ETC) GRAPH THEORY PROBABILITY/COMBINAT ORICS DISTRIBUTIONS (BINOMIAL, NORMAL, POISSON, ETC) SUMMARY STATISTICS (MEAN, VARIANCE, ETC) HYPOTHESIS TESTING (P-VALUE, CHI-SQUARE, ETC) BAYESIAN ANALYSIS
  42. 42. DATA SCIENCE EDUCATION FRAMEWORK LEARN MACHINE LEARNING AND SOFTWARE ENGINEERING MACHINE LEARNING SOFTWARE ENGINEERING SUPERVISED (SVM, RANDOM FOREST) UNSUPERVISED (K-MEANS, LDA) NLP/INFORMATION RETRIEVAL VALIDATION, MODEL COMPARISON ALGORITHMS & DATA STRUCTURES DATA VISUALIZATION DATA MUNGING/WRANGLING DISTRIBUTED COMPUTING
  43. 43. DATA SCIENCE EDUCATION FRAMEWORK YOU DON’T NEED A PHD TO DO DATA SCIENCE!
  44. 44. DATA SCIENCE EDUCATION FRAMEWORK
  45. 45. DATA SCIENCE EDUCATION FRAMEWORK
  46. 46. DEMO & Q/A

×