Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Data Science Crash Course

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 47 Publicité

Data Science Crash Course

Télécharger pour lire hors ligne

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).

Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).

Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.

Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).

Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).

Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.

Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Data Science Crash Course (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

Data Science Crash Course

  1. 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved DWS Washington, D.C. 2019 Robert Hryniewicz @robhryniewicz Data Science Crash Course
  2. 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved INTRO TO DATA SCIENCE
  3. 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved The scientific exploration of data to extract meaning or insight, using statistics and mathematical models with the end goal of making smarter, quicker decisions. What is Data Science?
  4. 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved
  5. 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved What is Machine Learning? Favorite cocktail party definitions Machine Learning is programming with data. Machine Learning is a way to use data to draw meaningful conclusions including identifying patterns, anomalies and trends that may not be obvious to humans. Machine learning is math, at scale. 2nd 3rd Using statistical analysis of data to build predictive systems without needing to design or maintain explicit rules. 1st
  6. 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Examples where Machine Learning can be applied Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels Insurance • Risk assessment • Customer insights/experience • Finance real time analysis Life sciences • Genome sequencing • Drug development • Sensor data
  7. 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved What is a ML Model? • Mathematical formula with a number of parameters that need to be learned from the data. Fitting a model to the data is a process known as model training. • E.g. linear regression • Goal: fit a line y = mx + c to data points • After model training: y = 2x + 5 Input OutputModel 1, 0, 7, 2, … 7, 5, 19, 9, … y = 2x + 5
  8. 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Types of Learning Supervised Learning Unsupervised Learning Reinforcement Learning
  9. 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Supervised Learning Input Input Input Input Input Input Input Output 1 Output n Use labeled (training) datasets on to learn the relationship of given inputs to outputs. Once model is trained use it to predict outputs on new input data. Output 2 . . . … …
  10. 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Unsupervised Learning Explore, classify & find patterns in the input data without being explicit about the output.
  11. 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Reinforcement Learning Algorithm Environment ActionRewardState Algorithm learns to maximize rewards it receives for its actions (e.g. maximizes points for investment returns). Use when you don’t have lots of training data, you can’t clearly define ideal end-state, or the only way to learn is by interacting with the environment.
  12. 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved ALGORITHMS
  13. 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Regression Classification Recommender Systems / Collaborative Filtering Clustering Dimensionality Reduction • Logistic Regression • Support Vector Machines (SVM) • Random Forest (RF) • Naïve Bayes • Linear Regression • Alternating Least Squares (ALS) • K-Means, LDA • Principal Component Analysis (PCA) Deep Learning • Fully Connected Neural Nets  Tabular or Recommender Systems • Convolutional Neural Nets (CNNs)  Images • Recurrent Neural Nets (RNNs)  Natural Language Processing (NLP) / Text
  14. 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved REGRESSION Predicting a continuous-valued output Example: Predicting house prices based on number of bedrooms and square footage Algorithms: Linear Regression
  15. 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved CLASSIFICATION Identifying to which category an object belongs to Examples: spam detection, diabetes diagnosis, text labeling Algorithms: • Logistic Regression • Fast training (linear model) • Classes expressed in probabilities • Less overfitting [+] • Less fitting (accuracy) [-] • Support Vector Machines (SVM) • “Best” supervised learning algorithm, effective • State of the art prior to Deep Learning • More robust to outliers than Log Regression • Handles non-linearity • Checkout: blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93 • Random Forest (ensemble of Decision Trees) • Fast training • Handles categorical features • Does not require feature scaling • Captures non-linearity and feature interaction • i.e. performs feature selection / PCA implicitly • Naïve Bayes • Good for text classification • Assumes independent variables / words
  16. 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Visual Intro to Decision Trees • http://www.r2d3.us/visual-intro-to-machine-learning-part-1 CLASSIFICATION
  17. 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved CLUSTERING Automatic grouping of similar objects into sets (clusters) Example: market segmentation – auto group customers into different market segments Algorithms: K-means, LDA
  18. 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved COLLABORATIVE FILTERING Fill in the missing entries of a user-item association matrix Applications: Product/movie recommendation Algorithms: Alternating Least Squares (ALS)
  19. 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved DIMENSIONALITY REDUCTION Reducing the number of redundant features/variables Applications: • Removing noise in images by selecting only “important” features • Removing redundant features, e.g. MPH & KPH are linearly dependent Algorithms: Principal Component Analysis (PCA)
  20. 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Deep Learning 20
  21. 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved
  22. 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Simple/shallow vs Deep Neural Net
  23. 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved • Convolutional Neural Nets (CNNs) • Recurrent Neural Nets (RNNs) • Long Short-Term Memory (LSTM) Popular Neural Net Architectures  Images  Text / Language (NLP) & Time Series
  24. 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Number Probability 0 0.03 1 0.01 2 0.04 3 0.08 4 0.05 5 0.08 6 0.07 7 0.02 8 0.54 9 0.08
  25. 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved scs.ryerson.ca/~aharley/vis/conv/flat.html
  26. 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Quickly Training Deep Learning Models with Transfer Learning 26
  27. 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved How to Build a Deep Learning Image Recognition System? African Bush Elephant Indian Elephant Sri Lankan Elephant Borneo Pygmy Elephant Step 1: Download examples to train the model with
  28. 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved How to Build a Deep Learning Image Recognition System? Step 2: Augment dataset to enrich training data  Adds 5-10x more training examples
  29. 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved dawn.cs.stanford.edu/benchmark Step 3: Checkout DAWNBench then select and download a pre-trained model. How to Build a Deep Learning Image Recognition System?
  30. 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html Sample Architecture of a CNN
  31. 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html Sample Architecture of a CNN Pretrained Parameters Random Parameters
  32. 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Step 4: Apply transfer learning to a downloaded model How to Build a Deep Learning Image Recognition System? Pretrained Network (millions of parameters) Random ParametersINPUT OUTPUT Borneo Pygmy Elephant Train Parameters Step A Adjust Parameters Step B image label
  33. 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Step 5: Save the trained model How to Build a Deep Learning Image Recognition System? Pretrained Network (millions of parameters) Random ParametersINPUT OUTPUT Train Parameters Adjust Parameters Trained Model (Neural Net)
  34. 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Step 6: Host a trained model on a server and make it accessible via a web app How to Build a Deep Learning Image Recognition System? User uploads Borneo Pygmy Elephant Web app returns
  35. 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved DATA SCIENCE JOURNEY
  36. 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved
  37. 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Start by Asking Relevant Questions • Specific (can you think of a clear answer?) • Measurable (quantifiable? data driven?) • Actionable (if you had an answer, could you do something with it?) • Realistic (can you get an answer with data you have?) • Timely (answer in reasonable timeframe?)
  38. 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Data Preparation 1. Data analysis (audit for anomalies/errors) 2. Creating an intuitive workflow (formulate seq. of prep operations) 3. Validation (correctness evaluated against sample representative dataset) 4. Transformation (actual prep process takes place) 5. Backflow of cleaned data (replace original dirty data) Approx. 80% of Data Analyst’s job is Data Preparation! Example of multiple values used for U.S. States  California, CA, Cal., Cal
  39. 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Visualizing Data https://www.autodeskresearch.com/publications/samestats
  40. 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Feature Selection • Also known as variable or attribute selection • Why important? • simplification of models  easier to interpret by researchers/users • shorter training times • enhanced generalization by reducing overfitting • Dimensionality reduction vs feature selection • Dimensionality reduction: create new combinations of attributes • Feature selection: include/exclude attributes in data without changing them Q: Which features should you use to create a predictive model?
  41. 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Hyperparameters • Define higher-level model properties, e.g. complexity or learning rate • Cannot be learned during training  need to be predefined • Can be decided by • setting different values • training different models • choosing the values that test better • Hyperparameter examples • Number of leaves or depth of a tree • Number of latent factors in a matrix factorization • Learning rate (in many models) • Number of hidden layers in a deep neural network • Number of clusters in a k-means clustering
  42. 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved • Residuals • residual of an observed value is the difference between the observed value and the estimated value • R2 (R Squared) – Coefficient of Determination • indicates a goodness of fit • R2 of 1 means regression line perfectly fits data • RMSE (Root Mean Square Error) • measure of differences between values predicted by a model and values actually observed • good measure of accuracy, but only to compare forecasting errors of different models (individual variables are scale-dependent)
  43. 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved With that in mind… • No simple formula for “good questions” only general guidelines • The right data is better than lots of data • Understanding relationships matters
  44. 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Enterprise Data Science @ Scale Enterprise- Grade Leverage enterprise-grade security, governance and operations Tools Enhance productivity by enabling data scientists to use their favorite tools, technologies and libraries Deployment Compress the time to insight by deploying models into production faster Data Build more robust models by using all the data in the data lake
  45. 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Thanks! Robert Hryniewicz @robhryniewicz
  46. 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved Easier intro books (less math) • The Hundred-Page Machine Learning Book by Andriy Burkov • Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron • Deep Learning with Python by Francois Chollet • Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies by John D. Kelleher, Brian Mac Namee, Aoife D’Arcy More thorough books (more math) • Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville • Information Theory, Inference and Learning Algorithms 1st Edition by David J. C. MacKay Machine Learning Books
  47. 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved

Notes de l'éditeur

  • Specific: Can you think of what an answer to your question would look like? The more clearly you can see it, the more specific the question is.
    Measurable: Is the answer something you can quantify? It’s hard to make decisions based off things that aren’t in a really data-driven way.
    Actionable: If you had the answer to your question, could you do something useful with it? If not, you don’t necessarily have a bad question but you may not want to expend a lot of resources answering it.
    Realistic: Can you get an answer to your question with the data you have? If not, can you get the data that would get you an answer?
    Timely: Can you get an answer in a reasonable time frame, or at least as before you need it? This is usually not a big issue, but if you operate according to a tight schedule, you may need to think about it.

×