Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Data Science 101

We've been taught that "data science" is the esoteric domain of PhDs,
but like anything else, it's easy once you understand it. This talk
explains the basics of data science, covering concepts in supervised
learning (including a detailed explanation of decision trees and
random forests) as well as examples of unsupervised learning
algorithms. Far from being a dry and academic topic, data science and machine learning are useful and practical analytical tools. (This talk is intended for a general audience.)

Topics will include:

1) An introduction to supervised learning using the popular decision
tree algorithm

2) The concepts of training and scoring, and the meaning of "real time"
machine learning

3) Model validation using holdout sets

4) Model complexity and overfitting; understanding bias and variance;
using ensembles to reduce variance

5) An overview of unsupervised learning models including clustering,
topic modeling and anomaly detection

and more!

  • Identifiez-vous pour voir les commentaires

Data Science 101

  1. 1. Data Science 101 David Gerster Strategic Advisory Board
  2. 2. About me • 10+ years experience in data science at various consumer web companies • Worked on web search at Yahoo and Microsoft • Led the Mobile data science team at Groupon • Joined BigML as VP Data Science in July 2013 • Joined JLL Spark as VP Data in July 2017 • Advisor to High Fidelity Genetics 3
  3. 3. Finding meaningful patterns in data • The famous “Iris” data set has measurements for 150 flowers • Given a flower’s measurements, can we predict its species? Iris setosa Iris versicolor Iris virginica 5
  4. 4. PetalWidth(cm) Petal Length (cm) Iris setosa, red dots Iris versicolor, green dots Iris virginica, blue dots 6
  5. 5. PetalWidth(cm) Petal Length (cm) Congratulations! You just trained a model. 7
  6. 6. PetalWidth(cm) Petal Length (cm) PetalWidth(cm) Petal Length (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica 8
  7. 7. PetalWidth(cm) Petal Length (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one. 9
  8. 8. Training versus Scoring • This process had two steps: training and scoring • When training on historical data, you’re using data gathered over some length of time • When scoring new data points, you want the answer immediately (in “real time”) 10
  9. 9. 11 Predicts “blue” with high confidence Explains a large chunk of the data (high support) Predicts “blue” with low confidence Explains a small chunk of the data (low support)
  10. 10. Support and Confidence • A rectangle with a large number of data points has high “support” • A rectangle that is purely one color has high “confidence” • If there is a small number of data points, confidence is low even if it’s purely one color 12
  11. 11. PetalWidth(cm) Petal Length (cm) 13 Width <= 0.8? Width > 0.8? Width > 1.75? Width <= 1.75? Length <= 5? Length > 5? 50 red 45 blue 1 blue, 48 green 4 blue, 2 green “Decision Tree” “Leaf Nodes” 50 blue, 50 green 5 blue, 50 green 50 red, 50 blue, 50 green
  12. 12. • Data is just a table of values • Each row is an instance, an example of the concept to be learned • Each column is an attribute or feature of the instance • The column we want to predict is the label or output • Because we have a label, this is supervised learning 14 instance instance feature labelfeature
  13. 13. Demo: The General Social Survey • Sociology survey given in the United States since 1972 • Data is 39,000 responses, almost 400 questions each • Demographic data like income, race, gender, education, marital status • Many questions about personal beliefs • “Should an atheist be allowed to teach college, or not?” • “Are we spending the right amount of money on education?” • Can we predict income from these responses? 16
  14. 14. How good is our model? • The model looks good, but how do we quantify this? 17
  15. 15. 80% training set 20% holdout set 3 out of 4 predictions are correct Accuracy = 75% 100% of data 1. Train a model using 80% training set 2. Pretend 20% holdout is new data, and feed it to the model 3. Check accuracy of predictions
  16. 16. Predicting political views • What happens if we predict political views instead of income? • A different subset of variables becomes important! 19
  17. 17. 20
  18. 18. Finding the important variables 21
  19. 19. 22
  20. 20. The Value of Predictive Modeling • Provides deep insight into your data • Finds the small subset of important variables • Extremely useful for business! 23
  21. 21. Demo: The StumbleUpon Dataset • StumbleUpon is an app that recommends web pages • Dataset of 7,400 web pages is provided, with each page labeled as either “evergreen” or “ephemeral” • We want to predict the page’s class using this historical data 24 While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen".
  22. 22. Training a model on StumbleUpon data • Live demo: training a model on StumbleUpon data • Key concepts: • “Bag of words” text analysis • Evaluating the model using a holdout set • Combining multiple models to improve accuracy • The “ensemble” of multiple models has better accuracy! 25
  23. 23. “Ensembles” of Models • Training multiple models on random subsets of the data gave us a better result! • Why? 26
  24. 24. Bias and Variance • We train a model with the goal of fitting it correctly to the data • When a model isn’t flexible enough, it may underfit the data, and we say it has high bias • When a model is too flexible, it may overfit the data, and we say it has high variance For a formal definition of bias and variance, see Thomas Dietterich’s paper on the subject
  25. 25. 28 High Bias
  26. 26. 29 High Variance
  27. 27. Decision trees have high variance • Decision trees can represent complex functions • But they are prone to overfitting; they have high variance • If you draw enough lines, you can create a “model” that just memorizes the dataset!
  28. 28. Decision trees have high variance • We can reduce this problem by: • Taking several random samples from the original data set • Training a decision tree on each sample • Having these trees vote on the class • Goal: Get the expressiveness of a decision tree, with less overfitting
  29. 29. 100% of data Prediction Single Tree
  30. 30. 100% of data Bootstrap sample Bootstrap sample Bootstrap sample Bootstrap sample Bootstrap sample Vote on Prediction Ensemble of Trees
  31. 31. 39
  32. 32. 40
  33. 33. 41
  34. 34. 42
  35. 35. 45 Blue side Red sideVote: 2-1, Blue Vote: 2-1, Red Vote: 2-1, Blue
  36. 36. Benefits of a Decision Tree Ensemble • Voted boundary is more accurate than for a single tree • “Best of both worlds”: Get most of the expressiveness of decision trees with lower variance • We’re actually taking advantage of the variance by feeding a different random sample to each tree and seeing what happens! 46
  37. 37. Why draw straight lines in decision trees? • Imagine you have 400 variables in your dataset • You only need to examine 400 variables to draw the “best” straight line between the dots • If you want a diagonal line in two dimensions, there are (400 choose 2) or 79,800 combinations of variables to examine • Some biology datasets have 100,000 variables! • (100,000 choose 2) = 4,999,950,000 combinations of 2 variables! 47
  38. 38. Popular algorithms for supervised learning • We got pretty deep into Decision Trees and ensembles of trees • Other popular algorithms for supervised learning: • Support Vector Machines • Neural Nets (“Deep Learning”) • Check out BigML’s automated deep learning! 50
  39. 39. Recap: Supervised Learning Topics • Definition of supervised learning • Training and scoring a model • Support and confidence • Model evaluation using a holdout set • Bias and variance, underfitting and overfitting • Using ensembles to improve models • … And a whole lot about decision trees! 51
  40. 40. 53 PetalWidth(cm) Petal Length (cm)
  41. 41. What if we don’t have labels? • Can we still get insight into our data if we don’t know the colors of the dots? • Since we don’t have labels, this is unsupervised learning • Clustering: Find “clumps” of unlabeled data that might be interesting • Anomaly detection: Find outliers in unlabeled data • Topic Modeling: Identify topics in free text 54
  42. 42. Clustering • Concept: Find “lumps” of data that exist in distinct clusters • K-means clustering: 1. Choose a number of clusters k that you are looking for 2. Choose initial “centroids” for the clusters 3. Compute which data points are closest to each centroid 4. Compute the actual center for each of the sets of datapoints 5. Continue until the k centroids stop moving 55
  43. 43. Demo: The Whisky Dataset • Data on the flavors of 86 single-malt Scotch whiskies • No labels, just a bunch of taste information • Can we get insight into this dataset? 69
  44. 44. Demo: Breast Cancer Dataset • Train a predictive model using the 699 biopsies • The “label” of benign or malignant is known for each one • We can train a highly accurate predictive model with this data 74
  45. 45. Demo: Breast Cancer Dataset • What if we remove the labels of “benign” and “malignant”? 75
  46. 46. 10 lines are needed to isolate this data point (not anomalous)
  47. 47. Only 4 lines are needed to isolate this data point (highly anomalous)
  48. 48. Demo: Anomaly Detection • Remove the labels of benign or malignant • Train an anomaly detector on this unlabeled data • Create a new dataset with the anomaly scores as “labels” • Use these “labels” to train a predictive model! 78
  49. 49. Who Needs Labels?
  50. 50. Minority Report • Anomaly detection works great on large unlabeled datasets, especially if you expect to find an (adversarial) minority class • Millions of credit card transactions, billions of network events … • Doesn’t require you to know what you’re looking for! 81
  51. 51. Topic Modeling using LDA • Uncovers groups of related words (“topics”) in documents • Does not require an external corpus (e.g. training on Wikipedia) • No semantic parsing of text • Unsupervised
  52. 52. Topic modeling on IMDB reviews • 52,000 reviews • 883 movies
  53. 53. Top 3 Topics in Shrek Reviews (n=26)
  54. 54. Topics Topic distribution for this document Borrowed/stolen from Prof. David Blei, with apologies …
  55. 55. The (assumed) generative process children A topic, which is a distribution over words A distribution over topics, specific to each document A distribution over topic distributions, fixed for this corpus A word in a document Topic 1 Topic 3 Topic 2 A distribution over word distributions, fixed for this corpus Word 1 Word 2Word 3
  56. 56. What we observe children A word in a document
  57. 57. n = 26 Shrek
  58. 58. n = 26 Shrek
  59. 59. n = 31 The Sum of All Fears
  60. 60. n = 31 The Sum of All Fears
  61. 61. n = 100 Love, Actually
  62. 62. How do we get such “good” topics? • Imagine that each document can only belong to one topic • Does that make it easier or harder to find “good” clusters of words? • LDA allows documents to belong to multiple topics
  63. 63. Recap: Unsupervised Learning Topics • Unsupervised learning uses unlabeled data • Clustering: Finding clumps in unlabeled data • Anomaly Detection: Finding “weird” instances in unlabeled data • Topic Modeling: Extracting meaningful topics from free text 94
  64. 64. Final Thought • Supervised learning has many different algorithms to solve one problem (predicting the output) • Unsupervised learning has a many different algorithms to solve many different problems 95 David Gerster gerster@bigml.com
  65. 65. Backup Slides
  66. 66. 102

×