Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

VSSML17 L3. Clusters and Anomaly Detection

Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 3: Clusters and Anomaly Detection. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017

  • Soyez le premier à commenter

VSSML17 L3. Clusters and Anomaly Detection

  1. 1. Valencian Summer School in Machine Learning 3rd edition September 14-15, 2017
  2. 2. BigML, Inc 2 Clusters Finding Similarities Poul Petersen CIO, BigML, Inc
  3. 3. BigML, Inc 3Clusters What is Clustering? • An unsupervised learning technique • No labels necessary • Useful for finding similar instances • Smart sampling/labelling • Finds “self-similar" groups of instances • Customer: groups with similar behavior • Medical: patients with similar diagnostic measurements • Defines each group by a “centroid” • Geometric center of the group • Represents the “average” member • Number of centroids (k) can be specified or determined
  4. 4. BigML, Inc 4Clusters Cluster Centroids date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  5. 5. BigML, Inc 5Clusters Cluster Centroids date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 auth = pin amount ~ $100 Same: date: Mon != Wed customer: Sally != Bob account: 6788 != 3421 class: clothes != gas zip: 26339 != 46140 Different: date = Wed (2 out of 3) customer = Bob account = 3421 auth = pin class = gas zip = 46140 amount = $104 Centroid: similar
  6. 6. BigML, Inc 6Clusters Use Cases • Customer segmentation • Which customers are similar? • How many natural groups are there? • Item discovery • What other items are similar to this one? • Similarity • What other instances share a specific property? • Recommender (almost) • If you like this item, what other items might you like? • Active learning • Labelling unlabelled data efficiently
  7. 7. BigML, Inc 7Clusters Customer Segmentation GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in- game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  8. 8. BigML, Inc 8Clusters Similarity GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population • Dataset of Lending Club Loans • Mark any loan that is currently or has even been late as “trouble” 0% 3% 7% 1%
  9. 9. BigML, Inc 9Clusters Active Learning GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  10. 10. BigML, Inc 10Clusters Active Learning *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not- fraud. Or a million images which need to be labeled as cat/not-cat. 2323
  11. 11. BigML, Inc 11Clusters Item Discovery GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  12. 12. BigML, Inc 12Clusters Clusters Demo #1
  13. 13. BigML, Inc 13Clusters Human Expert Cluster into 3 groups…
  14. 14. BigML, Inc 14Clusters Human Expert
  15. 15. BigML, Inc 15Clusters Human Expert • Jesa used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “edges”, “hard”, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • met criteria of K=3 • groups were sufficiently “distant” • no crossover
  16. 16. BigML, Inc 16Clusters Human Expert • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count Create features that capture these object differences
  17. 17. BigML, Inc 17Clusters Clustering Features Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2.75 6 box 1 6 block 1.6 6 screw 8 3 battery 5 3 key 4.25 3 bead 1 2
  18. 18. BigML, Inc 18Clusters Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space K=3
  19. 19. BigML, Inc 19Clusters Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find “best” (minimum distance) circles that include all points
  20. 20. BigML, Inc 20Clusters K-Means Algorithm K=3
  21. 21. BigML, Inc 21Clusters K-Means Algorithm K=3 Repeat until centroids stop moving
  22. 22. BigML, Inc 22Clusters Features Matter Metal Other Wood
  23. 23. BigML, Inc 23Clusters Convergence Convergence guaranteed but not necessarily unique Starting points important (K++)
  24. 24. BigML, Inc 24Clusters Starting Points • Random points or instances in n-dimensional space • Chose points “farthest” away from each other • but this is sensitive to outliers • k++ • the first center is chosen randomly from instances • each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
  25. 25. BigML, Inc 25Clusters Scaling Matters price number of bedrooms d = 160,000 d = 1
  26. 26. BigML, Inc 26Clusters Other Tricks • What is the distance to a “missing value”? • What is the distance between categorical values? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown “K”?
  27. 27. BigML, Inc 27Clusters Distance to Missing? • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  28. 28. BigML, Inc 28Clusters Distance to Categorical? • Define special distance function: For two instances 𝑥 and 𝑦   and the categorical field 𝑎: • if 𝑥 𝑎  =   𝑦 𝑎 then
 (𝑥,𝑦)distance=0  (or field scaling value) 
 else 
 (𝑥,𝑦)distance=1 Approach: similar to “k-prototypes”
  29. 29. BigML, Inc 29Clusters Distance to Categorical? animal favorite toy toy color cat ball red cat ball green d=0 d=0 d=1 cat laser red dog squeaky red d=1 d=1 d=0 D = 1 Then compute Euclidean distance between vectors D = √2 Note: the centroid is assigned the most common category of the member instances
  30. 30. BigML, Inc 30Clusters Text Vectors 1 Cosine Similarity 0 -1 "hippo" "safari" "zebra" …. 1 0 1 … 1 1 0 … 0 1 1 … Text Field #1 Text Field #2 Features(thousands) • Cosine  Similarity   • cos() between two vectors • 1 if collinear, 0 if orthogonal • only positive vectors: 0  ≤  CS  ≤  1 • Cosine  Distance=1-Cosine  Similarity   • CD(TF1,  TF2)  =  0.5
  31. 31. BigML, Inc 31Clusters Finding K: G-Means
  32. 32. BigML, Inc 32Clusters Finding K: G-Means
  33. 33. BigML, Inc 33Clusters Finding K: G-Means Let K=2 Keep 1, Split 1 New K=3
  34. 34. BigML, Inc 34Clusters Finding K: G-Means Let K=3 Keep 1, Split 2 New K=5
  35. 35. BigML, Inc 35Clusters Finding K: G-Means Let K=5 K=5
  36. 36. BigML, Inc 36Clusters Clusters Demo #2
  37. 37. BigML, Inc 37Clusters Summary • Cluster Purpose • Unsupervised technique for finding self-similar groups of instances • Number of centroids (k) can be inputed or computed • Outputs list of centroids • Configuration: • Algorithm: K-means / G-means • Cluster Parameter: k or critical value • Default missing / Summary fields / Scales / Weights • Model Clusters • Centroid / Batchcentroids
  38. 38. BigML, Inc 2 Anomaly Detection Finding the Unusual Poul Petersen CIO, BigML, Inc
  39. 39. BigML, Inc 3Anomaly Detection What is Anomaly Detection? • An unsupervised learning technique • No labels necessary • Useful for finding unusual instances • Filtering, finding mistakes, 1-class classifiers • Finds instances that do not match • Customer: big or small spender for profile • Medical: healthy patient despite indicative diagnostics • Defines each unusual instance by an “anomaly score” • in BigML: 0=normal,  1=unusual, and 0.7  ≫  0.6  ﹥0.5 • Standard deviation, distributions, etc
  40. 40. BigML, Inc 4Anomaly Detection Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  41. 41. BigML, Inc 5Anomaly Detection Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  42. 42. BigML, Inc 6Anomaly Detection Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  43. 43. BigML, Inc 7Anomaly Detection Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly • Amount $2,459 is higher than all other transactions • It is the only transaction • In zip 21350 • for the purchase class "tech"
  44. 44. BigML, Inc 8Anomaly Detection Use Cases • Unusual instance discovery - "exploration" • Intrusion Detection - "looking for unusual usage patterns" • Fraud - "looking for unusual behavior" • Identify Incorrect Data - "looking for mistakes" • Remove Outliers - "improve model quality" • Model Competence / Input Data Drift
  45. 45. BigML, Inc 9Anomaly Detection Removing Outliers • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  46. 46. BigML, Inc 10Anomaly Detection Diabetes Anomalies DIABETES SOURCE DIABETES DATASET TRAIN SET TEST SET ALL MODEL CLEAN DATASET FILTER ALL MODEL ALL EVALUATION CLEAN EVALUATION COMPARE EVALUATIONS ANAOMALY DETECTOR
  47. 47. BigML, Inc 11Anomaly Detection Anomaly Demo #1
  48. 48. BigML, Inc 12Anomaly Detection Intrusion Detection GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  49. 49. BigML, Inc 13Anomaly Detection Fraud • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  50. 50. BigML, Inc 14Anomaly Detection Model Competence • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted. Prediction T T Confidence 86% 84% Anomaly Score 0.5367 0.7124 Competent? Y N At Prediction TimeAt Training Time DATASET MODEL ANOMALY DETECTOR
  51. 51. BigML, Inc 15Anomaly Detection Univariate Approach • Single variable: heights, test scores, etc • Assume the value is distributed “normally” • Compute standard deviation • a measure of how “spread out” the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  52. 52. BigML, Inc 16Anomaly Detection Univariate Approach measurement frequency outliersoutliers • Available in BigML API
  53. 53. BigML, Inc 17Anomaly Detection Benford’s Law • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  54. 54. BigML, Inc 18Anomaly Detection Multivariate Matters
  55. 55. BigML, Inc 19Anomaly Detection Multivariate Matters
  56. 56. BigML, Inc 20Anomaly Detection Human Expert Most Unusual?
  57. 57. BigML, Inc 21Anomaly Detection Human Expert “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Key Insight The “most unusual” object is different in some way from every partition of the features. Most unusual
  58. 58. BigML, Inc 22Anomaly Detection Human Expert • Human used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “smooth”, “corners” • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  59. 59. BigML, Inc 23Anomaly Detection Human Expert • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  60. 60. BigML, Inc 24Anomaly Detection Anomaly Features Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2.75 6 TRUE box 1 6 TRUE block 1.6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4.25 3 FALSE bead 1 2 TRUE
  61. 61. BigML, Inc 25Anomaly Detection length/width > 5 smooth? box blockeraser knob penny/dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Know that “splits” matter - don’t know the order TrueFalse TrueFalse TrueFalse FalseTrue TrueFalse Random Splits
  62. 62. BigML, Inc 26Anomaly Detection Isolation Forest Grow a random decision tree until each instance is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  63. 63. BigML, Inc 27Anomaly Detection Isolation Forest Scoring D = 3 D = 6 D = 2 S=0.45 Map avg depth to final score f1 f2 f3 i1 red cat ball i2 red cat ball i3 red cat box i4 blue dog pen For the instance, i2 Find the depth in each tree
  64. 64. BigML, Inc 28Anomaly Detection Model Competence • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you should not trust the model. Prediction T T Confidence 86% 84% Anomaly Score 0.5367 0.7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  65. 65. BigML, Inc 29Anomaly Detection Anomaly Demo #2
  66. 66. BigML, Inc 30Anomaly Detection 1-Class Classifier? • You place an advertisement in a local newspaper • You collect demographic information about all responders • Now you want to market in a new locality with direct letters • To optimize mailing costs, need to predict who will respond • But, can not distinguish not interested from didn’t see the ad • Train an anomaly detector on the 1-class data • Pick the households with the lowest scores for mailing: • If a household has a low anomaly score, then they are “similar” to enough of your positive responders and therefore may respond as well • If an individual has a high anomaly score, then they are dissimilar from all previous responders and therefore are less likely to respond.
  67. 67. BigML, Inc 31Anomaly Detection Summary • Anomaly detection is the process of finding unusual instances • Some techniques and how they work: • Univariate: standard deviation • Benford’s law • Isolation Forest • Applications • Filtering to improve models • Finding mistakes, fraud, and intruders • Knowing when to retrain a model (competence) • 1-class classifiers • In general… unsupervised learning techniques: • Require more finesse and interpretation • Are more commonly part of a multistep workflow

×