Presented at the eBay Inc Data Conference 2013:
“Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization”
Showed a Genetic Algorithm based method to optimize cluster analysis and developed a demo, applying this algorithm, for grouping similar items on eBay into a catalog of unique products.
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
Survival of the Fittest: Using Genetic Algorithm for Data Mining Optimization
1. Survival of the Fittest - Using Genetic
Algorithm for Data Mining Optimization
July 25, 2013
Or Levi
2. Introduction
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 2
•Better Results
•Higher Accuracy
•Knowledge
•Insights
Big DataMachine Learning on eBay
Data Mining Optimization
Genetic Algorithm
3. Agenda
What is Genetic Algorithm?
How GA can help improve Cluster Analysis?
Where it might be useful? An eBay Use Case
Questions and Answers
3Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
1
2
3
4
10. Genetic Algorithm
Crossover
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 10
1 1 0 1 1 1 0
Ron
0 1 0 1 0 1 1
Zoe
1 1 0 1 0 1 1
Ron Junior
0 1 0 1 1 1 0
Zoe Junior
5’8
4’2
6’0
5’3
Crossover Probability
11. Genetic Algorithm
Mutation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 11
1 1 0 1 0 1 1
Ron Junior
0 1 0 1 1 1 0
Zoe Junior
No Mutation
6’0
5’3
Mutation Probability: 0. 1
Fitness
Chromosome
12. Genetic Algorithm
Crossover
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 12
1 0 0 1 1 0 0
Joe
0 0 1 0 0 1 0
Adi
No Crossover
5’6
5’1
Crossover Probability
13. Genetic Algorithm
Mutation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 13
1 0 0 1 1 0 0
Joe
0 0 1 0 0 1 0
Adi
0 0 1 0 1 1 0
Adi Junior 5’5
5’6
5’1
14. Genetic Algorithm
New Generation
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 14
1 1 0 1 1 1 0
Ron
1 1 0 1 0 1 1
Ron Junior
1 0 0 1 1 0 0
Joe
0 0 1 0 1 1 0
5’8
6’0
5’6
Neck Length
Previous
Adi Junior 5’5
0 1 0 1 1 1 0
Zoe Junior 5’3
5’1 5’5
New
21. Structured Data
21Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Data Vendors
eBay Sellers
eBay Items
Products
Everywhere
Products
ISBN
UPI
22. Choose Aggregation Set
Brand
Model
Color
Creating Products from Items
22
Product Features
Network: 4G
Camera: 8.0MP
Screen Size: 4 in.
Used iOS
16GB New
Unlocked
$525.00
17 Bids
$649.99
Buy It Now
$579.99
or Best Offer
Storage
Carrier
Apple iPhone 5 – BlackSmartphones
Product Type eBay View Items Product
Apple iPhone 5 Black
Apple iPhone 5 Black
Black Apple iPhone 5
Other Features
Bluetooth: Yes
GPS: Yes
Dimensions:
Height: 4.87 in.
Depth: 0.30 in.
Width: 2.31 in.
Choose Aggregation Set Extract Relevant Attributes Aggregate Similar Items
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
23. Creating Products from Items
23Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Structured Aggregation
Top-Down
Unstructured Clustering
Bottom-Up
Items
Products
Aggregation Set
Aggregation Set
24. Overview – eBay Use Case
24Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case Example
26. K-Means Cluster Analysis
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 26
𝒙𝒋
𝒎𝒊𝒏
𝒊=𝟏
𝑲
𝑺𝑺𝑬𝒊
𝑺𝑺𝑬𝒊 =
𝒙 𝒋∈𝑪 𝒊
𝒙𝒋 − 𝝁𝒊
𝟐
𝝁𝒊
𝑪𝒊
Model
Total Within
Cluster Variance
Observation
Center
Cluster
Objective
27. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 27
Choose
K Random
Points
Initial Center
28. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 28
Assign
Points to
Clusters
Cluster
Center
30. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 30
1
33.7
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Recalculate
the Clusters
Means
31. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 31
2
26.8
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
32. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 32
3
23.6
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
33. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 33
4
21.6
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
34. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 34
5
19.5
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
35. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 35
6
18.8
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
36. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 36
7
18.7
Solution
In Iteration
Total Within
Cluster Variance
Solution Score
Local Optimum
37. Standard K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 37
Initial
Cluster
Centers
Initial Center
Local Optimum
38. Overview – Standard K-Means
38Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case
Standard K-Means
Local Optimum
40. Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 40
0 0 1 0 0 1 0
Chromosome
Adi
7 Genes
Solution Representation
43. Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 43
Solution Fitness
Neck Length
44. Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 44
𝒊=𝟏
𝑲
𝒙 𝒋∈𝑪 𝒊
𝒙𝒋 − 𝝁𝒊
𝟐
Total Within
Cluster Variance
Solution Fitness
45. Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 45
𝑲 𝑴𝒆𝒂𝒏𝒔 𝑰𝒕𝒆𝒓𝒂𝒕𝒊𝒐𝒏
Solution Fitness
46. Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 46
𝑨𝒓𝒊𝒕𝒉𝒎𝒆𝒕𝒊𝒄 𝑪𝒓𝒐𝒔𝒔𝒐𝒗𝒆𝒓
Solution 1
Solution 2
Crossover
Ron Zoe
47. Genetic K-Means Algorithm
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 47
Offspring 1
Offspring 2
Crossover
Ron Junior Zoe Junior
𝑨𝒓𝒊𝒕𝒉𝒎𝒆𝒕𝒊𝒄 𝑪𝒓𝒐𝒔𝒔𝒐𝒗𝒆𝒓
50. Overview – Genetic K-Means Algorithm
50Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case
Apply GA to K-MeansStandard K-Means
Local Optimum
52. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 52
1
32.5
Best Solution
In Generation
Cluster
Center
Total Within
Cluster Variance
Solution Fitness
53. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 53
2
21.9
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
54. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 54
3
14.7
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
55. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 55
4
12.3
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
56. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 56
5
9.7
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
57. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 57
6
9.1
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
58. Demo
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 58
7
8.9
Best Solution
In Generation
Total Within
Cluster Variance
Solution Fitness
59. Genetic Algorithm VS Standard K-Means
Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization 59
0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12
Total
Within
Cluster
Variance
Generations
Total Within Cluster Variance Per Generation
Genetic Algorithm K-Means
60. Genetic Algorithm VS Standard K-Means
60Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
0
5
10
15
20
25
30
0 5 10 15 20
Total Within-Cluster Variance on Different Runs
K-Means Multiple K-Means Genetic Algorithm
Local Optimum
High Volatility
Global Optimum
51% 32%
Average
Improvement
Across 20 Different Runs VS Standard K-Means VS Multiple K-Means
Total Within
Cluster Variance
61. Overview – GA VS Standard K-Means
61Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case
Apply GA to K-Means
Global Optimum
Standard K-Means
Local Optimum
63. eBay Use Case
63Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Lumia 920 Red 32GB Lumia 520 Yellow 8GB Lumia 620 Green 8GB
Lumia 800 Blue 16GB
64. eBay Use Case
64Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Nokia Lumia 800 Blue 8GB phone
0.05 0.12 0 0.31 0 0.20 0.12 0 0.14
Clean Up
TF-IDF Weights
NOKIA | LUMIA | 800 | BLUE | 8GB | PHONE
NOKIA LUMIA 800 BLUE 8GB PHONE
Number of Unique Terms in All Titles
Original Title
9 7 9
25
520 620 800 920
50 Random Items
Text Dictionary: All Titles
Importance of
A term to a title
{Stop Words}
brand new
65. eBay Use Case
65Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Aggregation Set:
Model Color Storage
0.08 0.11 0 0.13 0 0.06 0.05 0 0.03
8GB 620
Average
Weight
GREEN 5MP CAMERA PHONE
Cluster Center
1 Item
NOKIA | LUMIA | 800 | BLUE | 8GB | PHONE
46% 23%
Average
Improvement
Across 20 Different Runs VS Standard K-Means VS Multiple K-Means
Accurate Item
Classifications
66. Overview – Example
66Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case Example
Apply GA to K-Means
Global Optimum
Standard K-Means
Local Optimum
69. Conclusion
69Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Use Case Example
Apply GA to K-Means
Global Optimum
Standard K-Means
Local Optimum
+50% Accuracy
70. Thank You!
Or Levi
Data Analyst
Catalog & Classification
eBay Structured Data
olevi@ebay.com
Linked
71. Appendix – Genetic Algorithm Parameters
71Survival of the Fittest - Using Genetic Algorithm for Data Mining Optimization
Crossover
Probability
65% 90%
5%
10%
Mutation
Probability
Population Size: 10 Number of Generations: 10
Crossover Probability: 75% Mutation Probability: 9%
Normalized
Score
100
0
Total Within
Cluster Variance
Average of 5 Runs
Editor's Notes
Still, you could say “ovcourse genetic algorithm is better In GA you have a population size of say 10 indivudlas and in each generation you run 1 iteration of kmeans for all the solutions, so it’s like you’re running k-means 10 times Simultaneously.” So I’ve also compared the genetic algorithm to what you could call mutiple kmeans and what I’ve found was really interesting. The genentic mostly returned the optimal solution, the multiple kmeans kept getting stuck around local optimums and Standard kmeans was just all over the place.
So you can see that multiple k means can help us reduce the volatility of the results, but it is still can’t get past local optimums, and this really showes the added value of evolution and in particular the crossover and mutation operators.
What it means is that, On average, genetic algorithm can help us reduce the total variance within each cluster by more than half compared to the standard k means
(And across 20 different runs, On average, the genetic algorithm was able to find solutions that are more than 50% better than standard kmeans and more than 30% better than multiple kmeans.)
* don’t just the describe what you anaylzed and the results