Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
1. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Spark and Recommendations
Spark, Streaming, Machine Learning, Graph Processing,
Approximations, Probabilistic Data Structures, NLP
USF Seminar Series
Thanks, USF!!
Feb 5th, 2016
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring! (Only Nice People)
advancedspark.com!
2. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Netflix OSS Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
3. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 5 Most-active Spark Meetup!
2400+ Members in just 6 mos!!
2500+ Docker image downloads
Meetup Mission
Deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance
3
4. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Live, Interactive Demo!!
Audience Participation Required
(cell phone or laptop)
4
5. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
demo.advancedspark.com
End User ->
ElasticSearch ->
Spark ML ->
Data Scientist ->
5
<- Kafka
<- Spark
Streaming
<- Cassandra,
Redis
<- Zeppelin,
iPython
6. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
6
7. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Parallelism
7
Peter
O(log n)
O(log n)
8. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Composability
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d)
== (a U b) U (c U d)
Addition (a + b + c + d)
== (a + b)
+
(c + d)
Multiply
(a * b * c * d)
== (a * b) * (c * d)
Division??
8
9. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Division?
Division
(a / b / c / d)
!= (a / b) / (c / d)
(3 / 4 / 7 / 8)
!= (3 / 4) / (7 / 8)
(((3 / 4) / 7) / 8)
!= ((3 * 8) / (4 * 7))
0.134
!=
0.857
9
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
10. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Average?
Overall AVG (
[3, 1]
((3 + 5) + (5 + 7))
20
[5, 1] == ----------------------- == --- == 5
[5, 1]
((1 + 2) + 1)
4
[7, 1]
)
10
value
count
Pairwise AVG
(3 + 5) (5 + 7) 8 12 20
------- + ------- == --- + --- == --- == 10 != 5
2
2
2 2
2
Divide, Add, Divide?
Not
Composable
Single Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!
11. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
11
12. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Similarity
12
13. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance
Linear measure
Bias towards magnitude
13
14. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
14
Normalizes to unit vectors
15. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Jaccard Similarity
Set similarity measurement
Set intersection / set union ->
Based on Jaccard distance
Bias towards popularity
15
16. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
16
17. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Word Similarity
Edit Distance
Calculate char differences between words
Deletes, transposes, replaces, inserts
17
18. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Document Similarity
TD/IDF
Term Freq / Inverse Document Freq
Used by most search engines
Word2Vec
Words embedded in vector space nearby similars
18
19. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Pathway
ie. Closest recommendations between 2 people
19
20. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Calculating Similarity
Exact Brute-Force
“All-pairs similarity”
aka “Pair-wise similarity”, “Similarity join”
Cartesian O(n^2) shuffle and comparison
Approximate
Sampling
Bucketing (aka “Partitioning”, “Clustering”)
Remove data with low probability of similarity
Reduce shuffle and comparisons
20
21. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Document Summary
Text Rank
aka “Sentence Rank”
TF/IDF + Similarity Graph + PageRank
Intuition
Surface summary sentences (abstract)
Most similar to all others (TF/IDF + Similarity Graph)
Most influential sentences (PageRank)
21
22. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Graph
Vertex is movie, tag, actor, plot summary, etc.
Edges are relationships and weights
22
23. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topic-Sensitive PageRank
Graph diffusion algorithm
Pre-process graph, add vector of probabilities to each vertex
Probability of ending up at this vertex from every other
vertex
23
24. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Recommendations
24
25. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like or rating
Implicit User Feedback: search, click, hover, view, scroll
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
25
26. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Feature Engineering
Dimension Reduction
Reduce number of features (aka “feature space”)
Principle Component Analysis (PCA)
Find principle features that describe the data in terms of variance
Peel the dimensional layers back until you describe the data
Example: One-Hot Encoding
Convert categorical feature values to 0’s, 1’s
Remove any hint of a relationship between the categories
Bears
-> 1
Bears ->
[1,0,0]
49’ers -> 2
-->
49’ers ->
[0,1,0]
Steelers-> 3
Steelers-> [0,0,1]
26
1 binary column
per category
27. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Features
Binary Features: True or False
Numeric Discrete Features: Integers
Numeric Features: Real values
Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL)
Temporal Features: Time-based (Time of Day, Binge Watching)
Categorical Features: Finite, unique set of categories (NFL teams)
27
28. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Non-Personalized Recommendations
28
29. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cold Start Problem
“Cold Start” problem
New user, don’t know their pref, must show them something!
Movies with highest-rated actors
Top K Aggregations
Most desirable singles
PageRank of like activity
Facebook social graph
Recommend friend activity
29
30. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Personalized Recommendations
30
31. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Clustering (aka. Nearest Neighbors)
User-to-User Clustering
Similar movies watched or rated
Similar wiewing pattern (ie. binge or casual)
Item-to-Item Clustering
Similar tags/genres on movies
Similar textual description (TF/IDF, Word2Vec, NLP, Image)
31
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html!My OKCupid Profile! My Hinge Profile!
32. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
User-to-Item Collaborative Filtering
Matrix Factorization
① Factor the large matrix (left) into 2 smaller matrices (right)
② Fill in the missing values with in the large matrix
32
33. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Item-to-Item Collaborative Filtering
Made famous by Amazon Paper ~2003
Problem
As # of users grew, Matrix Factorization couldn’t scale
Solution
Offline/Batch
Generate itemId -> List[customerId] vectors
Online/Real-time
For each item in cart, recommend similar items from vector space
33
34. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
34
35. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When to Approximate?
Memory or time constrained queries
Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos
Inherently probabilistic and approximate
Finding topics in documents (LDA)
Finding similar pairs of users, items, words at scale (LSH)
Finding top influencers (PageRank)
Streaming aggregations (distinct count or top k)
Inherently sloppy means of collecting (at least once delivery)
35
Approximate as much as you can get away with!
Ask for forgiveness later !!
36. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When NOT to Approximate?
If you’ve ever heard the term…
“Sarbanes-Oxley”
…in-that-order, at the office, after 2002.
36
37. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
37
38. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
A Few Good Algorithms
38
You can’t handle
the approximate!
39. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
39
40. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
40
41. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter
Approximate set membership for key
False positive: expect contains(), actual !contains()
True negative: expect !contains(), actual !contains()
Elements only added, never removed
41
42. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter in Action
42
set(key)
contains(key): Boolean
Images by @avibryant
TRUE -> maybe contains
FALSE -> definitely does not contain.
43. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
43
44. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS)
Approximate frequency count and TopK for key
ie. “Heavy Hitters” on Twitter
44
Johnny Hallyday
Martin Odersky
Donald Trump
45. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch In Action
45
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate,
but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of
“Top Gun” for slightly
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few
Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few
Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
46. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
46
47. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL)
Approximate count distinct
Slight twist
Special hash function creates uniform distribution
Error estimate
14 bits for size of range
m = 2^14 = 16,384 slots
error = 1.04/(sqrt(16,384)) = .81%
47
Not many of these
48. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog In Action
Use Case: Distinct number of views per movie
48
0
32
Top Gun: Hour 2
user
2001
user
4009
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0
16
Uniform Distribution:
Estimate distinct # of users by
inspecting just the beginning
Uniform Distribution:
Estimate distinct # of users
by inspecting just the beginning
Composable: Hour 1 + 2
(lose a bit of precision)
49. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
49
50. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets
b << m
Hash items multiple times
Similar items hash to overlapping buckets
Compare just contents of buckets
Much smaller cartesian … and parallel !!
50
51. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
DIMSUM
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
51
52. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study
40% efficiency gain over bruce-force cosine sim
52
53. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
Scaling with Parallelism and Composability
Similarity and Recommendations
When to Approximate
Common Algorithms and Data Structures
Common Libraries and Tools
53
54. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
54
Composable Library
Distributed Cache
Big Data Processing
55. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Twitter Algebird
Rooted in Algebraic Fundamentals!
Parallel
Associative
Composable
Examples
Min, Max, Avg
BloomFilter (Set.contains(key))
HyperLogLog (Count Distinct)
CountMin Sketch (TopK Count)
55
56. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)
Add user views for given movie
PFADD TopGun_HLL user1001 user2009 user3005
PFADD TopGun_HLL user3003 user1001
Get distinct count (cardinality) of set
PFCOUNT TopGun_HLL
Returns: 4 (distinct users viewed this movie)
56
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_HLL Taps_HLL
57. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Approximations
Spark Core
RDD.count*Approx()
Spark SQL
PartialResult
HyperLogLogPlus
approxCountDistinct(column)
Spark ML
Stratified sampling
PairRDD.sampleByKey(fractions: Double[ ])
DIMSUM sampling
Probabilistic sampling reduces amount of comparison shuffle
RowMatrix.columnSimilarities(threshold)
Spark Streaming
A/B testing
StreamingTest.setTestMethod(“welch”).registerStream(dstream)
57
58. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demos!
58
59. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Counting
Exact Count vs. Approx HyperLogLog, CountMin Sketch
59
60. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. HyperLogLog
60
61. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. CountMin Sketch
61
62. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Set Similarity
Exact Jaccard Similarity vs. Approx Locality Sensitive Hashing
62
63. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Brute Force Cartesian All Pair Similarity
63
90 mins!
64. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Pairs & Locality Sensitive Hashing
64
<< 90 mins!
65. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Many More Demos Available!
http://advancedspark.com
Download Docker
or Clone Github
65
66. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Bonus: Netflix Recommendations
From Offline DVD Ratings to Real-time Trending Now
66
67. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
$1 Million Netflix Prize (2006-2009)
Goal
Improve movie predictions by 10% (RMSE)
Dataset
(userId, movieId, rating, timestamp)
Test data withheld to calculate RMSE upon submission
Winning algorithm
10.06% improvement (RMSE)
Ensemble of 500+ ML
Combined using GBDT’s
Computationally impractical
67
68. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Secret to the Winning Algorithms
Adjust for the following…
Human bias
“Alice effect”: Alice tends to rate lower than average user
“Inception effect”: Inception is rated higher than average
“Alice-Inception effect”: Combo of Alice and Inception
Time-based bias
Number of days since a user’s first rating
Number of days since a movie’s first rating
Number of people who have rated a movie
A movie’s overall mean rating
68
69. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Current Netflix Recommendations
69
Throw away
loffline-generated
user factors (U)
70. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
…
70
71. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Netflix Search
No results? No problem… Show similar results!
Used as implicit feedback for future decision making
71
72. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix and Data
Netflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like Politics and Kevin Spacey.
72
The UK doesn’t have any White Castles.
So they renamed my favourite movie,
“Harold and Kumar Get the Munchies”
(This broke all of my unit tests.)
My favorite movie,
“Harold and Kumar Go to White Castle”
Summary: Buy NFLX Stock!
73. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!
Chris Fregly @cfregly
IBM Spark Tech Center
http://spark.tc
San Francisco, California, USA
http://advancedspark.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
73
Image derived from http://www.duchess-france.org/
74. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com
@cfregly