SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Spark and Recommendations
Spark, Streaming, Machine Learning, Graph Processing,
Approximations, Probabilistic Data Structures, NLP 
USF Seminar Series
Thanks, USF!!
Feb 5th, 2016
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring! (Only Nice People)
advancedspark.com!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Netflix OSS Committer


Data Solutions Engineer

Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced .
Due 2016
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
http://advancedspark.com
Meetup Metrics
Top 5 Most-active Spark Meetup!
2400+ Members in just 6 mos!!
2500+ Docker image downloads
Meetup Mission
Deep-dive into Spark and related open source projects
Surface key patterns and idioms
Focus on distributed systems, scale, and performance

 

3
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Live, Interactive Demo!!
Audience Participation Required
(cell phone or laptop)
4
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
demo.advancedspark.com
End User ->



ElasticSearch ->


Spark ML ->


Data Scientist ->

5
<- Kafka


<- Spark

Streaming


<- Cassandra,
Redis


<- Zeppelin, 
 
iPython
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
6
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Parallelism
7
Peter
O(log n)
O(log n)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Scaling with Composability

 
Max (a max b max c max d) == (a max b) max (c max d) 
Set Union (a U b U c U d) 
 
== (a U b) U (c U d)
Addition (a + b + c + d) 
 == (a + b) 
 +

 (c + d)
Multiply 
 
 (a * b * c * d) 
 
== (a * b) * (c * d) 
Division??
8
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Division?
Division
(a / b / c / d) 
!= (a / b) / (c / d)

 
 
 
(3 / 4 / 7 / 8) 
!= (3 / 4) / (7 / 8) 

 
 
 (((3 / 4) / 7) / 8)
!= ((3 * 8) / (4 * 7)) 

 
 
 
 
 
0.134 
 
!= 
 0.857


9
What were the Egyptians thinking?!
Not Composable
“Divide like
an Egyptian”
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What about Average?
Overall AVG ( 

 
 
 
 
 
[3, 1] 
 
 
 
 ((3 + 5) + (5 + 7)) 
 
 
 
 20

 
 
 
 
 
[5, 1] == ----------------------- == --- == 5

 
 
 
 
[5, 1] 
 
 
 
 
 ((1 + 2) + 1) 
 
 
 

 
 4 


 
 
 
 
 
[7, 1] 
 


 
 
 
 
 )
10
value
count
Pairwise AVG

 (3 + 5) (5 + 7) 8 12 20

 ------- + ------- == --- + --- == --- == 10 != 5

 2 
 2 
 2 2
 2
Divide, Add, Divide?
Not 
 Composable
Single Divide at the End?
Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?
Composable!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
11
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Similarity
12
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Euclidean Similarity
Exists in Euclidean, flat space
Based on Euclidean distance 
Linear measure
Bias towards magnitude
13
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cosine Similarity
Angular measure
Adjusts for Euclidean magnitude bias
14
Normalizes to unit vectors
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Jaccard Similarity
Set similarity measurement
Set intersection / set union ->
Based on Jaccard distance
Bias towards popularity
15
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Log Likelihood Similarity
Adjusts for popularity bias
Netflix “Shawshank” problem
16
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Word Similarity
Edit Distance

Calculate char differences between words

Deletes, transposes, replaces, inserts
17
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Document Similarity
TD/IDF

Term Freq / Inverse Document Freq

Used by most search engines

Word2Vec

Words embedded in vector space nearby similars

18
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Pathway
ie. Closest recommendations between 2 people
19
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Calculating Similarity
Exact Brute-Force

“All-pairs similarity” 

aka “Pair-wise similarity”, “Similarity join”

Cartesian O(n^2) shuffle and comparison

Approximate

Sampling

Bucketing (aka “Partitioning”, “Clustering”)

Remove data with low probability of similarity

Reduce shuffle and comparisons
20
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Document Summary
Text Rank

aka “Sentence Rank”

TF/IDF + Similarity Graph + PageRank

Intuition

Surface summary sentences (abstract)

 
Most similar to all others (TF/IDF + Similarity Graph)

 
Most influential sentences (PageRank)
21
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similarity Graph
Vertex is movie, tag, actor, plot summary, etc.
Edges are relationships and weights
22
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Topic-Sensitive PageRank
Graph diffusion algorithm
Pre-process graph, add vector of probabilities to each vertex

Probability of ending up at this vertex from every other
vertex
23
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Recommendations
24
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Basic Terminology
User: User seeking recommendations
Item: Item being recommended
Explicit User Feedback: like or rating
Implicit User Feedback: search, click, hover, view, scroll 
Instances: Rows of user feedback/input data
Overfitting: Training a model too closely to the training data & hyperparameters
Hold Out Split: Holding out some of the instances to avoid overfitting 
Features: Columns of instance rows (of feedback/input data)
Cold Start Problem: Not enough data to personalize (new)
Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)
Model Evaluation: Compare predictions to actual values of hold out split
Feature Engineering: Modify, reduce, combine features
25
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Feature Engineering
Dimension Reduction

Reduce number of features (aka “feature space”)

Principle Component Analysis (PCA)

Find principle features that describe the data in terms of variance

Peel the dimensional layers back until you describe the data

Example: One-Hot Encoding

Convert categorical feature values to 0’s, 1’s

Remove any hint of a relationship between the categories

 
 
 
Bears 
-> 1 
 
 
 
Bears -> 
[1,0,0]

 
 
 
49’ers -> 2 
 --> 
49’ers ->
[0,1,0]

 
 
 
Steelers-> 3 
 
 
 
Steelers-> [0,0,1]
26
1 binary column 
per category
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Features
Binary Features: True or False
Numeric Discrete Features: Integers
Numeric Features: Real values
Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL)
Temporal Features: Time-based (Time of Day, Binge Watching)
Categorical Features: Finite, unique set of categories (NFL teams)

27
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Non-Personalized Recommendations
28
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cold Start Problem
“Cold Start” problem

New user, don’t know their pref, must show them something!

Movies with highest-rated actors

Top K Aggregations

 

Most desirable singles

PageRank of like activity

Facebook social graph

Recommend friend activity

29
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Personalized Recommendations
30
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Clustering (aka. Nearest Neighbors)
User-to-User Clustering

Similar movies watched or rated

Similar wiewing pattern (ie. binge or casual)
Item-to-Item Clustering

Similar tags/genres on movies

Similar textual description (TF/IDF, Word2Vec, NLP, Image)


31
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html!My OKCupid Profile! My Hinge Profile!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
User-to-Item Collaborative Filtering
Matrix Factorization
①  Factor the large matrix (left) into 2 smaller matrices (right)
②  Fill in the missing values with in the large matrix
32
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Item-to-Item Collaborative Filtering
Made famous by Amazon Paper ~2003

Problem

As # of users grew, Matrix Factorization couldn’t scale

Solution

Offline/Batch

 
Generate itemId -> List[customerId] vectors


Online/Real-time

 
For each item in cart, recommend similar items from vector space

33
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
34
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When to Approximate?
Memory or time constrained queries

Relative vs. exact counts are OK (# errors between then and now)
Using machine learning or graph algos

Inherently probabilistic and approximate

Finding topics in documents (LDA)

Finding similar pairs of users, items, words at scale (LSH)

Finding top influencers (PageRank)
Streaming aggregations (distinct count or top k)

Inherently sloppy means of collecting (at least once delivery)
35
Approximate as much as you can get away with!
Ask for forgiveness later !!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
When NOT to Approximate?
If you’ve ever heard the term…

“Sarbanes-Oxley”

…in-that-order, at the office, after 2002.
36
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
37
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
A Few Good Algorithms
38
You can’t handle 

the approximate!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common to These Algos & Data Structs
Low, fixed size in memory
Known error bounds
Store large amount of data
Less memory than Java/Scala collections
Tunable tradeoff between size and error
Rely on multiple hash functions or operations
Size of hash range defines error
39
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Bloom Filter
Set.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
40
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter
Approximate set membership for key

False positive: expect contains(), actual !contains()

True negative: expect !contains(), actual !contains()
Elements only added, never removed
41
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bloom Filter in Action
42
set(key)
 contains(key): Boolean
Images by @avibryant
TRUE -> maybe contains
FALSE -> definitely does not contain.
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
CountMin Sketch
Frequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
43
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch (CMS)
Approximate frequency count and TopK for key 
ie. “Heavy Hitters” on Twitter
44
Johnny Hallyday
 Martin Odersky
 Donald Trump
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CountMin Sketch In Action
45
Images derived from @avibryant
Find minimum of all rows
…
…
Can overestimate, 

but never underestimate
Multiple hash functions
(1 hash function per row)
Binary hash output
(1 element per column)
x 2 occurrences of 
“Top Gun” for slightly 
additional complexity
Top Gun
Top Gun
Top Gun
(x 2)
A Few

Good Men
Taps
Top Gun
(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A Few

Good Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
HyperLogLog
Count Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
46
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog (HLL)
Approximate count distinct
Slight twist

Special hash function creates uniform distribution
Error estimate

14 bits for size of range

m = 2^14 = 16,384 slots

error = 1.04/(sqrt(16,384)) = .81% 

47
Not many of these
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog In Action
Use Case: Distinct number of views per movie
48
0
 32
Top Gun: Hour 2
user

2001
user
4009 
user
3002
user
7002
user
1005
user
6001
User
8001
User
8002
user
1001
user
2009 
user
3005
user
3003
Top Gun: Hour 1
user
3001
user
7009
0
 16
Uniform Distribution:
Estimate distinct # of users by 
inspecting just the beginning
Uniform Distribution:
Estimate distinct # of users 

by inspecting just the beginning
Composable: Hour 1 + 2
(lose a bit of precision)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Locality Sensitive Hashing
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
49
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Locality Sensitive Hashing (LSH)
Approximate set similarity
Hash designed to cluster similar items
Avoids cartesian all-pairs comparison
Pre-process m rows into b buckets

b << m
Hash items multiple times

Similar items hash to overlapping buckets
Compare just contents of buckets

Much smaller cartesian … and parallel !!
50
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
DIMSUM
Set Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
51
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DIMSUM
“Dimension Independent Matrix Square Using MR”
Remove vectors with low probability of similarity

RowMatrix.columnSimiliarites(threshold)
Twitter DIMSUM Case Study

40% efficiency gain over bruce-force cosine sim
52
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Presentation Outline
  Scaling with Parallelism and Composability

  Similarity and Recommendations
  When to Approximate
  Common Algorithms and Data Structures

  Common Libraries and Tools
53
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
54
Composable Library
Distributed Cache
Big Data Processing
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Twitter Algebird
Rooted in Algebraic Fundamentals!
Parallel
Associative
Composable
Examples

Min, Max, Avg

BloomFilter (Set.contains(key))

HyperLogLog (Count Distinct)

CountMin Sketch (TopK Count)

55
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Redis
Implementation of HyperLogLog (Count Distinct)
12KB per item count
2^64 max # of items
0.81% error (Tunable)

Add user views for given movie

PFADD TopGun_HLL user1001 user2009 user3005

PFADD TopGun_HLL user3003 user1001

Get distinct count (cardinality) of set

PFCOUNT TopGun_HLL

Returns: 4 (distinct users viewed this movie)

56
ignore duplicates
Tunable
Union 2 HyperLogLog Data Structures
PFMERGE TopGun_HLL Taps_HLL
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Approximations
Spark Core

RDD.count*Approx()
Spark SQL

PartialResult

HyperLogLogPlus

approxCountDistinct(column)
Spark ML

Stratified sampling

 
PairRDD.sampleByKey(fractions: Double[ ])

DIMSUM sampling

 
Probabilistic sampling reduces amount of comparison shuffle

 
RowMatrix.columnSimilarities(threshold)
Spark Streaming

A/B testing

 
StreamingTest.setTestMethod(“welch”).registerStream(dstream)
57
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Demos!
58
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Counting
Exact Count vs. Approx HyperLogLog, CountMin Sketch
59
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. HyperLogLog
60
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HashSet vs. CountMin Sketch
61
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Set Similarity
Exact Jaccard Similarity vs. Approx Locality Sensitive Hashing
62
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Brute Force Cartesian All Pair Similarity
63
90 mins!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Pairs & Locality Sensitive Hashing
64
<< 90 mins!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Many More Demos Available!
http://advancedspark.com


Download Docker 
 
 
 or Clone Github
65
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Bonus: Netflix Recommendations
From Offline DVD Ratings to Real-time Trending Now
66
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
$1 Million Netflix Prize (2006-2009)
Goal

Improve movie predictions by 10% (RMSE)

Dataset

(userId, movieId, rating, timestamp)

Test data withheld to calculate RMSE upon submission

Winning algorithm

10.06% improvement (RMSE)

Ensemble of 500+ ML 

Combined using GBDT’s 

Computationally impractical
67
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Secret to the Winning Algorithms
Adjust for the following…

Human bias

 
“Alice effect”: Alice tends to rate lower than average user

 
“Inception effect”: Inception is rated higher than average

 
“Alice-Inception effect”: Combo of Alice and Inception

Time-based bias

 
Number of days since a user’s first rating

 
Number of days since a movie’s first rating

 
Number of people who have rated a movie

 
A movie’s overall mean rating
68
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Current Netflix Recommendations
69
Throw away 
loffline-generated
user factors (U)
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix Common ML Algorithms
Logistic Regression
Linear Regression
Gradient Boosted Decision Trees
Random Forest
Matrix Factorization
SVD
Restricted Boltzmann Machines
Deep Neural Nets
Markov Models
LDA
Clustering
…
70
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Bonus: Netflix Search
No results? No problem… Show similar results!
Used as implicit feedback for future decision making
71
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Netflix and Data
Netflix has a lot of data about a lot of users and a lot of movies.


Netflix can use this data to buy new movies.


Netflix is global.


Netflix can use this data to choose original programming.


Netflix knows that a lot of people like Politics and Kevin Spacey.
72
The UK doesn’t have any White Castles.
So they renamed my favourite movie, 
“Harold and Kumar Get the Munchies”
(This broke all of my unit tests.)
My favorite movie, 
“Harold and Kumar Go to White Castle”
Summary: Buy NFLX Stock!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!
Chris Fregly @cfregly
IBM Spark Tech Center 
http://spark.tc
San Francisco, California, USA
http://advancedspark.com
Sign up for the Meetup and Book
Contribute to Github Repo
Run all Demos using Docker
Find me: LinkedIn, Twitter, Github, Email, Fax
73
Image derived from http://www.duchess-france.org/
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
 spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
advancedspark.com
@cfregly

Contenu connexe

Tendances

Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Chris Fregly
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Chris Fregly
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Chris Fregly
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Chris Fregly
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Chris Fregly
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Chris Fregly
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Chris Fregly
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Chris Fregly
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Chris Fregly
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Chris Fregly
 

Tendances (20)

Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
 
Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016  Spark Summit East NYC Meetup 02-16-2016
Spark Summit East NYC Meetup 02-16-2016
 
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
 
Similarity at Scale
Similarity at ScaleSimilarity at Scale
Similarity at Scale
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016Boston Spark Meetup May 24, 2016
Boston Spark Meetup May 24, 2016
 
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLl...
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015Helsinki Spark Meetup Nov 20 2015
Helsinki Spark Meetup Nov 20 2015
 
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015
 
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
Practical Data Science Workshop - Recommendation Systems - Collaborative Filt...
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 

En vedette

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Chris Fregly
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Chris Fregly
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Chris Fregly
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Chris Fregly
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 

En vedette (11)

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 

Similaire à USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Anya Bida
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 

Similaire à USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016 (16)

Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When ...
 
Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)Just enough DevOps for Data Scientists (Part II)
Just enough DevOps for Data Scientists (Part II)
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 

Plus de Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

Plus de Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Dernier

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Dernier (20)

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016

  • 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Spark and Recommendations Spark, Streaming, Machine Learning, Graph Processing, Approximations, Probabilistic Data Structures, NLP USF Seminar Series Thanks, USF!! Feb 5th, 2016 Chris Fregly Principal Data Solutions Engineer We’re Hiring! (Only Nice People) advancedspark.com!
  • 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Netflix OSS Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Meetup Organizer Advanced Apache Meetup Book Author Advanced . Due 2016
  • 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup http://advancedspark.com Meetup Metrics Top 5 Most-active Spark Meetup! 2400+ Members in just 6 mos!! 2500+ Docker image downloads Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance 3
  • 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Live, Interactive Demo!! Audience Participation Required (cell phone or laptop) 4
  • 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark demo.advancedspark.com End User -> ElasticSearch -> Spark ML -> Data Scientist -> 5 <- Kafka <- Spark
 Streaming <- Cassandra, Redis <- Zeppelin, iPython
  • 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools 6
  • 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Parallelism 7 Peter O(log n) O(log n)
  • 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Scaling with Composability Max (a max b max c max d) == (a max b) max (c max d) Set Union (a U b U c U d) == (a U b) U (c U d) Addition (a + b + c + d) == (a + b) + (c + d) Multiply (a * b * c * d) == (a * b) * (c * d) Division?? 8
  • 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Division? Division (a / b / c / d) != (a / b) / (c / d) (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7)) 0.134 != 0.857 9 What were the Egyptians thinking?! Not Composable “Divide like an Egyptian”
  • 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark What about Average? Overall AVG ( [3, 1] ((3 + 5) + (5 + 7)) 20 [5, 1] == ----------------------- == --- == 5 [5, 1] ((1 + 2) + 1) 4 [7, 1] ) 10 value count Pairwise AVG (3 + 5) (5 + 7) 8 12 20 ------- + ------- == --- + --- == --- == 10 != 5 2 2 2 2 2 Divide, Add, Divide? Not Composable Single Divide at the End? Doesn’t need to be Composable! AVG (3, 5, 5, 7) == 5 Add, Add, Add? Composable!
  • 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools 11
  • 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Similarity 12
  • 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Euclidean Similarity Exists in Euclidean, flat space Based on Euclidean distance Linear measure Bias towards magnitude 13
  • 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cosine Similarity Angular measure Adjusts for Euclidean magnitude bias 14 Normalizes to unit vectors
  • 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Jaccard Similarity Set similarity measurement Set intersection / set union -> Based on Jaccard distance Bias towards popularity 15
  • 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Log Likelihood Similarity Adjusts for popularity bias Netflix “Shawshank” problem 16
  • 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Word Similarity Edit Distance Calculate char differences between words Deletes, transposes, replaces, inserts 17
  • 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Document Similarity TD/IDF Term Freq / Inverse Document Freq Used by most search engines Word2Vec Words embedded in vector space nearby similars 18
  • 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity Pathway ie. Closest recommendations between 2 people 19
  • 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Calculating Similarity Exact Brute-Force “All-pairs similarity” aka “Pair-wise similarity”, “Similarity join” Cartesian O(n^2) shuffle and comparison Approximate Sampling Bucketing (aka “Partitioning”, “Clustering”) Remove data with low probability of similarity Reduce shuffle and comparisons 20
  • 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Document Summary Text Rank aka “Sentence Rank” TF/IDF + Similarity Graph + PageRank Intuition Surface summary sentences (abstract) Most similar to all others (TF/IDF + Similarity Graph) Most influential sentences (PageRank) 21
  • 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Similarity Graph Vertex is movie, tag, actor, plot summary, etc. Edges are relationships and weights 22
  • 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Topic-Sensitive PageRank Graph diffusion algorithm Pre-process graph, add vector of probabilities to each vertex Probability of ending up at this vertex from every other vertex 23
  • 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Recommendations 24
  • 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Basic Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations) Model Evaluation: Compare predictions to actual values of hold out split Feature Engineering: Modify, reduce, combine features 25
  • 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Feature Engineering Dimension Reduction Reduce number of features (aka “feature space”) Principle Component Analysis (PCA) Find principle features that describe the data in terms of variance Peel the dimensional layers back until you describe the data Example: One-Hot Encoding Convert categorical feature values to 0’s, 1’s Remove any hint of a relationship between the categories Bears -> 1 Bears -> [1,0,0] 49’ers -> 2 --> 49’ers -> [0,1,0] Steelers-> 3 Steelers-> [0,0,1] 26 1 binary column per category
  • 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Features Binary Features: True or False Numeric Discrete Features: Integers Numeric Features: Real values Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL) Temporal Features: Time-based (Time of Day, Binge Watching) Categorical Features: Finite, unique set of categories (NFL teams) 27
  • 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Non-Personalized Recommendations 28
  • 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Cold Start Problem “Cold Start” problem New user, don’t know their pref, must show them something! Movies with highest-rated actors Top K Aggregations Most desirable singles PageRank of like activity Facebook social graph Recommend friend activity 29
  • 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Personalized Recommendations 30
  • 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Clustering (aka. Nearest Neighbors) User-to-User Clustering Similar movies watched or rated Similar wiewing pattern (ie. binge or casual) Item-to-Item Clustering Similar tags/genres on movies Similar textual description (TF/IDF, Word2Vec, NLP, Image) 31 http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html!My OKCupid Profile! My Hinge Profile!
  • 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark User-to-Item Collaborative Filtering Matrix Factorization ①  Factor the large matrix (left) into 2 smaller matrices (right) ②  Fill in the missing values with in the large matrix 32
  • 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Item-to-Item Collaborative Filtering Made famous by Amazon Paper ~2003 Problem As # of users grew, Matrix Factorization couldn’t scale Solution Offline/Batch Generate itemId -> List[customerId] vectors Online/Real-time For each item in cart, recommend similar items from vector space 33
  • 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools 34
  • 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark When to Approximate? Memory or time constrained queries Relative vs. exact counts are OK (# errors between then and now) Using machine learning or graph algos Inherently probabilistic and approximate Finding topics in documents (LDA) Finding similar pairs of users, items, words at scale (LSH) Finding top influencers (PageRank) Streaming aggregations (distinct count or top k) Inherently sloppy means of collecting (at least once delivery) 35 Approximate as much as you can get away with! Ask for forgiveness later !!
  • 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark When NOT to Approximate? If you’ve ever heard the term… “Sarbanes-Oxley” …in-that-order, at the office, after 2002. 36
  • 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools 37
  • 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc A Few Good Algorithms 38 You can’t handle 
 the approximate!
  • 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Common to These Algos & Data Structs Low, fixed size in memory Known error bounds Store large amount of data Less memory than Java/Scala collections Tunable tradeoff between size and error Rely on multiple hash functions or operations Size of hash range defines error 39
  • 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Bloom Filter Set.contains(key): Boolean “Hash Multiple Times and Flip the Bits Wherever You Land” 40
  • 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter Approximate set membership for key False positive: expect contains(), actual !contains() True negative: expect !contains(), actual !contains() Elements only added, never removed 41
  • 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bloom Filter in Action 42 set(key) contains(key): Boolean Images by @avibryant TRUE -> maybe contains FALSE -> definitely does not contain.
  • 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc CountMin Sketch Frequency Count and TopK “Hash Multiple Times and Add 1 Wherever You Land” 43
  • 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch (CMS) Approximate frequency count and TopK for key ie. “Heavy Hitters” on Twitter 44 Johnny Hallyday Martin Odersky Donald Trump
  • 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark CountMin Sketch In Action 45 Images derived from @avibryant Find minimum of all rows … … Can overestimate, 
 but never underestimate Multiple hash functions (1 hash function per row) Binary hash output (1 element per column) x 2 occurrences of “Top Gun” for slightly additional complexity Top Gun Top Gun Top Gun (x 2) A Few
 Good Men Taps Top Gun (x 2) add(Top Gun, 2) getCount(Top Gun): Long Use Case: TopK movies using total views add(A Few Good Men, 1) add(Taps, 1) A Few
 Good Men Taps … … Overlap Top Gun Overlap A Few Good Men
  • 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc HyperLogLog Count Distinct “Hash Multiple Times and Uniformly Distribute Where You Land” 46
  • 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog (HLL) Approximate count distinct Slight twist Special hash function creates uniform distribution Error estimate 14 bits for size of range m = 2^14 = 16,384 slots error = 1.04/(sqrt(16,384)) = .81% 47 Not many of these
  • 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HyperLogLog In Action Use Case: Distinct number of views per movie 48 0 32 Top Gun: Hour 2 user
 2001 user 4009 user 3002 user 7002 user 1005 user 6001 User 8001 User 8002 user 1001 user 2009 user 3005 user 3003 Top Gun: Hour 1 user 3001 user 7009 0 16 Uniform Distribution: Estimate distinct # of users by inspecting just the beginning Uniform Distribution: Estimate distinct # of users 
 by inspecting just the beginning Composable: Hour 1 + 2 (lose a bit of precision)
  • 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Locality Sensitive Hashing Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 49
  • 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Locality Sensitive Hashing (LSH) Approximate set similarity Hash designed to cluster similar items Avoids cartesian all-pairs comparison Pre-process m rows into b buckets b << m Hash items multiple times Similar items hash to overlapping buckets Compare just contents of buckets Much smaller cartesian … and parallel !! 50
  • 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc DIMSUM Set Similarity “Pre-process Items into Buckets, Compare Within Buckets” 51
  • 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark DIMSUM “Dimension Independent Matrix Square Using MR” Remove vectors with low probability of similarity RowMatrix.columnSimiliarites(threshold) Twitter DIMSUM Case Study 40% efficiency gain over bruce-force cosine sim 52
  • 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Presentation Outline   Scaling with Parallelism and Composability   Similarity and Recommendations   When to Approximate   Common Algorithms and Data Structures   Common Libraries and Tools 53
  • 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Common Tools to Approximate Twitter Algebird Redis Apache Spark 54 Composable Library Distributed Cache Big Data Processing
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Twitter Algebird Rooted in Algebraic Fundamentals! Parallel Associative Composable Examples Min, Max, Avg BloomFilter (Set.contains(key)) HyperLogLog (Count Distinct) CountMin Sketch (TopK Count) 55
  • 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Redis Implementation of HyperLogLog (Count Distinct) 12KB per item count 2^64 max # of items 0.81% error (Tunable) Add user views for given movie PFADD TopGun_HLL user1001 user2009 user3005 PFADD TopGun_HLL user3003 user1001 Get distinct count (cardinality) of set PFCOUNT TopGun_HLL Returns: 4 (distinct users viewed this movie) 56 ignore duplicates Tunable Union 2 HyperLogLog Data Structures PFMERGE TopGun_HLL Taps_HLL
  • 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark Approximations Spark Core RDD.count*Approx() Spark SQL PartialResult HyperLogLogPlus approxCountDistinct(column) Spark ML Stratified sampling PairRDD.sampleByKey(fractions: Double[ ]) DIMSUM sampling Probabilistic sampling reduces amount of comparison shuffle RowMatrix.columnSimilarities(threshold) Spark Streaming A/B testing StreamingTest.setTestMethod(“welch”).registerStream(dstream) 57
  • 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Demos! 58
  • 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Counting Exact Count vs. Approx HyperLogLog, CountMin Sketch 59
  • 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. HyperLogLog 60
  • 61. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark HashSet vs. CountMin Sketch 61
  • 62. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Set Similarity Exact Jaccard Similarity vs. Approx Locality Sensitive Hashing 62
  • 63. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Brute Force Cartesian All Pair Similarity 63 90 mins!
  • 64. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark All Pairs & Locality Sensitive Hashing 64 << 90 mins!
  • 65. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Many More Demos Available! http://advancedspark.com Download Docker or Clone Github 65
  • 66. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Bonus: Netflix Recommendations From Offline DVD Ratings to Real-time Trending Now 66
  • 67. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark $1 Million Netflix Prize (2006-2009) Goal Improve movie predictions by 10% (RMSE) Dataset (userId, movieId, rating, timestamp) Test data withheld to calculate RMSE upon submission Winning algorithm 10.06% improvement (RMSE) Ensemble of 500+ ML Combined using GBDT’s Computationally impractical 67
  • 68. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Secret to the Winning Algorithms Adjust for the following… Human bias “Alice effect”: Alice tends to rate lower than average user “Inception effect”: Inception is rated higher than average “Alice-Inception effect”: Combo of Alice and Inception Time-based bias Number of days since a user’s first rating Number of days since a movie’s first rating Number of people who have rated a movie A movie’s overall mean rating 68
  • 69. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Current Netflix Recommendations 69 Throw away loffline-generated user factors (U)
  • 70. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix Common ML Algorithms Logistic Regression Linear Regression Gradient Boosted Decision Trees Random Forest Matrix Factorization SVD Restricted Boltzmann Machines Deep Neural Nets Markov Models LDA Clustering … 70
  • 71. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Bonus: Netflix Search No results? No problem… Show similar results! Used as implicit feedback for future decision making 71
  • 72. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Netflix and Data Netflix has a lot of data about a lot of users and a lot of movies. Netflix can use this data to buy new movies. Netflix is global. Netflix can use this data to choose original programming. Netflix knows that a lot of people like Politics and Kevin Spacey. 72 The UK doesn’t have any White Castles. So they renamed my favourite movie, “Harold and Kumar Get the Munchies” (This broke all of my unit tests.) My favorite movie, “Harold and Kumar Go to White Castle” Summary: Buy NFLX Stock!
  • 73. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!! Chris Fregly @cfregly IBM Spark Tech Center http://spark.tc San Francisco, California, USA http://advancedspark.com Sign up for the Meetup and Book Contribute to Github Repo Run all Demos using Docker Find me: LinkedIn, Twitter, Github, Email, Fax 73 Image derived from http://www.duchess-france.org/
  • 74. Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark advancedspark.com @cfregly