Customers are adopting Apache Spark ‒ an open-source distributed processing framework ‒ on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, a set of machine learning algorithms included with Spark, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts, and can access data from Amazon S3, Amazon Kinesis, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin and Jupyter notebooks, and use the Spark ML pipelines API to manage your training workflow. In addition, Jasjeet Thind, Senior Director of Data Science and Engineering at Zillow Group, will discuss his organization's development of personalization algorithms and platforms at scale using Spark on Amazon EMR.
2. What to Expect from the Session
• Apache Spark and Spark ML overview
• Running Spark ML on Amazon EMR
• Interactive notebook options
• Building recommendation engines at Zillow Group
3. Spark for fast processing
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
5. Spark ML addresses the full ML pipeline
- Built on top of DataFrame API
- Extract, transform, and select features
- Distributed algorithms
- Classification and Regression
- Clustering
- Collaborative Filtering
- Model selection tools
- Pipelines
Process Data
Feature Extraction
Model Training
Model Testing
Model Validation
16. Creating a Spark ML pipeline
val pipeline = new
Pipeline().setStages(Array(assembler, indexer, dt))
val model = pipeline.fit(df)
val predictions = model.transform(df)
Save and load machine learning models and full Pipelines
17. Tools to pick the right model
- CrossValidator and TrainValidationSplit select the Model
produced by the best-performing set of parameters
- Split the input data into separate training and test
datasets
- For each (training, test) pair, iterate through the set of
ParamMaps
- Fit the Estimator using those parameters, get the fitted
Model, and evaluate the Model’s performance using the
Evaluator
18. Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
20. • Run Spark Driver in
Client or Cluster mode
• Spark application runs
as a YARN application
• SparkContext runs as a
library in your program,
one instance per Spark
application.
• Spark Executors run in
YARN Containers on
NodeManagers in your
cluster
• Access Spark UI through
the Resource Manager
or Spark History Server
Spark on YARN
Spark UI
23. Coming soon: advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal AZ based on capacity/price
• Spot Block support
24. Productionizing your pipeline
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
26. Agenda
Intro to Zillow Group
Recommendation Use Cases
Architecture
Algorithms
Training & Scoring Pipeline
Metrics
27. Zillow Group
Build the world's largest, most trusted, and vibrant home-related marketplace.
28. Recommendation use cases
Email - homes for sale / for rent
Home Details - homes for sale / homes like this
Personalized Search
Mobile - smart SMS and push notifications
Home owner / pre-seller predictions
Lender selection algorithm
Similar photos / video
29. Architecture
RECOMMENDATION API
(Python, R, Flask)
Zillow Group
Data Lake
(S3 / Kinesis)
Property Featurization
(Spark EMR)
User Profiles
(Spark EMR)
Ranking
(Spark EMR)
Wedge Counting
Collaborative Filtering
(Spark EMR)
Property Aggregate Features
(Spark EMR)
Data Collection Systems
(Java/Python/SQL)
30. Like vs. dislike
Predict homes per user using behavior of similar users
Like = user actively engaged with property
Dislike = user viewed property but weak engagement
$22M
$19M
$664K
?+
+
- +
-
Spencer Stan
Feature Description
uid unique id of user
pid Property id
first_visit timestamp or 0
num_views sigmoid(#views)
time_spent time on page
num_contacts # leads sent
num_saves # saves on zpid
num_shares # shares on zpid
num_photos # photos viewed
31. Wedge count
For all user & property pairs to form a prediction, perform wedge count
- http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf
Does Stan like $19M? Wedge #
3
(wedge03_cnt
)
5
(wedge05_cnt
)
$22M
+
-
$19M
+
?
Spencer
Stan
$664k
-
+
$19M
+
?
Spencer
Stan
32. Classifier
Gradient Boosting Classifier (sklearn)
Popular users / properties:
- Divide wedge counts by degree product ju * ki
Prediction for all user / property pairs, limit candidate set by
- Top 10 zip codes
- 300 properties per user
features
wedge00_cnt
wedge01_cnt
wedge02_cnt
wedge03_cnt
wedge04_cnt
wedge05_cnt
wedge06_cnt
wedge07_cnt
wedge00_norm_cnt
wedge01_norm_cnt
wedge02_norm_cnt
wedge03_norm_cnt
wedge04_norm_cnt
wedge05_norm_cnt
wedge06_norm_cnt
wedge07_norm_cnt
Does Stan like the $19M home? features
(uid: Stan, pid: $19M) (see right side)
33. User profile
Signals - website, mobile app, and search queries
Binary classification
- labels (like/dislike) same as collab filtering model
User profile model determines preference scores
Features (categorical
variables)
Bath 0_bath, 0.5_bath, 1_Bath,
1.5_bath, 2_bath,
2.5_bath, 3_bath
Bed 0_bed, 1_bed, 2_bed,
3_bed, 4_bed, 5_bed
Price 100_125_price,
125_150_price,
150_175_price
Use
Code
condo, single_family,
farm_land
Zipcode zip_98109
pid uid features label
0 or 1 - see right side 0 or 1
0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6
34. Ranking
Property matrix - feature space same as user profile
Dot product of property matrix with user profile vector
Age decay for older listings
(uid, pid) score
{"uId":"10307499",
"pId":"1044183744"}
0.3364
1 0 0 0
0 0 1 0
1 0 0 0
0 0 0 1
0
0.01
0.8
0.6
0_bed 1_bed 2_bed 3_bed uid_0
pid_0
pid_1
pid_2
pid_3
=
0
0.8
0
0.6
35. Training & scoring
Collect user behavior and real-estate data, train the various models, generate the
candidate set, and make predictions.
User
Behavior
(Kinesis
/S3)
Public
Record
(Kinesis
/ S3)
Event API
(Java)
Producer
(Python)
Filter
(Spark)
User Store
(Hive / S3)
Spark job creates Hive
table with user events
(uid, pid) partitioned
by date
Active
Listings
(Kinesis
/ S3)
Producer
(Python)
Training Data
(Spark) Training Set
(Hive / S3)
pid -> uid reverse index
Past and current
user events
Models
(Python)
Train Models
(Spark)
Score
(Spark)
Recommendations
Property Data
Collaborative Filtering
/ User Profile Models
Hashmap
(Redis)
Wedge features or property
features (user profile)
36. Offline evaluation
Hyperparameter tuning with validation set
Training/test data sets for model evaluation
Offline Metrics Description
Precision rk = # recommended properties in test set in top k
Recall n = total properties in the test set
Freshness # listings recommended w/ modified date < y day old in top k
Coverage # unique listings recommended across all users / total # unique listings
37. Future work
Classifiers for listing descriptions
Deep learning on listing images
Structured streaming on Spark 2.0
Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy
Real-time scoring