SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Sr. Product Manager, Amazon EMR
Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group
November 29, 2016
MAC303
Zillow Group: Developing Classification and
Recommendation Engines With
Amazon EMR and Apache Spark
What to Expect from the Session
• Apache Spark and Spark ML overview
• Running Spark ML on Amazon EMR
• Interactive notebook options
• Building recommendation engines at Zillow Group
Spark for fast processing
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark ML addresses the full ML pipeline
- Built on top of DataFrame API
- Extract, transform, and select features
- Distributed algorithms
- Classification and Regression
- Clustering
- Collaborative Filtering
- Model selection tools
- Pipelines
Process Data
Feature Extraction
Model Training
Model Testing
Model Validation
Extracting features in DataFrames
- Feature Extractors
- CountVectorizer
- Feature Transformers
- Tokenizer
- Binarizer
- StandardScaler
- Feature Selectors
- VectorSlicer
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Training data
Bank loan
write-off
predictions
Classification algorithms in Spark ML
- Logistic regression
- Decision tree classifier
- Random forest classifier
- Gradient-boosted tree classifier
- Multilayer perceptron classifier
- One-vs-Rest classified
- Naive Bayes
What is logistic regression?
What are decision trees?
Weather predictors for Golf
Decision trees: tree induction
Decision trees: partition data with hyperplanes
Spark ML pipelines - training
Spark ML pipelines - testing
Creating a Spark ML pipeline
val pipeline = new
Pipeline().setStages(Array(assembler, indexer, dt))
val model = pipeline.fit(df)
val predictions = model.transform(df)
Save and load machine learning models and full Pipelines
Tools to pick the right model
- CrossValidator and TrainValidationSplit select the Model
produced by the best-performing set of parameters
- Split the input data into separate training and test
datasets
- For each (training, test) pair, iterate through the set of
ParamMaps
- Fit the Estimator using those parameters, get the fitted
Model, and evaluate the Model’s performance using the
Evaluator
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
Develop fast using notebooks and IDEs
• Run Spark Driver in
Client or Cluster mode
• Spark application runs
as a YARN application
• SparkContext runs as a
library in your program,
one instance per Spark
application.
• Spark Executors run in
YARN Containers on
NodeManagers in your
cluster
• Access Spark UI through
the Resource Manager
or Spark History Server
Spark on YARN
Spark UI
Monitor your Spark jobs
Auto Scaling for data science on-demand
YARN metrics
Coming soon: advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal AZ based on capacity/price
• Spot Block support
Productionizing your pipeline
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Recommendation Systems @
Zillow Group
Jasjeet Thind
Sr Director, Data Science & Engineering
Agenda
Intro to Zillow Group
Recommendation Use Cases
Architecture
Algorithms
Training & Scoring Pipeline
Metrics
Zillow Group
Build the world's largest, most trusted, and vibrant home-related marketplace.
Recommendation use cases
Email - homes for sale / for rent
Home Details - homes for sale / homes like this
Personalized Search
Mobile - smart SMS and push notifications
Home owner / pre-seller predictions
Lender selection algorithm
Similar photos / video
Architecture
RECOMMENDATION API
(Python, R, Flask)
Zillow Group
Data Lake
(S3 / Kinesis)
Property Featurization
(Spark EMR)
User Profiles
(Spark EMR)
Ranking
(Spark EMR)
Wedge Counting
Collaborative Filtering
(Spark EMR)
Property Aggregate Features
(Spark EMR)
Data Collection Systems
(Java/Python/SQL)
Like vs. dislike
Predict homes per user using behavior of similar users
Like = user actively engaged with property
Dislike = user viewed property but weak engagement
$22M
$19M
$664K
?+
+
- +
-
Spencer Stan
Feature Description
uid unique id of user
pid Property id
first_visit timestamp or 0
num_views sigmoid(#views)
time_spent time on page
num_contacts # leads sent
num_saves # saves on zpid
num_shares # shares on zpid
num_photos # photos viewed
Wedge count
For all user & property pairs to form a prediction, perform wedge count
- http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf
Does Stan like $19M? Wedge #
3
(wedge03_cnt
)
5
(wedge05_cnt
)
$22M
+
-
$19M
+
?
Spencer
Stan
$664k
-
+
$19M
+
?
Spencer
Stan
Classifier
Gradient Boosting Classifier (sklearn)
Popular users / properties:
- Divide wedge counts by degree product ju * ki
Prediction for all user / property pairs, limit candidate set by
- Top 10 zip codes
- 300 properties per user
features
wedge00_cnt
wedge01_cnt
wedge02_cnt
wedge03_cnt
wedge04_cnt
wedge05_cnt
wedge06_cnt
wedge07_cnt
wedge00_norm_cnt
wedge01_norm_cnt
wedge02_norm_cnt
wedge03_norm_cnt
wedge04_norm_cnt
wedge05_norm_cnt
wedge06_norm_cnt
wedge07_norm_cnt
Does Stan like the $19M home? features
(uid: Stan, pid: $19M) (see right side)
User profile
Signals - website, mobile app, and search queries
Binary classification
- labels (like/dislike) same as collab filtering model
User profile model determines preference scores
Features (categorical
variables)
Bath 0_bath, 0.5_bath, 1_Bath,
1.5_bath, 2_bath,
2.5_bath, 3_bath
Bed 0_bed, 1_bed, 2_bed,
3_bed, 4_bed, 5_bed
Price 100_125_price,
125_150_price,
150_175_price
Use
Code
condo, single_family,
farm_land
Zipcode zip_98109
pid uid features label
0 or 1 - see right side 0 or 1
0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6
Ranking
Property matrix - feature space same as user profile
Dot product of property matrix with user profile vector
Age decay for older listings
(uid, pid) score
{"uId":"10307499",
"pId":"1044183744"}
0.3364
1 0 0 0
0 0 1 0
1 0 0 0
0 0 0 1
0
0.01
0.8
0.6
0_bed 1_bed 2_bed 3_bed uid_0
pid_0
pid_1
pid_2
pid_3
=
0
0.8
0
0.6
Training & scoring
Collect user behavior and real-estate data, train the various models, generate the
candidate set, and make predictions.
User
Behavior
(Kinesis
/S3)
Public
Record
(Kinesis
/ S3)
Event API
(Java)
Producer
(Python)
Filter
(Spark)
User Store
(Hive / S3)
Spark job creates Hive
table with user events
(uid, pid) partitioned
by date
Active
Listings
(Kinesis
/ S3)
Producer
(Python)
Training Data
(Spark) Training Set
(Hive / S3)
pid -> uid reverse index
Past and current
user events
Models
(Python)
Train Models
(Spark)
Score
(Spark)
Recommendations
Property Data
Collaborative Filtering
/ User Profile Models
Hashmap
(Redis)
Wedge features or property
features (user profile)
Offline evaluation
Hyperparameter tuning with validation set
Training/test data sets for model evaluation
Offline Metrics Description
Precision rk = # recommended properties in test set in top k
Recall n = total properties in the test set
Freshness # listings recommended w/ modified date < y day old in top k
Coverage # unique listings recommended across all users / total # unique listings
Future work
Classifiers for listing descriptions
Deep learning on listing images
Structured streaming on Spark 2.0
Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy
Real-time scoring
Thank you!
jonfritz@amazon.com
aws.amazon.com/emr/
aws.amazon.com/blogs/big-data/
http://www.zillow.com/data-science/
Come join us @ Zillow Group!
Hiring:
- SDE, ML, Data Scientist
- Big Data Engineer
- Analytic Engineer
- Product Management
Remember to complete
your evaluations!
Related Sessions

Contenu connexe

Tendances

What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
Amazon Web Services
 

Tendances (20)

Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
 
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
大數據運算媒體業案例分享 (Big Data Compute Case Sharing for Media Industry)
 
ENT314 Automate Best Practices and Operational Health for Your AWS Resources
ENT314 Automate Best Practices and Operational Health for Your AWS ResourcesENT314 Automate Best Practices and Operational Health for Your AWS Resources
ENT314 Automate Best Practices and Operational Health for Your AWS Resources
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...Deep dive and best practices on real time streaming applications nyc-loft_oct...
Deep dive and best practices on real time streaming applications nyc-loft_oct...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
Deep Dive on Amazon Relational Database Service
Deep Dive on Amazon Relational Database ServiceDeep Dive on Amazon Relational Database Service
Deep Dive on Amazon Relational Database Service
 
Deep Dive On Amazon Redshift
Deep Dive On Amazon RedshiftDeep Dive On Amazon Redshift
Deep Dive On Amazon Redshift
 
AWS Services Overview and Quarterly Update - April 2017 AWS Online Tech Talks
AWS Services Overview and Quarterly Update - April 2017 AWS Online Tech TalksAWS Services Overview and Quarterly Update - April 2017 AWS Online Tech Talks
AWS Services Overview and Quarterly Update - April 2017 AWS Online Tech Talks
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 

En vedette

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 

En vedette (7)

AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
AWS re:Invent 2016: Deep Learning at Cloud Scale: Improving Video Discoverabi...
 
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
 
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
AWS re:Invent 2016: Using MXNet for Recommendation Modeling at Scale (MAC306)
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 

Similaire à AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Amazon Web Services Korea
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Similaire à AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303) (20)

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan FritzAnalytics at Scale with Apache Spark on AWS with Jonathan Fritz
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
 
.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
 
AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine AWS ML and SparkML on EMR to Build Recommendation Engine
AWS ML and SparkML on EMR to Build Recommendation Engine
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
Lumberjacking on AWS: Cutting Through Logs to Find What Matters (ARC306) | AW...
 
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
 
Azure Machine Learning and its real-world use cases
Azure Machine Learning and its real-world use casesAzure Machine Learning and its real-world use cases
Azure Machine Learning and its real-world use cases
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
Machine Learning for Developers
Machine Learning for DevelopersMachine Learning for Developers
Machine Learning for Developers
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
국내 건설 기계사 도입 사례를 통해 보는 AI가 적용된 수요 예측 관리 - 베스핀글로벌 조창윤 AI/ML팀 팀장
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jonathan Fritz, Sr. Product Manager, Amazon EMR Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group November 29, 2016 MAC303 Zillow Group: Developing Classification and Recommendation Engines With Amazon EMR and Apache Spark
  • 2. What to Expect from the Session • Apache Spark and Spark ML overview • Running Spark ML on Amazon EMR • Interactive notebook options • Building recommendation engines at Zillow Group
  • 3. Spark for fast processing join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  • 4. Spark components to match your use case
  • 5. Spark ML addresses the full ML pipeline - Built on top of DataFrame API - Extract, transform, and select features - Distributed algorithms - Classification and Regression - Clustering - Collaborative Filtering - Model selection tools - Pipelines Process Data Feature Extraction Model Training Model Testing Model Validation
  • 6. Extracting features in DataFrames - Feature Extractors - CountVectorizer - Feature Transformers - Tokenizer - Binarizer - StandardScaler - Feature Selectors - VectorSlicer
  • 7. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  • 9. Classification algorithms in Spark ML - Logistic regression - Decision tree classifier - Random forest classifier - Gradient-boosted tree classifier - Multilayer perceptron classifier - One-vs-Rest classified - Naive Bayes
  • 10. What is logistic regression?
  • 11. What are decision trees? Weather predictors for Golf
  • 12. Decision trees: tree induction
  • 13. Decision trees: partition data with hyperplanes
  • 14. Spark ML pipelines - training
  • 15. Spark ML pipelines - testing
  • 16. Creating a Spark ML pipeline val pipeline = new Pipeline().setStages(Array(assembler, indexer, dt)) val model = pipeline.fit(df) val predictions = model.transform(df) Save and load machine learning models and full Pipelines
  • 17. Tools to pick the right model - CrossValidator and TrainValidationSplit select the Model produced by the best-performing set of parameters - Split the input data into separate training and test datasets - For each (training, test) pair, iterate through the set of ParamMaps - Fit the Estimator using those parameters, get the fitted Model, and evaluate the Model’s performance using the Evaluator
  • 18. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy to manage options Flexible Customize the cluster
  • 19. Develop fast using notebooks and IDEs
  • 20. • Run Spark Driver in Client or Cluster mode • Spark application runs as a YARN application • SparkContext runs as a library in your program, one instance per Spark application. • Spark Executors run in YARN Containers on NodeManagers in your cluster • Access Spark UI through the Resource Manager or Spark History Server Spark on YARN Spark UI
  • 22. Auto Scaling for data science on-demand YARN metrics
  • 23. Coming soon: advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal AZ based on capacity/price • Spot Block support
  • 24. Productionizing your pipeline Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  • 25. Recommendation Systems @ Zillow Group Jasjeet Thind Sr Director, Data Science & Engineering
  • 26. Agenda Intro to Zillow Group Recommendation Use Cases Architecture Algorithms Training & Scoring Pipeline Metrics
  • 27. Zillow Group Build the world's largest, most trusted, and vibrant home-related marketplace.
  • 28. Recommendation use cases Email - homes for sale / for rent Home Details - homes for sale / homes like this Personalized Search Mobile - smart SMS and push notifications Home owner / pre-seller predictions Lender selection algorithm Similar photos / video
  • 29. Architecture RECOMMENDATION API (Python, R, Flask) Zillow Group Data Lake (S3 / Kinesis) Property Featurization (Spark EMR) User Profiles (Spark EMR) Ranking (Spark EMR) Wedge Counting Collaborative Filtering (Spark EMR) Property Aggregate Features (Spark EMR) Data Collection Systems (Java/Python/SQL)
  • 30. Like vs. dislike Predict homes per user using behavior of similar users Like = user actively engaged with property Dislike = user viewed property but weak engagement $22M $19M $664K ?+ + - + - Spencer Stan Feature Description uid unique id of user pid Property id first_visit timestamp or 0 num_views sigmoid(#views) time_spent time on page num_contacts # leads sent num_saves # saves on zpid num_shares # shares on zpid num_photos # photos viewed
  • 31. Wedge count For all user & property pairs to form a prediction, perform wedge count - http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf Does Stan like $19M? Wedge # 3 (wedge03_cnt ) 5 (wedge05_cnt ) $22M + - $19M + ? Spencer Stan $664k - + $19M + ? Spencer Stan
  • 32. Classifier Gradient Boosting Classifier (sklearn) Popular users / properties: - Divide wedge counts by degree product ju * ki Prediction for all user / property pairs, limit candidate set by - Top 10 zip codes - 300 properties per user features wedge00_cnt wedge01_cnt wedge02_cnt wedge03_cnt wedge04_cnt wedge05_cnt wedge06_cnt wedge07_cnt wedge00_norm_cnt wedge01_norm_cnt wedge02_norm_cnt wedge03_norm_cnt wedge04_norm_cnt wedge05_norm_cnt wedge06_norm_cnt wedge07_norm_cnt Does Stan like the $19M home? features (uid: Stan, pid: $19M) (see right side)
  • 33. User profile Signals - website, mobile app, and search queries Binary classification - labels (like/dislike) same as collab filtering model User profile model determines preference scores Features (categorical variables) Bath 0_bath, 0.5_bath, 1_Bath, 1.5_bath, 2_bath, 2.5_bath, 3_bath Bed 0_bed, 1_bed, 2_bed, 3_bed, 4_bed, 5_bed Price 100_125_price, 125_150_price, 150_175_price Use Code condo, single_family, farm_land Zipcode zip_98109 pid uid features label 0 or 1 - see right side 0 or 1 0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6
  • 34. Ranking Property matrix - feature space same as user profile Dot product of property matrix with user profile vector Age decay for older listings (uid, pid) score {"uId":"10307499", "pId":"1044183744"} 0.3364 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0.01 0.8 0.6 0_bed 1_bed 2_bed 3_bed uid_0 pid_0 pid_1 pid_2 pid_3 = 0 0.8 0 0.6
  • 35. Training & scoring Collect user behavior and real-estate data, train the various models, generate the candidate set, and make predictions. User Behavior (Kinesis /S3) Public Record (Kinesis / S3) Event API (Java) Producer (Python) Filter (Spark) User Store (Hive / S3) Spark job creates Hive table with user events (uid, pid) partitioned by date Active Listings (Kinesis / S3) Producer (Python) Training Data (Spark) Training Set (Hive / S3) pid -> uid reverse index Past and current user events Models (Python) Train Models (Spark) Score (Spark) Recommendations Property Data Collaborative Filtering / User Profile Models Hashmap (Redis) Wedge features or property features (user profile)
  • 36. Offline evaluation Hyperparameter tuning with validation set Training/test data sets for model evaluation Offline Metrics Description Precision rk = # recommended properties in test set in top k Recall n = total properties in the test set Freshness # listings recommended w/ modified date < y day old in top k Coverage # unique listings recommended across all users / total # unique listings
  • 37. Future work Classifiers for listing descriptions Deep learning on listing images Structured streaming on Spark 2.0 Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy Real-time scoring
  • 38. Thank you! jonfritz@amazon.com aws.amazon.com/emr/ aws.amazon.com/blogs/big-data/ http://www.zillow.com/data-science/ Come join us @ Zillow Group! Hiring: - SDE, ML, Data Scientist - Big Data Engineer - Analytic Engineer - Product Management