SlideShare une entreprise Scribd logo
1  sur  26
1
ZESTIMATE + LAMBDA ARCHITECTURE
Steven Hoelscher, Machine Learning Engineer
How we produce low-latency, high-quality home estimates
Goals of the Zestimate
• Independent
• Transparent
• High Accuracy
• Low Bias
• Stable over time
• Respond quickly to data
updates
• High coverage (about 100M
homes)
www.zillow.com/zestimate
In early 2015, we shared the original architecture of the
Zestimate…
…but a lot has changed
Then (2015)
• Languages: R and Python
• Data Storage: on-prem RDBMSs
• Compute: on-prem hosts
• Framework: in-house
parallelization library (ZPL)
• People: Data Analysts and
Scientists
Now (2017)
• Languages: Python and R
• Data Storage: AWS Simple
Storage Service (S3), Redis
• Compute: AWS Elastic
MapReduce (EMR)
• Framework: Apache Spark
• People: Data Analysts, Scientists,
and Engineers
So, what’s changed?
Lambda Architecture
• Introduced by Nathan Marz
(Apache Storm) and highlighted in
his book, Big Data (2015)
• An architecture for scalable, fault-
tolerant, low-latency big data
systems
Low Latency,
Accuracy
High Latency,
Accuracy
Latency-Accuracy Tradeoff
www.databricks.com/blog/2016/05/19/approximate-
algorithms-in-apache-spark-hyperloglog-and-quantiles.html
>>> review_lengths.approxQuantile("lengths", quantiles, relative_error)
High-level Lambda Architecture
• We can process new data with
only a batch layer, but for
computationally expensive
queries, the results will be out-of-
date
• The speed layer compensates for
this lack of timeliness, by
computing, generally,
approximate views
Master Data Architecture
Lock down permissions
to prevent data deletes
and updates!
PropertyId Bedrooms Bathrooms SquareFootage UpdateDate
1 2.0 1.0 1450 2010-03-13
1 2.0 2.0 1500 2015-05-15
1 3.0 2.5 1800 2016-06-24
Data is immutable
Below, we see the evolution of a home over time:
• Constructed in 2010 with 2 bedrooms and 1 bath
• A full-bath added five years later, increasing the square footage
• Finally, another bedroom is added as well as a half-bath
Data is eternally true
PropertyId Bathrooms UpdateTime
1 2.0 2015-05-15
1 2.5 2016-06-24
PropertyId SaleValue SaleTime
1 450000 2015-08-19
This bathroom value would have
been overwritten in our mutable
data view
This transaction in our training data
would erroneously use a bathroom
upgrade from the future
Batch Layer Architecture
ETL
• Ingests master data
• Standardizes data across many sources
• Dedupes, cleanses and performs sanity checks on data
• Stores partitioned training and scoring sets in Parquet format
Train
• Large memory requirements (caching training sets for various models)
Score
• Scoring set partitioned in uniform chunks for parallelization
Batch Layer Highlights
• The number one source of Zestimate error is the facts that
flow into it – about bedrooms, bathrooms, and square
footage.
• To combat data issues, we give homeowners the ability to
update such facts and immediately see a change to their
Zestimate
• Beyond that, we want to recalculate Zestimates when
homes are listed on the market
Responding to data changes quickly
• Kinesis consumer is responsible
for low-latency transformations to
the data.
• Much of the data cleansing in the
batch layer relies on a
longitudinal view of the data, so
we cannot afford these
computations
• It looks up pertinent property
information in Redis and decides
whether to update the Zestimate
by calling the API
Speed Layer Architecture: Kinesis Consumer
Speed Layer Architecture: Zestimate API
• Uses latest, pre-trained models
from batch layer to avoid costly
retraining
• All property information required
for scoring is stored in Redis,
reusing a majority of the exact
calculations from the batch layer
• Relies on sharding of pre-trained
region models due to individual
model memory requirements
• The speed layer is not meant to be perfect; it’s meant to be lightning fast.
Your batch layer will correct mistakes, eventually.
• As a result, we can think of the speed layer view as ephemeral
PropertyId LotSize
0 21
1 16
2 5
Remember: Eventual Accuracy
Toy Example: Square feet or Acres?
Imagine a GIS model for validating lot
size by looking at a given property’s
parcel and its neighboring parcels. But
what happens if that model is slow
to compute?
• We still rely on our on-prem SQL
Server for serving Zestimates on
Zillow.com
• Reconciliation of views requires
knowing when the batch layer
started: if a home fact comes in
after the batch layer began, we
serve the speed layer’s
calculation
Serving Layer Architecture
The Big Picture
(3) Reduces
latency and
improves
timeliness
(2) Performs
heavy-lifting
cleaning and
training
(4)
Reconciles
views to
ensure better
estimation is
chosen
(1) Data is
immutable and
human-fault
tolerant
19
SO DID YOU FIX MY
ZESTIMATE?
Andrew Martin, Zestimate Research Manager
Accuracy Metrics for Real-Estate
Valuation
• Median Absolute Percent Error (MAPE)
• Measures the “average” amount of error in in prediction in terms of
percentage off the correct answer in either direction
• Measuring error in percentages more natural for home prices since
they are heteroscedastic
• Percent Error Within 5%, 10%, 20%
• Measure of how many predictions fell within +/-X% of the true value
𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒
𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% =
𝑆𝑎𝑙𝑒𝑠
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖
< 𝑋%
Did you know we keep a public
scorecard?
www.zillow.com/zestimate/
Comparing Accuracy at 10,000FT
• Let’s focus on King County, WA since the new architecture has
been live here since January 2017
• We compute accuracy by using the Zestimate at the end of the
month prior to when a home was sold as our prediction
• i.e. if a home sold in Kent for $300,000 on April 10th we’d use the
Zestimate from March 31st
• We went back and recomputed Zestimates at month ends with
the new architecture for all homes and months 2016
• We compare architectures by looking at error on the same set of sales
Architecture MAPE Within 5% Within 10% Within 20%
2015 (Z5.4) 5.1% 49.0% 75.0% 92.5%
2017 (Z6) 4.5% 54.1% 81.0% 94.9%
Breaking Accuracy out by Price
0
1000
2000
3000
4000
5000
6000
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
MAPE
Sales 2015 (Z5.4) 2017 (Z6)
Breaking Accuracy out by Home Type
Architecture Home
Type
MAPE Within5% Within10% Within20%
2015 (Z5.4) SFR
5.1% 49.2% 74.8% 92.4%
Condo 5.1% 49.5% 76.8% 93.7%
2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6%
Condo 4.6% 53.4% 81.6% 96.0%
Think that you might have an idea for how to
improve the Zestimate? We’re all ears...
+
www.zillow.com/promo/zillow-prize
26
We are hiring!
• Data Scientist
• Machine Learning Engineer
• Data Scientist, Computer Vision and Deep Learning
• Software Developer Engineer, Computer Vision
• Economist
• Data Analyst
www.zillow.com/jobs

Contenu connexe

Tendances

Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsYves Raimond
 
From deep learning to deep reasoning
From deep learning to deep reasoningFrom deep learning to deep reasoning
From deep learning to deep reasoningDeakin University
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixJustin Basilico
 
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 bitnineglobal
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...Neo4j
 
Driving profitability of Google App Campaigns in scale. What is easy, what is...
Driving profitability of Google App Campaigns in scale. What is easy, what is...Driving profitability of Google App Campaigns in scale. What is easy, what is...
Driving profitability of Google App Campaigns in scale. What is easy, what is...GameCamp
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at SpotifyOguz Semerci
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Store Item Demand Forecasting - AIT 582
Store Item Demand Forecasting - AIT 582 Store Item Demand Forecasting - AIT 582
Store Item Demand Forecasting - AIT 582 Rahul Pandey
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemEhsan38
 
주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)Seung-Woo Kang
 
파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)Heungsub Lee
 
Suspicious Activity Detection python Project Abstract
Suspicious Activity Detection python Project AbstractSuspicious Activity Detection python Project Abstract
Suspicious Activity Detection python Project AbstractVenkat Projects
 
[Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개
[Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개 [Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개
[Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개 CJ Olivenetworks
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionDatabricks
 
Adobe Summit - Data Storytelling
Adobe Summit - Data StorytellingAdobe Summit - Data Storytelling
Adobe Summit - Data StorytellingChris Haleua
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 

Tendances (20)

Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
From deep learning to deep reasoning
From deep learning to deep reasoningFrom deep learning to deep reasoning
From deep learning to deep reasoning
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례 Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
Graph Database Meetup in Korea #4. 그래프 이론을 적용한 그래프 데이터베이스 활용 사례
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
 
Driving profitability of Google App Campaigns in scale. What is easy, what is...
Driving profitability of Google App Campaigns in scale. What is easy, what is...Driving profitability of Google App Campaigns in scale. What is easy, what is...
Driving profitability of Google App Campaigns in scale. What is easy, what is...
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Store Item Demand Forecasting - AIT 582
Store Item Demand Forecasting - AIT 582 Store Item Demand Forecasting - AIT 582
Store Item Demand Forecasting - AIT 582
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
 
주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)주가_변화시점탐지(Change point Detection)
주가_변화시점탐지(Change point Detection)
 
파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)파이썬 생존 안내서 (자막)
파이썬 생존 안내서 (자막)
 
Zenly - Reverse geocoding
Zenly - Reverse geocodingZenly - Reverse geocoding
Zenly - Reverse geocoding
 
Suspicious Activity Detection python Project Abstract
Suspicious Activity Detection python Project AbstractSuspicious Activity Detection python Project Abstract
Suspicious Activity Detection python Project Abstract
 
[Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개
[Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개 [Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개
[Gridgain]인메모리컴퓨팅 및 국내레퍼런스 소개
 
A Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly DetectionA Fast Decision Rule Engine for Anomaly Detection
A Fast Decision Rule Engine for Anomaly Detection
 
Adobe Summit - Data Storytelling
Adobe Summit - Data StorytellingAdobe Summit - Data Storytelling
Adobe Summit - Data Storytelling
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 

Similaire à Zestimate Lambda Architecture

Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformSanjana Chowdhury
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...Amazon Web Services
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceSense Corp
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Amazon Web Services
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAmazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoAmazon Web Services
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?Amazon Web Services
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFAmazon Web Services
 

Similaire à Zestimate Lambda Architecture (20)

Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI Platform
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data AnalyticsAWS Sydney Summit 2013 - Big Data Analytics
AWS Sydney Summit 2013 - Big Data Analytics
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with KinesisAWS APAC Webinar Week - Real Time Data Processing with Kinesis
AWS APAC Webinar Week - Real Time Data Processing with Kinesis
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Which Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San FranciscoWhich Database is Right for My Workload?: Database Week San Francisco
Which Database is Right for My Workload?: Database Week San Francisco
 
Which Database is Right for My Workload?
Which Database is Right for My Workload?Which Database is Right for My Workload?
Which Database is Right for My Workload?
 
Which Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SFWhich Database is Right for My Workload: Database Week SF
Which Database is Right for My Workload: Database Week SF
 

Dernier

Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...SUHANI PANDEY
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...SUHANI PANDEY
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubaikojalkojal131
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...nirzagarg
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋nirzagarg
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 

Dernier (20)

Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
Wadgaon Sheri $ Call Girls Pune 10k @ I'm VIP Independent Escorts Girls 80057...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
Thalassery Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call G...
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
 

Zestimate Lambda Architecture

  • 1. 1 ZESTIMATE + LAMBDA ARCHITECTURE Steven Hoelscher, Machine Learning Engineer How we produce low-latency, high-quality home estimates
  • 2. Goals of the Zestimate • Independent • Transparent • High Accuracy • Low Bias • Stable over time • Respond quickly to data updates • High coverage (about 100M homes) www.zillow.com/zestimate
  • 3. In early 2015, we shared the original architecture of the Zestimate… …but a lot has changed
  • 4. Then (2015) • Languages: R and Python • Data Storage: on-prem RDBMSs • Compute: on-prem hosts • Framework: in-house parallelization library (ZPL) • People: Data Analysts and Scientists Now (2017) • Languages: Python and R • Data Storage: AWS Simple Storage Service (S3), Redis • Compute: AWS Elastic MapReduce (EMR) • Framework: Apache Spark • People: Data Analysts, Scientists, and Engineers So, what’s changed?
  • 5. Lambda Architecture • Introduced by Nathan Marz (Apache Storm) and highlighted in his book, Big Data (2015) • An architecture for scalable, fault- tolerant, low-latency big data systems Low Latency, Accuracy High Latency, Accuracy
  • 7. High-level Lambda Architecture • We can process new data with only a batch layer, but for computationally expensive queries, the results will be out-of- date • The speed layer compensates for this lack of timeliness, by computing, generally, approximate views
  • 8. Master Data Architecture Lock down permissions to prevent data deletes and updates!
  • 9. PropertyId Bedrooms Bathrooms SquareFootage UpdateDate 1 2.0 1.0 1450 2010-03-13 1 2.0 2.0 1500 2015-05-15 1 3.0 2.5 1800 2016-06-24 Data is immutable Below, we see the evolution of a home over time: • Constructed in 2010 with 2 bedrooms and 1 bath • A full-bath added five years later, increasing the square footage • Finally, another bedroom is added as well as a half-bath
  • 10. Data is eternally true PropertyId Bathrooms UpdateTime 1 2.0 2015-05-15 1 2.5 2016-06-24 PropertyId SaleValue SaleTime 1 450000 2015-08-19 This bathroom value would have been overwritten in our mutable data view This transaction in our training data would erroneously use a bathroom upgrade from the future
  • 12. ETL • Ingests master data • Standardizes data across many sources • Dedupes, cleanses and performs sanity checks on data • Stores partitioned training and scoring sets in Parquet format Train • Large memory requirements (caching training sets for various models) Score • Scoring set partitioned in uniform chunks for parallelization Batch Layer Highlights
  • 13. • The number one source of Zestimate error is the facts that flow into it – about bedrooms, bathrooms, and square footage. • To combat data issues, we give homeowners the ability to update such facts and immediately see a change to their Zestimate • Beyond that, we want to recalculate Zestimates when homes are listed on the market Responding to data changes quickly
  • 14. • Kinesis consumer is responsible for low-latency transformations to the data. • Much of the data cleansing in the batch layer relies on a longitudinal view of the data, so we cannot afford these computations • It looks up pertinent property information in Redis and decides whether to update the Zestimate by calling the API Speed Layer Architecture: Kinesis Consumer
  • 15. Speed Layer Architecture: Zestimate API • Uses latest, pre-trained models from batch layer to avoid costly retraining • All property information required for scoring is stored in Redis, reusing a majority of the exact calculations from the batch layer • Relies on sharding of pre-trained region models due to individual model memory requirements
  • 16. • The speed layer is not meant to be perfect; it’s meant to be lightning fast. Your batch layer will correct mistakes, eventually. • As a result, we can think of the speed layer view as ephemeral PropertyId LotSize 0 21 1 16 2 5 Remember: Eventual Accuracy Toy Example: Square feet or Acres? Imagine a GIS model for validating lot size by looking at a given property’s parcel and its neighboring parcels. But what happens if that model is slow to compute?
  • 17. • We still rely on our on-prem SQL Server for serving Zestimates on Zillow.com • Reconciliation of views requires knowing when the batch layer started: if a home fact comes in after the batch layer began, we serve the speed layer’s calculation Serving Layer Architecture
  • 18. The Big Picture (3) Reduces latency and improves timeliness (2) Performs heavy-lifting cleaning and training (4) Reconciles views to ensure better estimation is chosen (1) Data is immutable and human-fault tolerant
  • 19. 19 SO DID YOU FIX MY ZESTIMATE? Andrew Martin, Zestimate Research Manager
  • 20. Accuracy Metrics for Real-Estate Valuation • Median Absolute Percent Error (MAPE) • Measures the “average” amount of error in in prediction in terms of percentage off the correct answer in either direction • Measuring error in percentages more natural for home prices since they are heteroscedastic • Percent Error Within 5%, 10%, 20% • Measure of how many predictions fell within +/-X% of the true value 𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% = 𝑆𝑎𝑙𝑒𝑠 𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 < 𝑋%
  • 21. Did you know we keep a public scorecard? www.zillow.com/zestimate/
  • 22. Comparing Accuracy at 10,000FT • Let’s focus on King County, WA since the new architecture has been live here since January 2017 • We compute accuracy by using the Zestimate at the end of the month prior to when a home was sold as our prediction • i.e. if a home sold in Kent for $300,000 on April 10th we’d use the Zestimate from March 31st • We went back and recomputed Zestimates at month ends with the new architecture for all homes and months 2016 • We compare architectures by looking at error on the same set of sales Architecture MAPE Within 5% Within 10% Within 20% 2015 (Z5.4) 5.1% 49.0% 75.0% 92.5% 2017 (Z6) 4.5% 54.1% 81.0% 94.9%
  • 23. Breaking Accuracy out by Price 0 1000 2000 3000 4000 5000 6000 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% MAPE Sales 2015 (Z5.4) 2017 (Z6)
  • 24. Breaking Accuracy out by Home Type Architecture Home Type MAPE Within5% Within10% Within20% 2015 (Z5.4) SFR 5.1% 49.2% 74.8% 92.4% Condo 5.1% 49.5% 76.8% 93.7% 2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6% Condo 4.6% 53.4% 81.6% 96.0%
  • 25. Think that you might have an idea for how to improve the Zestimate? We’re all ears... + www.zillow.com/promo/zillow-prize
  • 26. 26 We are hiring! • Data Scientist • Machine Learning Engineer • Data Scientist, Computer Vision and Deep Learning • Software Developer Engineer, Computer Vision • Economist • Data Analyst www.zillow.com/jobs

Notes de l'éditeur

  1. Hi everyone, thanks for joining me here at Zillow for today’s meet up. My name is Steven Hoelscher, and I’m a machine learning engineer on the data science and engineering team. I’ve been with Zillow for 2.5 years now and had the opportunity to work on the team responsible for building and rearchitecting a new Zestimate pipeline, largely inspired by Lambda Architecture. It’s my hope that you’ll walk away from this presentation with a better understanding of what lambda architecture means and will have seen a in-production example for actually realizing it.
  2. Without further ado, let’s start with the Zestimate itself and its goals. For those who aren’t familiar, the Zestimate is simply our estimated market value for individual homes nationwide. We strive to put a Zestimate on every rooftop, just as we see in this screenshot. Every day, the Zestimate team thinks about how we can improve our algorithm, and from a data science perspective, improvement is based on whether we achieve these goals. To talk about a few: obviously, we would like our Zestimates to have high accuracy; when a home sells, it’s our goal for the Zestimate to be that near sale price. The Zestimate, as an algorithm, should also be stable over time and not exhibit erratic behavior day-to-day. The Zestimate should also be able to respond quickly to data updates. Users can supply us with more accurate data to improve our estimates, and their Zestimate should immediately reflect fact updates. In a sense, these are the goals that our pipeline must support and we’re going to spend some more time talking about how to balance these goals in a big data system.
  3. In early 2015, right around the time I started at Zillow, a few of my colleagues presented on the Zestimate architecture…as it was then. But a lot has changed since that presentation, only just 2 years ago.
  4. At the core, the Zestimate in 2015 was largely written in R. Our team was comprised of R language experts and we even built an in-house R framework for parallelization a la MapReduce. We were a smaller team back then, mostly data scientists who also had a knack for engineering. We relied on collaboration with others teams, especially our database administrators to interface with on-premises relational databases. Two years later, we’ve made a hiring push across all skill sets and invited engineers to join the fray. Python has become the new language of choice, thanks mostly to its long history of support in Apache Spark. We started leveraging more and more cloud-based services, such as Amazon’s Simple Storage Service for storing our data and Elastic MapReduce for compute. No longer are we bottlenecked by the size of a single machine. With all of these changes, we had the opportunity to start afresh and design a system that would handle large amounts of data in the cloud, that would rely on horizontal scaling, and most importantly would meet the goals of the Zestimate.
  5. Enter Lambda Architecture. The idea of Lambda Architecture was introduced by Nathan Marz, the creator of Apache Storm. I highly recommend the book he published in 2015 with the title *Big Data*. This book for the uninitiated provided the foundations for Lambda Architecture, with great case studies for understanding how to achieve this architecture. Simply put, Lambda Architecture is a generic data processing architecture that is horizontally scalable, fault-tolerant (in the face of both human and hardware failures), and capable of low latency responses. Shortly, we’ll see what a high-level lambda architecture looks like. But before we dive into that, I want to talk about making a tradeoff between latency and accuracy. In some cases, we cannot expect to have low latency responses when dealing with enormous amounts of data. As such, we have to tradeoff some degree of accuracy to reduce our latency. This idea will underpins Lambda Architecture.
  6. Let’s look at example, highlighted by the Databricks team. Apache Spark implements an algorithm for calculating approximate percentiles of numerical data, with a function called approxQuantile. This algorithm requires a user to specify a target error bound and the result is guaranteed to be within this bound. This algorithm can be adjusted to trade accuracy against computation time and memory. In the example here, the Databricks team studies the length of the text in each Amazon review. On the x-axis, we have the targeted residual. As we would guess, the higher the residual, the less computationally expensive our calculation becomes, but the tradeoff is accuracy.
  7. Let’s start thinking about what this means for a big data processing system. We could start simple by building a batch system with low complexity. It reads directly from a master dataset, that contains all of the data so far. This batch layer, as it’s called, will virtually freeze the data at the time the job begins and start running computations. The problem is that once the batch layer finishes computing a query, the data is already out-of-date: new changes have come in and were not accounted for. This is the gap that the lambda architecture is trying to solve. We can rely on a speed layer that will compensate for the batch layer’s lack of timeliness. But the speed layer, generally speaking, cannot rely on the same algorithms that the batch layer did. In the example before, we would want our batch layer to calculate a correct and highly accurate quantile, but the speed layer should rely on approximation to be more nimble. In this way, at any given moment, we could have two different views: one view from the batch layer that is accurate but not so timely and one view from the speed layer that is less accurate but timely. Reconciling these two views, we can answer a query in a relatively accurate and timely fashion.
  8. At this point, we’re going to explore a few of the layers of the Lambda Architecture and see how we implement each layer for the Zestimate itself. To begin, we start with the data. As I mentioned before, most of our data in 2015 was only stored on premises in relational databases. Our first goal, then, was to move this data to the cloud and start having new data-generating processes to write directly to the cloud store. At Zillow, we use AWS S3 for our data lake / master dataset. It is optimized to handle a large, constantly growing set of data. In our case, we have a bucket specific designated for raw data. In this design, we don’t want to actually modify or update the raw data and I’ll talk about why we don’t want to do this in a second here. As such, we set permissions on the bucket itself to prevent data deletes and updates. Any generic data-generating process is responsible for only appending new records to this object store, never deleting. Most data-generating processes are writing JSON data. We do mandate a schema contract between the producers and consumers of the data, to ensure data types are conformed to.
  9. Data is immutable. Let’s understand what this means by working through this example. We have a sample home and how it has evolved over time. In 2010, it was constructed with 2 bedrooms and 1 bathroom. Five years later, the homeowner added a full-bath, therefore increasing the square footage. This was done right before selling the home in a few months later in 2015. A new owner purchased the home, and nearly a year later, decided to add another bedroom and half-bath.With mutable data, this story is lost. One way of storing these attributes in a relational database would be to update records with the new attributes.
  10. Data is eternally true. Now let’s introduce the transaction that I referred to. It occurred before the number of bathrooms changed again. In our mutable data view, this transaction would have been tied with a bathroom upgrade from the future.Once we attach a timestamp to data, we ensure it is eternally true. It is eternally true that in 2015, this home had 2 bathrooms, but in 2016, a half bath was added. This story is extremely important for data scientists. And while this example may be trivial, you can imagine tying a sale value to a larger set of home facts that weren’t actually true at that point of time. Immutability of data allows us to retain this story. We’re no longer updating data, and as a benefit, we are less prone to human mistakes, especially when it comes to what all data scientists hold dear: the raw data itself.
  11. After migrating our data to the AWS S3, we began work on the batch layer for the Zestimate pipeline. From a high-level, the Zestimate batch layer has a few components: first, we need to make available the raw, master dataset. Apache Spark allows us to read directly from S3, but some of our raw data sources suffer from the painful small-files problem in Hadoop. Simply put, big data systems expect to consume fewer large files rather than a lot of small files. Apache Spark suffers from this same problem. We rely heavily on vacuuming applications, such as Hadoop’s distcp, to aggregate data into larger files, by pulling from S3 and storing the aggregates on HDFS. From there, our jobs read directly from HDFS: we begin with an ETL layer, responsible for producing training and scoring sets for our various models. Then, training and scoring takes place for about 100 M homes in the nation. Models, training and scoring sets, and performance metrics are all stored in a different bucket in S3, one for transformed data. This ensures that we’re distinguishing between the raw data (our master dataset) and the data derived from the raw data.
  12. The ETL layer is responsible for interfacing with the master dataset and transforming it in order to arrive at cleaner, standardized datasets that are consumable by our Zestimate models. We have a wide variety of data sources that we deal with and so need to pull appropriate features from each to build a rich feature set. We invest a lot of time into ensuring our data is clean. As we know, garbage in, garbage out, and this holds true for the Zestimate algorithm. One example we always talk about is the case of fat-fingers. You can imagine that typing 500 square feet instead of 5000 square feet could drastically change how we perceive that home’s value. This cleaning process, in addition to the partitioning required, can be very expensive computationally. This is one area where a speed layer would need to be more nimble, as it won’t be able to look at historical data to make inferences about the quality of new data. After the ETL step, we can begin training models. Training, in our cases, requires large amounts of memory to support caching of training sets for various models. We train models on various geographies, making tradeoffs between data skew and volume of data available. Scoring is then done in parallel, using data partitioned in uniform chunks. At this point, we have a view created (the Zestimates for about 100M homes in the nation) as well as pre-trained models for the speed layer. But at this point, some of the facts that went into our model training and scoring could be out of date.
  13. The number one source of Zestimate error is the facts that flow into it, like bedroom count, bathroom counts, and square footage. We provide homeowners with a means for proactively making adjustments to their Zestimate. They can update a bathroom count or square footage and immediately see a change in their Zestimate. Beyond that, we want to recalculate Zestimates when homes are listed on the market, because in these cases an off the market home is updated with all of the latest facts so that it is represented accurately on the market.
  14. In lambda architecture, we want our speed layer to read from the same generic data-generating processes that our batch layer does. Amazon Kinesis (firehose and streams) makes it easy to both write to S3 as well as have consumers read directly from the stream. At this stage, you have the choice of which consumer to use. Spark Streaming can be used directly to enable code sharing (specifically, code relying on the Spark API) between the batch layer and the speed layer, but if Spark-specific code sharing is not a requirement, Amazon’s Kinesis Client Library (which Spark Streaming relies on) is a good solution. In our case, we built our Kinesis Consumer with just the Kinesis Client Library, for three reasons: (1) simplicity, (2) lack of spark processing, and (3) Elastic MapReduce would be more expensive than a small Elastic ComputeCloud (EC2) instance.
  15. Steven