Zestimate Lambda Architecture

1
ZESTIMATE + LAMBDA ARCHITECTURE
Steven Hoelscher, Machine Learning Engineer
How we produce low-latency, high-quality home estimates

Goals of the Zestimate
• Independent
• Transparent
• High Accuracy
• Low Bias
• Stable over time
• Respond quickly to data
updates
• High coverage (about 100M
homes)
www.zillow.com/zestimate

In early 2015, we shared the original architecture of the
Zestimate…
…but a lot has changed

Then (2015)
• Languages: R and Python
• Data Storage: on-prem RDBMSs
• Compute: on-prem hosts
• Framework: in-house
parallelization library (ZPL)
• People: Data Analysts and
Scientists
Now (2017)
• Languages: Python and R
• Data Storage: AWS Simple
Storage Service (S3), Redis
• Compute: AWS Elastic
MapReduce (EMR)
• Framework: Apache Spark
• People: Data Analysts, Scientists,
and Engineers
So, what’s changed?

Lambda Architecture
• Introduced by Nathan Marz
(Apache Storm) and highlighted in
his book, Big Data (2015)
• An architecture for scalable, fault-
tolerant, low-latency big data
systems
Low Latency,
Accuracy
High Latency,
Accuracy

Latency-Accuracy Tradeoff
www.databricks.com/blog/2016/05/19/approximate-
algorithms-in-apache-spark-hyperloglog-and-quantiles.html
>>> review_lengths.approxQuantile("lengths", quantiles, relative_error)

High-level Lambda Architecture
• We can process new data with
only a batch layer, but for
computationally expensive
queries, the results will be out-of-
date
• The speed layer compensates for
this lack of timeliness, by
computing, generally,
approximate views

Master Data Architecture
Lock down permissions
to prevent data deletes
and updates!

PropertyId Bedrooms Bathrooms SquareFootage UpdateDate
1 2.0 1.0 1450 2010-03-13
1 2.0 2.0 1500 2015-05-15
1 3.0 2.5 1800 2016-06-24
Data is immutable
Below, we see the evolution of a home over time:
• Constructed in 2010 with 2 bedrooms and 1 bath
• A full-bath added five years later, increasing the square footage
• Finally, another bedroom is added as well as a half-bath

Data is eternally true
PropertyId Bathrooms UpdateTime
1 2.0 2015-05-15
1 2.5 2016-06-24
PropertyId SaleValue SaleTime
1 450000 2015-08-19
This bathroom value would have
been overwritten in our mutable
data view
This transaction in our training data
would erroneously use a bathroom
upgrade from the future

ETL
• Ingests master data
• Standardizes data across many sources
• Dedupes, cleanses and performs sanity checks on data
• Stores partitioned training and scoring sets in Parquet format
Train
• Large memory requirements (caching training sets for various models)
Score
• Scoring set partitioned in uniform chunks for parallelization
Batch Layer Highlights

• The number one source of Zestimate error is the facts that
flow into it – about bedrooms, bathrooms, and square
footage.
• To combat data issues, we give homeowners the ability to
update such facts and immediately see a change to their
Zestimate
• Beyond that, we want to recalculate Zestimates when
homes are listed on the market
Responding to data changes quickly

• Kinesis consumer is responsible
for low-latency transformations to
the data.
• Much of the data cleansing in the
batch layer relies on a
longitudinal view of the data, so
we cannot afford these
computations
• It looks up pertinent property
information in Redis and decides
whether to update the Zestimate
by calling the API
Speed Layer Architecture: Kinesis Consumer

Speed Layer Architecture: Zestimate API
• Uses latest, pre-trained models
from batch layer to avoid costly
retraining
• All property information required
for scoring is stored in Redis,
reusing a majority of the exact
calculations from the batch layer
• Relies on sharding of pre-trained
region models due to individual
model memory requirements

• The speed layer is not meant to be perfect; it’s meant to be lightning fast.
Your batch layer will correct mistakes, eventually.
• As a result, we can think of the speed layer view as ephemeral
PropertyId LotSize
0 21
1 16
2 5
Remember: Eventual Accuracy
Toy Example: Square feet or Acres?
Imagine a GIS model for validating lot
size by looking at a given property’s
parcel and its neighboring parcels. But
what happens if that model is slow
to compute?

• We still rely on our on-prem SQL
Server for serving Zestimates on
Zillow.com
• Reconciliation of views requires
knowing when the batch layer
started: if a home fact comes in
after the batch layer began, we
serve the speed layer’s
calculation
Serving Layer Architecture

The Big Picture
(3) Reduces
latency and
improves
timeliness
(2) Performs
heavy-lifting
cleaning and
training
(4)
Reconciles
views to
ensure better
estimation is
chosen
(1) Data is
immutable and
human-fault
tolerant

19
SO DID YOU FIX MY
ZESTIMATE?
Andrew Martin, Zestimate Research Manager

Accuracy Metrics for Real-Estate
Valuation
• Median Absolute Percent Error (MAPE)
• Measures the “average” amount of error in in prediction in terms of
percentage off the correct answer in either direction
• Measuring error in percentages more natural for home prices since
they are heteroscedastic
• Percent Error Within 5%, 10%, 20%
• Measure of how many predictions fell within +/-X% of the true value
𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒
𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% =
𝑆𝑎𝑙𝑒𝑠
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖
< 𝑋%

Did you know we keep a public
scorecard?
www.zillow.com/zestimate/

Comparing Accuracy at 10,000FT
• Let’s focus on King County, WA since the new architecture has
been live here since January 2017
• We compute accuracy by using the Zestimate at the end of the
month prior to when a home was sold as our prediction
• i.e. if a home sold in Kent for $300,000 on April 10th we’d use the
Zestimate from March 31st
• We went back and recomputed Zestimates at month ends with
the new architecture for all homes and months 2016
• We compare architectures by looking at error on the same set of sales
Architecture MAPE Within 5% Within 10% Within 20%
2015 (Z5.4) 5.1% 49.0% 75.0% 92.5%
2017 (Z6) 4.5% 54.1% 81.0% 94.9%

Breaking Accuracy out by Price
0
1000
2000
3000
4000
5000
6000
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
MAPE
Sales 2015 (Z5.4) 2017 (Z6)

Breaking Accuracy out by Home Type
Architecture Home
Type
MAPE Within5% Within10% Within20%
2015 (Z5.4) SFR
5.1% 49.2% 74.8% 92.4%
Condo 5.1% 49.5% 76.8% 93.7%
2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6%
Condo 4.6% 53.4% 81.6% 96.0%

Think that you might have an idea for how to
improve the Zestimate? We’re all ears...
+
www.zillow.com/promo/zillow-prize

26
We are hiring!
• Data Scientist
• Machine Learning Engineer
• Data Scientist, Computer Vision and Deep Learning
• Software Developer Engineer, Computer Vision
• Economist
• Data Analyst
www.zillow.com/jobs

Zestimate Lambda Architecture

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Zestimate Lambda Architecture

Similaire à Zestimate Lambda Architecture (20)

Dernier

Dernier (20)

Zestimate Lambda Architecture

Notes de l'éditeur