DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
1. DevOps and Machine Learning
Jasjeet Thind
VP, Data Science & Engineering, Zillow Group
@JasjeetThind
2. Agenda
Overview of Zillow Group (ZG)
Machine Learning (ML) at ZG
Architecture
DevOps for ML
@JasjeetThind
3. Zillow Group Composed of 9 Brands
Build the world's largest, most trusted and vibrant home-related marketplace.
@JasjeetThind
RealEstate.com
4. Machine Learning at ZG
Personalization
Ad Targeting
Zestimate (AVMs)
Premier Agent (B2B)
Mortgages
Deep Learning
Demographics & Community
Business Analytics
Forecasting home price trends
@JasjeetThind
6. DevOps & ML at ZG
Tight collaboration, fast iterations, and lots of automation.
CONTINUOUS INTEGRATION (CI) CONTINUOUS DELIVERY (CD)
@JasjeetThind
Research Dev Test Ship ML Workflow
Real-time
ML
7. Research
Data scientists prototype features and models to solve business problems
Models
● Random Forest
● Gradient Boosted Machines
● CNN / Deep Learning
● NLP / TF-IDF / Word2vec / Bag of Words
@JasjeetThind
● Linear Regression
● K-means clustering
● Collaborative Filtering
8. DevOps Challenges in Research
(1) Data scientists iterate through lots of prototype code, models, features and datasets
that never get into production
(2) How do data scientists validate experiments on production system?
@JasjeetThind
9. How to apply DevOps to experiments?
Check-in prototypes
● Organize around research area
● Document heavily. Experiments often revisited
● Check-in subsets of training & scoring datasets
For large training datasets, store in the data lake
● Pointers to data in README.md
@JasjeetThind
10. DevOps enables experiments in production
Build tools for scientists to package and deploy ML models and features into
production A/B tests
● Once submitted via the tool the model goes through the same release / validation
pipeline as ML production models go through.
Real world prototyping can be done before production quality ML code.
@JasjeetThind
11. DEV
ML engineers operationalize features and models for scale.
Automatically build and upload artifacts
ML
Developer
Machine
Git Jenkins
- build
- unit tests
- upload artifacts
AWS EMR
Build
Cluster
Local
Artifact
Repository
AWS Cloud
Repository
(S3)
- AWS S3- Squid proxy
Upload
Artifact
GIT
push
Trigger
Check-in unit tests,
source code
Run unit tests to test
machine learning
models
@JasjeetThind
12. DevOps ML Challenges in Dev
(1) Unit tests non-trivial as model outputs not deterministic
output = f (inputs)
(2) CI requires fast builds. But accurate predictions require large training data sets, and
training models is time consuming.
@JasjeetThind
13. Unit Tests w/ ML
Don’t test “real” model outputs.
● Leave that to integration /
regression tests
● Focus on code coverage, build
and unit tests are fast.
Train models w/ mock training data
● Test schema, type, # records,
expected values
Predictions on mock scoring data sets
// load mock small training dataset
training_data = read(“training_small.csv”)
// model trained on mock training data
model = train(training_data)
// schema test
assertHasColumn(model.trainingData, “zipcode”)
// type test
assertIsInstance(model.trainingData, Vector)
// # of records test
assertEqual(model.trainingData.size, 32)
// expected value test
assertEqual(model.trainingData[0].key, 207)
// load mock small scoring dataset
predictions = score(model, scoring_data)
@JasjeetThind
14. Mock training data facilitates fast builds
Model prediction accuracy tested in the Test phase
@JasjeetThind
15. TEST
After artifact uploaded by build process, deploy and auto-trigger its regression tests
DevOps
API
Jenkins
Deployment
ShipIt
- REST API to to
deploy packages
- Hosts config
Jenkins host that
deploys & saves history
Upload
new job
AWS EMR
Regression
Cluster
Tool that executes
deployments
Unified
Versioning
SystemPin
@JasjeetThind
Kick off regression tests after deployment
16. DevOps ML Challenges in Test
(1) Training and scoring models can take many hours or more
(2) Models have very large memory footprint
(3) Regression tests non-trivial as model outputs not deterministic
(4) Real time model outputs vs that of backend systems
@JasjeetThind
17. How to train and score in Test?
For regression testing, create “golden” datasets, smaller subsets of training and scoring
data, for training and scoring models.
Makes training and scoring time much faster
@JasjeetThind
18. Reducing Model Memory Footprint in Test
Model dynamically generated using golden datasets
Model itself not in GIT, treated as binary
@JasjeetThind
19. Regression Tests and ML
Adjust with fuzz factor
● May need to run multiple
times and average results
Model validation tests to
ensure model metrics meet
minimum bar
● More on this later
// load “real” subset training data from data lake
training_data = read(“training_golden_dataset.csv”)
// model trained on golden training data
model = train(training_data)
// generate model output
scoring_data = read(“scoring_golden_dataset.csv”)
predictions = score(model, scoring_data)
// fuzz factor
id = 5
zestimate = predictions[id]
assertAlmostEqual(zestimate, 238000, delta=7140)
@JasjeetThind
20. Verify Real-time Matches Backend
Real-time machine learning services thousands / millions of concurrent requests
// get model output from web service API for scoring row
id = 5
prediction1 = getFromMLApi(id)
// get model output from batch pipeline for same scoring row
prediction2 = getFromBatchSystem(id)
// get model output from near real time pipeline for same scoring row
prediction3 = getFromNRTSystem(id)
// verify the values are equal
assertAlmostEqual(prediction1, prediction2, prediction3, delta=7140)
@JasjeetThind
21. SHIP
Automated deployment of ML code to production
DevOps
API
Jenkins
Deployment
ShipIt
Upload
new job
AWS EMR
Production
Cluster
@JasjeetThind
22. DevOps ML Challenges in Ship
(1) Deploy starts during model training and scoring
(2) Backward compatibility issues with code and models for real-time ML API
@JasjeetThind
23. 4 Ways to Handle Deploy during Training
Terminate ML workflow and start the deploy
Wait for ML workflow to finish and start the deploy
Deploy at certain times (e.g. once per day at midnight)
Deploy to another cluster, shutdown the previous cluster
@JasjeetThind
24. Shipping real-time ML services
Complete successful run of batch pipeline.
● Then ship code and models to web services hosts.
● More on deployment to web services hosts later
@JasjeetThind
25. ML Workflow
Production ML backends generate models and predictions.
@JasjeetThind
Raw Data
Sources
Data
Collection
Generate
datasets
and
features
Model
Learning
Evaluate
Model
metrics
Scoring Deployment
26. DevOps Challenges w/ML Workflow
(1) Data quality given high velocity petabyte scale data continually being pushed into
the system
(2) Ensuring models aren’t stale as data continues to evolve
(3) Partial failures due to intermediate model training pipelines failing
(4) Model deployments don’t cause bad user experience
@JasjeetThind
27. Build Analytical Systems for Data Quality
Monitor
● what is expected data
● outlier detection
● missing data
● expected # of records
● latency
Reports / alerts that drive action
@JasjeetThind
Consume
Generate
Metrics
Models Data
Quality DB
Historical
Metrics
Write
Alert
for
Issues
Raw Data
Sources
Data Quality Architecture
28. Automate ML to Keep Models Fresh
Run continuously or frequently
Workflow engine
● Oozie, Airflow, Azkaban, SWF, etc
Logging, monitoring, and alerts throughout the pipeline
@JasjeetThind
29. Persist Intermediate Train Data for Failures
Breakup ML pipelines into appropriate abstraction layers
Leverage Spark which can recover from intermediate failures by replaying the DAG
@JasjeetThind
30. Model validation tests to verify quality
Don’t replace existing model if one or more metrics fail. Trigger alert instead.
Replace if all tests pass.
if (mean_squared_error(y_actual, y_pred) > 3) // mean squared error
alert()
else if (median_error(y_actual, y_pred) > 6) // median error
alert()
else if (auc(y_actual, y_pred)) // area under curve
alert()
else
deploy_models() // all model metrics passed so replace old models
@JasjeetThind
31. Real-time ML
Models need to be deployed into production
Models need to be executed per request with concurrent users under strict SLA
@JasjeetThind
32. DevOps ML Challenges w/ Real-time ML
(1) Deployment of models can take many seconds or minutes
(2) Latency to execute ML model for each request and return responses must meet
strict SLA (< 100 milliseconds)
(3) Need to ensure features at serving time same as backend systems
@JasjeetThind
33. Automate Deployment of Models
Replicate hosts
During deploy, take host offline, update models, put back into rotation, repeat for
replicated hosts.
For small models, do in-memory swap
@JasjeetThind
34. CPU is Bottleneck for Real-time ML
Real-time feature generation requires data serialization
● Minimize as much as possible
● If C++ code needs to prep data for ML in Python, consider re-writing Python code
into C++
Hyperparameters - balance latency and accuracy
● 1000 trees for RF that meets SLA might be a better business decision than 2000
trees with “slight” accuracy gain
@JasjeetThind
35. Features at serving should equal backend
Save feature set used at serving time and pipe into a log or queue
● Use features directly in training
● Or at least verify they match training sets
@JasjeetThind
36. Free Zillow Data
@JasjeetThind
Zillow Home Value Index (ZHVI)
• Top / Middle / Bottom Thirds
• Single Family / Condo / Co-op
• Median Home Value Per Sq Ft
Zillow Rent Index (ZRI)
• Multi-family / SFR / Condo / Co-op
• Median ZRI Per Sq ft
• Median Rent List Price
Other Metrics
• Median List Price
• Price-to-Rent ratio
• Homes Foreclosed
• For-sale Inventory / Age Inventory
• Negative Equity
• And many more…
Time Series: national, state, metro, county, city and ZIP
code levels
ZTRAX: Zillow Transaction and Assessment Dataset
Previously inaccessible or prohibitively expensive housing
data for academic and institutional researchers FOR FREE.
∙ More than 100 gigabytes
∙ 374 million detailed public records across more than
2,750 U.S. counties
∙ 20+ years of deed transfers, mortgages, foreclosures,
auctions, property tax delinquencies and more for
residential and commercial properties.
∙ Assessor data including property characteristics,
geographic information, and prior valuations on
approximately 200 million parcels in more than 3,100
counties.
Email ZTRAX@zillow.com for more information
Zillow.com/data
37. We’re Hiring!!!
Roles
● Principal DevOps Engineer
● Machine Learning Engineer
● Data Scientist
● Product Manager
● Data Engineer
@JasjeetThind
Related Blogs
● Zillow.com/data-science
● Trulia.com/blog/tech/
38. Zillow Prize - $1 million dollars
The brightest scientific minds have a unique opportunity to work on the algorithm that
changed the world of real estate - the Zestimate® home valuation algorithm.
www.zillow.com/promo/zillow-prize
@JasjeetThind
Notes de l'éditeur
DevOps and Machine Learning
How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
Zillow Group - conglomerate of real-estate websites, mobile apps, and services.
170 million monthly unique users
Hundreds of petabytes of data in our data lake
Brands
Trulia - 2005, unprecedented information on comps, prior sale, and listing data
Naked Apartments - rentals in NYC
StreetEasy - sales and rentals in NYC
Zillow - 2006, Zestimate - first freely available home valuation
Hotpads - 2005, rentals nationwide
Mortech - loan pricing from multiple vendors
Dotloop - allows paperwork in real-estate transaction to be paper-less
Retsly - platform and APIs that power real-estate software
Email - homes for sale / for rent
Home Details - homes for sale / homes like this
Personalized Search
Mobile - smart SMS and push notifications
Home owner predictions
Pre-Seller predictions
Lender selection algorithm
Real-estate content personalization
Similar photos / video
Here is a list of models that we use, everything from RF, to deep learning, and collab filtering
- Master r3.2xlarge
- Slave r3.4xlarge
https://stash.atl.zillow.net/projects/ZDA/repos/generic-models/browse/generic_models/tests/test_generic_models.py
https://stash.atl.zillow.net/projects/ZDA/repos/tax-valuation/browse/tax_valuation/tests/test_tax_valuation_model.py#46
We can use the same artifact repository for trained models that we use for our services and libraries. It gives us the ability to use a set of coordinates (group ID, artifact ID, and version) to define an artifact, which gets stored in our artifact repository that is backed by S3.
Here's how it looks for our built code (link points to a lightweight frontend to S3):
http://repo-storage-proxy.in.zillow.net/release/com/zillow/datascience/prior-sale-model/0.0.27-master.71390e6/
group ID = com.zillow.datascience
artifact ID = prior-sale-model
version = 0.0.27-master.71390e6