Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
2. 2
Who We Are
Data Governance Platform Team
@ Zillow
Smit Shah
Senior Software Development
Engineer, Big Data
Yuliana Havryshchuk
Software Development Engineer,
Big Data
3. 3
Agenda
● What is Zillow?
● Data Quality Challenges
● Centralized Data Quality Platform
○ Architecture
○ Self-Service
○ Pipeline integration
● Key Takeaways
5. About Zillow
● Reimagining real estate to make it
easier to unlock life’s next chapter
* As of Q4-2020
● Offer customers an on-demand
experience for selling, buying,
renting and financing with
transparency and nearly seamless
end-to-end service
● Most-visited real estate website in
the United States
7. Why Monitor Data Quality?
● Data fuels many customer facing
and internal services at Zillow that
rely on high quality data
○ Zestimate
○ Zillow Offers
○ Zillow Premier Agent
○ Econ and many more
● Reliable performance of ML and
Services requires certain level of
data quality
8. Challenges we Faced
● No standard way to monitor quality
● Lack of visibility into data health
● No known lineage between data and processes
10. Data
Quality
Platform
Increase Visibility of
Data Health
Integrate with Data
Lineage
Support Built-in
Alerting
Enable Safe
Evolution of Rules
Standardize Data
Quality Rules
5 Pillars for Data Quality Platform
21. Self-Service Onboarding - Example
* These values are simulated
id name type page_views data_date
1 123 Green St house 709 2021-05-01
2 47 Walker Rd townhouse 132 2021-05-01
1225 City St #901 condo 800 2021-05-01
4 47 Walker Ave test 600 2021-05-01
24. Self-Service Onboarding - Example
* These values are simulated
id name type page_views data_date
1 123 Green St house 709 2021-05-01
1 123 Green St house 820 2021-05-02
1 123 Green St house 12 2021-05-03
1 123 Green St house 760 2021-05-04
32. Validation Results
● Alert data users if any checks fail
● Integrate with pipeline execution to prevent propagation
● Provide visibility through data discovery tool
● Provide common understanding between producers and consumers
33. Future Direction
● Tighter integration between components
● Expand libraries to support more use-cases
● Move from detection to diagnosis
● Validation for streaming data
35. Key Takeaways
● 5 pillars that helped us build a robust platform: standardization,
visibility, evolution, alerting, lineage
● Alerting on data quality issues early allows proactive response
● Producing quality data increases trust in data and improves decisions
made
● Data quality is a shared responsibility, and collaboration is needed to
be successful