SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Data Quality with or
without Apache Spark
and its ecosystem
Serge Smertin
Sr. Resident Solutions Architect at
Databricks
▪ Intro
▪ Dimensions
▪ Frameworks
▪ TLDR
▪ Outro
About me
▪ Worked in all stages of data
lifecycle for the past 14 years
▪ Built data science platforms from
scratch
▪ Tracked cyber criminals through
massively scaled data forensics
▪ Built anti-PII analysis measures
for payments industry
▪ Bringing Databricks strategic
customers to next level as
full-time job now
Colleen Graham
“Performance Management Driving BI Spending”, InformationWeek, February 14, 2006
https://www.informationweek.com/performance-management-driving-bi-spending/d/d-id/10405
52
Data quality requires certain
level of sophistication within a
company to even understand
that it’s a problem.
Data
Catalogs
Data
Profiling
ETL
Quality Checks
Metrics repository
Alerting
Noise filtering
Dashboards Oncall
Data
Catalogs
Data
Profiling
ETL
Metrics repository
Alerting
Noise filtering
Dashboards Oncall
Completeness
Consistency
Uniqueness
Timeliness
Relevance
Accuracy
Validity
Quality Checks
Record
level
Database
level
- Stream-friendly
- Quarantine invalid data
- Debug and re-process
- Make sure to (re-)watch
“Make reliable ETL easy
on Delta Lake” talk
- Batch-friendly
- See health of the entire pipeline
- Detect processing anomalies
- Reconciliation testing
- Mutual information analysis
- This talk
Data owners and
Subject Matter
Experts define
ideal shape of the
data
May not fully cover
all aspects, when
number of
datasets is bigger
that SME team
Often is the only way
for larger orgs,
where expertise still
has to be developed
internally
May lead to
incomplete data
coverage and
missed signals about
problems in data
pipelines
Exploration
Expertise
Semi-supervised
code generation
based on data
profiling results
May overfit
alerting with rules
that are too strict
by default,
resulting in more
noise than signal
Automation
Few solutions exist in the open-source
community either in the form of libraries or
complete stand-alone platforms, which can be
used to assure a certain data quality, especially
when continuous imports happen.
“1” if check(s)
succeeded for a given
row. Result is
averaged.
Streaming friendly.
Success
Keys
Check compares
incoming batch
with existing
dataset - e.g.
unique keys
Domain
Keys
Materialised
synthetic
aggregations - e.g.
is this batch |2σ|
records different
than previous?
Dataset
Metrics
Repeat computation
in a separate,
simplified pipeline
and validate results -
e.g. double-entry
bookkeeping
Reconciliation
Tests
If you “build your own everything” - consider
embedding Deequ.It has has constraint
suggestion among advanced enterprise
features like data profiling and anomaly
detection out of the box, though documentation
is not that extensive. And you may want to fork it
internally.
Deequ code
generation
from pydeequ.suggestions import *
suggestionResult = (
ConstraintSuggestionRunner(spark)
.onData(spark.table('demo'))
.addConstraintRule(DEFAULT())
.run())
print('from pydeequ.checks import *')
print('check = (Check(spark, CheckLevel.Warning, "Generated check")')
for suggestion in suggestionResult['constraint_suggestions']:
if 'Fractional' in suggestion['suggesting_rule']:
continue
print(f' {suggestion["code_for_constraint"]}')
print(')')
from pydeequ.checks import *
check = (Check(spark, CheckLevel.Warning,
"Generated check")
.isComplete("b")
.isNonNegative("b")
.isComplete("a")
.isNonNegative("a")
.isUnique("a")
.hasCompleteness("c", lambda x: x >= 0.32,
"It should be above 0.32!"))
Great Expectations is less enterprise'y data
validation platform written in Python, that
focuses on supporting Apache Spark among
other data sources, like Postgres, Pandas,
BigQuery, and so on.
Pandas Profiling
▪ Exploratory Data Analysis
simplified by generating HTML
report
▪ Native bi-directional
integration with Great
Expectations
▪ great_expectations
profile DATASOURCE
▪ (pandas_profiling
.ProfileReport(pandas_df)
.to_expectation_suite())
https://pandas-profiling.github.io/pandas-profiling/
Apache Griffin may be the most
enterprise-oriented solution with user interface
available, given the fact it being Apache
top-level project and backed up by eBay since
2016, but it is not as easily embeddable into
existing applications, because it requires
standalone deployment along with JSON DSL
definitions for rules.
Completeness
SELECT AVG(IF(c IS NOT NULL, 1, 0)) AS isComplete FROM demo
Deequ
PySpark
Great Expectations
SQL
Uniqueness
SELECT (COUNT(DISTINCT c) / COUNT(1)) AS isUnique FROM demo
Deequ
Great Expectations
PySpark
SQL
Validity
SELECT AVG(IF(a < b, 1, 0)) AS isValid FROM demo
Deequ
Great Expectations
PySpark
SQL
Timeliness
SELECT NOW() - MAX(rawEventTime) AS delay
FROM processed_events
raw events processed
events
Honorable Mentions
• https://github.com/FRosner/drunken-data-quality
• https://github.com/databrickslabs/dataframe-rules-engine
Make sure to (re-)watch
“Make reliable ETL easy
on Delta Lake” talk
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Contenu connexe

Tendances

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 

Tendances (20)

Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 

Similaire à Data Quality With or Without Apache Spark and Its Ecosystem

"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Databricks
 
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiouslyQA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QAFest
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 

Similaire à Data Quality With or Without Apache Spark and Its Ecosystem (20)

AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
Testing Big Data solutions fast and furiously
Testing Big Data solutions fast and furiouslyTesting Big Data solutions fast and furiously
Testing Big Data solutions fast and furiously
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiouslyQA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
 
Scalable and Resilient Security Ratings Platform with ScyllaDB
Scalable and Resilient Security Ratings Platform with ScyllaDBScalable and Resilient Security Ratings Platform with ScyllaDB
Scalable and Resilient Security Ratings Platform with ScyllaDB
 
Building functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortalBuilding functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortal
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction ProfilerSplunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Maximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and PracticesMaximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and Practices
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
 

Plus de Databricks

Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 

Dernier

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 

Dernier (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 

Data Quality With or Without Apache Spark and Its Ecosystem

  • 1. Data Quality with or without Apache Spark and its ecosystem Serge Smertin Sr. Resident Solutions Architect at Databricks
  • 2. ▪ Intro ▪ Dimensions ▪ Frameworks ▪ TLDR ▪ Outro
  • 3. About me ▪ Worked in all stages of data lifecycle for the past 14 years ▪ Built data science platforms from scratch ▪ Tracked cyber criminals through massively scaled data forensics ▪ Built anti-PII analysis measures for payments industry ▪ Bringing Databricks strategic customers to next level as full-time job now
  • 4. Colleen Graham “Performance Management Driving BI Spending”, InformationWeek, February 14, 2006 https://www.informationweek.com/performance-management-driving-bi-spending/d/d-id/10405 52 Data quality requires certain level of sophistication within a company to even understand that it’s a problem.
  • 6. Data Catalogs Data Profiling ETL Metrics repository Alerting Noise filtering Dashboards Oncall Completeness Consistency Uniqueness Timeliness Relevance Accuracy Validity Quality Checks
  • 7. Record level Database level - Stream-friendly - Quarantine invalid data - Debug and re-process - Make sure to (re-)watch “Make reliable ETL easy on Delta Lake” talk - Batch-friendly - See health of the entire pipeline - Detect processing anomalies - Reconciliation testing - Mutual information analysis - This talk
  • 8. Data owners and Subject Matter Experts define ideal shape of the data May not fully cover all aspects, when number of datasets is bigger that SME team Often is the only way for larger orgs, where expertise still has to be developed internally May lead to incomplete data coverage and missed signals about problems in data pipelines Exploration Expertise Semi-supervised code generation based on data profiling results May overfit alerting with rules that are too strict by default, resulting in more noise than signal Automation
  • 9. Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen.
  • 10. “1” if check(s) succeeded for a given row. Result is averaged. Streaming friendly. Success Keys Check compares incoming batch with existing dataset - e.g. unique keys Domain Keys Materialised synthetic aggregations - e.g. is this batch |2σ| records different than previous? Dataset Metrics Repeat computation in a separate, simplified pipeline and validate results - e.g. double-entry bookkeeping Reconciliation Tests
  • 11. If you “build your own everything” - consider embedding Deequ.It has has constraint suggestion among advanced enterprise features like data profiling and anomaly detection out of the box, though documentation is not that extensive. And you may want to fork it internally.
  • 12. Deequ code generation from pydeequ.suggestions import * suggestionResult = ( ConstraintSuggestionRunner(spark) .onData(spark.table('demo')) .addConstraintRule(DEFAULT()) .run()) print('from pydeequ.checks import *') print('check = (Check(spark, CheckLevel.Warning, "Generated check")') for suggestion in suggestionResult['constraint_suggestions']: if 'Fractional' in suggestion['suggesting_rule']: continue print(f' {suggestion["code_for_constraint"]}') print(')') from pydeequ.checks import * check = (Check(spark, CheckLevel.Warning, "Generated check") .isComplete("b") .isNonNegative("b") .isComplete("a") .isNonNegative("a") .isUnique("a") .hasCompleteness("c", lambda x: x >= 0.32, "It should be above 0.32!"))
  • 13. Great Expectations is less enterprise'y data validation platform written in Python, that focuses on supporting Apache Spark among other data sources, like Postgres, Pandas, BigQuery, and so on.
  • 14. Pandas Profiling ▪ Exploratory Data Analysis simplified by generating HTML report ▪ Native bi-directional integration with Great Expectations ▪ great_expectations profile DATASOURCE ▪ (pandas_profiling .ProfileReport(pandas_df) .to_expectation_suite()) https://pandas-profiling.github.io/pandas-profiling/
  • 15. Apache Griffin may be the most enterprise-oriented solution with user interface available, given the fact it being Apache top-level project and backed up by eBay since 2016, but it is not as easily embeddable into existing applications, because it requires standalone deployment along with JSON DSL definitions for rules.
  • 16.
  • 17. Completeness SELECT AVG(IF(c IS NOT NULL, 1, 0)) AS isComplete FROM demo Deequ PySpark Great Expectations SQL
  • 18. Uniqueness SELECT (COUNT(DISTINCT c) / COUNT(1)) AS isUnique FROM demo Deequ Great Expectations PySpark SQL
  • 19. Validity SELECT AVG(IF(a < b, 1, 0)) AS isValid FROM demo Deequ Great Expectations PySpark SQL
  • 20. Timeliness SELECT NOW() - MAX(rawEventTime) AS delay FROM processed_events raw events processed events
  • 21. Honorable Mentions • https://github.com/FRosner/drunken-data-quality • https://github.com/databrickslabs/dataframe-rules-engine Make sure to (re-)watch “Make reliable ETL easy on Delta Lake” talk
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.