SlideShare une entreprise Scribd logo
1  sur  49
WHOOPS, THE NUMBERS
ARE WRONG!
SCALING DATA QUALITY
@
MICHELLE UFFORD
DATA ENGINEERING & ANALYTICS, NETFLIX
HADOOP SUMMIT 2017
Overview.
The business.
20170612
100+ million
members
$6 billion
on content
125+ million
hours watched
launched
in 1997
every. day.
$
Anytime. Anywhere.*
20170612
* Well, almost anywhere.
Any device.
20170612
300 terabyte
DW writes
5 petabyte
DW reads
The data.
20170612
60+ petabyte
data warehouse
700+ billion
events written
300 terabyte
DW writes
5 petabyte
DW reads
The data.
20170612
60+ petabyte
data warehouse
700+ billion
events written
20170612
data access
AWS
S3
Amazon
Redshift
data processing
fast storage data viz
METACA
T
data services
events data
operational data
elastic storage
Apache Pig
Big Data Platform
Data Quality.
Federated metastore &
extensible data catalog
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702
s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541
dw.fact_table_f
20170612
Metacat
Federated Metastore
s3://…/dw/fact_table_f/utc_date=20170101/batchid=1483229855
…
s3://…/dw/fact_table_f/utc_date=20170611/batchid=1497226702
s3://…/dw/fact_table_f/utc_date=20170612/batchid=1497312541
dw.fact_table_f
utc_date=20170101
utc_date=20170611
utc_date=20170612
…
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Metacat
Federated Metastore
utc_date=20170101
20170612
Extended table attributes
● primary key(s)
● column types
● lifecycle
● audience
● “valid-thru” timestamp
● … and much more
Metacat
Federated Metastore
Data Quality Service.
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
20170612
Quinto
Data Quality Service
Write - Audit - Publish
ETL pattern
for high-quality
big data jobs
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f
WAP
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
dw.my_table_f audit.my_table_f_1497312000
WAPStage-0: Prep
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
s3://…/utc_date=20170612/batchid=1497312541
WAPStage-1: Write
audit.my_table_f_1497312000dw.my_table_f
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-1: Write
audit.my_table_f_1497312000
$TABLE
dw.my_table_f
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
Quinto configuration
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_f
Quint
o
metric eval behavior result
--------------------------------------------------
RowCount >= zero fail job
RowCount >= prior value fail job
NullCount normal dist warn job
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job fail
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
…
WAPStage-2: Audit
audit.my_table_f_1497312000dw.my_table_fmetric eval behavior result
--------------------------------------------------
RowCount >= zero fail job pass
RowCount >= prior value fail job pass
NullCount normal dist warn job fail
Quint
o
utc_date=20170612
com.netflix.dse.mds.metric.RowCount: 17240
com.netflix.dse.mds.metric.NullCount: 17240
...
utc_date=20170611
com.netflix.dse.mds.metric.RowCount: 16135
com.netflix.dse.mds.metric.NullCount: 21
...
Quinto configuration
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
audit.my_table_f_1497312000dw.my_table_f
s3://…/utc_date=20170612/batchid=1497312541
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
audit.my_table_f_1497312000dw.my_table_f
s3://…/utc_date=20170612/batchid=1497312541
ETL Pattern
20170612
s3://…/utc_date=20170101/batchid=1483229855
…
s3://…/utc_date=20170611/batchid=1497226702
WAPStage-3: Publish
s3://…/utc_date=20170612/batchid=1497312541
dw.my_table_f
valid_thru_ts = 20170613 00:00:00
ETL Pattern
20170612
Quinto evaluations
● intelligent recommendations
● multiple tiers of coverage
● configurable rules
Jumpstarter.
Python Library
WAP.
Python Library
Minimal requirements
● parameterized destination table
WAP.
Python Library
Running WAP.
20170612
What’s Next.
● additional Metacat statistics
● robust anomaly detection (RAD)
● complete migration for all prod tables
20170612
Tips & Lessons Learned.
● Query-based solution may be “good enough” for many.
● Not all tables need quality coverage.
● One size rarely fits all tables.
● Build components, not “all-or-nothing” frameworks.
MICHELLE UFFORD
mufford@netflix.com
twitter.com/MichelleUfford
DATA
techblog.netflix.com
medium.com/netflix-techblog
twitter.com/NetflixData
tinyurl.com/NetflixData
Thank you!
WE’RE HIRING! jobs.netflix.com

Contenu connexe

Tendances

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
DataStax
 

Tendances (20)

Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
 
Druid
DruidDruid
Druid
 
Splunk for IT Operations
Splunk for IT OperationsSplunk for IT Operations
Splunk for IT Operations
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics Apache Flink, AWS Kinesis, Analytics
Apache Flink, AWS Kinesis, Analytics
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Implementing Domain Events with Kafka
Implementing Domain Events with KafkaImplementing Domain Events with Kafka
Implementing Domain Events with Kafka
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 

Similaire à Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Amazon Web Services
 

Similaire à Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix (20)

SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
SQREAM DB on IBM Power9
SQREAM DB on IBM Power9SQREAM DB on IBM Power9
SQREAM DB on IBM Power9
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-HealingApplying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
Applying AI to Performance Engineering: Shift-Left, Shift-Right, Self-Healing
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Refactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics ProductsRefactoring your EDW with Mobile Analytics Products
Refactoring your EDW with Mobile Analytics Products
 
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data WarehouseGPSWKS401_Designing a Cloud Enterprise Data Warehouse
GPSWKS401_Designing a Cloud Enterprise Data Warehouse
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
Flink Forward SF 2017: Chinmay Soman - Real Time Analytics in the real World ...
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
 
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

Notes de l'éditeur

  1. https://dataworkssummit.com/san-jose-2017/sessions/whoops-the-numbers-are-wrong-scaling-data-quality-netflix/ WHOOPS, THE NUMBERS ARE WRONG! SCALING DATA QUALITY @ NETFLIX Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data. DETAILS This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Apache Pig, Apache Spark and is geared towards Architect, Data Analyst, Developer / Engineer, Operations / IT audiences.
  2. 1500+ devices as of Q1 2017
  3. Goal is to provide behind-the-scenes look at how we’re approaching DQ. We’re sharing ideas, not code – no open-source announcement.
  4. That’s cool but sounds like a lot of work I need to: know what stats are available in Metacat know what quality templates exist figure out which ones I should use figure out a good configuration for each and do everything we just walked through in WAP
  5. Takes ~5 minutes to enable WAP with 108 audits on a new Spark job. Added 42 seconds to hourly processing time.
  6. It’s a combination of these solutions that allow us to scale not only the processing time but the engineering time too. Takes ~5 minutes to enable WAP with 108 audits on a new Spark job. Added 42 seconds to hourly processing time.
  7. Stats Cardinality Histograms Map keys RAD High cardinality dimensions Seasonality beyond week-over-week Atypical data distributions Reduce false positives
  8. Query-based solution Not efficient but much less complicated to implement Good place to start Works well for small-to-medium datasets and/or nightly batch ETL Transient, experimental, single-user Content vs. Streaming
  9. 2 main motivations Confidence Notify users when quality issues arise Only make data available after some basic validation Increase confidence for data consumers that data is good to use Efficiency Catch issues faster Less business impact Much easier to simply not update downstream dependencies than to fix it after-the-fact