SlideShare une entreprise Scribd logo
1  sur  41
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@fzk
frisovanvollenhoven@godatadriven.com
Dirty Data
Friso van Vollenhoven
It’s a mess. It’s your problem.
'februari-22 2013'
A: Yes, sometimes as
often as 1 in every 10K
calls. Or about once a
week at 3K files / day.
þ
þ
TSV ==
thorn separated values?
þ == 0xFE
or -2, in Hive
CREATE TABLE browsers (
browser_id STRING,
browser STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
• The format will change
• Faulty deliveries will occur
• Your parser will break
• Records will be mistakingly produced (over-logging)
• Other people test in production too (and you get the
data from it)
• Etc., etc.
•Simple deployment of ETL code
•Scheduling
•Scalable
•Independent jobs
•Fixable data store
•Incremental where possible
•Metrics
EXTRACT
TRANSFORM
LOAD
• No JVM startup overhead for Hadoop API usage
• Relatively concise syntax (Python)
• Mix Python standard library with any Java libs
• Flexible scheduling with dependencies
• Saves output
• E-mails on errors
• Scales to multiple nodes
• REST API
• Status monitor
• Integrates with version control
Deployment
git push jenkins master
Independent jobs
source (external)
staging (HDFS)
hive-staging (HDFS)
Hive
HDFS upload + move in place
MapReduce + HDFS move
Hive map external table + SELECT INTO
Out of order jobs
• At any point, you don’t really know what ‘made it’
to Hive
• Will happen anyway, because some days the data
delivery is going to be three hours late
• Or you get half in the morning and the other half
later in the day
• It really depends on what you do with the data
• This is where metrics + fixable data store help...
Fixable data store
• Using Hive partitions
• Jobs that move data from staging create partitions
• When new data / insight about the data arrives,
drop the partition and re-insert
• Be careful to reset any metrics in this case
• Basically: instead of trying to make everything
transactional, repair afterwards
• Use metrics to determine whether data is fit for
purpose
Metrics
Metrics service
• Job ran, so may units processed, took so much
time
• e.g. 10GB imported, took 1 hr
• e.g. 60M records transformed, took 10 minutes
• Dropped partition
• Inserted X records into partition
GoDataDriven
We’re hiring / Questions? / Thank you!
@fzk
frisovanvollenhoven@godatadriven.com
Friso van Vollenhoven

Contenu connexe

Plus de GoDataDriven

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 

Plus de GoDataDriven (20)

Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
 
The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019
The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019
The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019
 
CI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure DatabricksCI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure Databricks
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Dirty data by friso van vollenhoven

  • 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP @fzk frisovanvollenhoven@godatadriven.com Dirty Data Friso van Vollenhoven It’s a mess. It’s your problem.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 8.
  • 9. A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.
  • 10.
  • 11.
  • 12. þ
  • 13. þ
  • 16. or -2, in Hive CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. • The format will change • Faulty deliveries will occur • Your parser will break • Records will be mistakingly produced (over-logging) • Other people test in production too (and you get the data from it) • Etc., etc.
  • 24. •Simple deployment of ETL code •Scheduling •Scalable •Independent jobs •Fixable data store •Incremental where possible •Metrics
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. • No JVM startup overhead for Hadoop API usage • Relatively concise syntax (Python) • Mix Python standard library with any Java libs
  • 32.
  • 33. • Flexible scheduling with dependencies • Saves output • E-mails on errors • Scales to multiple nodes • REST API • Status monitor • Integrates with version control
  • 34.
  • 36. Independent jobs source (external) staging (HDFS) hive-staging (HDFS) Hive HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO
  • 37. Out of order jobs • At any point, you don’t really know what ‘made it’ to Hive • Will happen anyway, because some days the data delivery is going to be three hours late • Or you get half in the morning and the other half later in the day • It really depends on what you do with the data • This is where metrics + fixable data store help...
  • 38. Fixable data store • Using Hive partitions • Jobs that move data from staging create partitions • When new data / insight about the data arrives, drop the partition and re-insert • Be careful to reset any metrics in this case • Basically: instead of trying to make everything transactional, repair afterwards • Use metrics to determine whether data is fit for purpose
  • 40. Metrics service • Job ran, so may units processed, took so much time • e.g. 10GB imported, took 1 hr • e.g. 60M records transformed, took 10 minutes • Dropped partition • Inserted X records into partition
  • 41. GoDataDriven We’re hiring / Questions? / Thank you! @fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven