SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
@martin_loetzsch
Dr. Martin Loetzsch
PyData Berlin September meetup
https://github.com/mara
Lightweight ETL pipelines with Mara
All the data of the company in one place


Data is
the single source of truth
easy to access
documented
embedded into the organisation
Integration of different domains





















Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!2
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv files
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price 

histories
emails
clicks
…
…
operation

events
!3
Data Engineering
@martin_loetzsch
!4
Which technology?
@martin_loetzsch
Avoid click-tools
hard to debug
hard to change
hard to scale with team size/ data complexity / data volume 

Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code 

Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
!5
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
Megabytes
Plain scripts
Petabytes
Apache Airflow
In between
Mara
!6
Mara: Things that worked for us
@martin_loetzsch
Open source (MIT license)
m_dim marketing
!7
Fokus: complex data
@martin_loetzsch
Many sources, complex business logic
!8
Mara
@martin_loetzsch
Runnable app
Integrates PyPI project download stats with 

Github repo events
!9
Example: Python project stats data warehouse
@martin_loetzsch
https://github.com/mara/mara-example-project
Example pipeline
pipeline = Pipeline(
id="pypi",
description="Builds a PyPI downloads cube using the public ..”)
# ..
pipeline.add(
Task(id=“transform_python_version", description=‘..’,
commands=[
ExecuteSQL(sql_file_name="transform_python_version.sql")
]),
upstreams=['read_download_counts'])
pipeline.add(
ParallelExecuteSQL(
id=“transform_download_counts", description=“..”,
sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);",
parameter_function=etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download_counts.sql")
]),
upstreams=["preprocess_project_version", "transform_installer",
"transform_python_version"])
!10
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands
Target of computation


CREATE TABLE m_dim_next.region (

region_id SMALLINT PRIMARY KEY,

region_name TEXT NOT NULL UNIQUE,

country_id SMALLINT NOT NULL,

country_name TEXT NOT NULL,

_region_name TEXT NOT NULL

);



Do computation and store result in table


WITH raw_region
AS (SELECT DISTINCT
country,

region

FROM m_data.ga_session

ORDER BY country, region)



INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,

CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;

INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');

Speedup subsequent transformations


SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);



SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);



ANALYZE m_dim_next.region;
!11
PostgreSQL as a data processing engine
@martin_loetzsch
Leave data in DB, Tables as (intermediate) results of processing steps
Execute query


ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql 
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl

Read file


ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv" 
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py" 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'"

Copy from other databases


Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"})

cat app/data_integration/pipelines/load_data/pdm/load-product.sql 
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g" 
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';') 
| (cat && echo ';
go') 
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv 
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning 
psql --username=mloetzsch --host=localhost --echo-all 
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl 
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!12
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe
Read a set of files


pipeline.add(
ParallelReadFile(
id="read_download",
description="Loads PyPI downloads from pre_downloaded csv
files",
file_pattern="*/*/*/pypi/downloads-v1.csv.gz",
read_mode=ReadMode.ONLY_NEW,
compression=Compression.GZIP,
target_table="pypi_data.download",
delimiter_char="t", skip_header=True, csv_format=True,
file_dependencies=read_download_file_dependencies,
date_regex="^(?P<year>d{4})/(?P<month>d{2})/(?
P<day>d{2})/",
partition_target_table_by_day_id=True,
timezone="UTC",
commands_before=[
ExecuteSQL(
sql_file_name="create_download_data_table.sql",
file_dependencies=read_download_file_dependencies)
]))
Split large joins into chunks
pipeline.add(
ParallelExecuteSQL(
id="transform_download",
description="Maps downloads to their dimensions",
sql_statement="SELECT
pypi_tmp.insert_download(@chunk@::SMALLINT);”,
parameter_function=
etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download.sql")
]),
upstreams=["preprocess_project_version",
"transform_installer"])
!13
Incremental & parallel processing
@martin_loetzsch
You can’t join all clicks with all customers at once
Consider Mara for your next ETL project
@martin_loetzsch
!14
!15
Refer us a data person, earn 1000€
@martin_loetzsch
Also sales people, developers, product managers, etc.
Thank you
@martin_loetzsch
!16

Contenu connexe

Tendances

Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
Mark Wong
 
ODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data Management
ODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data ManagementODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data Management
ODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data Management
Francisco Amores
 
PL/SQL Code for Sample Projects
PL/SQL Code for Sample ProjectsPL/SQL Code for Sample Projects
PL/SQL Code for Sample Projects
jwjablonski
 

Tendances (20)

Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptx
 
Apache Flink Adoption @ Shopify
Apache Flink Adoption @ ShopifyApache Flink Adoption @ Shopify
Apache Flink Adoption @ Shopify
 
Big data-cheat-sheet
Big data-cheat-sheetBig data-cheat-sheet
Big data-cheat-sheet
 
Oracle
OracleOracle
Oracle
 
PostgreSQL- An Introduction
PostgreSQL- An IntroductionPostgreSQL- An Introduction
PostgreSQL- An Introduction
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
ODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data Management
ODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data ManagementODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data Management
ODTUG KSCOPE 2018 - REST APIs for FDMEE and Cloud Data Management
 
80 different SQL Queries with output
80 different SQL Queries with output80 different SQL Queries with output
80 different SQL Queries with output
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to dax
 
SQL practice questions set
SQL practice questions setSQL practice questions set
SQL practice questions set
 
MySQL Cheat Sheet
MySQL Cheat SheetMySQL Cheat Sheet
MySQL Cheat Sheet
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Project A Data Modelling Best Practices Part I: How to model data in a data w...
Project A Data Modelling Best Practices Part I: How to model data in a data w...Project A Data Modelling Best Practices Part I: How to model data in a data w...
Project A Data Modelling Best Practices Part I: How to model data in a data w...
 
Polyglot Persistence vs Multi-Model Databases
Polyglot Persistence vs Multi-Model DatabasesPolyglot Persistence vs Multi-Model Databases
Polyglot Persistence vs Multi-Model Databases
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
Pi Day 2022 -  from IoT to MySQL HeatWave Database ServicePi Day 2022 -  from IoT to MySQL HeatWave Database Service
Pi Day 2022 - from IoT to MySQL HeatWave Database Service
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
PL/SQL Code for Sample Projects
PL/SQL Code for Sample ProjectsPL/SQL Code for Sample Projects
PL/SQL Code for Sample Projects
 

Similaire à Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

Informatica
InformaticaInformatica
Informatica
mukharji
 
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minIRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
Digendra Vir Singh (DV)
 
Stored-Procedures-Presentation
Stored-Procedures-PresentationStored-Procedures-Presentation
Stored-Procedures-Presentation
Chuck Walker
 

Similaire à Lightweight ETL pipelines with mara (PyData Berlin September Meetup) (20)

Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
 
Informatica
InformaticaInformatica
Informatica
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Informatica Interview Questions & Answers
Informatica Interview Questions & AnswersInformatica Interview Questions & Answers
Informatica Interview Questions & Answers
 
Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
PASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep DivePASS Summit - SQL Server 2017 Deep Dive
PASS Summit - SQL Server 2017 Deep Dive
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20minIRMUK-SOA_for_MDM_DQ_Integration_DV_20min
IRMUK-SOA_for_MDM_DQ_Integration_DV_20min
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
Stored-Procedures-Presentation
Stored-Procedures-PresentationStored-Procedures-Presentation
Stored-Procedures-Presentation
 
Technological insights behind Clusterpoint database
Technological insights behind Clusterpoint databaseTechnological insights behind Clusterpoint database
Technological insights behind Clusterpoint database
 
Air Line Management System | DBMS project
Air Line Management System | DBMS projectAir Line Management System | DBMS project
Air Line Management System | DBMS project
 
2006 DDD4: Data access layers - Convenience vs. Control and Performance?
2006 DDD4: Data access layers - Convenience vs. Control and Performance?2006 DDD4: Data access layers - Convenience vs. Control and Performance?
2006 DDD4: Data access layers - Convenience vs. Control and Performance?
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 

Dernier

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Dernier (20)

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 

Lightweight ETL pipelines with mara (PyData Berlin September Meetup)

  • 1. @martin_loetzsch Dr. Martin Loetzsch PyData Berlin September meetup https://github.com/mara Lightweight ETL pipelines with Mara
  • 2. All the data of the company in one place 
 Data is the single source of truth easy to access documented embedded into the organisation Integration of different domains
 
 
 
 
 
 
 
 
 
 
 Main challenges Consistency & correctness Changeability Complexity Transparency !2 Data warehouse = integrated data @martin_loetzsch Nowadays required for running a business application databases events csv files apis reporting crm marketing … search pricing DWH orders users products price 
 histories emails clicks … … operation
 events
  • 5. Avoid click-tools hard to debug hard to change hard to scale with team size/ data complexity / data volume 
 Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code 
 Easy to debug & inspect Develop locally, test on staging system, then deploy to production !5 Make changing and testing things easy @martin_loetzsch Apply standard software engineering best practices Megabytes Plain scripts Petabytes Apache Airflow In between Mara
  • 6. !6 Mara: Things that worked for us @martin_loetzsch Open source (MIT license)
  • 7. m_dim marketing !7 Fokus: complex data @martin_loetzsch Many sources, complex business logic
  • 9. Runnable app Integrates PyPI project download stats with 
 Github repo events !9 Example: Python project stats data warehouse @martin_loetzsch https://github.com/mara/mara-example-project
  • 10. Example pipeline pipeline = Pipeline( id="pypi", description="Builds a PyPI downloads cube using the public ..”) # .. pipeline.add( Task(id=“transform_python_version", description=‘..’, commands=[ ExecuteSQL(sql_file_name="transform_python_version.sql") ]), upstreams=['read_download_counts']) pipeline.add( ParallelExecuteSQL( id=“transform_download_counts", description=“..”, sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);", parameter_function=etl_tools.utils.chunk_parameter_function, parameter_placeholders=["@chunk@"], commands_before=[ ExecuteSQL(sql_file_name="transform_download_counts.sql") ]), upstreams=["preprocess_project_version", "transform_installer", "transform_python_version"]) !10 ETL pipelines as code @martin_loetzsch Pipeline = list of tasks with dependencies between them. Task = list of commands
  • 11. Target of computation 
 CREATE TABLE m_dim_next.region (
 region_id SMALLINT PRIMARY KEY,
 region_name TEXT NOT NULL UNIQUE,
 country_id SMALLINT NOT NULL,
 country_name TEXT NOT NULL,
 _region_name TEXT NOT NULL
 );
 
 Do computation and store result in table 
 WITH raw_region AS (SELECT DISTINCT country,
 region
 FROM m_data.ga_session
 ORDER BY country, region)
 
 INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id,
 CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1;
 INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
 Speedup subsequent transformations 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']);
 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']);
 
 ANALYZE m_dim_next.region; !11 PostgreSQL as a data processing engine @martin_loetzsch Leave data in DB, Tables as (intermediate) results of processing steps
  • 12. Execute query 
 ExecuteSQL(sql_file_name=“preprocess-ad.sql") cat app/data_integration/pipelines/facebook/preprocess-ad.sql | PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all -—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
 Read file 
 ReadFile(file_name=“country_iso_code.csv", compression=Compression.NONE, target_table="os_data.country_iso_code", mapper_script_file_name=“read-country-iso-codes.py", delimiter_char=“;") cat "dwh-data/country_iso_code.csv" | .venv/bin/python3.6 "app/data_integration/pipelines/load_data/ read-country-iso-codes.py" | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.country_iso_code FROM STDIN WITH CSV DELIMITER AS ';'"
 Copy from other databases 
 Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm", target_table=“os_data.product", replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps", "@@client@@": "kfzteile24 GmbH"})
 cat app/data_integration/pipelines/load_data/pdm/load-product.sql | sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/ kfzteile24 GmbH/g" | sed 's/$/$/g;s/$/$/g' | (cat && echo ';') | (cat && echo '; go') | sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv | PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning psql --username=mloetzsch --host=localhost --echo-all --no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl --command="COPY os_data.product FROM STDIN WITH CSV HEADER" !12 Shell commands as interface to data & DBs @martin_loetzsch Nothing is faster than a unix pipe
  • 13. Read a set of files 
 pipeline.add( ParallelReadFile( id="read_download", description="Loads PyPI downloads from pre_downloaded csv files", file_pattern="*/*/*/pypi/downloads-v1.csv.gz", read_mode=ReadMode.ONLY_NEW, compression=Compression.GZIP, target_table="pypi_data.download", delimiter_char="t", skip_header=True, csv_format=True, file_dependencies=read_download_file_dependencies, date_regex="^(?P<year>d{4})/(?P<month>d{2})/(? P<day>d{2})/", partition_target_table_by_day_id=True, timezone="UTC", commands_before=[ ExecuteSQL( sql_file_name="create_download_data_table.sql", file_dependencies=read_download_file_dependencies) ])) Split large joins into chunks pipeline.add( ParallelExecuteSQL( id="transform_download", description="Maps downloads to their dimensions", sql_statement="SELECT pypi_tmp.insert_download(@chunk@::SMALLINT);”, parameter_function= etl_tools.utils.chunk_parameter_function, parameter_placeholders=["@chunk@"], commands_before=[ ExecuteSQL(sql_file_name="transform_download.sql") ]), upstreams=["preprocess_project_version", "transform_installer"]) !13 Incremental & parallel processing @martin_loetzsch You can’t join all clicks with all customers at once
  • 14. Consider Mara for your next ETL project @martin_loetzsch !14
  • 15. !15 Refer us a data person, earn 1000€ @martin_loetzsch Also sales people, developers, product managers, etc.