3D: DBT using Databricks and Delta

3D: DBT, Databricks and Delta
Fokko Driesprong
Principal Code Connaisseur

whoami
▪ Fokko Driesprong
▪ Master Distributed Systems &
Software Engineering
▪ Code Connaisseur at GoDataDriven
▪ Mostly doing {data,software} engineering
▪ Open source enthousiast
▪ ASF Member
▪ Committer + PMC member on Apache {Airflow, Avro, Druid}
▪ Committer on Apache Parquet

GoDataDriven
▪ Amsterdam based consultancy
▪ Data {Engineer,Science,Strategy}
▪ And now also analytics engineering!
▪ Around 50 consultants
▪ Used to do Hadoop, now Cloud

Agenda
What is DBT?
DBT + Delta Lake
DBT + Azure Databricks

What’s DBT? And why I ❤ it so much

Data Build Tool
▪ Tool for building data pipelines following the DataOps principles
▪ Simple tool to build complex pipelines
▪ Best practices from software engineering
▪ Linting / Peer reviews / DRY principle / Data testing
▪ SQL First
▪ Encodes organisational knowledge into the pipeline
Try it yourself: https://godatadriven.com/blog/tutorial-for-dbt-analytics-engineering-made-easy/
But everyone just calls it DBT

Data Build Tool
▪ Main focus on integration with DWH platforms
▪ Postgres, Redshift, Snowflake and Bigquery
▪ Support for Spark / Databricks
▪ Created by Fishtown Analytics
▪ Huge open source community
▪ Apache 2.0 Open Source license
But everyone just calls it DBT

DBT
The T in Extract-Transform-Load (ELT)
Analysts using dbt can transform their data by simply writing select statements, while
dbt handles turning these statements into tables and views in a data warehouse.

Let’s go through at a small example
Combine orders and order_lines into a revenue table

SQL with some Ninja2 sauce
Revenue table

DBT as a SQL Runner
Executes the pipeline from the command line

Seamless integration with the
Databricks Metastore
▪ Requires a Hive Metastore
▪ Analyize table

DBT as a SQL Compiler
Compiled SQL: target/compiled/dbtpreprocessing/models/revenue.sql

Next to the SQL there is documentation
Give meaning to the columns and add constraints

Looking at the docs
dbt docs generate
dbt docs serve
▪ Columns including types
▪ Test constraints
▪ Statistics
▪ The compiled query

Testing
dbt test
▪ Not-null
▪ Uniqueness
▪ Accepted values
▪ Referential constraints
▪ Custom tests

How does DBT communicate with Spark?
▪ SQL Over HTTP
▪ Authenticate using the token
▪ Parallel execution

Switch to incremental ingestion
Using the Delta format
▪ ACID dataformat by Databricks
▪ Linux Software Foundation
▪ Allows MERGE INTO
▪ Enabled incremental imports

Switch to incremental Delta
If the table doesn’t exists (yet)

Switch to incremental Delta
Incremental MERGE INTO if the table exists

In practice
Incremental imports
▪ Watermark column
▪ Only load the changed orders
▪ Also interesting for Users table

DBT Macro’s
Running it a second time
▪ Don’t Repeat Yourself
▪ Write a Macro instead

Observability is king
Keeping track of your pipelines
▪ Building trust
▪ Track aggregated metrics
computed by Spark
▪ Application insights
▪ Centralized system

Very simple Hive UDF
Keep track of stats over time

Small snippet of Scala
Sends the metrics to Application Insights

Use the UDF in DBT
Sends the metrics to Application Insights
▪ Register the UDF
▪ Keep track of
▪ Seconds since last order
▪ Number of orders

Be proactive
Before there are angry managers at your desk
▪ Keep track of the metric
▪ Send alerts on business rules
▪ Outlier based on historical
distributions

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
▪ Code available at:
▪ https://github.com/godatadriven/dbt-data-ai-summit
▪ https://github.com/godatadriven/azure-dbt-logger

Two Columns
▪ Bulleted list format
Headline FormatHeadline Format

Attribution Format
Second line of attribution
This is a template for a quote slide.
This is where the quote goes.
Attribute the source below…

Databricks simplifies data and AI
so data teams can innovate faster

3D: DBT using Databricks and Delta

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 3D: DBT using Databricks and Delta

Similaire à 3D: DBT using Databricks and Delta (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

3D: DBT using Databricks and Delta