Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.
2. whoami
▪ Fokko Driesprong
▪ Master Distributed Systems &
Software Engineering
▪ Code Connaisseur at GoDataDriven
▪ Mostly doing {data,software} engineering
▪ Open source enthousiast
▪ ASF Member
▪ Committer + PMC member on Apache {Airflow, Avro, Druid}
▪ Committer on Apache Parquet
3. GoDataDriven
▪ Amsterdam based consultancy
▪ Data {Engineer,Science,Strategy}
▪ And now also analytics engineering!
▪ Around 50 consultants
▪ Used to do Hadoop, now Cloud
6. Data Build Tool
▪ Tool for building data pipelines following the DataOps principles
▪ Simple tool to build complex pipelines
▪ Best practices from software engineering
▪ Linting / Peer reviews / DRY principle / Data testing
▪ SQL First
▪ Encodes organisational knowledge into the pipeline
Try it yourself: https://godatadriven.com/blog/tutorial-for-dbt-analytics-engineering-made-easy/
But everyone just calls it DBT
7. Data Build Tool
▪ Main focus on integration with DWH platforms
▪ Postgres, Redshift, Snowflake and Bigquery
▪ Support for Spark / Databricks
▪ Created by Fishtown Analytics
▪ Huge open source community
▪ Apache 2.0 Open Source license
But everyone just calls it DBT
8. DBT
The T in Extract-Transform-Load (ELT)
Analysts using dbt can transform their data by simply writing select statements, while
dbt handles turning these statements into tables and views in a data warehouse.
9. Let’s go through at a small example
Combine orders and order_lines into a revenue table
19. Switch to incremental ingestion
Using the Delta format
▪ ACID dataformat by Databricks
▪ Linux Software Foundation
▪ Allows MERGE INTO
▪ Enabled incremental imports
26. Observability is king
Keeping track of your pipelines
▪ Building trust
▪ Track aggregated metrics
computed by Spark
▪ Application insights
▪ Centralized system
28. Small snippet of Scala
Sends the metrics to Application Insights
29. Use the UDF in DBT
Sends the metrics to Application Insights
▪ Register the UDF
▪ Keep track of
▪ Seconds since last order
▪ Number of orders
30. Be proactive
Before there are angry managers at your desk
▪ Keep track of the metric
▪ Send alerts on business rules
▪ Outlier based on historical
distributions
31. Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
▪ Code available at:
▪ https://github.com/godatadriven/dbt-data-ai-summit
▪ https://github.com/godatadriven/azure-dbt-logger
34. Two Columns
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
Headline FormatHeadline Format
35. Attribution Format
Second line of attribution
This is a template for a quote slide.
This is where the quote goes.
Attribute the source below…