SlideShare une entreprise Scribd logo
1  sur  23
Practical Kimball Data
Patterns
Antoni Ivanov
Lead maintainer of Versatile Data Kit
2
Where are we going ?
Data Science & Data Engineering Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model (exercise)
3
Data Science & Data Engineering Process
https://neptune.ai/blog/best-practices-for-data-science-project-workflows-and-file-organizations
4
Where are we in the agenda?
Data Science Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model using our fact and dimension tables (exercise)
Cons and Pros of dimensional modelling in ML
5
Data modeling
Data Integration and Transformation
Data Sources
3rd party SaaS
products
Corporate
systems/DBs
Data driven products
Insights
BI
Data Science tools
Business model
6
Dimensional modeling
Dimension and fact tables
7
Why dimensional modelling ?
Performance
Extensibility
Consistency
Ease of understanding
8
What is Kimball ?
https://www.kimballgroup.com/
Architecture
Process
Design Patterns (Techniques)
9
Kimball Architecture
Quick mention
Data Integration and Transformation Insights
Data Sources
3rd party SaaS
products
Corporate
systems/DBs
Business model
Staging (Data Lake)
Data driven products
BI
Data Science tools
Back room (the kitchen) Front room (the dining room)
Metadata
See more in https://bit.ly/kimball-architecture
10
Kimball Dimensional Design Process
Data modelling steps consider both business needs and data realitities.
Identify the business
process
Identify the Grain
Identify the Dimensions
Identify the Facts
Checking account balance
Boarding a plane
Date, Customer, Bank
Date, Passenger, Flight, Airline
The bank account balance each month
The boarding pass scanned at the gate of a passenger
monthly account balance snapshot
passenger boarding event
11
Kimball Data Modelling Design Patterns
Kimball Dimensional Modelling Techniques
Transaction fact tables
Periodic snapshot fact tables
Accumulating snapshot fact tables
Slowly Changing Dimension Type 1 to 6
12
Where are we in the agenda?
Data Science Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model using our fact and dimension tables (exercise)
Cons and Pros of dimensional modelling in ML
13
Versatile Data Kit
Data lifecycle (Data Journey) and where VDK fits in
Ingest
Data Job
Transfor
m
Data Job
Export
Data Job
Data Integration and Transformation Insights
Data Sources
3rd party SaaS
products
Corporate
systems/DBs
Business model
Raw Data (Data Lake)
Data driven products
BI & Data
Science tools
Automate DevOps for Data
14
Where are we in the agenda?
Data Science Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model (exercise)
Cons and Pros of dimensional modelling in ML
15
Data Modeling Insights
Data Sources
Product
events
Corporate
systems
Data model
(Dimensional model)
Raw Data
(Data Lake)
Data driven
products
BI & Data
Science tools
Transform
Ingest Publish Export
ML Modeling
Train &
Validation Train
Model
Object
The Data & ML Journey
Confidential │ ©2021 VMware, Inc. 16
Use Case– EV Prediction Model
• Follow Dimensional Design Process
• Process data into dimension and facts
• Read from local files (CSV/Excel)
• Create data jobs using Versatile Data Kit
• Build and test a linear regression model (using VDK)
• Create an interactive visualization (using Streamlit)
Confidential │ ©2021 VMware, Inc. 17
We work for Volkswagen (or Tesla or…) and
our customers complain that their battery is
drained in the middle of a trip.
We want to provide them with an app
How much should you expect your battery to be
drained if you drive 60 km at 50 km per hour, using
heated seats?
18
Expected application would look like
19
Try it yourself at home
What you need:
Open: bit.ly/dsc-demo
Laptop Internet Connectivity
20
Where are we in the agenda?
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model (exercise)
21
Feedback is appreciated
https://bit.ly/vdk-dsc
22
Thank you!
https://github.com/vmware/versatile-data-kit/#contacts
https://www.linkedin.com/in/antoni-ivanov
26
Meeting business needs with quality and efficiency
Challenges :
• Efficiently processing the data and making it
ready for BI and Data Science.
• Troubleshooting and debugging data issues
Product Telemetry
BIlling Data
NPS Customer Success
Customer Data
Support Data
Integrate data from diverse
data sources
Clean &
pre-process data
Reporting, Advanced Analytics
and Data Science
Troubleshoot & debug
• Quickly enhancing existing analytics
• Transforming raw data into business KPIs
• Productionizing the data analytics
Deploy & Operate

Contenu connexe

Similaire à [DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx

Best practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biBest practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power bi
Satya Shyam K Jayanty
 
Anzo smart data integration february 2015
Anzo smart data integration february 2015Anzo smart data integration february 2015
Anzo smart data integration february 2015
John Rueter
 
Microsoft SQL Server 2008 R2 - Analysis Services Presentation
Microsoft SQL Server 2008 R2 - Analysis Services PresentationMicrosoft SQL Server 2008 R2 - Analysis Services Presentation
Microsoft SQL Server 2008 R2 - Analysis Services Presentation
Microsoft Private Cloud
 

Similaire à [DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx (20)

Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Business Intelligence solutions using Excel 2013 and Power BI
Business Intelligence solutions using Excel 2013 and Power BIBusiness Intelligence solutions using Excel 2013 and Power BI
Business Intelligence solutions using Excel 2013 and Power BI
 
Power BI
Power BIPower BI
Power BI
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
Using the information server toolset to deliver end to end traceability
Using the information server toolset to deliver end to end traceabilityUsing the information server toolset to deliver end to end traceability
Using the information server toolset to deliver end to end traceability
 
Learn How to Use Microsoft Power BI for Office 365 to Analyze Salesforce Data
Learn How to Use Microsoft Power BI for Office 365 to Analyze Salesforce DataLearn How to Use Microsoft Power BI for Office 365 to Analyze Salesforce Data
Learn How to Use Microsoft Power BI for Office 365 to Analyze Salesforce Data
 
Best practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biBest practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power bi
 
Anzo smart data integration february 2015
Anzo smart data integration february 2015Anzo smart data integration february 2015
Anzo smart data integration february 2015
 
Models in Minutes using AutoML
Models in Minutes using AutoMLModels in Minutes using AutoML
Models in Minutes using AutoML
 
Microsoft Power BI Overview
Microsoft Power BI OverviewMicrosoft Power BI Overview
Microsoft Power BI Overview
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Analytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual WorkshopAnalytics in a Day Ft. Synapse Virtual Workshop
Analytics in a Day Ft. Synapse Virtual Workshop
 
Webinar: BI Team Backlogged with Information Demands?
Webinar: BI Team Backlogged with Information Demands?Webinar: BI Team Backlogged with Information Demands?
Webinar: BI Team Backlogged with Information Demands?
 
Microsoft SQL Server 2008 R2 - Analysis Services Presentation
Microsoft SQL Server 2008 R2 - Analysis Services PresentationMicrosoft SQL Server 2008 R2 - Analysis Services Presentation
Microsoft SQL Server 2008 R2 - Analysis Services Presentation
 
MLOps for Compositional AI
MLOps for Compositional AIMLOps for Compositional AI
MLOps for Compositional AI
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Achieving Massive Concurrency & Sub-second Query Latency on Cloud Warehouses ...
Achieving Massive Concurrency & Sub-second Query Latency on Cloud Warehouses ...Achieving Massive Concurrency & Sub-second Query Latency on Cloud Warehouses ...
Achieving Massive Concurrency & Sub-second Query Latency on Cloud Warehouses ...
 
DataLive conference in Geneva 2018 - Bringing AI to the Data
DataLive conference in Geneva 2018 - Bringing AI to the DataDataLive conference in Geneva 2018 - Bringing AI to the Data
DataLive conference in Geneva 2018 - Bringing AI to the Data
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced Analytics
 

Plus de DataScienceConferenc1

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
DataScienceConferenc1
 
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
DataScienceConferenc1
 
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
DataScienceConferenc1
 

Plus de DataScienceConferenc1 (20)

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf
 
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...
 
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf
 
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf
 
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf
 
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx
 
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf
 
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...
 
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf
 
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...
 
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...
 
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf
 
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx
 
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...
 
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx
 
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...
 
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...
 
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx
 
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx
 
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf
 

Dernier

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 

Dernier (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

[DSC Adria 23] Antoni Ivanov Practical Kimball Data Patterns.pptx

  • 1. Practical Kimball Data Patterns Antoni Ivanov Lead maintainer of Versatile Data Kit
  • 2. 2 Where are we going ? Data Science & Data Engineering Process Dimensional modelling Versatile Data Kit Create our own dimension and facts (exercise) Create ML model (exercise)
  • 3. 3 Data Science & Data Engineering Process https://neptune.ai/blog/best-practices-for-data-science-project-workflows-and-file-organizations
  • 4. 4 Where are we in the agenda? Data Science Process Dimensional modelling Versatile Data Kit Create our own dimension and facts (exercise) Create ML model using our fact and dimension tables (exercise) Cons and Pros of dimensional modelling in ML
  • 5. 5 Data modeling Data Integration and Transformation Data Sources 3rd party SaaS products Corporate systems/DBs Data driven products Insights BI Data Science tools Business model
  • 7. 7 Why dimensional modelling ? Performance Extensibility Consistency Ease of understanding
  • 8. 8 What is Kimball ? https://www.kimballgroup.com/ Architecture Process Design Patterns (Techniques)
  • 9. 9 Kimball Architecture Quick mention Data Integration and Transformation Insights Data Sources 3rd party SaaS products Corporate systems/DBs Business model Staging (Data Lake) Data driven products BI Data Science tools Back room (the kitchen) Front room (the dining room) Metadata See more in https://bit.ly/kimball-architecture
  • 10. 10 Kimball Dimensional Design Process Data modelling steps consider both business needs and data realitities. Identify the business process Identify the Grain Identify the Dimensions Identify the Facts Checking account balance Boarding a plane Date, Customer, Bank Date, Passenger, Flight, Airline The bank account balance each month The boarding pass scanned at the gate of a passenger monthly account balance snapshot passenger boarding event
  • 11. 11 Kimball Data Modelling Design Patterns Kimball Dimensional Modelling Techniques Transaction fact tables Periodic snapshot fact tables Accumulating snapshot fact tables Slowly Changing Dimension Type 1 to 6
  • 12. 12 Where are we in the agenda? Data Science Process Dimensional modelling Versatile Data Kit Create our own dimension and facts (exercise) Create ML model using our fact and dimension tables (exercise) Cons and Pros of dimensional modelling in ML
  • 13. 13 Versatile Data Kit Data lifecycle (Data Journey) and where VDK fits in Ingest Data Job Transfor m Data Job Export Data Job Data Integration and Transformation Insights Data Sources 3rd party SaaS products Corporate systems/DBs Business model Raw Data (Data Lake) Data driven products BI & Data Science tools Automate DevOps for Data
  • 14. 14 Where are we in the agenda? Data Science Process Dimensional modelling Versatile Data Kit Create our own dimension and facts (exercise) Create ML model (exercise) Cons and Pros of dimensional modelling in ML
  • 15. 15 Data Modeling Insights Data Sources Product events Corporate systems Data model (Dimensional model) Raw Data (Data Lake) Data driven products BI & Data Science tools Transform Ingest Publish Export ML Modeling Train & Validation Train Model Object The Data & ML Journey
  • 16. Confidential │ ©2021 VMware, Inc. 16 Use Case– EV Prediction Model • Follow Dimensional Design Process • Process data into dimension and facts • Read from local files (CSV/Excel) • Create data jobs using Versatile Data Kit • Build and test a linear regression model (using VDK) • Create an interactive visualization (using Streamlit)
  • 17. Confidential │ ©2021 VMware, Inc. 17 We work for Volkswagen (or Tesla or…) and our customers complain that their battery is drained in the middle of a trip. We want to provide them with an app How much should you expect your battery to be drained if you drive 60 km at 50 km per hour, using heated seats?
  • 19. 19 Try it yourself at home What you need: Open: bit.ly/dsc-demo Laptop Internet Connectivity
  • 20. 20 Where are we in the agenda? Dimensional modelling Versatile Data Kit Create our own dimension and facts (exercise) Create ML model (exercise)
  • 23. 26 Meeting business needs with quality and efficiency Challenges : • Efficiently processing the data and making it ready for BI and Data Science. • Troubleshooting and debugging data issues Product Telemetry BIlling Data NPS Customer Success Customer Data Support Data Integrate data from diverse data sources Clean & pre-process data Reporting, Advanced Analytics and Data Science Troubleshoot & debug • Quickly enhancing existing analytics • Transforming raw data into business KPIs • Productionizing the data analytics Deploy & Operate

Notes de l'éditeur

  1. In this course we will create our own Data Warehouse and create star schema using Kimball . And then we will use it in a simple ML model. We will discuss the benefits and downsides of using Warehousing design patterns in ML https://www.kimballgroup.com/2008/11/fact-tables https://www.mighty.digital/blog/data-modeling-techniques-explained https://www.educba.com/fact-table-vs-dimension-table/ https://www.softwaretestinghelp.com/dimensional-data-model-in-data-warehouse/ https://www.bluegranite.com/blog/dimensional-modeling-in-the-advanced-analytics-age#:~:text=Dimensional%20models%20aren't%20just,they%20also%20benefit%20data%20scientists. https://towardsdatascience.com/dimensional-modelling-for-customer-churn-9d0148548f04 https://www.astera.com/type/blog/automate-dimensional-modeling-data-warehouse/ https://github.com/chrthomsen/pygrametl/tree/master/docs/examples Missing here is best practice for DS. DS tools use "observation sets", which blend all variables, item level (fact) and context (agg) onto the same flat tupleset in order to drive independent -> dependent variable inference and other analysis. Helping DS's do this correctly has no tooling support that I have seen. Also, capturing the agg-level as metadata on the resulting columns so downstream aggs of aggs are done correctly is completely unaddressed. Best practices are hard because DS's are not historically code-disciplined. (We run into this a lot during platform and pipeline migrations; AWS to GCP is the moment when the flashlight shines on everything.)
  2. Okay so before we continue with understanding what the problems are let’s see what is the typical data science process flow. Data scientists usually start with being asked an interesting question
  3. https://en.wikipedia.org/wiki/Dimensional_modelling Data Dimensional Modelling (DDM) is a technique that uses Dimensions and Facts to store the data in a Data Warehouse efficiently
  4. http://mis587mozhou.blogspot.com/2014/02/the-four-step-dimensional-design-process.html Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Fact: Measurements, metrics or facts about a business process. The facts are the performance metrics that business users are concerned about. These must be appropriately defined in accordance with the declared grain. Usually, facts are numerical data, such as total cost or order quantity. Dimension: Companion table to the fact table contains descriptive attributes to be used as query constraining. the dimensions typically can easily be identified as they represent the “who, what, where, when, why, and how” associated with the event A robust set of dimensions representing all possible descriptions should be identified. The following are some examples: Date Customer Employee Facility
  5. Performance: The dimension tables in particular are often highly de-normalized. For example, a customer table might store the zip code of the customer, their town and state. If you have 20 customers in Sofia, then the customer dimension table will store the fact that Sofia in is Bulgaria a total of 20 times. By denormalizing and simplifying the schema (fewer joins), we were able to obtain better performance, and we were able to better predict the performance of our data warehouse. This is especially important in modern data architecture with the adoption of column oriented storage (where joins are very expensive). Extensibility: Dimensional modelling is modular by nature; many components can and should be re-used. Data warehouse are built incrementally and avoid a big bang approach Consistency: Dimension model is designed to integrate various business processes, regardless of the source. For example, a conformed customer dimension allowed finance, engineering, and sales teams to have one common customer reference regardless of the source application. Ease of understanding: The consistent and fairly clear structure of the database would allow even a non-technical end user, ( an accountant or marketing analyst) to query the model without wondering if a relationship was 1-n, n-n, or if there was a loop in the model without needing to know that those could be a problem And second the way the data is queried is generally the same. You join fact table based on the dimension you need and aggregate on some of the metrics. Most data scientists spend around 80% of their time wrangling, cleaning, and organizing data to obtain a tidy dataset (Wickham, 2014): one observation per row and one variable per column. This type of data structure is extremely easy to obtain from dimensional modeling. A simple join between the relevant dimensions, aggregate the indicators, and you have a tidy tabular dataset. Cleaned, organized data ensures that data scientists can focus on actual data science, rather than on engineering tasks .
  6. There are many approaches to data modelling. We focus on Kimball. Ralph Kimball introduced the data warehouse/business intelligence industry to dimensional modelling. But we should note that there are other approaches to data modeling that are commonly mentioned . One approach is known as Inmon data modeling, named after data warehouse pioneer It focused on normalized schemas, instead of Kimball’s more denormalized approach. A third data modeling approach, named Data Vault, was released in the early 2000s which aims to tackle changes https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
  7. https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/technical-dw-bi-system-architecture/ The Kimball technical system architecture separates the data and processes comprising the DW system into the backroom extract, transformation and load (ETL) environment and the front room presentation area, as illustrated in the following diagram. https://www.kimballgroup.com/2004/03/differences-of-opinion/ https://www.kimballgroup.com/2004/01/data-warehouse-dining-experience/ Data warehouses should have an area that focuses exclusively on data staging and extract, transform, and load (ETL) activities. A separate layer of the warehouse environment should be optimized for presentation of the data to the business constituencies and application developers. This division is underscored if you consider the similarities between a data warehouse and restaurant. The kitchen of a fine restaurant is a world unto itself. It’s where the magic happens. Talented chefs take raw materials and transform them into appetizing, delicious multi-course meals for the restaurant’s diners The layout must be highly efficient Quality must be high (delicious food) Food must also be of high integrity (nobody likes poison) Procured products must meet quality standards Given the dangerous surroundings, the kitchen is off-limits to patrons. the data warehouse’s staging area should be off-limits to the business users and reporting/delivery application developers The data warehouse’s staging area is very similar to the restaurant’s kitchen. The staging area is where source data is magically transformed into meaningful, presentable information. Like the kitchen, the staging area is designed to ensure throughput. It must transform raw source data into the target model efficiently, minimizing unnecessary movement if possible. The Dining Room Food – quality, presentation Menu - easy to access Service – prompt, good support
  8. https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/four-4-step-design-process/ The answers to these questions are determined by considering the needs of the business along with the realities of the underlying source data during the collaborative modeling sessions. Following the business process, grain, dimension, and fact declarations, the design team determines the table and column names, samIdentify the business process
  9. https://github.com/vmware/versatile-data-kit/wiki/SQL-Data-Processing-templates-examples
  10. It eases and tackles the data ingestion job, data transformation jobs and the data publishing jobs. While at the same time allow data users to benefit from good DevOps and DataOps practices. Versatile Data Kit supports data jobs with SQL, Python or both. It comes with Data SDK which is used to develop data jobs locally. VDK provides main building blocks to ingest from any source and transform data using Python or SQL For example for transformations, VDK provides support for creating Kimball’s dimensional model using templates to create facts and dimensions with SQL only. VDK Data SDK also provides native DB connections. By using VDK Data SDK data users can choose to develop their jobs locally only OR use Versatile Data Kit Control Service which would provide them production setup. The Data SDK comes with data lineage and quality and is entirely independently usable. VDK Control Service manages the whole data job lifecycle. It allows data users to productionize Versatile Data Kit Data jobs by deploying them. Versatile Data Kit Control Service comes with out-of-the-box deployment, versioning,, moniroting, alerting and notifications as well as many more. TODO: Show case send_object for ingestion works the same way regardless of any infrastructure
  11. It eases and tackles the data ingestion job, data transformation jobs and the data publishing jobs. While at the same time allow data users to benefit from good DevOps and DataOps practices. Versatile Data Kit supports data jobs with SQL, Python or both. It comes with Data SDK which is used to develop data jobs locally. VDK provides main building blocks to ingest from any source and transform data using Python or SQL For example for transformations, VDK provides support for creating Kimball’s dimensional model using templates to create facts and dimensions with SQL only. VDK Data SDK also provides native DB connections. By using VDK Data SDK data users can choose to develop their jobs locally only OR use Versatile Data Kit Control Service which would provide them production setup. The Data SDK comes with data lineage and quality and is entirely independently usable. VDK Control Service manages the whole data job lifecycle. It allows data users to productionize Versatile Data Kit Data jobs by deploying them. Versatile Data Kit Control Service comes with out-of-the-box deployment, versioning,, moniroting, alerting and notifications as well as many more. TODO: Show case send_object for ingestion works the same way regardless of any infrastructure
  12. We would be very happy if you would like to contribute, raise an issue, product request, etc. We are actively looking for partners who with to collaborate with us, participate in requirements understanding, discussing common problems and jointly solve the problems.
  13. We need a consolidated view of how the service is performing. That view includes information regarding customer count, overall consumption, customer sentiment (e.g. NPS Score), customer onboarding metrics, SLA metrics, etc.