According to Gartner, 85% of Machine Learning projects fail.
Most data scientists spend around 80% of their time wrangling, cleaning, and organizing data to obtain a clean dataset: one observation per row and one variable per column. This type of data structure is straightforward to get from dimensional modeling.
In this session, Antoni will demo the creation of a Data Warehouse and create a star schema using Kimball. And then, he will use it in a simple ML model. He will discuss the benefits and downsides of using Warehousing design patterns in ML.
2. 2
Where are we going ?
Data Science & Data Engineering Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model (exercise)
3. 3
Data Science & Data Engineering Process
https://neptune.ai/blog/best-practices-for-data-science-project-workflows-and-file-organizations
4. 4
Where are we in the agenda?
Data Science Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model using our fact and dimension tables (exercise)
Cons and Pros of dimensional modelling in ML
5. 5
Data modeling
Data Integration and Transformation
Data Sources
3rd party SaaS
products
Corporate
systems/DBs
Data driven products
Insights
BI
Data Science tools
Business model
8. 8
What is Kimball ?
https://www.kimballgroup.com/
Architecture
Process
Design Patterns (Techniques)
9. 9
Kimball Architecture
Quick mention
Data Integration and Transformation Insights
Data Sources
3rd party SaaS
products
Corporate
systems/DBs
Business model
Staging (Data Lake)
Data driven products
BI
Data Science tools
Back room (the kitchen) Front room (the dining room)
Metadata
See more in https://bit.ly/kimball-architecture
10. 10
Kimball Dimensional Design Process
Data modelling steps consider both business needs and data realitities.
Identify the business
process
Identify the Grain
Identify the Dimensions
Identify the Facts
Checking account balance
Boarding a plane
Date, Customer, Bank
Date, Passenger, Flight, Airline
The bank account balance each month
The boarding pass scanned at the gate of a passenger
monthly account balance snapshot
passenger boarding event
11. 11
Kimball Data Modelling Design Patterns
Kimball Dimensional Modelling Techniques
Transaction fact tables
Periodic snapshot fact tables
Accumulating snapshot fact tables
Slowly Changing Dimension Type 1 to 6
12. 12
Where are we in the agenda?
Data Science Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model using our fact and dimension tables (exercise)
Cons and Pros of dimensional modelling in ML
13. 13
Versatile Data Kit
Data lifecycle (Data Journey) and where VDK fits in
Ingest
Data Job
Transfor
m
Data Job
Export
Data Job
Data Integration and Transformation Insights
Data Sources
3rd party SaaS
products
Corporate
systems/DBs
Business model
Raw Data (Data Lake)
Data driven products
BI & Data
Science tools
Automate DevOps for Data
14. 14
Where are we in the agenda?
Data Science Process
Dimensional modelling
Versatile Data Kit
Create our own dimension and facts (exercise)
Create ML model (exercise)
Cons and Pros of dimensional modelling in ML
15. 15
Data Modeling Insights
Data Sources
Product
events
Corporate
systems
Data model
(Dimensional model)
Raw Data
(Data Lake)
Data driven
products
BI & Data
Science tools
Transform
Ingest Publish Export
ML Modeling
Train &
Validation Train
Model
Object
The Data & ML Journey
23. 26
Meeting business needs with quality and efficiency
Challenges :
• Efficiently processing the data and making it
ready for BI and Data Science.
• Troubleshooting and debugging data issues
Product Telemetry
BIlling Data
NPS Customer Success
Customer Data
Support Data
Integrate data from diverse
data sources
Clean &
pre-process data
Reporting, Advanced Analytics
and Data Science
Troubleshoot & debug
• Quickly enhancing existing analytics
• Transforming raw data into business KPIs
• Productionizing the data analytics
Deploy & Operate
Notes de l'éditeur
In this course we will create our own Data Warehouse and create star schema using Kimball . And then we will use it in a simple ML model. We will discuss the benefits and downsides of using Warehousing design patterns in ML
https://www.kimballgroup.com/2008/11/fact-tables
https://www.mighty.digital/blog/data-modeling-techniques-explained
https://www.educba.com/fact-table-vs-dimension-table/
https://www.softwaretestinghelp.com/dimensional-data-model-in-data-warehouse/
https://www.bluegranite.com/blog/dimensional-modeling-in-the-advanced-analytics-age#:~:text=Dimensional%20models%20aren't%20just,they%20also%20benefit%20data%20scientists.
https://towardsdatascience.com/dimensional-modelling-for-customer-churn-9d0148548f04
https://www.astera.com/type/blog/automate-dimensional-modeling-data-warehouse/
https://github.com/chrthomsen/pygrametl/tree/master/docs/examples
Missing here is best practice for DS. DS tools use "observation sets", which blend all variables, item level (fact) and context (agg) onto the same flat tupleset in order to drive independent -> dependent variable inference and other analysis.
Helping DS's do this correctly has no tooling support that I have seen.
Also, capturing the agg-level as metadata on the resulting columns so downstream aggs of aggs are done correctly is completely unaddressed. Best practices are hard because DS's are not historically code-disciplined. (We run into this a lot during platform and pipeline migrations; AWS to GCP is the moment when the flashlight shines on everything.)
Okay so before we continue with understanding what the problems are let’s see what is the typical data science process flow. Data scientists usually start with being asked an interesting question
https://en.wikipedia.org/wiki/Dimensional_modelling
Data Dimensional Modelling (DDM) is a technique that uses Dimensions and Facts to store the data in a Data Warehouse efficiently
http://mis587mozhou.blogspot.com/2014/02/the-four-step-dimensional-design-process.html
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context).
Fact:
Measurements, metrics or facts about a business process.
The facts are the performance metrics that business users are concerned about. These must be appropriately defined in accordance with the declared grain. Usually, facts are numerical data, such as total cost or order quantity.
Dimension:
Companion table to the fact table contains descriptive attributes to be used as query constraining.
the dimensions typically can easily be identified as they represent the “who, what, where, when, why, and how” associated with the event
A robust set of dimensions representing all possible descriptions should be identified. The following are some examples:
Date
Customer
Employee
Facility
Performance: The dimension tables in particular are often highly de-normalized. For example, a customer table might store the zip code of the customer, their town and state. If you have 20 customers in Sofia, then the customer dimension table will store the fact that Sofia in is Bulgaria a total of 20 times.
By denormalizing and simplifying the schema (fewer joins), we were able to obtain better performance, and we were able to better predict the performance of our data warehouse. This is especially important in modern data architecture with the adoption of column oriented storage (where joins are very expensive).
Extensibility: Dimensional modelling is modular by nature; many components can and should be re-used. Data warehouse are built incrementally and avoid a big bang approach
Consistency: Dimension model is designed to integrate various business processes, regardless of the source. For example, a conformed customer dimension allowed finance, engineering, and sales teams to have one common customer reference regardless of the source application.
Ease of understanding: The consistent and fairly clear structure of the database would allow even a non-technical end user, ( an accountant or marketing analyst) to query the model without wondering if a relationship was 1-n, n-n, or if there was a loop in the model without needing to know that those could be a problem
And second the way the data is queried is generally the same. You join fact table based on the dimension you need and aggregate on some of the metrics.
Most data scientists spend around 80% of their time wrangling, cleaning, and organizing data to obtain a tidy dataset (Wickham, 2014): one observation per row and one variable per column. This type of data structure is extremely easy to obtain from dimensional modeling. A simple join between the relevant dimensions, aggregate the indicators, and you have a tidy tabular dataset.
Cleaned, organized data ensures that data scientists can focus on actual data science, rather than on engineering tasks
.
There are many approaches to data modelling. We focus on Kimball. Ralph Kimball introduced the data warehouse/business intelligence industry to dimensional modelling.
But we should note that there are other approaches to data modeling that are commonly mentioned . One approach is known as Inmon data modeling, named after data warehouse pioneer It focused on normalized schemas, instead of Kimball’s more denormalized approach.
A third data modeling approach, named Data Vault, was released in the early 2000s which aims to tackle changes
https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/technical-dw-bi-system-architecture/
The Kimball technical system architecture separates the data and processes comprising the DW system into the backroom extract, transformation and load (ETL) environment and the front room presentation area, as illustrated in the following diagram.
https://www.kimballgroup.com/2004/03/differences-of-opinion/
https://www.kimballgroup.com/2004/01/data-warehouse-dining-experience/
Data warehouses should have an area that focuses exclusively on data staging and extract, transform, and load (ETL) activities. A separate layer of the warehouse environment should be optimized for presentation of the data to the business constituencies and application developers.
This division is underscored if you consider the similarities between a data warehouse and restaurant.
The kitchen of a fine restaurant is a world unto itself. It’s where the magic happens. Talented chefs take raw materials and transform them into appetizing, delicious multi-course meals for the restaurant’s diners
The layout must be highly efficient
Quality must be high (delicious food)
Food must also be of high integrity (nobody likes poison)
Procured products must meet quality standards
Given the dangerous surroundings, the kitchen is off-limits to patrons. the data warehouse’s staging area should be off-limits to the business users and reporting/delivery application developers
The data warehouse’s staging area is very similar to the restaurant’s kitchen. The staging area is where source data is magically transformed into meaningful, presentable information. Like the kitchen, the staging area is designed to ensure throughput. It must transform raw source data into the target model efficiently, minimizing unnecessary movement if possible.
The Dining Room
Food – quality, presentation
Menu - easy to access
Service – prompt, good support
https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/four-4-step-design-process/
The answers to these questions are determined by considering the needs of the business along with the realities of the underlying source data during the collaborative modeling sessions. Following the business process, grain, dimension, and fact declarations, the design team determines the table and column names, samIdentify the business process
It eases and tackles the data ingestion job, data transformation jobs and the data publishing jobs. While at the same time allow data users to benefit from good DevOps and DataOps practices.
Versatile Data Kit supports data jobs with SQL, Python or both. It comes with Data SDK which is used to develop data jobs locally. VDK provides main building blocks to ingest from any source and transform data using Python or SQL
For example for transformations, VDK provides support for creating Kimball’s dimensional model using templates to create facts and dimensions with SQL only. VDK Data SDK also provides native DB connections.
By using VDK Data SDK data users can choose to develop their jobs locally only OR use Versatile Data Kit Control Service which would provide them production setup.
The Data SDK comes with data lineage and quality and is entirely independently usable.
VDK Control Service manages the whole data job lifecycle. It allows data users to productionize Versatile Data Kit Data jobs by deploying them. Versatile Data Kit Control Service comes with out-of-the-box deployment, versioning,, moniroting, alerting and notifications as well as many more.
TODO:
Show case send_object for ingestion works the same way regardless of any infrastructure
It eases and tackles the data ingestion job, data transformation jobs and the data publishing jobs. While at the same time allow data users to benefit from good DevOps and DataOps practices.
Versatile Data Kit supports data jobs with SQL, Python or both. It comes with Data SDK which is used to develop data jobs locally. VDK provides main building blocks to ingest from any source and transform data using Python or SQL
For example for transformations, VDK provides support for creating Kimball’s dimensional model using templates to create facts and dimensions with SQL only. VDK Data SDK also provides native DB connections.
By using VDK Data SDK data users can choose to develop their jobs locally only OR use Versatile Data Kit Control Service which would provide them production setup.
The Data SDK comes with data lineage and quality and is entirely independently usable.
VDK Control Service manages the whole data job lifecycle. It allows data users to productionize Versatile Data Kit Data jobs by deploying them. Versatile Data Kit Control Service comes with out-of-the-box deployment, versioning,, moniroting, alerting and notifications as well as many more.
TODO:
Show case send_object for ingestion works the same way regardless of any infrastructure
We would be very happy if you would like to contribute, raise an issue, product request, etc.
We are actively looking for partners who with to collaborate with us, participate in requirements understanding, discussing common problems and jointly solve the problems.
We need a consolidated view of how the service is performing. That view includes information regarding customer count, overall consumption, customer sentiment (e.g. NPS Score), customer onboarding metrics, SLA metrics, etc.