Modern data warehouse

About Me
• Associate Director, IQVIA, DW & DS
• In Data warehouse technology for more than 10
years
• Worked as Developer, Lead, Consultant, Solutions
Architect on DW & MDM Projects
• Ex-Microsoft – worked as Technical Consultant
• Presenter at various technical forums , 2000+
answers on MSDN
• Certifications : Microsoft Certified IT Professional ,
Amazon certified Architect
• Linked in :
https://www.linkedin.com/in/rakeshjayaram

Agenda
•Traditional IT Vs SaaS
•Why cloud?
•Migrate to Cloud
Solution Models
•Storage
•Prep
•Serve
Modern DW Architecture
•Debunk myths
Data Lakes vs Data
Warehouse
•Data Lake
•databricks
•Synapse
Demo
•Compute
•Storage
Cloud optimized DW
Solutions
•Thank you!Q&A

Solution Models
Azure databricks
Azure database
Azure Synapse
SQL Server
on VM
SQL Server
on Prem

Solution Models
WHY CLOUD? Elasticity Pay as you go Secure
Increase/Decrease resources
on demand
Auto-scaling options
Unlimited resources availability
No upfront capital/cost
Maintenance Resources elimination
Encrypted at-rest(/transit
Azure AD integration
ACL’s & Security groups
IP Address black/white listing
Virtual network integration
Fast time to Market
Integration with Azure Services
Support to familiar language(T-sql)
Continuous build & Integration

Solution Models
HOW TO
MOVE TO
CLOUD?
Lift/Shift Remodel Build for Cloud
SQL Server on Azure VM Azure database Databricks & Synapse

Modern DW Architecture
STORE PREP SERVE
Azure data lake Gen1
Azure Blob
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage

Storage
STORE PREP SERVE
Azure Blob
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage

Storage
Blob Storage
Purpose General purpose object store for a wide variety of storage scenarios, including big data analytics
Use cases Any type of text or binary data, such as application back end, backup data, media storage for streaming
and general purpose data
Encryption (at rest) Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault (preview)
Client-side encryption
Life cycle Management Yes
Authentication Access Keys
Store Type Object store with flat namespace
Redundancy Locally redundant (LRS), zone redundant (ZRS), globally redundant (GRS)
Limits 2 PB/Account in US

Storage
ADLS Gen1
Purpose Optimized storage for big data analytics workloads
Use cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click
streams, large datasets
Life cycle Management No
Authentication Azure AD
Store Type Hierarchical file system
Redundancy Locally redundant (LRS)
Limits No support for Hot/Cold Storage
No support for Redundancy Storage
Don’t use for new projects – upgrade to ADLS Gen2

Storage
ADLS Gen2
Purpose Optimized storage for big data analytics workloads
Use cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click
streams, large datasets
Life cycle Management Yes (In preview)
Authentication Azure AD
Store Type Hierarchical file system
Redundancy Locally redundant (LRS)
ZRS & GRS (Preview)

Storage
Migration from Blob to ADLS Gen2 New features on ADLS Gen2

Modern Data Architecture
STORE PREP SERVE
Azure Blob
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage
Compute - databricks

Prep
Azure databricks
Purpose Platform/Tool to be used for massive data processing
Use cases Integration with ADLS
Spark & Notebook
Auto-scaling & Auto-termination
Limits Steep learning curve for data engineers

Prep
Azure HD Insight
Highlights Complete open source platform for Hadoop clusters on Azure
Preferred for Pig/Kafka/Hive/etc..
Hortonworks merger with cloudera!
No termination & Auto-scale
Highlights Self service BI
Good for smaller workloads
Primarily for Data Analysts/Business Analysts
Power BI Dataflow
Highlights Query as Service
pay for the processing
U-SQL (~ SQL /C#)
Azure Data Lake Analytics

Modern Data Architecture
STORE PREP SERVE
Azure Blob
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage
Compute - Synapse

Serve
Azure Synapse
Purpose Peta-byte scale cloud based DW
Use cases Massive parallel technology
De-coupled (storage & compute)architecture
Stop-resume options for cost-efficiency
Indexing, distribution, partition options for faster query performance
Limits Required reference to ADLS to fetch data using PolyBase

Data Lake vs Data Warehouse
Data format
Purpose
Sources
Volume
Ingestion
Users
Use Case
Raw Data – Un structed – Semi structured
Any purpose ( ML – AI – Data Warehouse)
Native raw form (logs – data files)
PB Scale
Stored with minimal validation & transformation
Data engineers – data Analysts (advanced)
Batch & Stream processing
Structured – Cleansed – Processed
Mostly Reporting & BI
Historical & relational form
Less than Data Lake
Data must be cleansed – validated – refined
Business Analysts
Batch processing

Data pipelines using data bricks
Load
•Flat, Parquet
•Zip files
•Control/Trigger
•Source & loadId driven
Validate
•Control file
•Column Name & Count
•Data type , Null , Duplicate
•Phone, Zip, e-mail pattern
validation
•Pattern check (Datetime,
Regex etc.)
•Error level configuration
Transform
•Transform from sql file
•Azure Tables
•Raise error on condition
•SCD processer
•Fact processer
Unload
•Azure Tables
•CSV file
Audit logs &
Report
Generation
•Reports
•Application logs
•Tracking identifier(UID)

Demo
Production Power platform world tour
Scene Azure Data Platform Demo
Take Hopefully only 1 !
Actors Azure data lake , Azure
databricks, Azure Synapse
Date 05-12-2019

Optimize for Cost & Performance
0
5
10
15
20
25
30
35
Traffic Server capacity(TDU)
Synapse
Shutdown
0
20
40
60
80
100
120
Metadata Workday CRM ERP National
Sales
Zip Sales
Metadata Workday CRM ERP National Sales Zip Sales
Cluster size depending
on source volume
• Synapse is a de-coupled architecture
• Shut down when production run-cycle is complete
• Chose the right distribution strategy
• Partition fact files where required
• Chose the right clusters size according to the volume of source to
avoid resource under-utilization
• Auto-scale if required
• Auto-termination on complete
Shutdown feature in Synapse
Large
Use spark for validations & source specific jobs
Small
Medium

Optimize for Cost & Performance
Enable ADLS Gen2 Lifecycle Management Policy
Operation
(For X
volume)
Hot Cool Archive
Store 100$ ~55$ ~5.5$
Write 100$ ~200$ ~200$
Read 100$ ~250$ ~120,000$
In
development
(Hot  Cool  Archive)

References
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-lifecycle-management-concepts?tabs=azure-portal
https://azure.microsoft.com/en-us/blog/multi-protocol-access-on-data-lake-storage-now-generally-available/
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-comparison-with-blob-storage
https://azure.microsoft.com/en-us/services/data-lake-analytics/
https://medium.com/@cprosenjit/restricting-access-to-your-big-data-system-on-azure-d887845c42ab
https://cloudarchitected.com/2019/03/data-level-security-in-azure-databricks/
https://cloudarchitected.com/2019/02/network-isolation-for-azure-databricks/
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-continuous-integration-and-deployment

Modern data warehouse

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Modern data warehouse

Similaire à Modern data warehouse (20)

Dernier

Dernier (20)

Modern data warehouse

Notes de l'éditeur