2. About Me
• Associate Director, IQVIA, DW & DS
• In Data warehouse technology for more than 10
years
• Worked as Developer, Lead, Consultant, Solutions
Architect on DW & MDM Projects
• Ex-Microsoft – worked as Technical Consultant
• Presenter at various technical forums , 2000+
answers on MSDN
• Certifications : Microsoft Certified IT Professional ,
Amazon certified Architect
• Linked in :
https://www.linkedin.com/in/rakeshjayaram
3. Agenda
•Traditional IT Vs SaaS
•Why cloud?
•Migrate to Cloud
Solution Models
•Storage
•Prep
•Serve
Modern DW Architecture
•Debunk myths
Data Lakes vs Data
Warehouse
•Data Lake
•databricks
•Synapse
Demo
•Compute
•Storage
Cloud optimized DW
Solutions
•Thank you!Q&A
5. Solution Models
WHY CLOUD? Elasticity Pay as you go Secure
Increase/Decrease resources
on demand
Auto-scaling options
Unlimited resources availability
No upfront capital/cost
Maintenance Resources elimination
Encrypted at-rest(/transit
Azure AD integration
ACL’s & Security groups
IP Address black/white listing
Virtual network integration
Fast time to Market
Integration with Azure Services
Support to familiar language(T-sql)
Continuous build & Integration
6. Solution Models
HOW TO
MOVE TO
CLOUD?
Lift/Shift Remodel Build for Cloud
SQL Server on Azure VM Azure database Databricks & Synapse
7. Modern DW Architecture
STORE PREP SERVE
Azure data lake Gen1
Azure Blob
Azure data lake Gen2
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage
8. Storage
STORE PREP SERVE
Azure data lake Gen1
Azure Blob
Azure data lake Gen2
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage
9. Storage
Blob Storage
Purpose General purpose object store for a wide variety of storage scenarios, including big data analytics
Use cases Any type of text or binary data, such as application back end, backup data, media storage for streaming
and general purpose data
Encryption (at rest) Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault (preview)
Client-side encryption
Life cycle Management Yes
Authentication Access Keys
Store Type Object store with flat namespace
Redundancy Locally redundant (LRS), zone redundant (ZRS), globally redundant (GRS)
Limits 2 PB/Account in US
10. Storage
ADLS Gen1
Purpose Optimized storage for big data analytics workloads
Use cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click
streams, large datasets
Encryption (at rest) Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault (preview)
Life cycle Management No
Authentication Azure AD
Store Type Hierarchical file system
Redundancy Locally redundant (LRS)
Limits No support for Hot/Cold Storage
No support for Redundancy Storage
Don’t use for new projects – upgrade to ADLS Gen2
11. Storage
ADLS Gen2
Purpose Optimized storage for big data analytics workloads
Use cases Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click
streams, large datasets
Encryption (at rest) Transparent, Server side
With service-managed keys
With customer-managed keys in Azure KeyVault (preview)
Life cycle Management Yes (In preview)
Authentication Azure AD
Store Type Hierarchical file system
Redundancy Locally redundant (LRS)
ZRS & GRS (Preview)
13. Modern Data Architecture
STORE PREP SERVE
Azure data lake Gen1
Azure Blob
Azure data lake Gen2
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage
Compute - databricks
14. Prep
Azure databricks
Purpose Platform/Tool to be used for massive data processing
Use cases Integration with ADLS
Spark & Notebook
Auto-scaling & Auto-termination
Limits Steep learning curve for data engineers
15. Prep
Azure HD Insight
Highlights Complete open source platform for Hadoop clusters on Azure
Preferred for Pig/Kafka/Hive/etc..
Hortonworks merger with cloudera!
No termination & Auto-scale
Highlights Self service BI
Good for smaller workloads
Primarily for Data Analysts/Business Analysts
Power BI Dataflow
Highlights Query as Service
pay for the processing
U-SQL (~ SQL /C#)
Azure Data Lake Analytics
16. Modern Data Architecture
STORE PREP SERVE
Azure data lake Gen1
Azure Blob
Azure data lake Gen2
Azure databricks
Azure HDInsight
Azure SQL Synapse
SQL Server on VM
INGEST
Azure Data factory
Text
Parquet
Json
CRM
BI +
Reporting
Downstream
apps
Advanced
Analytics
Storage
Compute
Compute + Storage
Compute - Synapse
17. Serve
Azure Synapse
Purpose Peta-byte scale cloud based DW
Use cases Massive parallel technology
De-coupled (storage & compute)architecture
Stop-resume options for cost-efficiency
Indexing, distribution, partition options for faster query performance
Limits Required reference to ADLS to fetch data using PolyBase
18. Data Lake vs Data Warehouse
Data format
Purpose
Sources
Volume
Ingestion
Users
Use Case
Raw Data – Un structed – Semi structured
Any purpose ( ML – AI – Data Warehouse)
Native raw form (logs – data files)
PB Scale
Stored with minimal validation & transformation
Data engineers – data Analysts (advanced)
Batch & Stream processing
Structured – Cleansed – Processed
Mostly Reporting & BI
Historical & relational form
Less than Data Lake
Data must be cleansed – validated – refined
Business Analysts
Batch processing
20. Demo
Production Power platform world tour
Scene Azure Data Platform Demo
Take Hopefully only 1 !
Actors Azure data lake , Azure
databricks, Azure Synapse
Date 05-12-2019
21. Optimize for Cost & Performance
0
5
10
15
20
25
30
35
Traffic Server capacity(TDU)
Synapse
Shutdown
0
20
40
60
80
100
120
Metadata Workday CRM ERP National
Sales
Zip Sales
Metadata Workday CRM ERP National Sales Zip Sales
Cluster size depending
on source volume
• Synapse is a de-coupled architecture
• Shut down when production run-cycle is complete
• Chose the right distribution strategy
• Partition fact files where required
• Chose the right clusters size according to the volume of source to
avoid resource under-utilization
• Auto-scale if required
• Auto-termination on complete
Shutdown feature in Synapse
Large
Use spark for validations & source specific jobs
Small
Medium
22. Optimize for Cost & Performance
Enable ADLS Gen2 Lifecycle Management Policy
Operation
(For X
volume)
Hot Cool Archive
Store 100$ ~55$ ~5.5$
Write 100$ ~200$ ~200$
Read 100$ ~250$ ~120,000$
In
development
(Hot Cool Archive)
Time : 9:05
Data has become the strategic asset used to transform businesses to uncover new insights.
IDC projects that this explosion of data will result in a 40 Zetabyte digital universe by 2020.
To drive the business forward, the enterprise needs to Integrate, Adapt their enterprise data warehouse to evolve into <pause> a modern data warehouse.
Time : 9:05
Time : 9:05
Solution Models – 10 min
Modern Data Architecture - 5 min
Storage – 5 min
Prep – 5 min
Serve – 5 min
Data Lakes vs Data Warehouse – 5 min
Demo – 15 min
Cloud Optimized Solutions - 5 min
Q&A – 5 min
Section End Time (1/3) – 9 : 10
Traditional IT
The traditional IT data warehouse was designed specifically to be a central repository for all data in a company. Disparate data from transactional systems, ERP, CRM, and POB applications are cleansed—that is, extracted, transformed, and loaded (ETL)—into the warehouse within an overall relational schema. The predictable data structure and quality optimized processing for operational reporting. However, preparing queries was largely IT-supported and based on scheduled batch processing.
Traditional data warehouse was built on symmetric multi-processing (SMP) technology. With SMP, adding more capacity involved procuring larger, more powerful hardware and then forklifting the prior data warehouse into it. This was necessary because as the warehouse approached capacity, its architecture experienced performance issues at a scale where there was no way to add incremental processor power or enable synchronization of the cache between processors.
SaaS
The cloud has quickly become an integral part of many IT organizations .Recent research from cloud solutions provider
Right Scale showing 93% of businesses using cloud technology.
Forrester recently did a study where they found 47% of organizations increasing their cloud deployments for big data specifically.
It makes sense because the cloud not only enables cost efficiencies, it gives you the scale to meet demands / SLA to process any amount of data now and in the future.
Section End Time (2/3) – 9 : 10
A defining characteristic of cloud computing is elasticity – the ability to rapidly provision and release resources to match what a workload requires – so that a user pays no more and no less than what they need to for the task at hand. Such just-in-time provisioning can save customers enormous amounts of money when their workloads are intermittent and heavily spiked.
In modern enterprise, there are few workloads that have a desperate need for such elastic capabilities as data warehousing and big data. Traditionally built on-premises with very expensive hardware and software, most enterprise Data Warehouse (DW) systems have very low utilization except during peak periods of data loading, transformation and report generation.
The Microsoft Modern Data Warehouse offers the most comprehensive options to deploy data warehousing and big data directly to the cloud with the elastic scalability of Azure.
Security Data Lake ;
Encrypted at-rest(Azure or Client Managed)
Encrypted at-transit(HTTPS)
Azure active directory integration
ACL’s & Security groups
IP Address black/white listing
Virtual network integration
Section End Time (3/3) – 9 : 10
Solution Models
As seen in the diagram, each offering can be characterized by the level of administration you have over the infrastructure, and by the degree of cost efficiency.
Factors that can influence your decision to choose between the different data offerings
Cost
Lift/Shift –
IaaS option you need to invest additional time and resources to manage your database
IaaS option enables you to shut down your resources while you are not using them to decrease the cost,
Remodel (PaaS)
PaaS version is always running unless if you drop and re-create your resources when they are needed
Administration
Lift/Shift
Supports CLR
Remodel (PaaS)
PaaS options reduce the amount of time that you need to invest to administer the database
SLA
Both IaaS and PaaS provide high, industry standard SLA. PaaS option guarantees 99.99% SLA, while IaaS guarantees 99.95% SLA for infrastructure
Time to Move to Azure
SQL Server in Azure VM is the exact match of your environment, so migration from on-premises to Azure SQL VM is not different than moving the databases from one on-premises server to another.
Azure database instance also enables extremely easy migration; however, there might be some changes that you need to apply before you migrate to a managed instance.
Section End Time (1/1) – 9 : 10
Increasing volume
Real Time performance
Integration of public & Business application data sources (public {facebook – twitter – linkedin} & business application {CRM – Sales – Supply Chain - Workday} )
Capital cost for infrastructure
Adhoc analytics / Client Self service analytics
Section End Time (1/5) – 9 : 20
Azure Data Lake Store is a single repository to build cloud-based data lakes to capture and access any type of data for high-performance processing and analytics and low latency workloads with enterprise-grade security. This lets you store data in a single place and use any type of analytics to process it such as Azure HDInsight (Hadoop and Spark), R Server, Hortonworks, Cloudera, and Azure SQL Data Warehouse
Section End Time (2/5) – 9 : 20
Azure storage offers different access tiers, which allow you to store blob object data in the most cost-effective manner. The available access tiers include:
Hot - Optimized for storing data that is accessed frequently.
Cool - Optimized for storing data that is infrequently accessed and stored for at least 30 days.
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements (on the order of hours).
Lifecycle management policiesYou can now set policies to a tier or delete data in Data Lake Storage. To learn more, see the documentation “Manage the Azure Blob storage lifecycle.”
Section End Time (3/5) – 9 : 20
No Client-side encryption
Object store allows to perform file operations faster – Rename , move , copy , delete than blob storage
No support for zone redundant (ZRS), globally redundant (GRS)
Migrating data via Azure Data Factory is currently the easiest way to do a one-time data migration, as there is not currently a migration tool available.
If you have any files in ADLS Gen1 larger than 5TB, they will need to be separated into multiple files before migration.
Section End Time (4/5) – 9 : 20
No Client-side encryption
Object store allows to perform file operations faster – Rename , move , copy , delete than blob storage
No support for zone redundant (ZRS), globally redundant (GRS)
Query Performance. When sending a query that is only retrieving a subset of data, with a hierarchical file system like ADLS Gen2 it is possible to leverage partition scans for data pruning (predicate pushdown). This can improve query performance dramatically for compute engines that understand how to take advantage of partition scans.
Data Load Performance. Sometimes it is necessary to rename files or relocate files from one directory to another.
Granular Security at the Directory and File Level. The hierarchical file system of ADLS Gen2 (and Gen1) is POSIX-compliant. Access control lists (ACLs) can be defined at the directory and file level to define granular security, which offers much-needed flexibility for controlling data-level security
Object storage, such as Azure blob storage, is known for being highly economical. With respect to the direct storage cost, Microsoft has released ADLS Gen2 at the same price as Azure blob storage (i.e., block blob pricing). You only pay for the storage that you use; there is not the concept of reserving a specific size.
However, the transaction costs are somewhat higher for storage accounts which have the hierarchical namespace enabled. Transaction costs are usually measured in batches of 10,000.
Section End Time (5/5) – 9 : 20
Multi-protocol access to the same data, via Azure Blob storage API and Azure Data Lake Storage API, allows you to leverage existing object storage capabilities on Data Lake Storage accounts, which are hierarchical namespace-enabled storage accounts built on top of Blob storage
Blob storage resembled a pseudo-filesystem directory hierarchy, adopting naming conventions to Blob objects containing slashes (/). This was inefficient because applications would have to iterate through potentially millions of individual Blob objects to achieve directory-level tasks: For example, deleting a directory with several million objects in Blob storage would require an equal number of delete operations as objects in that directory. In contrast, with ADLS Gen2, deleting a directory is a single operation regardless of the number of files in the directory.
Section End Time (1/3) – 9 : 25
Section End Time (2/3) – 9 : 25
Data engineers from SQL background to learn new language – Spark/Scala, Flink, Beam!
Section End Time (3/3) – 9 : 25
Considering that U-SQL within Azure Data Lake Analytics (ADLA) is not one of the initial services to be supported by the optimized ABFS driver, that says something about where we should be placing our bets. Microsoft has not announced the future roadmap for ADLA, but we are observing that open source technologies such as Spark appeal to a wider customer base vs. proprietary tools and languages.
Section End Time (1/2) – 9 : 30
Section End Time (1/2) – 9 : 30
In the cloud, Azure SQL Data Warehouses leverages the same MPP architecture as the Analytics Platform System letting you combine the scaling power of this architecture with the elasticity of the cloud. A defining characteristic of cloud computing is elasticity – the ability to rapidly provision and release resources to match what a workload requires – so that a user pays no more and no less than what they need to for the task at hand. Such just-in-time provisioning can save customers enormous amounts of money when their workloads are intermittent and heavily spiked.
Azure SQL Data Warehouse is a fully managed DW as a Service that you can provision in minutes and scale up to 60 times larger in seconds. With a few clicks in the Azure Portal, you can launch a data warehouse, and start analyzing or querying data at the scale of hundreds of terabytes. Our architecture separates compute and storage so that you can independently scale them.
A very unique pause feature allows you to suspend compute in seconds and resume when needed while your data remains intact in Azure storage.
Section End Time (1/1) – 9 : 35
Myth : You need data lake OR data warehouse
Data lake and data warehouse serve different purpose. They are not mutually exclusive but infact work conjunction for optimal results & outcomes
Myth : Easy to build Data Warehouse, While Data Lakes Are Difficult
It’s true that data lakes require the specific skills of data engineers and data scientists (or experts with similar skill sets) to sort and make use of the data stored within. The unstructured nature of the data makes it less readily accessible to those without a full understanding of how the data lake works.
However, once data scientists and data engineers build data models or pipelines, business users can often leverage integrations (custom or pre-built) with popular business tools to explore the data. Likewise, most business users access data stored within data warehouses through connected business intelligence (BI) tools like Tableau and Looker. With the help of third-party BI tools, business users should be able to access and analyze data, whether that data is stored in a data warehouse or a data lake.
Section End Time (1/1) – 9 : 35
Demo start – 9 :35
Demo end – 9 : 50
Section End Time (1/2) – 9:55
Run multiple Databricks Spark cluster to meet SLA if required
e.g ; (Synapse)
Largest volume : 100 GB
Resource purchased : 100 DTU
Resource utilized : Source with Largest volume : 100 GB – Completely utilized
Resource utilized : Source with Smallest volume : 10 MB – Under utilized
e.g ; (databricks)
Largest volume : 100 GB
Resource purchased : Large cluster DTU
Resource utilized : Source with Largest volume : 100 GB – Completely utilized
Smallest volume : 10 MB
Resource purchased : Smaller cluster DTU
Resource utilized : Source with Smallest volume : 10 MB – Completely utilized
Section End Time (2/2) – 9:55
Rehydrate an archived blob to an online tier - Rehydrate an archive blob to hot or cool by changing its tier using the Set Blob Tier operation.
Copy an archived blob to an online tier - Create a new copy of an archive blob by using the Copy Blob operation. Specify a different blob name and a destination tier of hot or cool.