Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

5
Data sourcesNon-relational data
DESIGNED FOR THE
QUESTIONS YOU KNOW!

The Data Lake Approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Hadoop, Spark, R,
Azure Data Lake
Analytics (ADLA)
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

Microsoft’s Big Data Journey
We needed to better leverage data and analytics to
do more experimentation
So, we built a Data Lake for Microsoft:
• A data lake for everyone to put their data
• Tools approachable by any developer
• Batch, Interactive, Streaming, ML
By the numbers
• Exabytes of data under management
• 100Ks of Physical Servers
• 100Ks of Batch Jobs, Millions of Interactive Queries
• Huge Streaming Pipelines
• 10K+ Developers running diverse workloads and scenarios
2010 2013 2017
Windows
SMSG
Live
Bing
CRM/Dynamics
Xbox Live
Office365
Malware Protection Microsoft Stores
Commerce Risk
Skype
LCA
Exchange
Yammer
Data Stored

Culture Changes Engineering
How is the system performing? What is the experience my customers are
having? How does that correlate to other actions?
Is my feature successful ?
Marketing
What can we observe from our customers to increase revenues?
Management
How do I drive my business based on the data?
Field
Where are there new opportunities? How can I connect with my
customers more deeply?
Support
How does this customer’s experience compare with others?

HDFS Compatible REST API
ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
Open Source Apache
Hadoop ADL Client
Azure Databricks
HDInsight
Hive
• Performance at
scale
• Optimized for
analytics
• Multiple
analytics engines
• Single repository
sharing

ADL Store
Storage
• Architected and built for very high throughput at scale for Big Data workloads
• No limits to file size, account size or number of files
• Single-repository for sharing
• Cloud-scale distributed filesystem with file/folder ACLS and RBAC
• Encryption-at-rest by default with Azure Key Vault
• Authenticated access with Azure Active Directory integration
• Formal Certifications incl. ISO, SOC, PCI, HIPAA

ADL Store
Analytics
Storage
Cloudera CDH
Hortonworks HDP
Qubole QDS
• Open Source Apache® ADL client
for commercial and custom Hadoop
• Cloud IaaS and Hybrid

Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
A Z U R E D ATA B R I C K S
A F A S T , E A S Y , A N D C O L L A B O R A T I V E A P A C H E S P A R K B A S E D A N A L Y T I C S P L A T F O R M

HDInsight
ADL Store
Hive
Analytics
Storage
• 63% lower TCO
than on-premise*
• SLA- managed,
monitored and
supported by
Microsoft
• Fully managed
Hadoop, Spark
and R
• Clusters
deployed in
minutes
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics• Serverless. Pay per job. Starts in
seconds. Scales instantly.
• Develop massively parallel
programs with simplicity
• Federated query from multiple data
sources

Ingress
• Event Hubs
• IoT Hub
• Kafka
Analytics
• Stream Analytics
• Spark Streaming
• Storm
Sinks
• Data Lake Store
• Blob Store
• SQL Database
• SQL Data Warehouse
• Event Hub
• Power BI
• Table Storage
• Service Bus Queues
• Service Bus Topics
• Cosmos DB
• Azure Functions
• …..

Azure Data Lake
Store
1Create small files
2Copy small
files
3Concat +
copy file
4ASA
5Event Hub
Capture

• Copy
• SDK
• Tools (Storage Explorer, Visual Studio, 3rd Party)
• Data Factory
• SQL Integration Services
• Streaming from external sources
• Generated by cloud analytics

Scales out your custom code in .NET, Python, R over
your Data Lake
Familiar syntax to millions of SQL & .NET developers
Unifies
• Declarative nature of SQL with the imperative
power of your language of choice (e.g., C#,
Python)
• Processing of structured, semi-structured and
unstructured data
• Querying multiple Azure Data Sources
(Federated Query)
U-SQL
A framework for Big Data

Develop massively parallel programs with simplicity
A simple U-SQL script can scale
from Gigabytes to Petabytes
without learning complex big data
programming techniques.
U-SQL automatically generates a scaled
out and optimized execution plan to
handle any amount of data.
Execution nodes immediately
rapidly allocated to run the
program.
Error handling, network issues, and
runtime optimization are handled
automatically.
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM @"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
OUTPUT @searchlog
TO @"/Samples/Output/SearchLog_output.tsv"
USING Outputters.Tsv();

 Automatic "in-lining"
optimized out-of-the-
box
 Per job
parallelization
visibility into execution
 Heatmap to identify
bottlenecks

• Schema on Read
• Write to File
• Built-in and custom Extractors
and Outputters
• ADL Storage and Azure Blob
Storage
“Unstructured” Files
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options, Parquet
• Custom Extractors: e.g., JSON, XML, etc. (see http://usql.io)
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text, Parquet
• Custom Outputters: e.g., JSON, XML, etc. (see http://usql.io)
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"

• Simple Patterns
• Virtual Columns
• Only on EXTRACT GA for now
• OUTPUT in Private Preview
File Sets
Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Limits on number of files and file sizes can be improved with
SET @@FeaturePreviews =
"FileSetV2Dot5:on,InputFileGrouping:on,
AsyncCompilerStoreAccess:on";
(Will become default between now and middle of year)
Virtual columns
EXTRACT name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in predicates to get partition elimination
• Warning gets raised if no partition elimination was found

@rows = SELECT
Domain,
SUM(Clicks) AS TotalClicks
FROM @ClickData
GROUP BY Domain;

Read Read
Partition Partition
Full Agg
Write
Full Agg
Write
Full Agg
Write
Read
Partition
Partial Agg Partial Agg Partial Agg
CNN,
FB,
WH
EXTENT 1 EXTENT 2 EXTENT 3
CNN,
FB,
WH
CNN,
FB,
WH
U-SQL Table Distributed by Domain
Read Read
Full Agg Full Agg
Write Write
Read
Full Agg
Write
FB
EXTENT 1
WH
EXTENT 2
CNN
EXTENT 3
Expensive!

ADLA Account/Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables
User
objects
Refers toContains Implemented
and named by
Procedures
Creden-
tials
MD
Name
C# Name
C# Applier
Table Types
Legend
Statistics
C# UDTs
Packages

• Naming
• Discovery
• Sharing
• Securing
U-SQL Catalog
Naming
• Default Database and Schema context: master.dbo
• Quote identifiers with []: [my table]
• Stores data in ADL Storage /catalog folder
Discovery
• Visual Studio Server Explorer
• Azure Data Lake Analytics Portal
• SDKs and Azure Powershell commands
• Catalog Views: usql.databases, usql.tables etc.
Sharing
• Within an Azure Data Lake Analytics account
• Across ADLA accounts that share same Azure Active Directory:
• Referencing Assemblies
• Calling TVFs, Procedures and referencing tables and views
• Inserting into tables
Securing
• Secured with AAD principals at catalog and Database level

CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col2 ASC)
PARTITION BY (col1)
DISTRIBUTED BY HASH (driver_id)
);
• Structured Data, built-in Data types only (no UDTs)
• Clustered Index (needs to be specified): row-oriented
• Fine-grained distribution (needs to be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
• Addressable Partitions (optional)
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and distribution (does not support partitioning)

Data
Partitioning
Tables
Distribution Scheme When to use?
HASH(keys) Automatic Hash for fast item lookup
DIRECT HASH(id) Exact control of hash bucket value
RANGE(keys) Keeps ranges together
ROUND ROBIN To get equal distribution (if others give skew)

Partitions,
Distributions and
Clusters
TABLE
T ( id …
, C …
, date DateTime, …
, INDEX i
CLUSTERED (id, C)
PARTITIONED BY (date)
DISTRIBUTED BY
HASH(id) INTO 4
)
PARTITION (@date1) PARTITION (@date2) PARTITION (@date3)
HASH DISTRIBUTION 1
HASH DISTRIBUTION 2
HASH DISTRIBUTION 3
HASH DISTRIBUTION 1
HASH DISTRIBUTION 1
HASH DISTRIBUTION 2
HASH DISTRIBUTION 3
HASH DISTRIBUTION 4 HASH DISTRIBUTION 3
C1
C2
C3
C1
C2
C4
C5
C4
C6
C6
C7
C8
C7
C5
C6
C9
C10
C1
C3
/catalog/…/tables/Guid(T)/
Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss
LOGICAL
PHYSICAL

Benefits of Table clustering and distribution
• Faster lookup of data provided by distribution and clustering when right
distribution/cluster is chosen
• Data distribution provides better localized scale out
• Used for filters, joins and grouping
Benefits of Table partitioning
• Provides data life cycle management (“expire” old partitions):
Partition on date/time dimension
• Partial re-computation of data at partition level
• Query predicates can provide partition elimination
Do not use when…
• No filters, joins and grouping
• No reuse of the data for future queries
If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.

Benefits of
Distribution in
Tables
Benefits
• Design for most frequent/costly queries
• Manage data skew in partition/table
• Manage parallelism in querying (by number of
distributions)
• Manage minimizing data movement in joins
• Provide distribution seeks and range scans for query
predicates (distribution bucket elimination)
Distribution in tables is mandatory, chose according to
desired benefits

Benefits of
Clustered Index
in Distribution
Benefits
• Design for most frequent/costly queries
• Manage data skew in distribution bucket
• Provide locality of same data values
• Provide seeks and range scans for query predicates (index
lookup)
Clustered index in tables is mandatory, chose according to
desired benefits
Pro Tip:
Distribution keys should be prefix of Clustered Index keys:
Especially for RANGE distribution
Optimizer will make use of global ordering then:
If you make the RANGE distribution key a prefix of the index key, U-SQL
will repartition on demand to align any UNIONALLed or JOINed tables or
partitions!
Split points of table distribution partitions are chosen independently, so
any partitioned table can do UNION ALL in this manner if the data is to
be processed subsequently on the distribution key.

Benefits of
Partitioned Tables
Benefits
• Partitions are addressable
• Enables finer-grained data lifecycle management at
partition level
• Manage parallelism in querying by number of partitions
• Query predicates provide partition elimination
• Predicate has to be constant-foldable
Use partitioned tables for
• Managing large amounts of incrementally growing
structured data
• Queries with strong locality predicates
• point in time, for specific market etc
• Managing windows of data
• provide data for last x months for processing

Partitioned
tables
 Use partitioned tables
for querying parts of
large amounts of
incrementally growing
structured data
 Get partition
elimination
optimizations with the
right query predicates
Creating partition table
CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float
, INDEX idx CLUSTERED (vehicle_id ASC)
PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4);
Creating partitions
DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14, 00,00,00,00,DateTimeKind.Utc);
DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc);
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2);
Loading data into partitions dynamically
DECLARE @date1 DateTime = DateTime.Parse("2014-09-14");
DECLARE @date2 DateTime = DateTime.Parse("2014-09-16");
INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE
SELECT vehicle_id, event_date, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
• Filters and inserts clean data only, ignore “dirty” data
Loading data into partitions statically
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate);
INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate
SELECT vehicle_id, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
• Filters and inserts clean data only, put “dirty” data into special partition

What is Table Fragmentation
• ADLS is an append-only store!
• Every INSERT statement is creating a new file (INSERT fragment)
Why is it bad?
• Every INSERT fragment contains data in its own distribution buckets, thus
query processing loses ability to get “localized” fast access
• Query generation has to read from many files now -> slow preparation
phase that may time out.
• Reading from too many files is disallowed:
Current LIMIT: 3000 table partitions and INSERT fragments per job!
What if I have to add data incrementally?
• Batch inserts into table
• Use ALTER TABLE REBUILD/ALTER TABLE REBUILD PARTITION regularly
to reduce fragmentation and keep performance.

Dips down to 1 active vertex at
these times

High-level
Roadmap
• Worldwide Region Availability (currently US and EU)
• Interactive Access with T-SQL query
• Scale out your custom code in the language of choice
(.Net, Java, Python, etc)
• Process the data formats of your choice (incl. Parquet,
ORC; larger string values)
• Continued ADF, AAS, ADC, SQL DW, EventHub, SSIS
integration
• Administrative policies to control usage/cost for storage
& compute
• Secure data sharing between common AAD and public
read-only sharing, fine grained ACLing
• Intense focus on developer productivity for authoring,
debugging, and optimization
• General customer feedback
http://aka.ms/adlfeedback

Resources http://usql.io
http://blogs.msdn.microsoft.com/azuredatalake/
http://blogs.msdn.microsoft.com/mrys/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
http://aka.ms/usql_reference
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-
sql-programmability-guide
https://docs.microsoft.com/en-us/azure/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
https://msdn.microsoft.com/magazine/mt790200
http://www.slideshare.net/MichaelRys
Getting Started with R in U-SQL
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-
sql-python-extensions
https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
http://aka.ms/adlfeedback
Continue your education at
Microsoft Virtual Academy
online.

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

Similaire à Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day) (20)

Plus de Michael Rys

Plus de Michael Rys (12)

Dernier

Dernier (20)

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Training Day)

Notes de l'éditeur