3. Free online webinar
events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Free Online Resources
PASS Blog
White Papers
Session Recordings
Newsletter www.pass.org
Explore everything PASS has to offer
PASS Connector
BA Insights
Get involved
4. Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
5. Ike Ellis
Partner, Crafting Bytes
Microsoft MVP
Since 2011
Frequent PASS Summit Speaker
And speak often at SQL Saturdays + maker
of youtube videos – Ike’s tips!
Author of Developing Azure
Solutions
Now in 2nd Edition
/ikeellis @ike_ellis ellisike
7. What is a data lake as a data architecture
term?
“If you think of a datamart as a store of bottled water
– cleansed, packaged, and structured for easy
consumption – the data lake is a large body of water
in a more natural state.” – James Dixon – creator of
the term
8. Data Lake Attributes
Contains unstructured,
structured, semi-structured
and multi-structured data.
Data lakes have been
traditionally Azure Blob
Storage, Amazon S3, and
Hadoop HDFS.
Data is transformed in place
with no movement.
10. Azure Data Lake Store
Cloud based file system that is unlimited in size
Compatible with Hadoop/Spark
• Hive
• Pig
• Sqoop
11.
12. Azure Data Lake Store
• Unlimited Storage – a single
petabyte file
• Tuned for batch jobs
• Use active directory for security
• Store all data in native format
• Uses ADL:// and a URI
• Exposes WebHDFS API
13. Import data
• Use the Azure Portal
• .NET SDK
• Data Factory
• DistCp (Hadoop
Distributed Copy)
• Apache Sqoop
14. Azure Data Lake Store vs Azure Blob
Storage
• ADLS is optimized for analytics
• Blob Storage holds tons of data inappropriate for analytics like VHDs
• ADLS has folders, Blob Storage has containers
• ADLS uses WebHDFS Rest API, Azure Blob Storage uses the Azure
SDK Rest APIs
• ADLS has no size limits
• ADLS has no geo-redundancy yet
• ADLS is more expensive
• Azure Blob Storage can be better for analytics if there are a lot of
small files
16. Azure Data Lake Analytics
Service that queries data for analysis
ADLA runs jobs for:
querying
cleaning
aggregating
transforming
ADLA uses U-SQL as a language
18. U-SQL combines SQL and C#
SQL is a declarative language
• Don’t need to write all the steps
• Just write what data you want and let the optimizer get the data for
you
• Multi-threading is done for you in the background
C# is a procedural language that is often a better fit
• Custom Functions
• Multi-threading is very difficult
U-SQL unifies both of these languages
• Highly customizable and extensible way to interact with data
19. U-SQL – Start with a simple script
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
OUTPUT @hitters
TO "/output/hitters.txt"
USING Outputters.Text();
19
20. Let’s run this script?
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballS tats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = select * from @hitters
where HomeRuns != null;
OUTPUT @bestHitters
TO "/output/hitters.txt"
USING Outputters.Text();
20
21. OK, here’s the good scripts
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT * FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0;
OUTPUT @bestHitters
TO "/output/besthitters.txt"
USING Outputters.Text();
21
22. We can do GROUP BY@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal
FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
OUTPUT @bestHitters
TO "/output/besthitters.txt"
USING Outputters.Text();
22
23. JOINS, TOO! Does this script work?@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
@bestHitterWithName = SELECT PlayerName, Team, HomeRuns
FROM @hitters h
JOIN @bestHitters bh
ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal
OUTPUT @bestHitterWithName
TO "/output/besthitterswithnames.txt" 23
24. Nope! Need AS for alias, ambiguious
column names, INNER for JOINS@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OneBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
@bestHitterWithName = SELECT h.PlayerName, h.Team, h.HomeRuns
FROM @hitters AS h
INNER JOIN @bestHitters AS bh
ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal;
OUTPUT @bestHitterWithName
TO "/output/besthitterswithnames.txt" 24
25. You can process every file in a folder like
this
@log = EXTRACT date string,
time string,
client_ip string,
username string,
server_ip string,
port int,
method string,
stem string,
query string,
status string,
server_bytes int,
client_bytes int,
time_taken int,
user_agent string,
referrer string
FROM "/iislogs/{*}.txt" USING Extractors.Text(' ', silent:true);
@dailysummary = SELECT date, COUNT(*) AS hits, SUM(server_bytes) AS bytes_sent, SUM(client_bytes) AS bytes_received
FROM @log
GROUP BY date;
25
27. Typical U-SQL Job
1) Extract
2) Assign results to variable
3) Create a schema on read
4) Use schema when we process it
5) Filter data that we need
6) Apply the results to a variable
7) Output the variable someplace, like a CSV file
38. DEMO U-SQL Custom Assemblies
1. Create a class library
2. Name project Data Utilities
3. Implement class
4. View the Cloud Explorer pane
5. In the Solution Explorer pane, right-click the DataUtilities project and
click Register Assembly.
6. select the webdata database.
7. In the Azure portal, on the blade for your Azure Data Lake Analytics
account, click Data Explorer; and then browse to the Assemblies
folder in your webdata database to verify that the assembly has been
registered.
8. Use in a query
39. U-SQL Job Architecture
Each job step is divided into vertices.
Each vertex represents a single piece of work
Each unit (the slider bar at the top) is functionally similar
to a physical node.
Each available unit is going to grab a vertex and work
through it until the task is completed.
Then it will move on to the next task.
42. Add ADLS data for Hive in HDInsight
• Connect ADLS with HDInsight
• Provide a security principal
• Query WebADFS with the ADL:// prefix
42
43. ALL DONE!
Ike Ellis
@ike_ellis
Crafting Bytes
We’re hiring a Data Experts!
Microsoft MVP
Chairperson of the San Diego TIG
Book co-author – Developing Azure Solutions
Upcoming course on Azure Data Lake
www.craftingbytes.com
www.ikeellis.com
Notes de l'éditeur
CAPITALIZE THE KEYWORDS!
CREATE DATABASE BaseballStats;
USE DATABASE BaseballStats;
CREATE SCHEMA stats;
CREATE TABLE stats.BestHomeRunHitterPerTeam
( Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OnBasePer decimal?
, INDEX idx_Rank CLUSTERED(Rank ASC) DISTRIBUTED BY HASH(Rank));
@hitters =
EXTRACT Rank int
, PlayerName string
, Age int
, Team string
, League string
, Games int
, HomeRuns int?
, BattingAvg decimal?
, OnBasePer decimal?
FROM "/BaseballStats/Baseball.csv"
USING Extractors.Csv(encoding: Encoding.[ASCII], skipFirstNRows: 1);
@bestHitters = SELECT Team, MAX(HomeRuns) AS HomeRunTotal FROM @hitters
WHERE HomeRuns != null AND HomeRuns != 0
GROUP BY Team;
@bestHitterWithName = SELECT h.Rank, h.PlayerName, h.Age, h.Team, h.League, h.Games, h.HomeRuns, h.BattingAvg, h.OnBasePer
FROM @hitters AS h
INNER JOIN @bestHitters AS bh
ON h.Team == bh.Team AND h.HomeRuns == bh.HomeRunTotal;
INSERT INTO stats.BestHomeRunHitterPerTeam
SELECT * FROM @bestHitterWithName;
@results = SELECT * FROM stats.BestHomeRunHitterPerTeam;
OUTPUT @results TO "/output/BestHRHitters.csv"
USING Outputters.Csv();
using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
namespace USQLApplication1
{
public class IkeTools
{
public static string FindGreatTeam(string teamName)
{
string temp = teamName;
if (teamName == "SDP")
{
temp = "The Great San Diego Padres!!!";
}
return temp;
}
}
}
@results =
SELECT hpt.PlayerName,
hpt.HomeRuns,
USQLApplication1.IkeTools.FindGreatTeam(hpt.Team) AS Team
FROM stats.BestHomeRunHitterPerTeam AS hpt;
OUTPUT @results
TO "/output/BestHRHitters.csv"
USING Outputters.Csv();
// Auto-generated header code
// Generated Code Behind Header
USE DATABASE [BaseballStats];
REFERENCE ASSEMBLY IkeADLTools;
@results =
SELECT hpt.PlayerName,
hpt.HomeRuns,
IkeADLTools.IkeTools.FindGreatTeam(hpt.Team) AS Team
FROM stats.BestHomeRunHitterPerTeam AS hpt;
OUTPUT @results
TO "/output/BestHRHitters.csv"
USING Outputters.Csv();
Add file to folder
Create tables
set hive.execution.engine=tez;
DROP TABLE BaseBallStats;
CREATE TABLE BaseBallStats
( Team string
, HomeRuns string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION 'adl://adlaikeadls.azuredatalakestore.net/passdemo’;
Query Table in Ambari
SELECT * FROM BaseBallStats;