Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
3. ADLA complements HDInsight
Target the same scenarios, tools, and customers
HDInsight
For developers familiar with the
Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency,
automatic scale, and management in
a “job service” form factor
6. Enterprise-
grade
Limitless scaleProductivity
from day one
Easy and
powerful data
preparation
All data
6
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
10001010101001010010001010101001
0100100010101010010100100010101
0100101001000101010100101001000
10101010010100100010101010010100
Azure Data Lake Analytics
7. Azure
Data Lake
Analytics Service
A new distributed
analytics service
Built on Apache YARN
Scales dynamically with the turn of a dial
Pay by the query
Supports Azure AD for access control,
roles, and integration with on-prem
identity systems
Built with U-SQL to unify the benefits of
SQL with the power of C#
Processes data across Azure
7
8. Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL DB in an
Azure VM
13. hard to work with anything other than
structured data
difficult to extend with custom code
14. User often has to
care about scale and performance
SQL is 2nd class within string
Often no code reuse/
sharing across queries
15. Get benefits of both!
Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative Code
• Local and remote Queries
• Increase productivity and agility from Day 1 and
at Day 100 for YOU!
19. U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
20. Intro Blog entry: http://aka.ms/usql-intro
Blog entry on UDFs: http://aka.ms/usql-udf
U-SQL Reference Doc (beta): http://aka.ms/usql_reference
U-SQL Community & Team site: http://usql.io/
Videos: https://channel9.msdn.com/Series/AzureDataLake
21. Microsoft Confidential Material - covered under NDA
Additional Resources • Blogs and community page:
• http://usql.io
• https://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-
data/
• https://channel9.msdn.com/Search?term=U-
SQL#ch9Search
• Documentation:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-
us/documentation/services/data-lake-analytics/
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
22. Unifies natively SQL’s declarativity and C#’s extensibility
Unifies querying structured and unstructured
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake and give us your feedback via
http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
Notes de l'éditeur
All data
Unstructured, Semi structured, Structured
Domain-specific user defined types using C#
Queries over Data Lake and Azure Blobs
Federated Queries over Operational and DW SQL stores removing the complexity of ETL
Productive from day one
Effortless scale and performance without need to manually tune/configure
Best developer experience throughout development lifecycle for both novices and experts
Leverage your existing skills with SQL and .NET
Easy and powerful data preparation
Easy to use built-in connectors for common data formats
Simple and rich extensibility model for adding customer – specific data transformation – both existing and new
No limits scale
Scales on demand with no change to code
Automatically parallelizes SQL and custom code
Designed to process petabytes of data
Enterprise grade
Managing, securing, sharing, and discovery of familiar data and code objects (tables, functions etc.)
Role based authorization of Catalogs and storage accounts using AAD security
Auditing of catalog objects (databases,tables etc.)
A new distributed analytics service
Built on Apache YARN
Dynamically scales
Handles jobs of any scale instantly by simply setting the dial for how much power you need.
You only pay for the cost of the query
Supports Azure Active Directory for Access Control, Roles, Integration with on-premises identity systems
It also includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#
U-SQL’s scalable runtime processes data across multiple Azure data sources
ADLA allows you to compute on data anywhere and a join data from multiple cloud sources.
Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps.
Some examples:
Hive UDAgg
Code and compile .java into .jar
Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading
Extend GenericUDAFEvaluator class: implements logic in 8 methods.
- Deploy:
Deploy jar into class path on server
Edit FunctionRegistry.java to register as built-in
Update the content of show functions with ant
Hive UDF (as of v0.13)
Code
Load JAR into head node or at URI
CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
Spark supports Custom “inputters and outputters” for defining custom RDDs
No UDAGGs
Simple integration of UDFs but only for duration of program. No reuse/sharing.
Cloud dataflow? Requires has to care about scale and perf
Spark UDAgg
Is not yet supported ( SPARK-3947)
Spark UDF
Write inline functiondef westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state)
for SQL usage need to register the tablecustomerTable.registerTempTable("customerTable")
Register each UDFsqlContext.udf.register("westernState", westernState _)
Call itval westernStates =sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
Offers Auto-scaling and performance
Operates on unstructured data without tables needed
Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg.
Easy to query remote sources even without external tables
U-SQL UDAgg
Code and compile .cs file:
Implement IAggregate’s 3 methods :Init(), Accumulate(), Terminate()
C# takes case of type checking, generics etc.
Deploy:
Tooling: one click registration in user db of assembly
By Hand:
Copy file to ADL
CREATE ASSEMBLY to register assembly
Use via AGG<MyNamespace.MyAggregate<T>>(a)
U-SQL UDF
Code in C#, register assembly once, call by C# name.
Extensions require .NET assemblies to be registered with a database