SQLSaturday Montreal 2017
Azure Data Lake
Le Big Data 2.0
SQLSaturday Montreal 2017
Jean-Pierre Riehl
Practice Manager Data & BI
@djeepy1
http://blog.djeepy1.net
MVP Data Platform
French Data
Community Leader
SQLSaturday Montreal 2017
Agenda
 Azure Data Lake, c’est quoi ?
 Azure Data Lake Store
 Azure Data Lake Analytics
 Outillage
 Coder en U-SQL
 Etendre ADLA
 ADL avec PowerShell
 Q&R
SQLSaturday Montreal 2017
Azure Data Lake, c’est quoi ?
SQLSaturday Montreal 2017
Un peu d’histoire
 Microsoft a besoin d’une technologie pour analyser des
Péta-octets de données
 2007-2008 - MS Research crée « Cosmos »
 2011-2012 - Le phénomène Big Data démarre
SQLSaturday Montreal 2017
Azure Data Lake
Solution « Big Data » managée proposée sur Azure
 ADL Store : un système de stockage distribué de type HDFS
 ADL Analytics : un moteur de requêtage « analytique »
 U-SQL : La simplicité du SQL, la puissance de .NET
SQLSaturday Montreal 2017
Cortana Analytics Suite
Transform data into intelligent action
Business
apps
Custom
apps
Sensors
and devices
People
Automated
Systems
Data Collection ToolsData Collection Tools
SQLSaturday Montreal 2017
Azure Data Lake
Analytics
Storage
Azure Data Lake Analytics
Azure Data Lake Storage
HDInsight
(“managed clusters”)
U-SQL
Azure Blob
SQLSaturday Montreal 2017
Azure Data Lake Store
SQLSaturday Montreal 2017
Azure Data Lake Store
 Pushing the Limits !!!
 Pas de limite sur les tailles de fichier / stockage
 « massive throughput, low latency »
 Sécurité avancée (type NTFS/POSIX)
 Compatible HDFS, WebHDFS
 « Optimisé pour l’analytique »
SQLSaturday Montreal 2017
Azure Blob vs. ADL Store
 Le prix :
 ADL : 34€ / To / mois
 Azure Blob : 20€ / To / mois*
* Hot LRS, First 100TB
?
SQLSaturday Montreal 2017
Azure Data Lake Store
Focus Sécurité
 Chiffré avec Azure Key Vault
 Authentification moderne (OAuth, MFA,
etc.)
 Intégration Azure Active Directory
 Autorisation avec ACL (type POSIX)
 Audit
SQLSaturday Montreal 2017
Azure Data Lake Store
SQLSaturday Montreal 2017
Azure Data Lake Analytics
SQLSaturday Montreal 2017
Azure Data Lake Analytics
Les arguments Marketing
 “Elastic analytics service”
 “all data, at any size”
 “No Limits to Scale”
SQLSaturday Montreal 2017
Azure Data Lake Analytics
 Service PaaS
 Mode batch (on parle de « job »)
 Modèle de tarification à l’exécution
 Sécurité et Audit
 Optimisé pour ADL Store
 Langage dédié : U-SQL
Les + :
SQLSaturday Montreal 2017
Azure Data Lake Analytics
SQLSaturday Montreal 2017
U-SQL
SQLSaturday Montreal 2017
Le language U-SQL
Les basiques de SQL
 Clauses de base
 SELECT, FROM, WHERE
 GROUP BY, JOIN, OVER
 Fonctionne sur des données
structurées et non-structurées
 Modèle relationnel pour les
méta-données
La puissance de .NET
 C# Expressions
 Code Behind
 Types
 Fonctions
 Agrégats
 Extractors / Outputters
 Processors
 Réutilisation d’Assemblies .NET
SQLSaturday Montreal 2017
Usages U-SQL
Source @DoktorKermit
SQLSaturday Montreal 2017
Ma 1ere requête U-SQL
@checkins =
EXTRACT [Date] DateTime,
[Checkins] int,
[DenRatio] string, [MayorRatio] string,
[Category] string, [Subcategory] string,
Venue string, Country string, City string,
Latitude string, Longitude string
FROM "/Samples/Data/Djeepy1Foursquare/Export-ADL-20170305.csv"
USING Extractors.Csv(skipFirstNRows : 1);
@resByCat =
SELECT [Category],
COUNT( * ) AS NbCheckins
FROM @checkins
GROUP BY [Category];
OUTPUT @resByCat
TO "/Samples/Data/Djeepy1Foursquare/Out-ByCat-FirstQuery.csv"
USING Outputters.Csv();
Extraction des données
Schema-on-Read
Manipulation des données
Sortie
SQLSaturday Montreal 2017
Ma 1ere requête U-SQL
SQLSaturday Montreal 2017
Exécution d’un job
SQLSaturday Montreal 2017
L’exécution d’un job U-SQL
Job
Scheduler &
Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Compiler
Service &
USQL Catalog
SQLSaturday Montreal 2017
Plan d’exécution (aka “Job Graph”)
Le job est découpé en
Vertex
Les vertex sont organisés par
“type de travail”
(SuperVertex)
“Job Graph”
SQLSaturday Montreal 2017
Ma 1ere exécution
SQLSaturday Montreal 2017
Ma première exécution
SQLSaturday Montreal 2017
Analyse de l’exécution
5 DLAU
allouées
1 DLAU
consommée
SQLSaturday Montreal 2017
Data Lake Analytical Unit
 ADLAU : unité d’exécution d’un Job
 ADLAU = 1 VM avec 2 cœurs et 6Go de RAM
 Déclaratif : on indique combien d’ADLAU on souhaite
 Les Vertex sont « affectés » sur des ADLAU pour exécution
 La facturation se fait sur les ADLAU allouées
SQLSaturday Montreal 2017
Tarification
SQLSaturday Montreal 2017
Outillage
SQLSaturday Montreal 2017
Visual Studio
 Intellisense
 Exécution locale
 Visualisation des jobs
 Optimisation
 Replay
 Debug
Téléchargez
Azure Data Lake Tools
SQLSaturday Montreal 2017
Etendre ADLA
SQLSaturday Montreal 2017
Etendre U-SQL avec .NET
 C# Expressions
 UDFs : Fonctions
 UDAGGs : Agrégats
 UDOs : Opérations (Extractors, Outputters)
 PROCESS : traitements
SQLSaturday Montreal 2017
Etendre U-SQL
SQLSaturday Montreal 2017
ADL avec PowerShell
SQLSaturday Montreal 2017
ADL Store - commandes
SQLSaturday Montreal 2017
ADL Analytics - commandes
SQLSaturday Montreal 2017
Azure Data Lake & PowerShell
SQLSaturday Montreal 2017
Merci !
Questions…
…et réponses

Azure Data Lake, le Big Data 2.0 - SQL Saturday Montreal 2017

Notes de l'éditeur

  • #8 Cortana Analytics Suite delivers an end-to-end platform with integrated and comprehensive set of tools and services to help you build intelligent applications that let you easily take advantage of Advanced Analytics. First Cortana Analytics Suite provides services to bring data in, so that you can analyze it.  It provides information management capabilities like Azure Data Factory so that you can pull data from any source (relational DB like SQL or non-relational ones like your Hadoop cluster) in an automated and scheduled way, while performing the necessary data transforms (like setting certain data colums as dates vs. currency etc).  Think ETL (Extract, Transform, Load) in the cloud. Event hub does the same for IoT type ingestion of data that streams in from lots of end points. The data brought in then can be persisted in flexible big data storage services like Data Lake and Azure SQL DW. You can then use a wide range of analytics services from Azure ML to Azure HDInsight to Azure Stream Analytics to analyze the data that are stored in the big data storage.  This means you can create analytics services and models specific to your business need (say real time demand forecasting). The resultant analytics services and models created by taking these steps can then be surfaced as interactive dashboards and visualizations via Power BI These same analytics services and models created can also be integrated into various different UI (web apps or mobile apps or rich client apps) as well as via integrations with Cortana, so end users can naturally interact with them via speech etc., and so that end users can get proactively be notified by Cortana if the analytics model finds a new anomaly (unusual growth in certain product purchases- in the case of real time demand forecasting example given above) or whatever deserves the attention of the business users.
  • #9 HDFS, Yarn
  • #11 https://docs.microsoft.com/en-us/azure/storage/storage-scalability-targets 500To / 50k blocks / 100Mo par block
  • #14 Portail Projet .NET
  • #18 Portail Projet .NET
  • #23 Portail Projet .NET
  • #28 Portail Projet .NET
  • #34 https://www.microsoft.com/en-us/download/details.aspx?id=49504
  • #37 Portail Projet .NET