SlideShare une entreprise Scribd logo
1  sur  30
U-SQL CASE STUDY
Paris Datageeks, 05/10/2016
 Michel Caradec
• mcaradec@cegid.fr
 Project Manager, Software/Data Engineer at Cegid
 Background
• Business Intelligence, ETL, OLAP, Data Manipulation
• C#, R, Python
06/10/2016 CEGID3
About Me
Business Case
Azure Data Lake
U-SQL Case Study
Questions
Agenda
BUSINESS CASE
 Cegid web sites armed with tracking solutions
• Extend web analytics data
 Data Engineer: collect and prepare data
 Data Scientists: consume data in Azure ML Studio
• Visitors usage knowledge (browsing)
• Provide better experience (recommendations)
06/10/2016 CEGID6
Business Case
06/10/2016 CEGID7
Business Case
1 single visit
Bounce rate: 63%
Session length: 2 min
Mainly on Homepage
Average of 4.5 sessions
High screen resolution
Windows OS
Use of internal Search
During the day
Average of 2.5 sessions
1.5 pages / session
Bounce rate: 80%
iOS (iPhone, iPad)
In the evening and WE
No conversion
Average of 7 sessions
5 pages / session
Session length: 6 min
Mainly Solutions pages
More conversions
Tactile
visitor
3.5%
Returning
visitor
7%
Addict
visitor
3.5%
One shot
visitor
86%
Visitors clustering
Metrics built from a subset of the dataset.
Do not represent real traffic.
AZURE DATA LAKE
 ADL Store: repository, schema-on-read, Web HDFS
 ADL Analytics: distributed processing using U-SQL
06/10/2016 CEGID9
Azure Data Lake = Big Data as a Service
© Microsoft Azure
REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons];
@data = EXTRACT user string, timestamp DateTime, heart int
FROM "quantified-{user}.tsv"
USING Extractors.Tsv();
@agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour
FROM @data;
@agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min,
MAX(heart) AS max
FROM @agg
GROUP BY user, date, hour;
OUTPUT @agg TO "quantified.csv"
USING Outputters.Csv();
06/10/2016 CEGID10
U-SQL = SQL + C#
 Extract  Transform  Output
ADLS, Azure Storage Blobs, Azure SQL
 File schema-on-read
 SQL-like data manipulation
 C# data types
 C# integration
 Can store as relational
And many more…
 User-defined aggregators (C#)
 User-defined operators (C#)
 Custom Extractors, Outputters (C#)
 File sets for multiple input files access
patterns
 Credentials
Inspired by Michael Rys presentation
at SQL Server PASS Deutschland 2016
ExtractOutputTransform
CREATE TABLE quantified(user string, date DateTime, hour int,
avg long?, min int?, max int?,
INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user));
INSERT INTO quantified SELECT * FROM @agg;
Output’
CASE STUDY
 JSON Record
06/10/2016 CEGID12
Data sources
“Custom Dimensions”
Contains arrays
06/10/2016 CEGID13
U-SQL Pipeline
JSON / TSV
conversion
• cegid-<site>-
raw.tsv
Sessions
aggregation
• cegid-<site>-
sessions.tsv
Visitors
aggregation
• cegid-<site>-
visitors.tsv
06/10/2016 CEGID14
U-SQL Script 1 - JSON / TSV Conversion
*.json files
extraction
Records to
JSON objects
conversion
Cegid IPS
filtering
JSON objects
fields
extraction
Custom
dimensions
extraction
Custom
dimensions
pivot
Output to TSV
files
Data extracted as raw text
@json_records_raw =
EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json"
USING Extractors.Text(delimiter : Char.MinValue, quoting : false);
06/10/2016 CEGID15
JSON / TSV Conversion - Step 1/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
Azure Storage Blobs
20160901/20160901.json
20160902/20160902.json
20160903/20160903.json
20160904/20160904.json
…
@json_records_raw
DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId",
"customerId", ... };
@json_records =
SELECT JsonFunctions.JsonTuple(
json_raw, @fields.ToArray()
) AS json_object
FROM @json_records_raw;
06/10/2016 CEGID16
JSON / TSV Conversion - Step 2/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
@json_records =
SELECT json_object["ipAddress"] AS ipAddress, json_object
FROM @json_records;
@json_records =
SELECT *
FROM @json_records
LEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ips
ON ipAddress == ips.ip;
06/10/2016 CEGID17
JSON / TSV Conversion - Step 3/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
@events =
SELECT json_object["timestamp"] AS timestamp,
json_object["eventId"] AS eventId,
json_object["customerId"] AS customerId,
// ...
FROM @json_records;
06/10/2016 CEGID18
JSON / TSV Conversion - Step 4/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
@events =
SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_array
FROM @events;
06/10/2016 CEGID19
JSON / TSV Conversion - Step 5/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
index value
3 bigFan
4 powerBuyer
2 1443654335461
1 ComkzbXvn8g
value (ordered)
ComkzbXvn8g
1443654335461
bigFan
powerBuyer
@events =
SELECT timestamp,
eventId,
customerId,
verb,
// ...
cd_array[0] AS cd_01,
cd_array[1] AS cd_02,
// ...
cd_array[19] AS cd_20
FROM @events;
06/10/2016 CEGID20
JSON / TSV Conversion - Step 6/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
value
ComkzbXvn8g
1443654335461
bigFan
powerBuyer
cd_01 cd_02 cd_03 cd_04
ComkzbXvn8g 1443654335461 bigFan powerBuyer
OUTPUT
(
SELECT *
FROM @events
WHERE fullUrl.StartsWith("http://www.cegid.com/uk/")
)
TO "cegid-uk-raw.tsv"
ORDER BY clientId, timestamp ASC
USING Outputters.Tsv(quoting : false);
06/10/2016 CEGID21
JSON / TSV Conversion - Step 7/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
06/10/2016 CEGID22
JSON / TSV Conversion
 TSV format
06/10/2016 CEGID23
JSON / TSV Conversion - Output
 N events  M sessions (M >= 1)
• New session if no activity for 30 minutes
 Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
 User Agent parsing: UAParser (C#)
 Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#)
06/10/2016 CEGID24
U-SQL Script 2 - Sessions Aggregation
cegid-<site>-
raw.tsv files
extraction
Events to
sessions
aggregation
Output to TSV
files
 N sessions  1 visitor
 Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
06/10/2016 CEGID25
U-SQL Script 3 - Visitors Aggregation
cegid-<site>-
sessions.tsv
files extraction
Sessions to
visitors
aggregation
Output to TSV
files
 Custom dimensions aggregation (C#)
10/6/2016 CEGID26
Visitors Aggregation
visitor_id cd_01 cd_02
A homepage articles
A solutions detail
A homepage articles
A form news
B homepage articles
B blog news
visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_news
A 2 1 1 2 1 1
B 1 1 1 1
cd_01 = page category
cd_02 = content type
...
cd_20 = …
 Keep C# code centralized in dedicated assemblies
• Code-behind good for prototyping
 Prefer U-SQL to C# for better optimization
• Distributed execution built from U-SQL script, not C#
 Properly use parallelism
 Know your data
• Statistics (cardinality, distribution, skewness)
• Growth
 Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce
design patterns (will help)
 Local mode is good, but do not forget to test on ADL
06/10/2016 CEGID27
Lessons Learned
Pros
 Packaged solution, zero deployment
 Cloud agility (scalability, elasticity)
 Web HDFS storage = Hadoop
compatible (sqoop, etc.)
 Integrated development
(Visual Studio, local/debug mode)
 U-SQL = SQL + C# (business code
reuse)
06/10/2016 CEGID28
Azure Data Lake Review
Cons
 Proprietary solution
 Not on-premises
 Learning curve
QUESTIONS
Thank
you
for your attention

Contenu connexe

Tendances

MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose DatabaseAshnikbiz
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...Big Data Spain
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBMongoDB
 
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
Hermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDBHermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDBMongoDB
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
 
Code Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with WakandaCode Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with Wakandatroxell
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
13 Programación Web con .NET y C#
13 Programación Web con .NET y C#13 Programación Web con .NET y C#
13 Programación Web con .NET y C#guidotic
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.Keshav Murthy
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Gralldistributed matters
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015Mikael Svenson
 

Tendances (20)

MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose Database
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDB
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Hermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDBHermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDB
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Code Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with WakandaCode Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with Wakanda
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
13 Programación Web con .NET y C#
13 Programación Web con .NET y C#13 Programación Web con .NET y C#
13 Programación Web con .NET y C#
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015
 

En vedette

Panama papers - Investigation et Big Data
Panama papers - Investigation et Big DataPanama papers - Investigation et Big Data
Panama papers - Investigation et Big DataMichel Caradec
 
Flink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Forward
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkFlink Forward
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 

En vedette (7)

Panama papers - Investigation et Big Data
Panama papers - Investigation et Big DataPanama papers - Investigation et Big Data
Panama papers - Investigation et Big Data
 
Flink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Case Study: Bouygues Telecom
Flink Case Study: Bouygues Telecom
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 

Similaire à Paris Datageeks meetup 05102016

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Optimizing a React application for Core Web Vitals
Optimizing a React application for Core Web VitalsOptimizing a React application for Core Web Vitals
Optimizing a React application for Core Web VitalsJuan Picado
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DBBuilding event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DBMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Riccardo Zamana
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsSriskandarajah Suhothayan
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AITorsten Steinbach
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureHARMAN Services
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 

Similaire à Paris Datageeks meetup 05102016 (20)

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Optimizing a React application for Core Web Vitals
Optimizing a React application for Core Web VitalsOptimizing a React application for Core Web Vitals
Optimizing a React application for Core Web Vitals
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
ADF+Course+Deck.pdf
ADF+Course+Deck.pdfADF+Course+Deck.pdf
ADF+Course+Deck.pdf
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DBBuilding event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
 
Implementing Real-Time IoT Stream Processing in Azure
Implementing Real-Time IoT Stream Processing in Azure Implementing Real-Time IoT Stream Processing in Azure
Implementing Real-Time IoT Stream Processing in Azure
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
DW on AWS
DW on AWSDW on AWS
DW on AWS
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 

Dernier

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 

Dernier (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 

Paris Datageeks meetup 05102016

  • 1.
  • 2. U-SQL CASE STUDY Paris Datageeks, 05/10/2016
  • 3.  Michel Caradec • mcaradec@cegid.fr  Project Manager, Software/Data Engineer at Cegid  Background • Business Intelligence, ETL, OLAP, Data Manipulation • C#, R, Python 06/10/2016 CEGID3 About Me
  • 4. Business Case Azure Data Lake U-SQL Case Study Questions Agenda
  • 6.  Cegid web sites armed with tracking solutions • Extend web analytics data  Data Engineer: collect and prepare data  Data Scientists: consume data in Azure ML Studio • Visitors usage knowledge (browsing) • Provide better experience (recommendations) 06/10/2016 CEGID6 Business Case
  • 7. 06/10/2016 CEGID7 Business Case 1 single visit Bounce rate: 63% Session length: 2 min Mainly on Homepage Average of 4.5 sessions High screen resolution Windows OS Use of internal Search During the day Average of 2.5 sessions 1.5 pages / session Bounce rate: 80% iOS (iPhone, iPad) In the evening and WE No conversion Average of 7 sessions 5 pages / session Session length: 6 min Mainly Solutions pages More conversions Tactile visitor 3.5% Returning visitor 7% Addict visitor 3.5% One shot visitor 86% Visitors clustering Metrics built from a subset of the dataset. Do not represent real traffic.
  • 9.  ADL Store: repository, schema-on-read, Web HDFS  ADL Analytics: distributed processing using U-SQL 06/10/2016 CEGID9 Azure Data Lake = Big Data as a Service © Microsoft Azure
  • 10. REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons]; @data = EXTRACT user string, timestamp DateTime, heart int FROM "quantified-{user}.tsv" USING Extractors.Tsv(); @agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour FROM @data; @agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min, MAX(heart) AS max FROM @agg GROUP BY user, date, hour; OUTPUT @agg TO "quantified.csv" USING Outputters.Csv(); 06/10/2016 CEGID10 U-SQL = SQL + C#  Extract  Transform  Output ADLS, Azure Storage Blobs, Azure SQL  File schema-on-read  SQL-like data manipulation  C# data types  C# integration  Can store as relational And many more…  User-defined aggregators (C#)  User-defined operators (C#)  Custom Extractors, Outputters (C#)  File sets for multiple input files access patterns  Credentials Inspired by Michael Rys presentation at SQL Server PASS Deutschland 2016 ExtractOutputTransform CREATE TABLE quantified(user string, date DateTime, hour int, avg long?, min int?, max int?, INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user)); INSERT INTO quantified SELECT * FROM @agg; Output’
  • 12.  JSON Record 06/10/2016 CEGID12 Data sources “Custom Dimensions” Contains arrays
  • 13. 06/10/2016 CEGID13 U-SQL Pipeline JSON / TSV conversion • cegid-<site>- raw.tsv Sessions aggregation • cegid-<site>- sessions.tsv Visitors aggregation • cegid-<site>- visitors.tsv
  • 14. 06/10/2016 CEGID14 U-SQL Script 1 - JSON / TSV Conversion *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 15. Data extracted as raw text @json_records_raw = EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json" USING Extractors.Text(delimiter : Char.MinValue, quoting : false); 06/10/2016 CEGID15 JSON / TSV Conversion - Step 1/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files Azure Storage Blobs 20160901/20160901.json 20160902/20160902.json 20160903/20160903.json 20160904/20160904.json … @json_records_raw
  • 16. DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId", "customerId", ... }; @json_records = SELECT JsonFunctions.JsonTuple( json_raw, @fields.ToArray() ) AS json_object FROM @json_records_raw; 06/10/2016 CEGID16 JSON / TSV Conversion - Step 2/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 17. @json_records = SELECT json_object["ipAddress"] AS ipAddress, json_object FROM @json_records; @json_records = SELECT * FROM @json_records LEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ips ON ipAddress == ips.ip; 06/10/2016 CEGID17 JSON / TSV Conversion - Step 3/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 18. @events = SELECT json_object["timestamp"] AS timestamp, json_object["eventId"] AS eventId, json_object["customerId"] AS customerId, // ... FROM @json_records; 06/10/2016 CEGID18 JSON / TSV Conversion - Step 4/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 19. @events = SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_array FROM @events; 06/10/2016 CEGID19 JSON / TSV Conversion - Step 5/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files index value 3 bigFan 4 powerBuyer 2 1443654335461 1 ComkzbXvn8g value (ordered) ComkzbXvn8g 1443654335461 bigFan powerBuyer
  • 20. @events = SELECT timestamp, eventId, customerId, verb, // ... cd_array[0] AS cd_01, cd_array[1] AS cd_02, // ... cd_array[19] AS cd_20 FROM @events; 06/10/2016 CEGID20 JSON / TSV Conversion - Step 6/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files value ComkzbXvn8g 1443654335461 bigFan powerBuyer cd_01 cd_02 cd_03 cd_04 ComkzbXvn8g 1443654335461 bigFan powerBuyer
  • 21. OUTPUT ( SELECT * FROM @events WHERE fullUrl.StartsWith("http://www.cegid.com/uk/") ) TO "cegid-uk-raw.tsv" ORDER BY clientId, timestamp ASC USING Outputters.Tsv(quoting : false); 06/10/2016 CEGID21 JSON / TSV Conversion - Step 7/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 22. 06/10/2016 CEGID22 JSON / TSV Conversion
  • 23.  TSV format 06/10/2016 CEGID23 JSON / TSV Conversion - Output
  • 24.  N events  M sessions (M >= 1) • New session if no activity for 30 minutes  Aggregation using U-SQL REDUCE • Microsoft.Analytics.Interfaces.IReducer  User Agent parsing: UAParser (C#)  Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#) 06/10/2016 CEGID24 U-SQL Script 2 - Sessions Aggregation cegid-<site>- raw.tsv files extraction Events to sessions aggregation Output to TSV files
  • 25.  N sessions  1 visitor  Aggregation using U-SQL REDUCE • Microsoft.Analytics.Interfaces.IReducer 06/10/2016 CEGID25 U-SQL Script 3 - Visitors Aggregation cegid-<site>- sessions.tsv files extraction Sessions to visitors aggregation Output to TSV files
  • 26.  Custom dimensions aggregation (C#) 10/6/2016 CEGID26 Visitors Aggregation visitor_id cd_01 cd_02 A homepage articles A solutions detail A homepage articles A form news B homepage articles B blog news visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_news A 2 1 1 2 1 1 B 1 1 1 1 cd_01 = page category cd_02 = content type ... cd_20 = …
  • 27.  Keep C# code centralized in dedicated assemblies • Code-behind good for prototyping  Prefer U-SQL to C# for better optimization • Distributed execution built from U-SQL script, not C#  Properly use parallelism  Know your data • Statistics (cardinality, distribution, skewness) • Growth  Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce design patterns (will help)  Local mode is good, but do not forget to test on ADL 06/10/2016 CEGID27 Lessons Learned
  • 28. Pros  Packaged solution, zero deployment  Cloud agility (scalability, elasticity)  Web HDFS storage = Hadoop compatible (sqoop, etc.)  Integrated development (Visual Studio, local/debug mode)  U-SQL = SQL + C# (business code reuse) 06/10/2016 CEGID28 Azure Data Lake Review Cons  Proprietary solution  Not on-premises  Learning curve

Notes de l'éditeur

  1. Cegid has deployed multiple web sites as interfaces with its customers. In an effort to improve its services, and better understand its customers, Cegid has armed its web sites with a tracking solution. This tracking solution generates some data based on user events collected from each web sites. Cegid aims to setup an efficient way to make this data available to data analysts so they can consume it through: Machine Learning. Business Intelligence and DataViz. Google Analytics, in its basic version, only allows working on aggregated information.
  2. Exploring Azure Data Lake : http://tomkerkhove.ghost.io/2015/10/22/exploring-azures-data-lake/ About compute to data vs data to compute: https://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/ https://azure.microsoft.com/fr-fr/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/ Schema-on-read: any assumptions about the structure of stored data are implicitly encoded in the application/script logic and not explicitly defined through a data definition language (schema-on-write).
  3. “U-SQL - Azure Data Lake Analytics for Developers” by Michael Rys: http://www.slideshare.net/MichaelRys/usql-azure-data-lake-analytics-for-developers Microsoft Machine Learning & Data Science Summit 2016 http://www.slideshare.net/MichaelRys/taming-the-data-science-monster-with-a-new-sword-usql http://www.slideshare.net/MichaelRys/killer-scenarios-with-data-lake-in-azure-with-usql
  4. Use of file set patterns (1 file a day). Colors conventions: Blue= input Red = output Green = accessors
  5. Raw text is converted to a JSON object using C# function provided by Microsoft sample project. @fields variable was programmatically generated.
  6. IP address is extracted from JSON object (ipAddress) so it can be used by ANTISEMIJOIN operator. CegidIps is a table containing Cegid IP addresses.
  7. All required fields must be explicitly extracted.
  8. Custom dimensions JSON array is parsed and organized using a developed C# function.
  9. Data is dispatched to corresponding site (country) file. ORDER BY statement should be avoided if not required for future steps.
  10. U-SQL REDUCE: https://msdn.microsoft.com/en-us/library/azure/mt621336.aspx Same case as in Michael Rys post "How do I combine overlapping ranges using U-SQL?": https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/
  11. U-SQL REDUCE: https://msdn.microsoft.com/en-us/library/azure/mt621336.aspx
  12. Each custom dimensions (cd_01, cd_02, …, cd_20) has a pre-defined set of values, which are pivoted and aggregated.
  13. Prefer U-SQL to C# for better optimization: Use of PRESORT option on REDUCE rather than sorting data in IReducer. C# can sometimes be optimized (post filters on file sets). U-SQL Query Execution: https://channel9.msdn.com/Series/AzureDataLake/USQL-QE