SlideShare une entreprise Scribd logo
1  sur  30
U-SQL CASE STUDY
Paris Datageeks, 05/10/2016
 Michel Caradec
• mcaradec@cegid.fr
 Project Manager, Software/Data Engineer at Cegid
 Background
• Business Intelligence, ETL, OLAP, Data Manipulation
• C#, R, Python
06/10/2016 CEGID3
About Me
Business Case
Azure Data Lake
U-SQL Case Study
Questions
Agenda
BUSINESS CASE
 Cegid web sites armed with tracking solutions
• Extend web analytics data
 Data Engineer: collect and prepare data
 Data Scientists: consume data in Azure ML Studio
• Visitors usage knowledge (browsing)
• Provide better experience (recommendations)
06/10/2016 CEGID6
Business Case
06/10/2016 CEGID7
Business Case
1 single visit
Bounce rate: 63%
Session length: 2 min
Mainly on Homepage
Average of 4.5 sessions
High screen resolution
Windows OS
Use of internal Search
During the day
Average of 2.5 sessions
1.5 pages / session
Bounce rate: 80%
iOS (iPhone, iPad)
In the evening and WE
No conversion
Average of 7 sessions
5 pages / session
Session length: 6 min
Mainly Solutions pages
More conversions
Tactile
visitor
3.5%
Returning
visitor
7%
Addict
visitor
3.5%
One shot
visitor
86%
Visitors clustering
Metrics built from a subset of the dataset.
Do not represent real traffic.
AZURE DATA LAKE
 ADL Store: repository, schema-on-read, Web HDFS
 ADL Analytics: distributed processing using U-SQL
06/10/2016 CEGID9
Azure Data Lake = Big Data as a Service
© Microsoft Azure
REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons];
@data = EXTRACT user string, timestamp DateTime, heart int
FROM "quantified-{user}.tsv"
USING Extractors.Tsv();
@agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour
FROM @data;
@agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min,
MAX(heart) AS max
FROM @agg
GROUP BY user, date, hour;
OUTPUT @agg TO "quantified.csv"
USING Outputters.Csv();
06/10/2016 CEGID10
U-SQL = SQL + C#
 Extract  Transform  Output
ADLS, Azure Storage Blobs, Azure SQL
 File schema-on-read
 SQL-like data manipulation
 C# data types
 C# integration
 Can store as relational
And many more…
 User-defined aggregators (C#)
 User-defined operators (C#)
 Custom Extractors, Outputters (C#)
 File sets for multiple input files access
patterns
 Credentials
Inspired by Michael Rys presentation
at SQL Server PASS Deutschland 2016
ExtractOutputTransform
CREATE TABLE quantified(user string, date DateTime, hour int,
avg long?, min int?, max int?,
INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user));
INSERT INTO quantified SELECT * FROM @agg;
Output’
CASE STUDY
 JSON Record
06/10/2016 CEGID12
Data sources
“Custom Dimensions”
Contains arrays
06/10/2016 CEGID13
U-SQL Pipeline
JSON / TSV
conversion
• cegid-<site>-
raw.tsv
Sessions
aggregation
• cegid-<site>-
sessions.tsv
Visitors
aggregation
• cegid-<site>-
visitors.tsv
06/10/2016 CEGID14
U-SQL Script 1 - JSON / TSV Conversion
*.json files
extraction
Records to
JSON objects
conversion
Cegid IPS
filtering
JSON objects
fields
extraction
Custom
dimensions
extraction
Custom
dimensions
pivot
Output to TSV
files
Data extracted as raw text
@json_records_raw =
EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json"
USING Extractors.Text(delimiter : Char.MinValue, quoting : false);
06/10/2016 CEGID15
JSON / TSV Conversion - Step 1/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
Azure Storage Blobs
20160901/20160901.json
20160902/20160902.json
20160903/20160903.json
20160904/20160904.json
…
@json_records_raw
DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId",
"customerId", ... };
@json_records =
SELECT JsonFunctions.JsonTuple(
json_raw, @fields.ToArray()
) AS json_object
FROM @json_records_raw;
06/10/2016 CEGID16
JSON / TSV Conversion - Step 2/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
@json_records =
SELECT json_object["ipAddress"] AS ipAddress, json_object
FROM @json_records;
@json_records =
SELECT *
FROM @json_records
LEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ips
ON ipAddress == ips.ip;
06/10/2016 CEGID17
JSON / TSV Conversion - Step 3/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
@events =
SELECT json_object["timestamp"] AS timestamp,
json_object["eventId"] AS eventId,
json_object["customerId"] AS customerId,
// ...
FROM @json_records;
06/10/2016 CEGID18
JSON / TSV Conversion - Step 4/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
@events =
SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_array
FROM @events;
06/10/2016 CEGID19
JSON / TSV Conversion - Step 5/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
index value
3 bigFan
4 powerBuyer
2 1443654335461
1 ComkzbXvn8g
value (ordered)
ComkzbXvn8g
1443654335461
bigFan
powerBuyer
@events =
SELECT timestamp,
eventId,
customerId,
verb,
// ...
cd_array[0] AS cd_01,
cd_array[1] AS cd_02,
// ...
cd_array[19] AS cd_20
FROM @events;
06/10/2016 CEGID20
JSON / TSV Conversion - Step 6/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
value
ComkzbXvn8g
1443654335461
bigFan
powerBuyer
cd_01 cd_02 cd_03 cd_04
ComkzbXvn8g 1443654335461 bigFan powerBuyer
OUTPUT
(
SELECT *
FROM @events
WHERE fullUrl.StartsWith("http://www.cegid.com/uk/")
)
TO "cegid-uk-raw.tsv"
ORDER BY clientId, timestamp ASC
USING Outputters.Tsv(quoting : false);
06/10/2016 CEGID21
JSON / TSV Conversion - Step 7/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
06/10/2016 CEGID22
JSON / TSV Conversion
 TSV format
06/10/2016 CEGID23
JSON / TSV Conversion - Output
 N events  M sessions (M >= 1)
• New session if no activity for 30 minutes
 Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
 User Agent parsing: UAParser (C#)
 Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#)
06/10/2016 CEGID24
U-SQL Script 2 - Sessions Aggregation
cegid-<site>-
raw.tsv files
extraction
Events to
sessions
aggregation
Output to TSV
files
 N sessions  1 visitor
 Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
06/10/2016 CEGID25
U-SQL Script 3 - Visitors Aggregation
cegid-<site>-
sessions.tsv
files extraction
Sessions to
visitors
aggregation
Output to TSV
files
 Custom dimensions aggregation (C#)
10/6/2016 CEGID26
Visitors Aggregation
visitor_id cd_01 cd_02
A homepage articles
A solutions detail
A homepage articles
A form news
B homepage articles
B blog news
visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_news
A 2 1 1 2 1 1
B 1 1 1 1
cd_01 = page category
cd_02 = content type
...
cd_20 = …
 Keep C# code centralized in dedicated assemblies
• Code-behind good for prototyping
 Prefer U-SQL to C# for better optimization
• Distributed execution built from U-SQL script, not C#
 Properly use parallelism
 Know your data
• Statistics (cardinality, distribution, skewness)
• Growth
 Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce
design patterns (will help)
 Local mode is good, but do not forget to test on ADL
06/10/2016 CEGID27
Lessons Learned
Pros
 Packaged solution, zero deployment
 Cloud agility (scalability, elasticity)
 Web HDFS storage = Hadoop
compatible (sqoop, etc.)
 Integrated development
(Visual Studio, local/debug mode)
 U-SQL = SQL + C# (business code
reuse)
06/10/2016 CEGID28
Azure Data Lake Review
Cons
 Proprietary solution
 Not on-premises
 Learning curve
QUESTIONS
Thank
you
for your attention

Contenu connexe

Tendances

MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose DatabaseAshnikbiz
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...Big Data Spain
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBMongoDB
 
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorHenrik Ingo
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 
Hermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDBHermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDBMongoDB
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
 
Code Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with WakandaCode Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with Wakandatroxell
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
13 Programación Web con .NET y C#
13 Programación Web con .NET y C#13 Programación Web con .NET y C#
13 Programación Web con .NET y C#guidotic
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.Keshav Murthy
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Gralldistributed matters
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015Mikael Svenson
 

Tendances (20)

MongoDB - General Purpose Database
MongoDB - General Purpose DatabaseMongoDB - General Purpose Database
MongoDB - General Purpose Database
 
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Implementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDBImplementing and Visualizing Clickstream data with MongoDB
Implementing and Visualizing Clickstream data with MongoDB
 
MongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDBMongoDB World 2016: The Best IoT Analytics with MongoDB
MongoDB World 2016: The Best IoT Analytics with MongoDB
 
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to...
 
MongoDB and Spark
MongoDB and SparkMongoDB and Spark
MongoDB and Spark
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 
Hermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDBHermes: Free the Data! Distributed Computing with MongoDB
Hermes: Free the Data! Distributed Computing with MongoDB
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Code Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with WakandaCode Camp - Building a Glass app with Wakanda
Code Camp - Building a Glass app with Wakanda
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
13 Programación Web con .NET y C#
13 Programación Web con .NET y C#13 Programación Web con .NET y C#
13 Programación Web con .NET y C#
 
N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.N1QL workshop: Indexing & Query turning.
N1QL workshop: Indexing & Query turning.
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015SharePoint Search Queries Explained - SPSSthlm 2015
SharePoint Search Queries Explained - SPSSthlm 2015
 

En vedette

Panama papers - Investigation et Big Data
Panama papers - Investigation et Big DataPanama papers - Investigation et Big Data
Panama papers - Investigation et Big DataMichel Caradec
 
Flink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Forward
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkFlink Forward
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 

En vedette (7)

Panama papers - Investigation et Big Data
Panama papers - Investigation et Big DataPanama papers - Investigation et Big Data
Panama papers - Investigation et Big Data
 
Flink Case Study: Bouygues Telecom
Flink Case Study: Bouygues TelecomFlink Case Study: Bouygues Telecom
Flink Case Study: Bouygues Telecom
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 

Similaire à Paris Datageeks meetup 05102016

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Optimizing a React application for Core Web Vitals
Optimizing a React application for Core Web VitalsOptimizing a React application for Core Web Vitals
Optimizing a React application for Core Web VitalsJuan Picado
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DBBuilding event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DBMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Riccardo Zamana
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsSriskandarajah Suhothayan
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AITorsten Steinbach
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureHARMAN Services
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 

Similaire à Paris Datageeks meetup 05102016 (20)

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Optimizing a React application for Core Web Vitals
Optimizing a React application for Core Web VitalsOptimizing a React application for Core Web Vitals
Optimizing a React application for Core Web Vitals
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
ADF+Course+Deck.pdf
ADF+Course+Deck.pdfADF+Course+Deck.pdf
ADF+Course+Deck.pdf
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DBBuilding event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
Building event-driven Serverless Apps with Azure Functions and Azure Cosmos DB
 
Implementing Real-Time IoT Stream Processing in Azure
Implementing Real-Time IoT Stream Processing in Azure Implementing Real-Time IoT Stream Processing in Azure
Implementing Real-Time IoT Stream Processing in Azure
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
DW on AWS
DW on AWSDW on AWS
DW on AWS
 
WSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needsWSO2 Analytics Platform - The one stop shop for all your data needs
WSO2 Analytics Platform - The one stop shop for all your data needs
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 

Dernier

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Dernier (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Paris Datageeks meetup 05102016

  • 1.
  • 2. U-SQL CASE STUDY Paris Datageeks, 05/10/2016
  • 3.  Michel Caradec • mcaradec@cegid.fr  Project Manager, Software/Data Engineer at Cegid  Background • Business Intelligence, ETL, OLAP, Data Manipulation • C#, R, Python 06/10/2016 CEGID3 About Me
  • 4. Business Case Azure Data Lake U-SQL Case Study Questions Agenda
  • 6.  Cegid web sites armed with tracking solutions • Extend web analytics data  Data Engineer: collect and prepare data  Data Scientists: consume data in Azure ML Studio • Visitors usage knowledge (browsing) • Provide better experience (recommendations) 06/10/2016 CEGID6 Business Case
  • 7. 06/10/2016 CEGID7 Business Case 1 single visit Bounce rate: 63% Session length: 2 min Mainly on Homepage Average of 4.5 sessions High screen resolution Windows OS Use of internal Search During the day Average of 2.5 sessions 1.5 pages / session Bounce rate: 80% iOS (iPhone, iPad) In the evening and WE No conversion Average of 7 sessions 5 pages / session Session length: 6 min Mainly Solutions pages More conversions Tactile visitor 3.5% Returning visitor 7% Addict visitor 3.5% One shot visitor 86% Visitors clustering Metrics built from a subset of the dataset. Do not represent real traffic.
  • 9.  ADL Store: repository, schema-on-read, Web HDFS  ADL Analytics: distributed processing using U-SQL 06/10/2016 CEGID9 Azure Data Lake = Big Data as a Service © Microsoft Azure
  • 10. REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons]; @data = EXTRACT user string, timestamp DateTime, heart int FROM "quantified-{user}.tsv" USING Extractors.Tsv(); @agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour FROM @data; @agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min, MAX(heart) AS max FROM @agg GROUP BY user, date, hour; OUTPUT @agg TO "quantified.csv" USING Outputters.Csv(); 06/10/2016 CEGID10 U-SQL = SQL + C#  Extract  Transform  Output ADLS, Azure Storage Blobs, Azure SQL  File schema-on-read  SQL-like data manipulation  C# data types  C# integration  Can store as relational And many more…  User-defined aggregators (C#)  User-defined operators (C#)  Custom Extractors, Outputters (C#)  File sets for multiple input files access patterns  Credentials Inspired by Michael Rys presentation at SQL Server PASS Deutschland 2016 ExtractOutputTransform CREATE TABLE quantified(user string, date DateTime, hour int, avg long?, min int?, max int?, INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user)); INSERT INTO quantified SELECT * FROM @agg; Output’
  • 12.  JSON Record 06/10/2016 CEGID12 Data sources “Custom Dimensions” Contains arrays
  • 13. 06/10/2016 CEGID13 U-SQL Pipeline JSON / TSV conversion • cegid-<site>- raw.tsv Sessions aggregation • cegid-<site>- sessions.tsv Visitors aggregation • cegid-<site>- visitors.tsv
  • 14. 06/10/2016 CEGID14 U-SQL Script 1 - JSON / TSV Conversion *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 15. Data extracted as raw text @json_records_raw = EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json" USING Extractors.Text(delimiter : Char.MinValue, quoting : false); 06/10/2016 CEGID15 JSON / TSV Conversion - Step 1/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files Azure Storage Blobs 20160901/20160901.json 20160902/20160902.json 20160903/20160903.json 20160904/20160904.json … @json_records_raw
  • 16. DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId", "customerId", ... }; @json_records = SELECT JsonFunctions.JsonTuple( json_raw, @fields.ToArray() ) AS json_object FROM @json_records_raw; 06/10/2016 CEGID16 JSON / TSV Conversion - Step 2/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 17. @json_records = SELECT json_object["ipAddress"] AS ipAddress, json_object FROM @json_records; @json_records = SELECT * FROM @json_records LEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ips ON ipAddress == ips.ip; 06/10/2016 CEGID17 JSON / TSV Conversion - Step 3/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 18. @events = SELECT json_object["timestamp"] AS timestamp, json_object["eventId"] AS eventId, json_object["customerId"] AS customerId, // ... FROM @json_records; 06/10/2016 CEGID18 JSON / TSV Conversion - Step 4/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 19. @events = SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_array FROM @events; 06/10/2016 CEGID19 JSON / TSV Conversion - Step 5/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files index value 3 bigFan 4 powerBuyer 2 1443654335461 1 ComkzbXvn8g value (ordered) ComkzbXvn8g 1443654335461 bigFan powerBuyer
  • 20. @events = SELECT timestamp, eventId, customerId, verb, // ... cd_array[0] AS cd_01, cd_array[1] AS cd_02, // ... cd_array[19] AS cd_20 FROM @events; 06/10/2016 CEGID20 JSON / TSV Conversion - Step 6/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files value ComkzbXvn8g 1443654335461 bigFan powerBuyer cd_01 cd_02 cd_03 cd_04 ComkzbXvn8g 1443654335461 bigFan powerBuyer
  • 21. OUTPUT ( SELECT * FROM @events WHERE fullUrl.StartsWith("http://www.cegid.com/uk/") ) TO "cegid-uk-raw.tsv" ORDER BY clientId, timestamp ASC USING Outputters.Tsv(quoting : false); 06/10/2016 CEGID21 JSON / TSV Conversion - Step 7/7 *.json files extraction Records to JSON objects conversion Cegid IPS filtering JSON objects fields extraction Custom dimensions extraction Custom dimensions pivot Output to TSV files
  • 22. 06/10/2016 CEGID22 JSON / TSV Conversion
  • 23.  TSV format 06/10/2016 CEGID23 JSON / TSV Conversion - Output
  • 24.  N events  M sessions (M >= 1) • New session if no activity for 30 minutes  Aggregation using U-SQL REDUCE • Microsoft.Analytics.Interfaces.IReducer  User Agent parsing: UAParser (C#)  Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#) 06/10/2016 CEGID24 U-SQL Script 2 - Sessions Aggregation cegid-<site>- raw.tsv files extraction Events to sessions aggregation Output to TSV files
  • 25.  N sessions  1 visitor  Aggregation using U-SQL REDUCE • Microsoft.Analytics.Interfaces.IReducer 06/10/2016 CEGID25 U-SQL Script 3 - Visitors Aggregation cegid-<site>- sessions.tsv files extraction Sessions to visitors aggregation Output to TSV files
  • 26.  Custom dimensions aggregation (C#) 10/6/2016 CEGID26 Visitors Aggregation visitor_id cd_01 cd_02 A homepage articles A solutions detail A homepage articles A form news B homepage articles B blog news visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_news A 2 1 1 2 1 1 B 1 1 1 1 cd_01 = page category cd_02 = content type ... cd_20 = …
  • 27.  Keep C# code centralized in dedicated assemblies • Code-behind good for prototyping  Prefer U-SQL to C# for better optimization • Distributed execution built from U-SQL script, not C#  Properly use parallelism  Know your data • Statistics (cardinality, distribution, skewness) • Growth  Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce design patterns (will help)  Local mode is good, but do not forget to test on ADL 06/10/2016 CEGID27 Lessons Learned
  • 28. Pros  Packaged solution, zero deployment  Cloud agility (scalability, elasticity)  Web HDFS storage = Hadoop compatible (sqoop, etc.)  Integrated development (Visual Studio, local/debug mode)  U-SQL = SQL + C# (business code reuse) 06/10/2016 CEGID28 Azure Data Lake Review Cons  Proprietary solution  Not on-premises  Learning curve

Notes de l'éditeur

  1. Cegid has deployed multiple web sites as interfaces with its customers. In an effort to improve its services, and better understand its customers, Cegid has armed its web sites with a tracking solution. This tracking solution generates some data based on user events collected from each web sites. Cegid aims to setup an efficient way to make this data available to data analysts so they can consume it through: Machine Learning. Business Intelligence and DataViz. Google Analytics, in its basic version, only allows working on aggregated information.
  2. Exploring Azure Data Lake : http://tomkerkhove.ghost.io/2015/10/22/exploring-azures-data-lake/ About compute to data vs data to compute: https://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/ https://azure.microsoft.com/fr-fr/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/ Schema-on-read: any assumptions about the structure of stored data are implicitly encoded in the application/script logic and not explicitly defined through a data definition language (schema-on-write).
  3. “U-SQL - Azure Data Lake Analytics for Developers” by Michael Rys: http://www.slideshare.net/MichaelRys/usql-azure-data-lake-analytics-for-developers Microsoft Machine Learning & Data Science Summit 2016 http://www.slideshare.net/MichaelRys/taming-the-data-science-monster-with-a-new-sword-usql http://www.slideshare.net/MichaelRys/killer-scenarios-with-data-lake-in-azure-with-usql
  4. Use of file set patterns (1 file a day). Colors conventions: Blue= input Red = output Green = accessors
  5. Raw text is converted to a JSON object using C# function provided by Microsoft sample project. @fields variable was programmatically generated.
  6. IP address is extracted from JSON object (ipAddress) so it can be used by ANTISEMIJOIN operator. CegidIps is a table containing Cegid IP addresses.
  7. All required fields must be explicitly extracted.
  8. Custom dimensions JSON array is parsed and organized using a developed C# function.
  9. Data is dispatched to corresponding site (country) file. ORDER BY statement should be avoided if not required for future steps.
  10. U-SQL REDUCE: https://msdn.microsoft.com/en-us/library/azure/mt621336.aspx Same case as in Michael Rys post "How do I combine overlapping ranges using U-SQL?": https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/
  11. U-SQL REDUCE: https://msdn.microsoft.com/en-us/library/azure/mt621336.aspx
  12. Each custom dimensions (cd_01, cd_02, …, cd_20) has a pre-defined set of values, which are pivoted and aggregated.
  13. Prefer U-SQL to C# for better optimization: Use of PRESORT option on REDUCE rather than sorting data in IReducer. C# can sometimes be optimized (post filters on file sets). U-SQL Query Execution: https://channel9.msdn.com/Series/AzureDataLake/USQL-QE