3. Michel Caradec
• mcaradec@cegid.fr
Project Manager, Software/Data Engineer at Cegid
Background
• Business Intelligence, ETL, OLAP, Data Manipulation
• C#, R, Python
06/10/2016 CEGID3
About Me
6. Cegid web sites armed with tracking solutions
• Extend web analytics data
Data Engineer: collect and prepare data
Data Scientists: consume data in Azure ML Studio
• Visitors usage knowledge (browsing)
• Provide better experience (recommendations)
06/10/2016 CEGID6
Business Case
7. 06/10/2016 CEGID7
Business Case
1 single visit
Bounce rate: 63%
Session length: 2 min
Mainly on Homepage
Average of 4.5 sessions
High screen resolution
Windows OS
Use of internal Search
During the day
Average of 2.5 sessions
1.5 pages / session
Bounce rate: 80%
iOS (iPhone, iPad)
In the evening and WE
No conversion
Average of 7 sessions
5 pages / session
Session length: 6 min
Mainly Solutions pages
More conversions
Tactile
visitor
3.5%
Returning
visitor
7%
Addict
visitor
3.5%
One shot
visitor
86%
Visitors clustering
Metrics built from a subset of the dataset.
Do not represent real traffic.
10. REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons];
@data = EXTRACT user string, timestamp DateTime, heart int
FROM "quantified-{user}.tsv"
USING Extractors.Tsv();
@agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour
FROM @data;
@agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min,
MAX(heart) AS max
FROM @agg
GROUP BY user, date, hour;
OUTPUT @agg TO "quantified.csv"
USING Outputters.Csv();
06/10/2016 CEGID10
U-SQL = SQL + C#
Extract Transform Output
ADLS, Azure Storage Blobs, Azure SQL
File schema-on-read
SQL-like data manipulation
C# data types
C# integration
Can store as relational
And many more…
User-defined aggregators (C#)
User-defined operators (C#)
Custom Extractors, Outputters (C#)
File sets for multiple input files access
patterns
Credentials
Inspired by Michael Rys presentation
at SQL Server PASS Deutschland 2016
ExtractOutputTransform
CREATE TABLE quantified(user string, date DateTime, hour int,
avg long?, min int?, max int?,
INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user));
INSERT INTO quantified SELECT * FROM @agg;
Output’
24. N events M sessions (M >= 1)
• New session if no activity for 30 minutes
Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
User Agent parsing: UAParser (C#)
Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#)
06/10/2016 CEGID24
U-SQL Script 2 - Sessions Aggregation
cegid-<site>-
raw.tsv files
extraction
Events to
sessions
aggregation
Output to TSV
files
25. N sessions 1 visitor
Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
06/10/2016 CEGID25
U-SQL Script 3 - Visitors Aggregation
cegid-<site>-
sessions.tsv
files extraction
Sessions to
visitors
aggregation
Output to TSV
files
26. Custom dimensions aggregation (C#)
10/6/2016 CEGID26
Visitors Aggregation
visitor_id cd_01 cd_02
A homepage articles
A solutions detail
A homepage articles
A form news
B homepage articles
B blog news
visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_news
A 2 1 1 2 1 1
B 1 1 1 1
cd_01 = page category
cd_02 = content type
...
cd_20 = …
27. Keep C# code centralized in dedicated assemblies
• Code-behind good for prototyping
Prefer U-SQL to C# for better optimization
• Distributed execution built from U-SQL script, not C#
Properly use parallelism
Know your data
• Statistics (cardinality, distribution, skewness)
• Growth
Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce
design patterns (will help)
Local mode is good, but do not forget to test on ADL
06/10/2016 CEGID27
Lessons Learned
28. Pros
Packaged solution, zero deployment
Cloud agility (scalability, elasticity)
Web HDFS storage = Hadoop
compatible (sqoop, etc.)
Integrated development
(Visual Studio, local/debug mode)
U-SQL = SQL + C# (business code
reuse)
06/10/2016 CEGID28
Azure Data Lake Review
Cons
Proprietary solution
Not on-premises
Learning curve
Cegid has deployed multiple web sites as interfaces with its customers.
In an effort to improve its services, and better understand its customers, Cegid has armed its web sites with a tracking solution.
This tracking solution generates some data based on user events collected from each web sites.
Cegid aims to setup an efficient way to make this data available to data analysts so they can consume it through:
Machine Learning.
Business Intelligence and DataViz.
Google Analytics, in its basic version, only allows working on aggregated information.
Exploring Azure Data Lake : http://tomkerkhove.ghost.io/2015/10/22/exploring-azures-data-lake/
About compute to data vs data to compute:
https://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/
https://azure.microsoft.com/fr-fr/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Schema-on-read: any assumptions about the structure of stored data are implicitly encoded in the application/script logic and not explicitly defined through a data definition language (schema-on-write).
“U-SQL - Azure Data Lake Analytics for Developers” by Michael Rys: http://www.slideshare.net/MichaelRys/usql-azure-data-lake-analytics-for-developers
Microsoft Machine Learning & Data Science Summit 2016
http://www.slideshare.net/MichaelRys/taming-the-data-science-monster-with-a-new-sword-usql
http://www.slideshare.net/MichaelRys/killer-scenarios-with-data-lake-in-azure-with-usql
Use of file set patterns (1 file a day).
Colors conventions:
Blue= input
Red = output
Green = accessors
Raw text is converted to a JSON object using C# function provided by Microsoft sample project.
@fields variable was programmatically generated.
IP address is extracted from JSON object (ipAddress) so it can be used by ANTISEMIJOIN operator.
CegidIps is a table containing Cegid IP addresses.
All required fields must be explicitly extracted.
Custom dimensions JSON array is parsed and organized using a developed C# function.
Data is dispatched to corresponding site (country) file.
ORDER BY statement should be avoided if not required for future steps.
U-SQL REDUCE: https://msdn.microsoft.com/en-us/library/azure/mt621336.aspx
Same case as in Michael Rys post "How do I combine overlapping ranges using U-SQL?": https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/
Each custom dimensions (cd_01, cd_02, …, cd_20) has a pre-defined set of values, which are pivoted and aggregated.
Prefer U-SQL to C# for better optimization:
Use of PRESORT option on REDUCE rather than sorting data in IReducer.
C# can sometimes be optimized (post filters on file sets).
U-SQL Query Execution: https://channel9.msdn.com/Series/AzureDataLake/USQL-QE