Paris Datageeks meetup 05102016

U-SQL CASE STUDY
Paris Datageeks, 05/10/2016

 Michel Caradec
• mcaradec@cegid.fr
 Project Manager, Software/Data Engineer at Cegid
 Background
• Business Intelligence, ETL, OLAP, Data Manipulation
• C#, R, Python
06/10/2016 CEGID3
About Me

Business Case
Azure Data Lake
U-SQL Case Study
Questions
Agenda

 Cegid web sites armed with tracking solutions
• Extend web analytics data
 Data Engineer: collect and prepare data
 Data Scientists: consume data in Azure ML Studio
• Visitors usage knowledge (browsing)
• Provide better experience (recommendations)
06/10/2016 CEGID6
Business Case

06/10/2016 CEGID7
Business Case
1 single visit
Bounce rate: 63%
Session length: 2 min
Mainly on Homepage
Average of 4.5 sessions
High screen resolution
Windows OS
Use of internal Search
During the day
Average of 2.5 sessions
1.5 pages / session
Bounce rate: 80%
iOS (iPhone, iPad)
In the evening and WE
No conversion
Average of 7 sessions
5 pages / session
Session length: 6 min
Mainly Solutions pages
More conversions
Tactile
visitor
3.5%
Returning
visitor
7%
Addict
visitor
3.5%
One shot
visitor
86%
Visitors clustering
Metrics built from a subset of the dataset.
Do not represent real traffic.

 ADL Store: repository, schema-on-read, Web HDFS
 ADL Analytics: distributed processing using U-SQL
06/10/2016 CEGID9
Azure Data Lake = Big Data as a Service
© Microsoft Azure

REFERENCE ASSEMBLY [Cegid.DigitalAnalytics.Commons];
@data = EXTRACT user string, timestamp DateTime, heart int
FROM "quantified-{user}.tsv"
USING Extractors.Tsv();
@agg = SELECT *, timestamp.Date AS date, timestamp.Hour AS hour
FROM @data;
@agg = SELECT user, date, hour, AVG(heart) AS avg, MIN(heart) AS min,
MAX(heart) AS max
FROM @agg
GROUP BY user, date, hour;
OUTPUT @agg TO "quantified.csv"
USING Outputters.Csv();
06/10/2016 CEGID10
U-SQL = SQL + C#
 Extract  Transform  Output
ADLS, Azure Storage Blobs, Azure SQL
 File schema-on-read
 SQL-like data manipulation
 C# data types
 C# integration
 Can store as relational
And many more…
 User-defined aggregators (C#)
 User-defined operators (C#)
 Custom Extractors, Outputters (C#)
 File sets for multiple input files access
patterns
 Credentials
Inspired by Michael Rys presentation
at SQL Server PASS Deutschland 2016
ExtractOutputTransform
CREATE TABLE quantified(user string, date DateTime, hour int,
avg long?, min int?, max int?,
INDEX idx CLUSTERED(user, date, hour) DISTRIBUTED BY HASH(user));
INSERT INTO quantified SELECT * FROM @agg;
Output’

 JSON Record
06/10/2016 CEGID12
Data sources
“Custom Dimensions”
Contains arrays

06/10/2016 CEGID13
U-SQL Pipeline
JSON / TSV
conversion
• cegid-<site>-
raw.tsv
Sessions
aggregation
• cegid-<site>-
sessions.tsv
Visitors
aggregation
• cegid-<site>-
visitors.tsv

06/10/2016 CEGID14
U-SQL Script 1 - JSON / TSV Conversion
*.json files
extraction
Records to
JSON objects
conversion
Cegid IPS
filtering
JSON objects
fields
extraction
Custom
dimensions
extraction
Custom
dimensions
pivot
Output to TSV
files

Data extracted as raw text
@json_records_raw =
EXTRACT json_raw string FROM "{*}/{date:yyyy}{date:MM}{date:dd}.json"
USING Extractors.Text(delimiter : Char.MinValue, quoting : false);
06/10/2016 CEGID15
JSON / TSV Conversion - Step 1/7
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
Azure Storage Blobs
20160901/20160901.json
20160902/20160902.json
20160903/20160903.json
20160904/20160904.json
…
@json_records_raw

DECLARE @fields = new SqlArray<string>() { "timestamp", "eventId",
"customerId", ... };
@json_records =
SELECT JsonFunctions.JsonTuple(
json_raw, @fields.ToArray()
) AS json_object
FROM @json_records_raw;
06/10/2016 CEGID16
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files

@json_records =
SELECT json_object["ipAddress"] AS ipAddress, json_object
FROM @json_records;
@json_records =
SELECT *
FROM @json_records
LEFT ANTISEMIJOIN (SELECT ip FROM CegidIps) AS ips
ON ipAddress == ips.ip;
06/10/2016 CEGID17
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files

@events =
SELECT json_object["timestamp"] AS timestamp,
json_object["eventId"] AS eventId,
json_object["customerId"] AS customerId,
// ...
FROM @json_records;
06/10/2016 CEGID18
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files

@events =
SELECT *, CustomDimensions.ParseFromJson(cd) AS cd_array
FROM @events;
06/10/2016 CEGID19
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
index value
3 bigFan
4 powerBuyer
2 1443654335461
1 ComkzbXvn8g
value (ordered)
ComkzbXvn8g
1443654335461
bigFan
powerBuyer

@events =
SELECT timestamp,
eventId,
customerId,
verb,
// ...
cd_array[0] AS cd_01,
cd_array[1] AS cd_02,
// ...
cd_array[19] AS cd_20
FROM @events;
06/10/2016 CEGID20
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files
value
ComkzbXvn8g
1443654335461
bigFan
powerBuyer
cd_01 cd_02 cd_03 cd_04
ComkzbXvn8g 1443654335461 bigFan powerBuyer

OUTPUT
(
SELECT *
FROM @events
WHERE fullUrl.StartsWith("http://www.cegid.com/uk/")
)
TO "cegid-uk-raw.tsv"
ORDER BY clientId, timestamp ASC
USING Outputters.Tsv(quoting : false);
06/10/2016 CEGID21
*.json files
extraction
Records to JSON
objects conversion
Cegid IPS filtering
JSON objects fields
extraction
Custom dimensions
extraction
Custom dimensions
pivot
Output to TSV files

06/10/2016 CEGID22
JSON / TSV Conversion

 TSV format
06/10/2016 CEGID23
JSON / TSV Conversion - Output

 N events  M sessions (M >= 1)
• New session if no activity for 30 minutes
 Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
 User Agent parsing: UAParser (C#)
 Geocoding (lon/lat, timezone, time lag): Cegid.GeoTools (C#)
06/10/2016 CEGID24
U-SQL Script 2 - Sessions Aggregation
cegid-<site>-
raw.tsv files
extraction
Events to
sessions
aggregation
Output to TSV
files

 N sessions  1 visitor
 Aggregation using U-SQL REDUCE
• Microsoft.Analytics.Interfaces.IReducer
06/10/2016 CEGID25
U-SQL Script 3 - Visitors Aggregation
cegid-<site>-
sessions.tsv
files extraction
Sessions to
visitors
aggregation
Output to TSV
files

 Custom dimensions aggregation (C#)
10/6/2016 CEGID26
Visitors Aggregation
visitor_id cd_01 cd_02
A homepage articles
A solutions detail
A homepage articles
A form news
B homepage articles
B blog news
visitor_id cat_homepage cat_solutions cat_blog cat_form typ_articles typ_detail typ_news
A 2 1 1 2 1 1
B 1 1 1 1
cd_01 = page category
cd_02 = content type
...
cd_20 = …

 Keep C# code centralized in dedicated assemblies
• Code-behind good for prototyping
 Prefer U-SQL to C# for better optimization
• Distributed execution built from U-SQL script, not C#
 Properly use parallelism
 Know your data
• Statistics (cardinality, distribution, skewness)
• Growth
 Understand U-SQL concepts (vertex, partitioning, etc.) and MapReduce
design patterns (will help)
 Local mode is good, but do not forget to test on ADL
06/10/2016 CEGID27
Lessons Learned

Pros
 Packaged solution, zero deployment
 Cloud agility (scalability, elasticity)
 Web HDFS storage = Hadoop
compatible (sqoop, etc.)
 Integrated development
(Visual Studio, local/debug mode)
 U-SQL = SQL + C# (business code
reuse)
06/10/2016 CEGID28
Azure Data Lake Review
Cons
 Proprietary solution
 Not on-premises
 Learning curve

Paris Datageeks meetup 05102016

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Paris Datageeks meetup 05102016

Similaire à Paris Datageeks meetup 05102016 (20)

Dernier

Dernier (20)

Paris Datageeks meetup 05102016

Notes de l'éditeur