SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
www.dataiku.com 
Take back control of your Web Tracking 
@ClementStenac 
CTO, Dataiku
www.dataiku.com 
Give me dashboards !
www.dataiku.com 
Choose one 
Raw data Do what you want 
Your money 
Access to raw data is a premium feature
www.dataiku.com 
Who cares about raw data ? 
•SAAS analytics are full-featured 
•Custom variables to link with your backend data 
•Did you really join all data for your future needs ? 
•Do you have access / want to push to the JS all necessary data ? 
•What kinds of analysis will you do later on ?
www.dataiku.com 
A real example Segmentation and tracking user-satisfaction 
Raw tracking data 
User-level stats 
User base segmentation 
Metrics per segments 
Tracking over time 
TB 
GB
www.dataiku.com 
User-level data
www.dataiku.com 
Clustering
www.dataiku.com 
Labeling 
Search for a specific Topic 
Newcomer from Google News 
Foreigner Discovering The Site 
Fan who loves to comment 
Home Page Wanderer 
Dark Bot (Competitor?) 
Here you need your 
business intelligence
www.dataiku.com 
Compute metrics per segment 
Search for a specific Topic 
Newcomer from Google News 
Foreigner Discovering The Site 
Fan that loves to comment 
Home Page Wanderer 
Dark Bot (Competitor?) 
0.3€ per session 
0.23€ acquisition costs 
`` 
` 
13k sessions 
1.3€ per session 
0.23€ acquisition costs 
938k sessions 
938k sessions 
0.3€ per session 
0.23€ acquisition costs 
738k sessions 
0.83€ per session 
0.73€ acquisition costs 
68k sessions 
0.3€ per session 
1.23€ acquisition costs 
1k sessions 
0€ per session 
0€ acquisition costs 
Here you need to cross with your CRM
www.dataiku.com 
Track metrics over time 
Search for a specific Topic 
Newcomer from Google News 
Foreigner Discovering The Site 
Fan that loves to comment 
Home Page Wanderer 
Dark Bot (Competitor?) 
Using your already-computed segments 
Damn 
our latest 
release 
has diverging 
effects 
on segments
www.dataiku.com 
A few other examples 
•Churn prediction and explanation 
•Customer lifetime value prediction
www.dataiku.com 
OK I WANT TO DO IT
www.dataiku.com 
So, I have these Apache logs 
•First level of web tracking 
•"Nothing required"
www.dataiku.com 
Are backend logs a solution ? 
Challenge 1 : Identify a visitor 
•IP ? 
•NAT / Proxy 
•Not everyone has a public IP address 
•IP + user-agent ? 
•Big companies !
www.dataiku.com 
Are backend logs a solution ? 
Challenge 2 : Re-create sessions 
•Using expiration times 
•Advanced SQL / Hive / … 
makes this easier
www.dataiku.com 
Are backend logs a solution ? 
Challenge 3 : single-page webapps 
•Track behaviour within each page 
•Track events, not pages 
Also: getting logs from IT is sometimes another challenge 
www.dataiku.com 
Client-side tracking 
•visitor_id and session_id handled with cookies 
•Tracking page loads and various events 
•Historically, "tracking" = fetching a 1x1 image 
•AJAX 
www.website.com 
Browser 
tracker.com 
JS tracking code 
Tracking calls
www.dataiku.com 
Are cookies good for your (web) health ? 
•Each cookie belongs to a domain (and its subdomains) 
•Who can write a cookie ? 
–The HTTP server, who becomes owner (via the Set-Cookie HTTP header) 
–JS code running on the "owner" domain 
•Who can read a cookie ? 
–The owner HTTP server (sent by the browser) 
–JS code running on the "owner" domain
www.dataiku.com 
First-party cookies 
•Set by the originating server (HTTP) or JS code 
•Belong to the originating domain 
•Sent by HTTP to the originating domain only 
•Readable by JS code 
www.website.com 
Browser 
Cookies for www.website.com: 
None 
tracker.com 
GET / Cookies: none 
Fetch tracking script 
Tracking JS code: read cookies for www.website.com 
Tracking JS code: create visitor id and set cookie 
Contents
www.dataiku.com 
First-party cookies 
•Set by the originating server (HTTP) or JS code 
•Belong to the originating domain 
•Sent by HTTP to the originating domain only 
•Readable by JS code 
www.website.com 
Browser 
tracker.com 
GET /track?visitor_id=d37ecba Cookies: None 
JS code: send AJAX request to tracker.com with visitor_id 
Cookies for www.website.com: 
visitor_id=d37ecba
www.dataiku.com 
Third-party cookies 
•Set (in HTTP) by the tracker's domain – Belong to the tracker's domain 
•Not send by HTTP to the originating domain (does not belong) 
•NOT readable by JS code (does not belong) 
www.website.com 
Browser 
tracker.com 
GET / Cookies: none 
Fetch tracking script 
Contents 
Cookies for www.website.com: 
None 
Cookies for tracker.com: None
www.dataiku.com 
www.website.com 
Browser 
Cookies for www.website.com: None 
tracker.com 
Cookies for tracker.com: 
None 
GET /track Cookies: None 
200 OK Set-Cookie: visitor_id=33d7 
Tracker code: assign visitor_id 
Third-party cookies 
•Set (in HTTP) by the tracker's domain – Belong to the tracker's domain 
•Not send by HTTP to the originating domain (does not belong) 
•NOT readable by JS code (does not belong)
www.dataiku.com 
Third-party cookies 
•Set (in HTTP) by the tracker's domain – Belong to the tracker's domain 
•Not send by HTTP to the originating domain (does not belong) 
•NOT readable by JS code (does not belong) 
www.website.com 
Browser 
tracker.com 
Cookies for tracker.com: 
visitor_id=33d7 
GET /track Cookies: visitor_id=33d7 
200 OK 
Tracker code: read visitor_id 
Cookies for www.website.com: None
www.dataiku.com 
First party cookie 
•Tracks on a single website 
•Requires JS code for tracking 
•Reduced privacy impact: No exchange of information between sites 
•Usage: track your user's behaviour 
Third party cookie 
•Tracks across all websites using the same tracker 
•More frowned upon 
•Usage: generally, ads but also multi-website 
Why each ? 
Rarely blocked (used for logins) 
Blocked by up to 40% visitors
www.dataiku.com 
What are your obligations ? 
With ALL cookies 
•You should ask user whether he wants cookies 
•Even non-tracking related cookies 
•Yes, even login-related ones
www.dataiku.com 
What are your obligations ? 
With third party cookies 
•Obey the Do-Not-Track header 
www.website.com 
Browser 
tracker.com 
GET /track Cookies: None 
DNT: 1 
200 OK 
Tracker code: DO NOTHING
www.dataiku.com 
What are your obligations ? 
With third party cookies 
•Provide an opt-out URL 
•Allows the user to /optin , /optout or /status 
See in action : www.youronlinechoices.com
www.dataiku.com 
What are your obligations ? 
With third party cookies 
•Provide a P3P policy 
•Else, older IE blocks you 
"What are you doing with my data ?" 
Looks like this: 
CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
www.dataiku.com 
Tracking in mobile apps 
•Preserve battery 
–Each network call is costly 
–Do not track everything synchronously 
•Network access is intermittent 
–Queue events and wait for network access
www.dataiku.com 
So, what are my choices ? 
•You might really want to be your own web tracker 
•Most used open source Webtracker : Piwik 
•Provides both raw data and nice dashboards 
–MySQL backend 
–Raw data via API 
–Slightly less suited for analytics
www.dataiku.com 
WT1 YOUR OWN TRACKER IN MINUTES
www.dataiku.com 
WT1 
An open source (Apache License) server to build your own web tracking 
https://github.com/dataiku/wt1 
•Designed to provide you with raw data, directly usable for analytics 
•Very high performance and scalability
www.dataiku.com 
Features 
•1st or 3rd party cookies 
–Handling of DNT and opt-out 
–Helps handling P3P 
•Track events or pages with key-value data 
•Visitor-scope and session-scope variables 
•"Live view" debugging console
www.dataiku.com 
Features 
•Dashboards: None  
•Events processing and storage 
–Filesystem, S3 
–Event queues: Flume 
–Custom processors 
•JSON API for custom tracking 
•iOS library
www.dataiku.com 
Architecture 
Client-side JS tracker 
iOS library 
•1st or 3rd party cookies 
•Event-level tracking 
•Automatic batching 
•Queuing to deal with network interruptions 
WT1 Server 
Raw storage 
•Filesystem 
•S3 
Event processors: 
•Real-time aggregations 
•Custom code 
Event queues 
•Flume 
•Kafka, RabbitMQ, … 
•Java 
•> 20K events / second 
•Handles DNT, P3P, opt-out, … 
JSON POST
www.dataiku.com 
Future work 
•Android library 
•More event queues supported OOTB 
–Kafka 
–RabbitMQ 
•Avro storage
www.dataiku.com 
Thank you ! 
Clément Stenac clement.stenac@dataiku.com @ClementStenac 
www. .com

Contenu connexe

Similaire à OWF14 - Big Data Track : Take back control of your web tracking Go further by doing it yourself

Testing Single Page Webapp
Testing Single Page WebappTesting Single Page Webapp
Testing Single Page WebappAkshay Mathur
 
Four Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance ProblemsFour Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance ProblemsAndreas Grabner
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDamian T. Gordon
 
Tracking and business intelligence
Tracking and business intelligenceTracking and business intelligence
Tracking and business intelligenceSebastian Schleicher
 
Analyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo MonariAnalyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo MonariBizup
 
Introduction to Search Engine.pdf
Introduction to Search Engine.pdfIntroduction to Search Engine.pdf
Introduction to Search Engine.pdfPraveen Kurup
 
Introduction to Search Engine.pdf
Introduction to Search Engine.pdfIntroduction to Search Engine.pdf
Introduction to Search Engine.pdfPraveen Kurup
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub
 
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag ManagerPaul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag ManagerJulia Grosman
 
10 things you can do to speed up your web app today stir trek edition
10 things you can do to speed up your web app today   stir trek edition10 things you can do to speed up your web app today   stir trek edition
10 things you can do to speed up your web app today stir trek editionChris Love
 
20 tips for website performance
20 tips for website performance20 tips for website performance
20 tips for website performanceAndrew Siemer
 
Technical SEO for WordPress Developers, Designers and Webmasters
Technical SEO for WordPress Developers, Designers and WebmastersTechnical SEO for WordPress Developers, Designers and Webmasters
Technical SEO for WordPress Developers, Designers and WebmastersHenry Visotski
 
External JavaScript Widget Development Best Practices (updated) (v.1.1)
External JavaScript Widget Development Best Practices (updated) (v.1.1) External JavaScript Widget Development Best Practices (updated) (v.1.1)
External JavaScript Widget Development Best Practices (updated) (v.1.1) Volkan Özçelik
 
Affiliate Summit Orlando Meetup Group: Google Analytics for Beginners
Affiliate Summit Orlando Meetup Group:  Google Analytics for BeginnersAffiliate Summit Orlando Meetup Group:  Google Analytics for Beginners
Affiliate Summit Orlando Meetup Group: Google Analytics for BeginnersMissy Ward
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testingRoman Ananev
 
Web前端性能优化 2014
Web前端性能优化 2014Web前端性能优化 2014
Web前端性能优化 2014Yubei Li
 
Optimizing WordPress Performance
Optimizing WordPress PerformanceOptimizing WordPress Performance
Optimizing WordPress PerformanceDouglas Yuen
 

Similaire à OWF14 - Big Data Track : Take back control of your web tracking Go further by doing it yourself (20)

Testing Single Page Webapp
Testing Single Page WebappTesting Single Page Webapp
Testing Single Page Webapp
 
Four Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance ProblemsFour Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance Problems
 
Door Of Internet
Door Of InternetDoor Of Internet
Door Of Internet
 
Datasets, APIs, and Web Scraping
Datasets, APIs, and Web ScrapingDatasets, APIs, and Web Scraping
Datasets, APIs, and Web Scraping
 
Tracking and business intelligence
Tracking and business intelligenceTracking and business intelligence
Tracking and business intelligence
 
Analyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo MonariAnalyzing a Link with Google's Eyes by Matteo Monari
Analyzing a Link with Google's Eyes by Matteo Monari
 
Web Performance Optimization (WPO)
Web Performance Optimization (WPO)Web Performance Optimization (WPO)
Web Performance Optimization (WPO)
 
Introduction to Search Engine.pdf
Introduction to Search Engine.pdfIntroduction to Search Engine.pdf
Introduction to Search Engine.pdf
 
Introduction to Search Engine.pdf
Introduction to Search Engine.pdfIntroduction to Search Engine.pdf
Introduction to Search Engine.pdf
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for Startups
 
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag ManagerPaul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
Paul Duncan - Advanced Tracking & Enriched SERP Results via Google Tag Manager
 
10 things you can do to speed up your web app today stir trek edition
10 things you can do to speed up your web app today   stir trek edition10 things you can do to speed up your web app today   stir trek edition
10 things you can do to speed up your web app today stir trek edition
 
20 tips for website performance
20 tips for website performance20 tips for website performance
20 tips for website performance
 
Technical SEO for WordPress Developers, Designers and Webmasters
Technical SEO for WordPress Developers, Designers and WebmastersTechnical SEO for WordPress Developers, Designers and Webmasters
Technical SEO for WordPress Developers, Designers and Webmasters
 
External JavaScript Widget Development Best Practices (updated) (v.1.1)
External JavaScript Widget Development Best Practices (updated) (v.1.1) External JavaScript Widget Development Best Practices (updated) (v.1.1)
External JavaScript Widget Development Best Practices (updated) (v.1.1)
 
Affiliate Summit Orlando Meetup Group: Google Analytics for Beginners
Affiliate Summit Orlando Meetup Group:  Google Analytics for BeginnersAffiliate Summit Orlando Meetup Group:  Google Analytics for Beginners
Affiliate Summit Orlando Meetup Group: Google Analytics for Beginners
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testing
 
Web前端性能优化 2014
Web前端性能优化 2014Web前端性能优化 2014
Web前端性能优化 2014
 
Optimizing WordPress Performance
Optimizing WordPress PerformanceOptimizing WordPress Performance
Optimizing WordPress Performance
 

Plus de Paris Open Source Summit

#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...Paris Open Source Summit
 
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...Paris Open Source Summit
 
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...Paris Open Source Summit
 
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, ArduinoParis Open Source Summit
 
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...Paris Open Source Summit
 
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...Paris Open Source Summit
 
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, ZabbixParis Open Source Summit
 
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, InriaParis Open Source Summit
 
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...Paris Open Source Summit
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...Paris Open Source Summit
 
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...Paris Open Source Summit
 
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...Paris Open Source Summit
 
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...Paris Open Source Summit
 
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...Paris Open Source Summit
 
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...Paris Open Source Summit
 
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...Paris Open Source Summit
 
#OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données #OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données Paris Open Source Summit
 
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...Paris Open Source Summit
 
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...Paris Open Source Summit
 
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...Paris Open Source Summit
 

Plus de Paris Open Source Summit (20)

#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
#OSSPARIS19 : Control your Embedded Linux remotely by using WebSockets - Gian...
 
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
#OSSPARIS19 : A virtual machine approach for microcontroller programming : th...
 
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
#OSSPARIS19 : RIOT: towards open source, secure DevOps on microcontroller-bas...
 
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
#OSSPARIS19 : The evolving (IoT) security landscape - Gianluca Varisco, Arduino
 
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
#OSSPARIS19: Construire des applications IoT "secure-by-design" - Thomas Gaza...
 
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
 
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
#OSSPARIS19 : Supervision d'objets connectés industriels - Eric DOANE, Zabbix
 
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
#OSSPARIS19: Introduction to scikit-learn - Olivier Grisel, Inria
 
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
#OSSPARIS19 - Fostering disruptive innovation in AI with JEDI - André Loesekr...
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
 
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
#OSSPARIS19 : MDPH : une solution collaborative open source pour l'instructio...
 
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
#OSSPARIS19 - Understanding Open Source Governance - Gilles Gravier, Wipro Li...
 
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
#OSSPARIS19 : Publier du code Open Source dans une banque : Mission impossibl...
 
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
#OSSPARIS19 : Libre à vous ! Raconter les libertés informatiques à la radio -...
 
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
#OSSPARIS19 - Le logiciel libre : un enjeu politique et social - Etienne Gonn...
 
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
#OSSPARIS19 - Conflits d’intérêt & concurrence : la place de l’éditeur dans l...
 
#OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données #OSSPARIS19 - Table ronde : souveraineté des données
#OSSPARIS19 - Table ronde : souveraineté des données
 
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
#OSSPARIS19 - Comment financer un projet de logiciel libre - LUDOVIC DUBOST, ...
 
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
#OSSPARIS19 - BlueMind v4 : les dessous technologiques de 10 ans de travail p...
 
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
#OSSPARIS19 - Tuto de première installation de VITAM, un système d'archivage ...
 

OWF14 - Big Data Track : Take back control of your web tracking Go further by doing it yourself

  • 1. www.dataiku.com Take back control of your Web Tracking @ClementStenac CTO, Dataiku
  • 2. www.dataiku.com Give me dashboards !
  • 3. www.dataiku.com Choose one Raw data Do what you want Your money Access to raw data is a premium feature
  • 4. www.dataiku.com Who cares about raw data ? •SAAS analytics are full-featured •Custom variables to link with your backend data •Did you really join all data for your future needs ? •Do you have access / want to push to the JS all necessary data ? •What kinds of analysis will you do later on ?
  • 5. www.dataiku.com A real example Segmentation and tracking user-satisfaction Raw tracking data User-level stats User base segmentation Metrics per segments Tracking over time TB GB
  • 8. www.dataiku.com Labeling Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan who loves to comment Home Page Wanderer Dark Bot (Competitor?) Here you need your business intelligence
  • 9. www.dataiku.com Compute metrics per segment Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs Here you need to cross with your CRM
  • 10. www.dataiku.com Track metrics over time Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Using your already-computed segments Damn our latest release has diverging effects on segments
  • 11. www.dataiku.com A few other examples •Churn prediction and explanation •Customer lifetime value prediction
  • 12. www.dataiku.com OK I WANT TO DO IT
  • 13. www.dataiku.com So, I have these Apache logs •First level of web tracking •"Nothing required"
  • 14. www.dataiku.com Are backend logs a solution ? Challenge 1 : Identify a visitor •IP ? •NAT / Proxy •Not everyone has a public IP address •IP + user-agent ? •Big companies !
  • 15. www.dataiku.com Are backend logs a solution ? Challenge 2 : Re-create sessions •Using expiration times •Advanced SQL / Hive / … makes this easier
  • 16. www.dataiku.com Are backend logs a solution ? Challenge 3 : single-page webapps •Track behaviour within each page •Track events, not pages Also: getting logs from IT is sometimes another challenge 
  • 17. www.dataiku.com Client-side tracking •visitor_id and session_id handled with cookies •Tracking page loads and various events •Historically, "tracking" = fetching a 1x1 image •AJAX www.website.com Browser tracker.com JS tracking code Tracking calls
  • 18. www.dataiku.com Are cookies good for your (web) health ? •Each cookie belongs to a domain (and its subdomains) •Who can write a cookie ? –The HTTP server, who becomes owner (via the Set-Cookie HTTP header) –JS code running on the "owner" domain •Who can read a cookie ? –The owner HTTP server (sent by the browser) –JS code running on the "owner" domain
  • 19. www.dataiku.com First-party cookies •Set by the originating server (HTTP) or JS code •Belong to the originating domain •Sent by HTTP to the originating domain only •Readable by JS code www.website.com Browser Cookies for www.website.com: None tracker.com GET / Cookies: none Fetch tracking script Tracking JS code: read cookies for www.website.com Tracking JS code: create visitor id and set cookie Contents
  • 20. www.dataiku.com First-party cookies •Set by the originating server (HTTP) or JS code •Belong to the originating domain •Sent by HTTP to the originating domain only •Readable by JS code www.website.com Browser tracker.com GET /track?visitor_id=d37ecba Cookies: None JS code: send AJAX request to tracker.com with visitor_id Cookies for www.website.com: visitor_id=d37ecba
  • 21. www.dataiku.com Third-party cookies •Set (in HTTP) by the tracker's domain – Belong to the tracker's domain •Not send by HTTP to the originating domain (does not belong) •NOT readable by JS code (does not belong) www.website.com Browser tracker.com GET / Cookies: none Fetch tracking script Contents Cookies for www.website.com: None Cookies for tracker.com: None
  • 22. www.dataiku.com www.website.com Browser Cookies for www.website.com: None tracker.com Cookies for tracker.com: None GET /track Cookies: None 200 OK Set-Cookie: visitor_id=33d7 Tracker code: assign visitor_id Third-party cookies •Set (in HTTP) by the tracker's domain – Belong to the tracker's domain •Not send by HTTP to the originating domain (does not belong) •NOT readable by JS code (does not belong)
  • 23. www.dataiku.com Third-party cookies •Set (in HTTP) by the tracker's domain – Belong to the tracker's domain •Not send by HTTP to the originating domain (does not belong) •NOT readable by JS code (does not belong) www.website.com Browser tracker.com Cookies for tracker.com: visitor_id=33d7 GET /track Cookies: visitor_id=33d7 200 OK Tracker code: read visitor_id Cookies for www.website.com: None
  • 24. www.dataiku.com First party cookie •Tracks on a single website •Requires JS code for tracking •Reduced privacy impact: No exchange of information between sites •Usage: track your user's behaviour Third party cookie •Tracks across all websites using the same tracker •More frowned upon •Usage: generally, ads but also multi-website Why each ? Rarely blocked (used for logins) Blocked by up to 40% visitors
  • 25. www.dataiku.com What are your obligations ? With ALL cookies •You should ask user whether he wants cookies •Even non-tracking related cookies •Yes, even login-related ones
  • 26. www.dataiku.com What are your obligations ? With third party cookies •Obey the Do-Not-Track header www.website.com Browser tracker.com GET /track Cookies: None DNT: 1 200 OK Tracker code: DO NOTHING
  • 27. www.dataiku.com What are your obligations ? With third party cookies •Provide an opt-out URL •Allows the user to /optin , /optout or /status See in action : www.youronlinechoices.com
  • 28. www.dataiku.com What are your obligations ? With third party cookies •Provide a P3P policy •Else, older IE blocks you "What are you doing with my data ?" Looks like this: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
  • 29. www.dataiku.com Tracking in mobile apps •Preserve battery –Each network call is costly –Do not track everything synchronously •Network access is intermittent –Queue events and wait for network access
  • 30. www.dataiku.com So, what are my choices ? •You might really want to be your own web tracker •Most used open source Webtracker : Piwik •Provides both raw data and nice dashboards –MySQL backend –Raw data via API –Slightly less suited for analytics
  • 31. www.dataiku.com WT1 YOUR OWN TRACKER IN MINUTES
  • 32. www.dataiku.com WT1 An open source (Apache License) server to build your own web tracking https://github.com/dataiku/wt1 •Designed to provide you with raw data, directly usable for analytics •Very high performance and scalability
  • 33. www.dataiku.com Features •1st or 3rd party cookies –Handling of DNT and opt-out –Helps handling P3P •Track events or pages with key-value data •Visitor-scope and session-scope variables •"Live view" debugging console
  • 34. www.dataiku.com Features •Dashboards: None  •Events processing and storage –Filesystem, S3 –Event queues: Flume –Custom processors •JSON API for custom tracking •iOS library
  • 35. www.dataiku.com Architecture Client-side JS tracker iOS library •1st or 3rd party cookies •Event-level tracking •Automatic batching •Queuing to deal with network interruptions WT1 Server Raw storage •Filesystem •S3 Event processors: •Real-time aggregations •Custom code Event queues •Flume •Kafka, RabbitMQ, … •Java •> 20K events / second •Handles DNT, P3P, opt-out, … JSON POST
  • 36. www.dataiku.com Future work •Android library •More event queues supported OOTB –Kafka –RabbitMQ •Avro storage
  • 37. www.dataiku.com Thank you ! Clément Stenac clement.stenac@dataiku.com @ClementStenac www. .com