SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
User Behaviour Tracking
Track - Store - Process
!
//Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net

!
Vision
„Let’s build our own Google
Analytics“
Why

analytics does sampling
we want the (raw) data
Ideas,Thoughts&Goals
fast / minimal impact on page loading time
high availability
track user over multiple platforms
storage engine? -> hbase
Infrastructure
Numbers!
10-20ms Response Time per pixel
record for now: ~2500 concurrent reqs
1,5 billion entries in Hbase
10 Nodes in Hadoop Cluster
Serving Infrastructure

Loadbalancers & RR DNS
nginx with empty_gif module (~2ms)
data is written to logfile
Storing Infrastructure
every nginx node has flume-ng
flume ingests logfile
AsyncHBaseSink with custom Serializer
direct writes to HBase
why flume?

we had it already in production ;)
Storm might be an interesting alternative
HBase rowkey design
Why?

You can scan through all data and use filters
for selecting specific data
But scanning with start & stop row speeds
things up (a lot)
HBase rowkey design
Do I need a fast user or a fast timespan
lookup?
User - clientid,ts<,connectionId>
Timespan - ts,clientid<,connectionId>
Inverse Timestamps
Data in HBase is stored lexicographicaly
sorted
Normal TS - scan would yield oldest results
first
Inverse TS - newer entries come first (and you
can cancel the scan if you have enough data)
Cross Domain Tracking
(Flash)Cookies
Fingerprinting
Etag
HTML5 Storage
The olden times…
or
Cookies

Easy to drop a 3rd party cookie with userId
on different websites
Gets more and more blocked (Safari, FF..)
Fingerprinting
Yields interesting results on desktop, difficult
on e.g. iPhone
invisible to user
Last resort if everything else fails?
Etag

Quite new, based on browser cache
sounds interesting
HTML5 Storage

Store data in local HTML5 storage
Retrieve data with Cross Domain Messaging
Store data

e.g. UserId, SessionId, GeoIP, URL, action,
data
Batch Processing

Calculate how many users are active on
platform A and also on B
Get Traffic of all Questions belonging to
Channel X sorted by Country
Now to something
completely different…
demo
Recommendations
with Myrrix
Myrrix

Evolution: taste -> mahout -> myrrix (-> oryx)
Recommender based on ALS
Recommendations @
GF.net
User emit signals on questions
view, like, gives answer, answer is voted best
Application sends signals through RabbitMQ
to recommendation servers
YEAH
but what happens, when a new user signs up?
?
Fetch data from tracking
and feed it into myrrix
Collecting&Storing data
works great
using & processing is another thing ;)
Datageeks

Contenu connexe

Similaire à Datageeks

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
Paul Klipp
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
Amazon Web Services Korea
 

Similaire à Datageeks (20)

UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Krug Fat Client
Krug Fat ClientKrug Fat Client
Krug Fat Client
 
Aerospike for machine learning
Aerospike for machine learningAerospike for machine learning
Aerospike for machine learning
 
HTML5 Technical Executive Summary
HTML5 Technical Executive SummaryHTML5 Technical Executive Summary
HTML5 Technical Executive Summary
 
Splunk Stream - Einblicke in Netzwerk Traffic
Splunk Stream - Einblicke in Netzwerk TrafficSplunk Stream - Einblicke in Netzwerk Traffic
Splunk Stream - Einblicke in Netzwerk Traffic
 
Scaling 101 test
Scaling 101 testScaling 101 test
Scaling 101 test
 
Scaling 101
Scaling 101Scaling 101
Scaling 101
 
Would Mr. Spok choose Open Source
Would Mr. Spok choose Open SourceWould Mr. Spok choose Open Source
Would Mr. Spok choose Open Source
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Html5 workshop part 1
Html5 workshop part 1Html5 workshop part 1
Html5 workshop part 1
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
 
Html5
Html5Html5
Html5
 
Server Monitoring (Scaling while bootstrapped)
Server Monitoring  (Scaling while bootstrapped)Server Monitoring  (Scaling while bootstrapped)
Server Monitoring (Scaling while bootstrapped)
 
Making it fast: Zotonic & Performance
Making it fast: Zotonic & PerformanceMaking it fast: Zotonic & Performance
Making it fast: Zotonic & Performance
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
HTML5 - The Python Angle (PyCon Ireland 2010)
HTML5 - The Python Angle (PyCon Ireland 2010)HTML5 - The Python Angle (PyCon Ireland 2010)
HTML5 - The Python Angle (PyCon Ireland 2010)
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Datageeks