Agora owns dozens of themed, classified, entertainment and social services. There are news and sports portals, forums, advertising services, blogs and many other thematic websites. All sites generate over 400 page views per second (under normal conditions) and considerably more events (likes focus, clicks and scrolling events). It raises one question: how to build user profiles real-time in such a dynamic and changing environment?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-16.html
5. 4
Agenda
1. What is user profiling?
2. User profiling system using big data
technologies (Spark, HBase)
– recipe for profiling system
– algorithm
– technical issues and solutions
3. Enrichment of user profiles
by machine learning methods
BigData Spain Conference, October 15, 2015
6. 5
What is a user profile?
Tom
male
has 2
children
politics
volleyball
cars
most
active on
Monday
morning
City:
Cracow
Device:
iPhone
14 articles
read last
week
BigData Spain Conference, October 15, 2015
Single
Customer
View
7. 6
Application domains
• Classification issues:
– propensity to buy
– propensity to churn
– propensity to default (credit scores)
– anomaly detection (e.g., fraud detection)
• User grouping (segmentation)
• Personalised advertising and marketing
messaging
• Content personalisation
• Recommendations
BigData Spain Conference, October 15, 2015
9. 8
Our case
• Input data:
– online data: page views, events
– meta-data of items (articles, blogs, posts, …)
• The problem of data sparsity
• User content engagement in a certain period of time
• Specific user behaviour should be a necessary
condition to assign the user to a specific segment
(feature) in real-time
BigData Spain Conference, October 15, 2015
User behaviour: Segment/feature
The user has been reading forum
threads about care for young
children since last week
Parent of 0 to 3-year-old child
11. Our system: Stage 1
Building user daily profiles
BigData Spain Conference, October 15, 2015
12. 11
STEP 1: Tracking and queuing ...
BigData Spain Conference, October 15, 2015
13. 12
User identification
• Main issue: most of our users are not
logged in.
• Requirement: storing only non-PII data.
BigData Spain Conference, October 15, 2015
We rely on cookies only.
14. 13
STEP 1: Tracking and queuing
• JavaScript tracks:
– page views
– events
• Tracking application
generates Global User ID
(GUID) and session ID
• Data queuing using Apache
Kafka:
– Open-source message broker
project
– Unified, high-throughput, low-
latency platform
• We keep data on Kafka for
3 days.
BigData Spain Conference, October 15, 2015
page views,
events
cookies
Tracking
application
tomcat.apache.org, kafka.apache.org
page views
stream
events
stream
16. 15
What is Spark Streaming?
• Apache Spark: open source
cluster computing
framework.
• Spark Streaming: library for
streaming computation as a
series of small and
deterministic batch jobs:
– Splits stream into batches of
X seconds
– Each batch is treated as RDD
and is processed by RDD
operations
– Processed results are
returned in batches
BigData Spain Conference, October 15, 2015
live data stream
batches of X
secons
processed results
of RDD operations
spark.apache.org/streaming, https://databricks-training.s3.amazonaws.com/slides/Spark%20Summit%202014%20-%20Spark%20Streaming.pdf
Spark RDD
Engine
17. 16
STEP 2: Spark Streaming
• We have 2 streams (each
with 6 partitions)
– page views
– events
• Streaming duration:
5 seconds
• Page views (and
events) are converted
and parsed to obtain:
business ID, domain, parts of
URL and referer, geolocation,
User-Agent, GUID, Visit ID, etc. ...
BigData Spain Conference, October 15, 2015
page views
events
batches of union
of input streams
foreachRDD
flatMap Engine
call(page_view/event)
A single page view
or event
processing
19. 18
Fact definition
BigData Spain Conference, October 15, 2015
Example: page view:
• GUID: 123ABC
• time: 2015-10-15, 15:23:15
Example of facts:
123ABC
page view on
domain
wyborcza.pl
2015-10-15
15:23:15< >
123ABC referer domain google.pl
2015-10-15
15:23:15< >
123ABC geolocation city Madrid
2015-10-15
15:23:15< >
123ABC article tag news
2015-10-15
15:23:15< >
123ABC article tag sport
2015-10-15
15:23:15< >
GUID type of fact
value of
fact time
• URL:
http://www.wyborcza.pl/...
• Referrer:
http://www.google.pl/...
• Geo city: Madrid
• Article tags:
news, sport
20. 19
Fact definition – more formally
Fact – the smallest piece of information
describing a relation between a user
(GUID) and some feature/element of a
page view or event.
Programmer and data steward decide what types
of facts should be extracted.
BigData Spain Conference, October 15, 2015
23. 22
STEP 4: Profiling algorithm
BigData Spain Conference, October 15, 2015
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF article tag == ‘news’
THEN update segment ‘News’ by 1
Segments to be updated
for GUID 123ABC:
IF referer domain == ‘google.pl’
THEN update segment ‘Search’ by ‘Google’
24. 23
STEP 4: Profiling algorithm
BigData Spain Conference, October 15, 2015
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF article tag == ‘news’
THEN update segment ‘News’ by 1
Segments to be updated
for GUID 123ABC:
IF referer domain == ‘google.pl’
THEN update segment ‘Search’ by ‘Google’
!
"
‘News’ by value 1
25. 24
STEP 4: Profiling algorithm
BigData Spain Conference, October 15, 2015
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF article tag == ‘news’
THEN update segment ‘News’ by 1
Segments to be updated
for GUID 123ABC:
IF referer domain == ‘google.pl’
THEN update segment ‘Search’ by ‘Google’
‘News’ by value 1
!
‘Search’ by value ‘Google’
26. 25
STEP 5: Storing profiles in HBase
• Data are stored by bulk
operations in HBase after
properly processed Spark
batch
• Statistics are stored in Redis
HBase
• open source, non-relational,
distributed database
• provides BigTable-like
capabilities for Hadoop
• fault-tolerant way of storing
large quantities of sparse
data
BigData Spain Conference, October 15, 2015
foreachRDD
...
Engine
call(page_view/event)
Parser
Fact Extraction
Modules
Profiling
facts
returns segments
to be updated
Database
Manager
hbase.apache.org
27. 26
Resources in Spark – tips and tricks
• Resource managers as a
singletons.
• First call() method on a
worker initializes:
– singleton with resources (for
example database connections),
– shutdown hook which will close
all resources on application exit or
fault.
• Each worker manages and
keeps own resources
independently.
• Each resource on each worker
is initialized only once.
BigData Spain Conference, October 15, 2015
SparkContext
Driver
Cluster Manager
(for example Yarn)
Worker Node 1
Executor
Task 1
HBase
Singleton
HBase Connection
Worker Node N
Executor
…
28. 27
Resources in Spark – tips and tricks
• Resource managers as a
singletons.
• First call() method on a
worker initializes:
– singleton with resources (for
example database connections),
– shutdown hook which will close
all resources on application exit or
fault.
• Each worker manages and
keeps own resources
independently.
• Each resource on each worker
is initialized only once.
BigData Spain Conference, October 15, 2015
SparkContext
Driver
Cluster Manager
(for example Yarn)
Worker Node 1
Executor
Task 2
HBase
Singleton
HBase Connection
Worker Node N
Executor
…
29. 28
HBase –row key design
• The rows are sorted in alphanumeric order by
key names
• Hash keys if you want to distribute rows across
the regions (on servers of cluster)
• For efficient scanning use some suffixes
(separated by dashes):
– For time series data use a timestamp or [year]-
[month]-[day]-[hour]-[minute] structure.
Our HBase row key format:
[GUID]-[year]-[month]-[day]
BigData Spain Conference, October 15, 2015
http://hbase.apache.org/0.94/book/rowkey.design.html
30. Our system: Stage 2
Aggregation of daily user profiles
and sharing
BigData Spain Conference, October 15, 2015
31. 30
Architecture
• Final user profiles are
shared by REST web service
• Spring as a web
application framework
• Spring Hadoop library for
HBase connection
management
• Statistics are stored in
MySQL database
• We take into account a
permission issues:
– aggregated data are divided
by business IDs
BigData Spain Conference, October 15, 2015
User Profile
Web Service
Spring Framework
SpringHadoop
Library
Daily
Profiles
(Hbase)
Config
Daily
profiles
aggregation
JSON
External
system /
Client
REST query
with GUID
Profile
(JSON)
https://spring.io, http://docs.spring.io/spring-hadoop/docs/2.3.0.M3/reference/html/springandhadoop-hbase.html
Statistics
(MySQL)
32. 31
Profile aggregation algorithm
• For a specified input
GUID we aggregate
each existing
segment (feature)
• Each segment is
aggregated for a
specific period of time
• There are many
aggregation methods
corresponding to
different output
formats
BigData Spain Conference, October 15, 2015
123ABC
2015-09-12
3 Poznan
123ABC
2015-10-13
Gdansk
123ABC
2015-10-14
7
Poznan
123ABC
Today
2 Poznan
Sport
• 7 days
• output:
true
if value>5
City
• 14 days
• output:
mode
Example
123ABC
AGGREGATED
true Poznan
3
1
Number
of articles
• 3 days
• output:
sum
3
2
33. 32
Solved issues
• Kafka integration with Spark Streaming
• Parallelism of data streams (stream
division)
• Resources management in Spark
• Processing time
• Security of Spark: Kerberos integration
BigData Spain Conference, October 15, 2015
34. Enrichment of user profiles
by machine learning methods
How to classify users to the segments?
BigData Spain Conference, October 15, 2015
35. 34
Matching segments to users
• We want to classify a
user (of a specific
profile) to another
segments
– user vector consists of
user’s segments
• All segments are treated
as classes (labels)
• Online classification:
– model is learnt real-time
– model can be used for
real-time prediction
BigData Spain Conference, October 15, 2015
123ABC
Football Motorbikes
has
children
Handball
Swimming
Politics
Economy
Toys
Child car
seats
Mobile
Cars
Animals
?
?
?
?
?
36. 35
Multi-label classification
• Learning algorithm:
Binary Relevance
– independent binary
classifier for each label
(segment)
– each classifier is learnt
by existing user profiles
(belongs to or not)
• Each model returns
boolean or probability
• Prediction algorithm
returns results of binary
models for a given user
profile vector
BigData Spain Conference, October 15, 2015
Binary
classifier
Segment 1
Binary
classifier
Segment 2
Binary
classifier
Segment K
…
Profile 1 Profile 2 Profile N…
Prediction algorithm
(select ‘1’s or labels with
probability>0.5)
Profile n
List of recommended
segments for Profile n
Tsoumakas, Grigorios; Katakis, Ioannis (2007). "Multi-label classification: an overview". International Journal of Data Warehousing & Mining 3 (3): 1–13.
37. 36
Online learning by Spark
MLlib: Spark’s machine
learning (ML) library
• scalable ML algorithms
• classification
• regression
• clustering
• collaborative filtering
• dimensionality reduction
• lower-level optimization
primitives
• higher-level pipeline APIs
Streaming linear regression
in MLlib
• allows to fit regression
models online
• model parameters fitting
is similar to that
performed offline
• fitting occurs on each
batch of data
BigData Spain Conference, October 15, 2015
spark.apache.org/mllib