Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

Real-time User Profiling
based on Spark Streaming and HBase
Arkadiusz Jachnik
BigData Spain Conference, October 15, 2015

2
Data
Scientist
AGORA S.A.
PhD
Student
Poznan
University of
Technology
User
Profiling
Text
Classification
Big
Data
Machine
Learning
Multi-
class
classification Multi-
label
classification
Recomendation
█ Arkadiusz Jachnik

3
Polish Media Company
Press
Magazines
Internet
Cinemas
Advertising
Radio
TV
Books

4
Agenda
1.  What is user profiling?
2.  User profiling system using big data
technologies (Spark, HBase)
–  recipe for profiling system
–  algorithm
–  technical issues and solutions
3.  Enrichment of user profiles
by machine learning methods

5
What is a user profile?
Tom
male
has 2
children
politics
volleyball
cars
most
active on
Monday
morning
City:
Cracow
Device:
iPhone
14 articles
read last
week
Single
Customer
View

6
Application domains
•  Classification issues:
–  propensity to buy
–  propensity to churn
–  propensity to default (credit scores)
–  anomaly detection (e.g., fraud detection)
•  User grouping (segmentation)
•  Personalised advertising and marketing
messaging
•  Content personalisation
•  Recommendations

Our system
Introduction

8
Our case
•  Input data:
–  online data: page views, events
–  meta-data of items (articles, blogs, posts, …)
•  The problem of data sparsity
•  User content engagement in a certain period of time
•  Specific user behaviour should be a necessary
condition to assign the user to a specific segment
(feature) in real-time
User behaviour: Segment/feature
The user has been reading forum
threads about care for young
children since last week
Parent of 0 to 3-year-old child

9
Workflow
Building daily
profiles
Daily
profiles
aggregation
and sharing

Our system: Stage 1
Building user daily profiles

11
STEP 1: Tracking and queuing ...

12
User identification
•  Main issue: most of our users are not
logged in.
•  Requirement: storing only non-PII data.
We rely on cookies only.

13
STEP 1: Tracking and queuing
•  JavaScript tracks:
–  page views
–  events
•  Tracking application
generates Global User ID
(GUID) and session ID
•  Data queuing using Apache
Kafka:
–  Open-source message broker
project
–  Unified, high-throughput, low-
latency platform
•  We keep data on Kafka for
3 days.
page views,
events
cookies
Tracking
application
tomcat.apache.org, kafka.apache.org
page views
stream
events
stream

14
STEP 2: Spark Streaming ...

15
What is Spark Streaming?
•  Apache Spark: open source
cluster computing
framework.
•  Spark Streaming: library for
streaming computation as a
series of small and
deterministic batch jobs:
–  Splits stream into batches of
X seconds
–  Each batch is treated as RDD
and is processed by RDD
operations
–  Processed results are
returned in batches
live data stream
batches of X
secons
processed results
of RDD operations
spark.apache.org/streaming, https://databricks-training.s3.amazonaws.com/slides/Spark%20Summit%202014%20-%20Spark%20Streaming.pdf
Spark RDD
Engine

16
STEP 2: Spark Streaming
•  We have 2 streams (each
with 6 partitions)
–  page views
–  events
•  Streaming duration:
5 seconds
•  Page views (and
events) are converted
and parsed to obtain:
business ID, domain, parts of
URL and referer, geolocation,
User-Agent, GUID, Visit ID, etc. ...
page views
events
batches of union
of input streams
foreachRDD
flatMap Engine
call(page_view/event)
A single page view
or event
processing

17
STEP 3: Fact Extraction ...

18
Fact definition
Example: page view:
•  GUID: 123ABC
•  time: 2015-10-15, 15:23:15
Example of facts:
123ABC
page view on
domain
wyborcza.pl
2015-10-15
15:23:15< >
123ABC referer domain google.pl
2015-10-15
15:23:15< >
123ABC geolocation city Madrid
2015-10-15
15:23:15< >
123ABC article tag news
2015-10-15
15:23:15< >
123ABC article tag sport
2015-10-15
15:23:15< >
GUID type of fact
value of
fact time
•  URL:
http://www.wyborcza.pl/...
•  Referrer:
http://www.google.pl/...
•  Geo city: Madrid
•  Article tags:
news, sport

19
Fact definition – more formally
Fact – the smallest piece of information
describing a relation between a user
(GUID) and some feature/element of a
page view or event.
Programmer and data steward decide what types
of facts should be extracted.

20
STEP 4: Profiling algorithm ...

21
Profiling rules
IF
referer contains ‘google.pl’
THEN
update feature ‘Search’ by ‘1’

where
type of fact to check
value to check in fact
symbol
which segment and how to
update if rule is fulfilled
Rules can be stored in DB
rows:
•  type of fact,
•  value to check,
•  symbol,
•  ID of segment to update,
•  value to update in
segment

22
STEP 4: Profiling algorithm
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF article tag == ‘news’
THEN update segment ‘News’ by 1
Segments to be updated
for GUID 123ABC:
IF referer domain == ‘google.pl’
THEN update segment ‘Search’ by ‘Google’

23
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
< >
< >
Profiling rules:
for GUID 123ABC:
!
"
‘News’ by value 1

24
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
< >
< >
Profiling rules:
for GUID 123ABC:
‘News’ by value 1
!
‘Search’ by value ‘Google’

25
STEP 5: Storing profiles in HBase
•  Data are stored by bulk
operations in HBase after
properly processed Spark
batch
•  Statistics are stored in Redis
HBase
•  open source, non-relational,
distributed database
•  provides BigTable-like
capabilities for Hadoop
•  fault-tolerant way of storing
large quantities of sparse
data
foreachRDD
...
Engine
call(page_view/event)
Parser
Fact Extraction
Modules
Profiling
facts
returns segments
to be updated
Database
Manager
hbase.apache.org

26
Resources in Spark – tips and tricks
•  Resource managers as a
singletons.
•  First call() method on a
worker initializes:
–  singleton with resources (for
example database connections),
–  shutdown hook which will close
all resources on application exit or
fault.
•  Each worker manages and
keeps own resources
independently.
•  Each resource on each worker
is initialized only once.
SparkContext
Driver
Cluster Manager
(for example Yarn)
Worker Node 1
Executor
Task 1
HBase
Singleton
HBase Connection
Worker Node N
Executor
…

27
Resources in Spark – tips and tricks
•  Resource managers as a
singletons.
•  First call() method on a
worker initializes:
–  singleton with resources (for
example database connections),
–  shutdown hook which will close
all resources on application exit or
fault.
•  Each worker manages and
keeps own resources
independently.
•  Each resource on each worker
is initialized only once.
SparkContext
Driver
Cluster Manager
(for example Yarn)
Worker Node 1
Executor
Task 2
HBase
Singleton
HBase Connection
Worker Node N
Executor
…

28
HBase –row key design
•  The rows are sorted in alphanumeric order by
key names
•  Hash keys if you want to distribute rows across
the regions (on servers of cluster)
•  For efficient scanning use some suffixes
(separated by dashes):
–  For time series data use a timestamp or [year]-
[month]-[day]-[hour]-[minute] structure.
Our HBase row key format:
[GUID]-[year]-[month]-[day]
http://hbase.apache.org/0.94/book/rowkey.design.html

Our system: Stage 2
Aggregation of daily user profiles
and sharing

30
Architecture
•  Final user profiles are
shared by REST web service
•  Spring as a web
application framework
•  Spring Hadoop library for
HBase connection
management
•  Statistics are stored in
MySQL database
•  We take into account a
permission issues:
–  aggregated data are divided
by business IDs
User Profile
Web Service
Spring Framework
SpringHadoop
Library
Daily
Profiles
(Hbase)
Config
Daily
profiles
aggregation
JSON
External
system /
Client
REST query
with GUID
Profile
(JSON)
https://spring.io, http://docs.spring.io/spring-hadoop/docs/2.3.0.M3/reference/html/springandhadoop-hbase.html
Statistics
(MySQL)

31
Profile aggregation algorithm
•  For a specified input
GUID we aggregate
each existing
segment (feature)
•  Each segment is
aggregated for a
specific period of time
•  There are many
aggregation methods
corresponding to
different output
formats
123ABC
2015-09-12
3 Poznan
123ABC
2015-10-13
Gdansk
123ABC
2015-10-14
7
Poznan
123ABC
Today
2 Poznan
Sport
• 7 days
• output:
true
if value>5
City
• 14 days
• output:
mode
Example
123ABC
AGGREGATED
true Poznan
3
1
Number
of articles
• 3 days
• output:
sum
3
2

32
Solved issues
•  Kafka integration with Spark Streaming
•  Parallelism of data streams (stream
division)
•  Resources management in Spark
•  Processing time
•  Security of Spark: Kerberos integration

Enrichment of user profiles
by machine learning methods
How to classify users to the segments?

34
Matching segments to users
•  We want to classify a
user (of a specific
profile) to another
segments
–  user vector consists of
user’s segments
•  All segments are treated
as classes (labels)
•  Online classification:
–  model is learnt real-time
–  model can be used for
real-time prediction
123ABC
Football Motorbikes
has
children
Handball
Swimming
Politics
Economy
Toys
Child car
seats
Mobile
Cars
Animals
?
?
?
?
?

35
Multi-label classification
•  Learning algorithm:
Binary Relevance
–  independent binary
classifier for each label
(segment)
–  each classifier is learnt
by existing user profiles
(belongs to or not)
•  Each model returns
boolean or probability
•  Prediction algorithm
returns results of binary
models for a given user
profile vector
Binary
classifier
Segment 1
Binary
classifier
Segment 2
Binary
classifier
Segment K
…
Profile 1 Profile 2 Profile N…
Prediction algorithm
(select ‘1’s or labels with
probability>0.5)
Profile n
List of recommended
segments for Profile n
Tsoumakas, Grigorios; Katakis, Ioannis (2007). "Multi-label classification: an overview". International Journal of Data Warehousing & Mining 3 (3): 1–13.

36
Online learning by Spark
MLlib: Spark’s machine
learning (ML) library
•  scalable ML algorithms
•  classification
•  regression
•  clustering
•  collaborative filtering
•  dimensionality reduction
•  lower-level optimization
primitives
•  higher-level pipeline APIs
Streaming linear regression
in MLlib
•  allows to fit regression
models online
•  model parameters fitting
is similar to that
performed offline
•  fitting occurs on each
batch of data
spark.apache.org/mllib

Thank you!
Questions?

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

Similaire à Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015 (20)

Plus de Big Data Spain

Plus de Big Data Spain (20)

Dernier

Dernier (20)

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015