SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Real-time User Profiling
based on Spark Streaming and HBase
Arkadiusz Jachnik
BigData Spain Conference, October 15, 2015
2
Data
Scientist
AGORA S.A.
PhD
Student
Poznan
University of
Technology
User
Profiling
Text
Classification
Big
Data
Machine
Learning
Multi-
class
classification Multi-
label
classification
Recomendation
█ Arkadiusz Jachnik
BigData Spain Conference, October 15, 2015
3
Polish Media Company
Press
Magazines
Internet
Cinemas
Advertising
Radio
TV
Books
BigData Spain Conference, October 15, 2015
4
Agenda
1.  What is user profiling?
2.  User profiling system using big data
technologies (Spark, HBase)
–  recipe for profiling system
–  algorithm
–  technical issues and solutions
3.  Enrichment of user profiles
by machine learning methods
BigData Spain Conference, October 15, 2015
5
What is a user profile?
Tom
male
has 2
children
politics
volleyball
cars
most
active on
Monday
morning
City:
Cracow
Device:
iPhone
14 articles
read last
week
BigData Spain Conference, October 15, 2015
Single
Customer
View
6
Application domains
•  Classification issues:
–  propensity to buy
–  propensity to churn
–  propensity to default (credit scores)
–  anomaly detection (e.g., fraud detection)
•  User grouping (segmentation)
•  Personalised advertising and marketing
messaging
•  Content personalisation
•  Recommendations
BigData Spain Conference, October 15, 2015
Our system
Introduction
BigData Spain Conference, October 15, 2015
8
Our case
•  Input data:
–  online data: page views, events
–  meta-data of items (articles, blogs, posts, …)
•  The problem of data sparsity
•  User content engagement in a certain period of time
•  Specific user behaviour should be a necessary
condition to assign the user to a specific segment
(feature) in real-time
BigData Spain Conference, October 15, 2015
User behaviour: Segment/feature
The user has been reading forum
threads about care for young
children since last week
Parent of 0 to 3-year-old child
9
Workflow
BigData Spain Conference, October 15, 2015
Building daily
profiles
Daily
profiles
aggregation
and sharing
Our system: Stage 1
Building user daily profiles
BigData Spain Conference, October 15, 2015
11
STEP 1: Tracking and queuing ...
BigData Spain Conference, October 15, 2015
12
User identification
•  Main issue: most of our users are not
logged in.
•  Requirement: storing only non-PII data.
BigData Spain Conference, October 15, 2015
We rely on cookies only.
13
STEP 1: Tracking and queuing
•  JavaScript tracks:
–  page views
–  events
•  Tracking application
generates Global User ID
(GUID) and session ID
•  Data queuing using Apache
Kafka:
–  Open-source message broker
project
–  Unified, high-throughput, low-
latency platform
•  We keep data on Kafka for
3 days.
BigData Spain Conference, October 15, 2015
page views,
events
cookies
Tracking
application
tomcat.apache.org, kafka.apache.org
page views
stream
events
stream
14
STEP 2: Spark Streaming ...
BigData Spain Conference, October 15, 2015
15
What is Spark Streaming?
•  Apache Spark: open source
cluster computing
framework.
•  Spark Streaming: library for
streaming computation as a
series of small and
deterministic batch jobs:
–  Splits stream into batches of
X seconds
–  Each batch is treated as RDD
and is processed by RDD
operations
–  Processed results are
returned in batches
BigData Spain Conference, October 15, 2015
live data stream
batches of X
secons
processed results
of RDD operations
spark.apache.org/streaming, https://databricks-training.s3.amazonaws.com/slides/Spark%20Summit%202014%20-%20Spark%20Streaming.pdf
Spark RDD
Engine
16
STEP 2: Spark Streaming
•  We have 2 streams (each
with 6 partitions)
–  page views
–  events
•  Streaming duration:
5 seconds
•  Page views (and
events) are converted
and parsed to obtain:
business ID, domain, parts of
URL and referer, geolocation,
User-Agent, GUID, Visit ID, etc. ...
BigData Spain Conference, October 15, 2015
page views
events
batches of union
of input streams
foreachRDD	
flatMap	Engine
call(page_view/event)	
A single page view
or event
processing
17
STEP 3: Fact Extraction ...
BigData Spain Conference, October 15, 2015
18
Fact definition
BigData Spain Conference, October 15, 2015
Example: page view:
•  GUID: 123ABC
•  time: 2015-10-15, 15:23:15
Example of facts:
123ABC
page view on
domain
wyborcza.pl
2015-10-15
15:23:15< >
123ABC referer domain google.pl
2015-10-15
15:23:15< >
123ABC geolocation city Madrid
2015-10-15
15:23:15< >
123ABC article tag news
2015-10-15
15:23:15< >
123ABC article tag sport
2015-10-15
15:23:15< >
GUID type of fact
value of
fact time
•  URL:
http://www.wyborcza.pl/...
•  Referrer:
http://www.google.pl/...
•  Geo city: Madrid
•  Article tags:
news, sport
19
Fact definition – more formally
Fact – the smallest piece of information
describing a relation between a user
(GUID) and some feature/element of a
page view or event.
Programmer and data steward decide what types
of facts should be extracted.
BigData Spain Conference, October 15, 2015
20
STEP 4: Profiling algorithm ...
BigData Spain Conference, October 15, 2015
21
Profiling rules
IF		
		referer	contains	‘google.pl’		
THEN		
		update	feature	‘Search’	by	‘1’		
	
where
type of fact to check
value to check in fact
symbol
which segment and how to
update if rule is fulfilled
Rules can be stored in DB
rows:
•  type of fact,
•  value to check,
•  symbol,
•  ID of segment to update,
•  value to update in
segment
BigData Spain Conference, October 15, 2015
22
STEP 4: Profiling algorithm
BigData Spain Conference, October 15, 2015
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF	article	tag	==	‘news’		
THEN	update	segment	‘News’	by	1		
Segments to be updated
for GUID 123ABC:
IF	referer	domain	==	‘google.pl’		
THEN	update	segment	‘Search’	by	‘Google’
23
STEP 4: Profiling algorithm
BigData Spain Conference, October 15, 2015
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF	article	tag	==	‘news’		
THEN	update	segment	‘News’	by	1		
Segments to be updated
for GUID 123ABC:
IF	referer	domain	==	‘google.pl’		
THEN	update	segment	‘Search’	by	‘Google’	
!
"
‘News’	by	value	1
24
STEP 4: Profiling algorithm
BigData Spain Conference, October 15, 2015
123ABC
page view
on domain
wyborcza.pl 15:23
< >
Facts to check:
123ABC
referer
domain
google.pl 15:23
< >
123ABC
geolocation
city
Madrid 15:23
< >
123ABC article tag news 15:23
< >
123ABC article tag sport 15:23
< >
Profiling rules:
IF	article	tag	==	‘news’		
THEN	update	segment	‘News’	by	1		
Segments to be updated
for GUID 123ABC:
IF	referer	domain	==	‘google.pl’		
THEN	update	segment	‘Search’	by	‘Google’	
‘News’	by	value	1		
!
‘Search’	by	value	‘Google’
25
STEP 5: Storing profiles in HBase
•  Data are stored by bulk
operations in HBase after
properly processed Spark
batch
•  Statistics are stored in Redis
HBase
•  open source, non-relational,
distributed database
•  provides BigTable-like
capabilities for Hadoop
•  fault-tolerant way of storing
large quantities of sparse
data
BigData Spain Conference, October 15, 2015
foreachRDD	
...	
Engine
call(page_view/event)	
Parser
Fact Extraction
Modules
Profiling
facts
returns segments
to be updated
Database
Manager
hbase.apache.org
26
Resources in Spark – tips and tricks
•  Resource managers as a
singletons.
•  First call() method on a
worker initializes:
–  singleton with resources (for
example database connections),
–  shutdown hook which will close
all resources on application exit or
fault.
•  Each worker manages and
keeps own resources
independently.
•  Each resource on each worker
is initialized only once.
BigData Spain Conference, October 15, 2015
SparkContext
Driver
Cluster Manager
(for example Yarn)
Worker Node 1
Executor
Task 1
HBase
Singleton
HBase Connection
Worker Node N
Executor
…
27
Resources in Spark – tips and tricks
•  Resource managers as a
singletons.
•  First call() method on a
worker initializes:
–  singleton with resources (for
example database connections),
–  shutdown hook which will close
all resources on application exit or
fault.
•  Each worker manages and
keeps own resources
independently.
•  Each resource on each worker
is initialized only once.
BigData Spain Conference, October 15, 2015
SparkContext
Driver
Cluster Manager
(for example Yarn)
Worker Node 1
Executor
Task 2
HBase
Singleton
HBase Connection
Worker Node N
Executor
…
28
HBase –row key design
•  The rows are sorted in alphanumeric order by
key names
•  Hash keys if you want to distribute rows across
the regions (on servers of cluster)
•  For efficient scanning use some suffixes
(separated by dashes):
–  For time series data use a timestamp or [year]-
[month]-[day]-[hour]-[minute] structure.
Our HBase row key format:
[GUID]-[year]-[month]-[day]	
BigData Spain Conference, October 15, 2015
http://hbase.apache.org/0.94/book/rowkey.design.html
Our system: Stage 2
Aggregation of daily user profiles
and sharing
BigData Spain Conference, October 15, 2015
30
Architecture
•  Final user profiles are
shared by REST web service
•  Spring as a web
application framework
•  Spring Hadoop library for
HBase connection
management
•  Statistics are stored in
MySQL database
•  We take into account a
permission issues:
–  aggregated data are divided
by business IDs
BigData Spain Conference, October 15, 2015
User Profile
Web Service
Spring Framework
SpringHadoop
Library
Daily
Profiles
(Hbase)
Config
Daily
profiles
aggregation
JSON
External
system /
Client
REST query
with GUID
Profile
(JSON)
https://spring.io, http://docs.spring.io/spring-hadoop/docs/2.3.0.M3/reference/html/springandhadoop-hbase.html
Statistics
(MySQL)
31
Profile aggregation algorithm
•  For a specified input
GUID we aggregate
each existing
segment (feature)
•  Each segment is
aggregated for a
specific period of time
•  There are many
aggregation methods
corresponding to
different output
formats
BigData Spain Conference, October 15, 2015
123ABC
2015-09-12
3 Poznan
123ABC
2015-10-13
Gdansk
123ABC
2015-10-14
7
Poznan
123ABC
Today
2 Poznan
Sport
• 7 days
• output:
true
if value>5
City
• 14 days
• output:
mode
Example
123ABC
AGGREGATED
true Poznan
3
1
Number
of articles
• 3 days
• output:
sum
3
2
32
Solved issues
•  Kafka integration with Spark Streaming
•  Parallelism of data streams (stream
division)
•  Resources management in Spark
•  Processing time
•  Security of Spark: Kerberos integration
BigData Spain Conference, October 15, 2015
Enrichment of user profiles
by machine learning methods
How to classify users to the segments?
BigData Spain Conference, October 15, 2015
34
Matching segments to users
•  We want to classify a
user (of a specific
profile) to another
segments
–  user vector consists of
user’s segments
•  All segments are treated
as classes (labels)
•  Online classification:
–  model is learnt real-time
–  model can be used for
real-time prediction
BigData Spain Conference, October 15, 2015
123ABC
Football Motorbikes
has
children
Handball
Swimming
Politics
Economy
Toys
Child car
seats
Mobile
Cars
Animals
?
?
?
?
?
35
Multi-label classification
•  Learning algorithm:
Binary Relevance
–  independent binary
classifier for each label
(segment)
–  each classifier is learnt
by existing user profiles
(belongs to or not)
•  Each model returns
boolean or probability
•  Prediction algorithm
returns results of binary
models for a given user
profile vector
BigData Spain Conference, October 15, 2015
Binary
classifier
Segment 1
Binary
classifier
Segment 2
Binary
classifier
Segment K
…
Profile 1 Profile 2 Profile N…
Prediction algorithm
(select ‘1’s or labels with
probability>0.5)
Profile n
List of recommended
segments for Profile n
Tsoumakas, Grigorios; Katakis, Ioannis (2007). "Multi-label classification: an overview". International Journal of Data Warehousing & Mining 3 (3): 1–13.
36
Online learning by Spark
MLlib: Spark’s machine
learning (ML) library
•  scalable ML algorithms
•  classification
•  regression
•  clustering
•  collaborative filtering
•  dimensionality reduction
•  lower-level optimization
primitives
•  higher-level pipeline APIs
Streaming linear regression
in MLlib
•  allows to fit regression
models online
•  model parameters fitting
is similar to that
performed offline
•  fitting occurs on each
batch of data
BigData Spain Conference, October 15, 2015
spark.apache.org/mllib
Thank you!
Questions?
BigData Spain Conference, October 15, 2015

Contenu connexe

Tendances

How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudVMware Tanzu
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan
 
Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveDatabricks
 
Building a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineDataWorks Summit
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
 
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...MongoDB
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server ProLynn Langit
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 
[Strata] Sparkta
[Strata] Sparkta[Strata] Sparkta
[Strata] SparktaStratio
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataLuiz Henrique Zambom Santana
 
Implementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream AnalyticsImplementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream AnalyticsMarco Parenzan
 
Building a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineBuilding a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineDataWorks Summit
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 

Tendances (20)

How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, HadoopMonitoring @ scale over diverse data sources @ PayPal  - Druid, TSDB, Hadoop
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
 
Customer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data PerspectiveCustomer Experience at Disney+ Through Data Perspective
Customer Experience at Disney+ Through Data Perspective
 
Building a Self-Service Big Data Pipeline
Building a Self-Service Big Data PipelineBuilding a Self-Service Big Data Pipeline
Building a Self-Service Big Data Pipeline
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
 
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
MongoDB Europe 2016 - Choosing Between 100 Billion Travel Options – Instant S...
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
[Strata] Sparkta
[Strata] Sparkta[Strata] Sparkta
[Strata] Sparkta
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Implementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream AnalyticsImplementing a canonical IoT backend in Azure with Azure Stream Analytics
Implementing a canonical IoT backend in Azure with Azure Stream Analytics
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data PipelineBuilding a Hadoop Powered Commerce Data Pipeline
Building a Hadoop Powered Commerce Data Pipeline
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 

Similaire à Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your EnterpriseWSO2
 
KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0WSO2
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBencht_ivanov
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...DataBench
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
Self-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesSelf-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesApache StreamPipes
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2
 
From measurement to knowledge with sofia2 Platform
From measurement to knowledge with sofia2 PlatformFrom measurement to knowledge with sofia2 Platform
From measurement to knowledge with sofia2 PlatformSofia2 Smart Platform
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statisticsWSO2
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - AnalyticsDassana Wijesekara
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateSrinath Perera
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 

Similaire à Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015 (20)

Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0KPI definition with Business Activity Monitor 2.0
KPI definition with Business Activity Monitor 2.0
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Self-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesSelf-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipes
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics PlatformWSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
 
From measurement to knowledge with sofia2 Platform
From measurement to knowledge with sofia2 PlatformFrom measurement to knowledge with sofia2 Platform
From measurement to knowledge with sofia2 Platform
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statistics
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 

Plus de Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

Plus de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jachnik at Big Data Spain 2015

  • 1.
  • 2. Real-time User Profiling based on Spark Streaming and HBase Arkadiusz Jachnik BigData Spain Conference, October 15, 2015
  • 3. 2 Data Scientist AGORA S.A. PhD Student Poznan University of Technology User Profiling Text Classification Big Data Machine Learning Multi- class classification Multi- label classification Recomendation █ Arkadiusz Jachnik BigData Spain Conference, October 15, 2015
  • 5. 4 Agenda 1.  What is user profiling? 2.  User profiling system using big data technologies (Spark, HBase) –  recipe for profiling system –  algorithm –  technical issues and solutions 3.  Enrichment of user profiles by machine learning methods BigData Spain Conference, October 15, 2015
  • 6. 5 What is a user profile? Tom male has 2 children politics volleyball cars most active on Monday morning City: Cracow Device: iPhone 14 articles read last week BigData Spain Conference, October 15, 2015 Single Customer View
  • 7. 6 Application domains •  Classification issues: –  propensity to buy –  propensity to churn –  propensity to default (credit scores) –  anomaly detection (e.g., fraud detection) •  User grouping (segmentation) •  Personalised advertising and marketing messaging •  Content personalisation •  Recommendations BigData Spain Conference, October 15, 2015
  • 8. Our system Introduction BigData Spain Conference, October 15, 2015
  • 9. 8 Our case •  Input data: –  online data: page views, events –  meta-data of items (articles, blogs, posts, …) •  The problem of data sparsity •  User content engagement in a certain period of time •  Specific user behaviour should be a necessary condition to assign the user to a specific segment (feature) in real-time BigData Spain Conference, October 15, 2015 User behaviour: Segment/feature The user has been reading forum threads about care for young children since last week Parent of 0 to 3-year-old child
  • 10. 9 Workflow BigData Spain Conference, October 15, 2015 Building daily profiles Daily profiles aggregation and sharing
  • 11. Our system: Stage 1 Building user daily profiles BigData Spain Conference, October 15, 2015
  • 12. 11 STEP 1: Tracking and queuing ... BigData Spain Conference, October 15, 2015
  • 13. 12 User identification •  Main issue: most of our users are not logged in. •  Requirement: storing only non-PII data. BigData Spain Conference, October 15, 2015 We rely on cookies only.
  • 14. 13 STEP 1: Tracking and queuing •  JavaScript tracks: –  page views –  events •  Tracking application generates Global User ID (GUID) and session ID •  Data queuing using Apache Kafka: –  Open-source message broker project –  Unified, high-throughput, low- latency platform •  We keep data on Kafka for 3 days. BigData Spain Conference, October 15, 2015 page views, events cookies Tracking application tomcat.apache.org, kafka.apache.org page views stream events stream
  • 15. 14 STEP 2: Spark Streaming ... BigData Spain Conference, October 15, 2015
  • 16. 15 What is Spark Streaming? •  Apache Spark: open source cluster computing framework. •  Spark Streaming: library for streaming computation as a series of small and deterministic batch jobs: –  Splits stream into batches of X seconds –  Each batch is treated as RDD and is processed by RDD operations –  Processed results are returned in batches BigData Spain Conference, October 15, 2015 live data stream batches of X secons processed results of RDD operations spark.apache.org/streaming, https://databricks-training.s3.amazonaws.com/slides/Spark%20Summit%202014%20-%20Spark%20Streaming.pdf Spark RDD Engine
  • 17. 16 STEP 2: Spark Streaming •  We have 2 streams (each with 6 partitions) –  page views –  events •  Streaming duration: 5 seconds •  Page views (and events) are converted and parsed to obtain: business ID, domain, parts of URL and referer, geolocation, User-Agent, GUID, Visit ID, etc. ... BigData Spain Conference, October 15, 2015 page views events batches of union of input streams foreachRDD flatMap Engine call(page_view/event) A single page view or event processing
  • 18. 17 STEP 3: Fact Extraction ... BigData Spain Conference, October 15, 2015
  • 19. 18 Fact definition BigData Spain Conference, October 15, 2015 Example: page view: •  GUID: 123ABC •  time: 2015-10-15, 15:23:15 Example of facts: 123ABC page view on domain wyborcza.pl 2015-10-15 15:23:15< > 123ABC referer domain google.pl 2015-10-15 15:23:15< > 123ABC geolocation city Madrid 2015-10-15 15:23:15< > 123ABC article tag news 2015-10-15 15:23:15< > 123ABC article tag sport 2015-10-15 15:23:15< > GUID type of fact value of fact time •  URL: http://www.wyborcza.pl/... •  Referrer: http://www.google.pl/... •  Geo city: Madrid •  Article tags: news, sport
  • 20. 19 Fact definition – more formally Fact – the smallest piece of information describing a relation between a user (GUID) and some feature/element of a page view or event. Programmer and data steward decide what types of facts should be extracted. BigData Spain Conference, October 15, 2015
  • 21. 20 STEP 4: Profiling algorithm ... BigData Spain Conference, October 15, 2015
  • 22. 21 Profiling rules IF referer contains ‘google.pl’ THEN update feature ‘Search’ by ‘1’ where type of fact to check value to check in fact symbol which segment and how to update if rule is fulfilled Rules can be stored in DB rows: •  type of fact, •  value to check, •  symbol, •  ID of segment to update, •  value to update in segment BigData Spain Conference, October 15, 2015
  • 23. 22 STEP 4: Profiling algorithm BigData Spain Conference, October 15, 2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’
  • 24. 23 STEP 4: Profiling algorithm BigData Spain Conference, October 15, 2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’ ! " ‘News’ by value 1
  • 25. 24 STEP 4: Profiling algorithm BigData Spain Conference, October 15, 2015 123ABC page view on domain wyborcza.pl 15:23 < > Facts to check: 123ABC referer domain google.pl 15:23 < > 123ABC geolocation city Madrid 15:23 < > 123ABC article tag news 15:23 < > 123ABC article tag sport 15:23 < > Profiling rules: IF article tag == ‘news’ THEN update segment ‘News’ by 1 Segments to be updated for GUID 123ABC: IF referer domain == ‘google.pl’ THEN update segment ‘Search’ by ‘Google’ ‘News’ by value 1 ! ‘Search’ by value ‘Google’
  • 26. 25 STEP 5: Storing profiles in HBase •  Data are stored by bulk operations in HBase after properly processed Spark batch •  Statistics are stored in Redis HBase •  open source, non-relational, distributed database •  provides BigTable-like capabilities for Hadoop •  fault-tolerant way of storing large quantities of sparse data BigData Spain Conference, October 15, 2015 foreachRDD ... Engine call(page_view/event) Parser Fact Extraction Modules Profiling facts returns segments to be updated Database Manager hbase.apache.org
  • 27. 26 Resources in Spark – tips and tricks •  Resource managers as a singletons. •  First call() method on a worker initializes: –  singleton with resources (for example database connections), –  shutdown hook which will close all resources on application exit or fault. •  Each worker manages and keeps own resources independently. •  Each resource on each worker is initialized only once. BigData Spain Conference, October 15, 2015 SparkContext Driver Cluster Manager (for example Yarn) Worker Node 1 Executor Task 1 HBase Singleton HBase Connection Worker Node N Executor …
  • 28. 27 Resources in Spark – tips and tricks •  Resource managers as a singletons. •  First call() method on a worker initializes: –  singleton with resources (for example database connections), –  shutdown hook which will close all resources on application exit or fault. •  Each worker manages and keeps own resources independently. •  Each resource on each worker is initialized only once. BigData Spain Conference, October 15, 2015 SparkContext Driver Cluster Manager (for example Yarn) Worker Node 1 Executor Task 2 HBase Singleton HBase Connection Worker Node N Executor …
  • 29. 28 HBase –row key design •  The rows are sorted in alphanumeric order by key names •  Hash keys if you want to distribute rows across the regions (on servers of cluster) •  For efficient scanning use some suffixes (separated by dashes): –  For time series data use a timestamp or [year]- [month]-[day]-[hour]-[minute] structure. Our HBase row key format: [GUID]-[year]-[month]-[day] BigData Spain Conference, October 15, 2015 http://hbase.apache.org/0.94/book/rowkey.design.html
  • 30. Our system: Stage 2 Aggregation of daily user profiles and sharing BigData Spain Conference, October 15, 2015
  • 31. 30 Architecture •  Final user profiles are shared by REST web service •  Spring as a web application framework •  Spring Hadoop library for HBase connection management •  Statistics are stored in MySQL database •  We take into account a permission issues: –  aggregated data are divided by business IDs BigData Spain Conference, October 15, 2015 User Profile Web Service Spring Framework SpringHadoop Library Daily Profiles (Hbase) Config Daily profiles aggregation JSON External system / Client REST query with GUID Profile (JSON) https://spring.io, http://docs.spring.io/spring-hadoop/docs/2.3.0.M3/reference/html/springandhadoop-hbase.html Statistics (MySQL)
  • 32. 31 Profile aggregation algorithm •  For a specified input GUID we aggregate each existing segment (feature) •  Each segment is aggregated for a specific period of time •  There are many aggregation methods corresponding to different output formats BigData Spain Conference, October 15, 2015 123ABC 2015-09-12 3 Poznan 123ABC 2015-10-13 Gdansk 123ABC 2015-10-14 7 Poznan 123ABC Today 2 Poznan Sport • 7 days • output: true if value>5 City • 14 days • output: mode Example 123ABC AGGREGATED true Poznan 3 1 Number of articles • 3 days • output: sum 3 2
  • 33. 32 Solved issues •  Kafka integration with Spark Streaming •  Parallelism of data streams (stream division) •  Resources management in Spark •  Processing time •  Security of Spark: Kerberos integration BigData Spain Conference, October 15, 2015
  • 34. Enrichment of user profiles by machine learning methods How to classify users to the segments? BigData Spain Conference, October 15, 2015
  • 35. 34 Matching segments to users •  We want to classify a user (of a specific profile) to another segments –  user vector consists of user’s segments •  All segments are treated as classes (labels) •  Online classification: –  model is learnt real-time –  model can be used for real-time prediction BigData Spain Conference, October 15, 2015 123ABC Football Motorbikes has children Handball Swimming Politics Economy Toys Child car seats Mobile Cars Animals ? ? ? ? ?
  • 36. 35 Multi-label classification •  Learning algorithm: Binary Relevance –  independent binary classifier for each label (segment) –  each classifier is learnt by existing user profiles (belongs to or not) •  Each model returns boolean or probability •  Prediction algorithm returns results of binary models for a given user profile vector BigData Spain Conference, October 15, 2015 Binary classifier Segment 1 Binary classifier Segment 2 Binary classifier Segment K … Profile 1 Profile 2 Profile N… Prediction algorithm (select ‘1’s or labels with probability>0.5) Profile n List of recommended segments for Profile n Tsoumakas, Grigorios; Katakis, Ioannis (2007). "Multi-label classification: an overview". International Journal of Data Warehousing & Mining 3 (3): 1–13.
  • 37. 36 Online learning by Spark MLlib: Spark’s machine learning (ML) library •  scalable ML algorithms •  classification •  regression •  clustering •  collaborative filtering •  dimensionality reduction •  lower-level optimization primitives •  higher-level pipeline APIs Streaming linear regression in MLlib •  allows to fit regression models online •  model parameters fitting is similar to that performed offline •  fitting occurs on each batch of data BigData Spain Conference, October 15, 2015 spark.apache.org/mllib
  • 38. Thank you! Questions? BigData Spain Conference, October 15, 2015