SlideShare une entreprise Scribd logo
1  sur  38
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Murtaza Doctor Principal Architect
Giang Nguyen Sr. Software Engineer
How we solved
Real-Time User
Segmentation
using HBase ?
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Outline
• {rr} Story
• {rr} Personalization Platform
• User Segmentation: Problem Statement
• Design & Architecture
• Performance Metrics
• Q&A
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Multiple Personalization Placements
Personalized Product
Recommendations Targeted Promotions Relevant Brand Advertising
Inventory and
Margin Data
Catalog
Attribute Data
Real time user
behavior
Multi-Channel
purchase
history
3rd Party Data
Sources
Future Data
Sources
Input
Data
100+ algorithms dynamically targeted by
page, context, and user behavior, across
all retail channels.
Targeted content optimization to
customer segments
Monetization through relevant
brand advertising
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
 One place where all data
about your customer lives
(e.g. loyalty, customer
service, offline, catalogue)
 Direct access to data for
your entire enterprise via API
 Real-time actionable data
and business intelligence
 Link customer activity across
any channel
 Leverage 3rd party data (e.g.
weather, geography, census)
Delivering a
Customer-centric
Omni-channel
Data model
RichRelevanceDataMesh
Real-time
Segmentation
DMP
Integration
RichRelevance DataMesh Cloud Platform
Delivering a Single View of your Customer
Event-based
Triggers
Ad Hoc
Reporting
Omni-channel
Personalization
Loyalty
Integration
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Our cloud-based platform supports both real-time processes
and analytical use cases, utilizing technologies to name a
few: Crunch, Hive, HBase, Avro, Azkaban, Voldemort, Kafka
Someone clicks on a {rr} recommendation
every 21 milliseconds
Did You Know?
Our data capacity includes a 1.5 PB Hadoop infrastructure,
which enables us to employ 100+ algorithms in real-time
In the US, we serve 7000 requests per second with an average
response time of 50 ms
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Real-Time User Segmentation
Finding and Targeting the Right Audience
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Meet Amanda and Jessica
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Because We Know What They Like!
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
• Utilizes valuable targeting
data such as event logs,
off-line data, DMA,
demographics, and etc.
• Finds highly qualified
consumers and buckets
them into segments
What is the {rr} Segment Builder?
Example: for Retail, Segment Builder supports view,
click, purchase on products, categories, and brands
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Segment Builder
Segment’s List page
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Segment Builder
Add/Edit Segment page
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
• Create segments to capture the audience via UI
• Each behavior is captured by a rule
• Each rule corresponds to a row key in HBase
• Each rule returns the users
• Rules are joined using set union or intersection
• Segment definition consists of one or more rules
Design: Segment Evaluation Engine
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Our choice Avro and Kafka
Real-Time Data Ingestion
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
• User interacts with a retail site
• Events are triggered in real-
time, each event is converted
into an Avro record
• Events are ingested using our
real-time framework built using
Apache Kafka
Real-time data ingestion
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Design Principles: Real-Time Solution
• Streaming: Support for streaming versus batch
• Reliability: no loss of events and at least once
delivery of messages
• Scalable: add more brokers, add more consumers,
partition data
• Distributed system with central coordinator like
zookeeper
• Persistence of messages
• Push & pull mechanism
• Support for compression
• Low operational complexity & easy to monitor
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Our Decision – Apache Kafka
• Distributed pub-sub messaging system
• More generic concept (Ingest & Publish)
• Can support both offline & online use-cases
• Designed for persistent messages as the common
case
• Guarantee ordering of events
• Supports gzip & snappy (0.8) compression protocols
• Supports rewind offset & re-consumption of data
• No concept of master node, brokers are peers
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Kafka Architecture
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Common Consumer Framework
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Volume Facts?
Daily clickstream event data 150 GB
Average size of message 2 KB
Batch Size 5000 messages
Producer throughput 5000 messages/sec
Real time HBase Consumer 7000 messages/sec
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
End to End real-time story…
• User exhibits a behavior
• Events are generated at front-end data-center
• Events are streamed to backend data center
via Kafka
• Events are consumed and indexed in HBase
tables
• Takes seconds from event origination to
indexing in HBase
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
End to End real-time story…
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
User Segmentation Engines
Features {rr} Engine Other Engines
User’s behavior ingestion Real-time Not Real-time
Batch style processing is done Immediately At end of a day
When segment membership is changed
notifications will be
Event driven N/A
Technologies used Scalable and
open source
Unscalable and
proprietary
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Use Cases
Use Case #1
Users exhibit behaviors
Behaviors ingested and indexed in
real time
Users are now in corresponding
segments
Retrieving users takes seconds
Use Case #2
Users exhibit behaviors
Segment membership calculated in
real time
Notifications are sent on segment
membership change
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Our choice HBase?
• Real time segment evaluation
• Optimized for read and scan
• Column cell supports frequency use case
• Eventual consistency does not work
• Seamless integration with Hadoop
• Possible with good row key design
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
HBase Row Key Design
• Took a few attempts
• Design considerations
– Timestamp in row or columns
– Partition behavior by date
– Optimized for read or write
– Hot spotting issues
– Uniform key distribution
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Design: First Attempt
• Row key represents behavior
• Columns store the user id
• Cell stores behavior time and capture frequency
• One column family U
RowKey Columns
338VBChanel 23b93laddf82:1370377143973
Hd92jslahd0a:1313323414192
338CCElectronic z3be3la2dfa2:1370477142970
kd9zjsla3d01:1313323414192
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Design: First Attempt Issues
• Row too wide
• May exceed HFile size
• Terrible write/read performance
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
• Partition behavior by date
• Reduce row size
• Gained ability to scan across dates
Design: Second Attempt
Rowkey Columns
338VBChanel1370377143 23b93laddf82:1370377143973
Hd92jslahd0a:1313323414192
338CCElectronic1370377143 z3be3la2dfa2:1370477142970
kd9zjsla3d01:1313323414192
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Design: Second Attempt Issues
• Hot spotting
• Popular products or high level categories can
have millions of users, each day
• One region serving same dimension type
• Terrible write/read performance
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
OK…I need a BREAK!!!
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Design: Final
• Shard row to prevent hot-spotting
• Shard into N number of regions
• Significant improvement in read/write
• Prepend a salt to each key
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
[salt]_len_[siteId]_len_[metric]_len_[dimension]_len_[value][timestamp]
Design: Final
_len_ is the integer length of the field following it
timestamp is stored in day granularity
[salt] is computed by first creating a hash from the
siteId, metric, and dimension, then combining this
with a random number between 0 and N (number
of shards) for sharding
Row key contains
attribute value siteId metric attribute timestamp userGUID
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
• Complex segments contain many rules
• Each rule = one behavior = one row key
• Each row key returns a set of users
• OR = Full outer join
• AND = Inner join
• Done in memory for small rules
• Merged on disk for large rules
Design: Behavior Joins
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
HBase Consumer Sync versus Async API
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Segmentation Performance
• Seconds latency
• 40K puts/sec over 2 tables, 8 regions per table
• Scaling achieved through addition of regions
• Small segments calculated in msecs
• Mid-size segments in seconds
• Large segments calculated in 10s of seconds
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
© 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
Thank You

Contenu connexe

Tendances

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Amazon Web Services Korea
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)Open Source Consulting
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentationadvaitdeo
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheAmazon Web Services
 
Micro services Architecture
Micro services ArchitectureMicro services Architecture
Micro services ArchitectureAraf Karsh Hamid
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 

Tendances (20)

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Amazon ElastiCache and Redis
Amazon ElastiCache and RedisAmazon ElastiCache and Redis
Amazon ElastiCache and Redis
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)[오픈소스컨설팅] 서비스 메쉬(Service mesh)
[오픈소스컨설팅] 서비스 메쉬(Service mesh)
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCache
 
Micro services Architecture
Micro services ArchitectureMicro services Architecture
Micro services Architecture
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 

Similaire à How we solved Real-time User Segmentation using HBase

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...Cloudera, Inc.
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...Cloudera, Inc.
 
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Greg Makowski
 
Gcp intro-20160721
Gcp intro-20160721Gcp intro-20160721
Gcp intro-20160721Haeseung Lee
 
Hello Streams Overview
Hello Streams OverviewHello Streams Overview
Hello Streams Overviewpsanet
 
GPSBUS214-Key Considerations for Cloud Procurement in the Public Sector
GPSBUS214-Key Considerations for Cloud Procurement in the Public SectorGPSBUS214-Key Considerations for Cloud Procurement in the Public Sector
GPSBUS214-Key Considerations for Cloud Procurement in the Public SectorAmazon Web Services
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases AWS Germany
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFAmazon Web Services
 
Non-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph IdziorekNon-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph IdziorekAmazon Web Services
 
MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...
MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...
MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...Amazon Web Services
 
Implementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformImplementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformArvind Sathi
 
Best Practices for Monitoring Cloud Networks
Best Practices for Monitoring Cloud NetworksBest Practices for Monitoring Cloud Networks
Best Practices for Monitoring Cloud NetworksThousandEyes
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
Microservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesMicroservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesAmazon Web Services
 
Vistara 3.1 - Delivering Unified IT Operations
Vistara 3.1 - Delivering Unified IT OperationsVistara 3.1 - Delivering Unified IT Operations
Vistara 3.1 - Delivering Unified IT OperationsVistara
 

Similaire à How we solved Real-time User Segmentation using HBase (20)

HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
Powering Real­time Decision Engines in Finance and Healthcare using Open Sour...
 
Gcp intro-20160721
Gcp intro-20160721Gcp intro-20160721
Gcp intro-20160721
 
Hello Streams Overview
Hello Streams OverviewHello Streams Overview
Hello Streams Overview
 
GPSBUS214-Key Considerations for Cloud Procurement in the Public Sector
GPSBUS214-Key Considerations for Cloud Procurement in the Public SectorGPSBUS214-Key Considerations for Cloud Procurement in the Public Sector
GPSBUS214-Key Considerations for Cloud Procurement in the Public Sector
 
Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases Choosing the Right Database for My Workload: Purpose-Built Databases
Choosing the Right Database for My Workload: Purpose-Built Databases
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SF
 
Non-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph IdziorekNon-Relational Revolution - Joseph Idziorek
Non-Relational Revolution - Joseph Idziorek
 
Breaking Down the 'Monowhat'
Breaking Down the 'Monowhat'Breaking Down the 'Monowhat'
Breaking Down the 'Monowhat'
 
MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...
MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...
MBL204_Architecting Cost-Effective Mobile Backends for Scale, Security, and P...
 
Non-Relational Revolution
Non-Relational RevolutionNon-Relational Revolution
Non-Relational Revolution
 
Implementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformImplementing Advanced Analytics Platform
Implementing Advanced Analytics Platform
 
Best Practices for Monitoring Cloud Networks
Best Practices for Monitoring Cloud NetworksBest Practices for Monitoring Cloud Networks
Best Practices for Monitoring Cloud Networks
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
Microservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel CervantesMicroservices: Data & Design - Miguel Cervantes
Microservices: Data & Design - Miguel Cervantes
 
Vistara 3.1 - Delivering Unified IT Operations
Vistara 3.1 - Delivering Unified IT OperationsVistara 3.1 - Delivering Unified IT Operations
Vistara 3.1 - Delivering Unified IT Operations
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

How we solved Real-time User Segmentation using HBase

  • 1. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Murtaza Doctor Principal Architect Giang Nguyen Sr. Software Engineer How we solved Real-Time User Segmentation using HBase ?
  • 2. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Outline • {rr} Story • {rr} Personalization Platform • User Segmentation: Problem Statement • Design & Architecture • Performance Metrics • Q&A
  • 3. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 4. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Multiple Personalization Placements Personalized Product Recommendations Targeted Promotions Relevant Brand Advertising Inventory and Margin Data Catalog Attribute Data Real time user behavior Multi-Channel purchase history 3rd Party Data Sources Future Data Sources Input Data 100+ algorithms dynamically targeted by page, context, and user behavior, across all retail channels. Targeted content optimization to customer segments Monetization through relevant brand advertising
  • 5. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.  One place where all data about your customer lives (e.g. loyalty, customer service, offline, catalogue)  Direct access to data for your entire enterprise via API  Real-time actionable data and business intelligence  Link customer activity across any channel  Leverage 3rd party data (e.g. weather, geography, census) Delivering a Customer-centric Omni-channel Data model RichRelevanceDataMesh Real-time Segmentation DMP Integration RichRelevance DataMesh Cloud Platform Delivering a Single View of your Customer Event-based Triggers Ad Hoc Reporting Omni-channel Personalization Loyalty Integration
  • 6. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Our cloud-based platform supports both real-time processes and analytical use cases, utilizing technologies to name a few: Crunch, Hive, HBase, Avro, Azkaban, Voldemort, Kafka Someone clicks on a {rr} recommendation every 21 milliseconds Did You Know? Our data capacity includes a 1.5 PB Hadoop infrastructure, which enables us to employ 100+ algorithms in real-time In the US, we serve 7000 requests per second with an average response time of 50 ms
  • 7. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Real-Time User Segmentation Finding and Targeting the Right Audience
  • 8. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Meet Amanda and Jessica
  • 9. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Because We Know What They Like!
  • 10. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Utilizes valuable targeting data such as event logs, off-line data, DMA, demographics, and etc. • Finds highly qualified consumers and buckets them into segments What is the {rr} Segment Builder? Example: for Retail, Segment Builder supports view, click, purchase on products, categories, and brands
  • 11. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segment Builder Segment’s List page
  • 12. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segment Builder Add/Edit Segment page
  • 13. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Create segments to capture the audience via UI • Each behavior is captured by a rule • Each rule corresponds to a row key in HBase • Each rule returns the users • Rules are joined using set union or intersection • Segment definition consists of one or more rules Design: Segment Evaluation Engine
  • 14. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Our choice Avro and Kafka Real-Time Data Ingestion
  • 15. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • User interacts with a retail site • Events are triggered in real- time, each event is converted into an Avro record • Events are ingested using our real-time framework built using Apache Kafka Real-time data ingestion
  • 16. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Design Principles: Real-Time Solution • Streaming: Support for streaming versus batch • Reliability: no loss of events and at least once delivery of messages • Scalable: add more brokers, add more consumers, partition data • Distributed system with central coordinator like zookeeper • Persistence of messages • Push & pull mechanism • Support for compression • Low operational complexity & easy to monitor
  • 17. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Our Decision – Apache Kafka • Distributed pub-sub messaging system • More generic concept (Ingest & Publish) • Can support both offline & online use-cases • Designed for persistent messages as the common case • Guarantee ordering of events • Supports gzip & snappy (0.8) compression protocols • Supports rewind offset & re-consumption of data • No concept of master node, brokers are peers
  • 18. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Kafka Architecture
  • 19. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Common Consumer Framework
  • 20. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Volume Facts? Daily clickstream event data 150 GB Average size of message 2 KB Batch Size 5000 messages Producer throughput 5000 messages/sec Real time HBase Consumer 7000 messages/sec
  • 21. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. End to End real-time story… • User exhibits a behavior • Events are generated at front-end data-center • Events are streamed to backend data center via Kafka • Events are consumed and indexed in HBase tables • Takes seconds from event origination to indexing in HBase
  • 22. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. End to End real-time story…
  • 23. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. User Segmentation Engines Features {rr} Engine Other Engines User’s behavior ingestion Real-time Not Real-time Batch style processing is done Immediately At end of a day When segment membership is changed notifications will be Event driven N/A Technologies used Scalable and open source Unscalable and proprietary
  • 24. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Use Cases Use Case #1 Users exhibit behaviors Behaviors ingested and indexed in real time Users are now in corresponding segments Retrieving users takes seconds Use Case #2 Users exhibit behaviors Segment membership calculated in real time Notifications are sent on segment membership change
  • 25. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Our choice HBase? • Real time segment evaluation • Optimized for read and scan • Column cell supports frequency use case • Eventual consistency does not work • Seamless integration with Hadoop • Possible with good row key design
  • 26. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. HBase Row Key Design • Took a few attempts • Design considerations – Timestamp in row or columns – Partition behavior by date – Optimized for read or write – Hot spotting issues – Uniform key distribution
  • 27. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Design: First Attempt • Row key represents behavior • Columns store the user id • Cell stores behavior time and capture frequency • One column family U RowKey Columns 338VBChanel 23b93laddf82:1370377143973 Hd92jslahd0a:1313323414192 338CCElectronic z3be3la2dfa2:1370477142970 kd9zjsla3d01:1313323414192
  • 28. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Design: First Attempt Issues • Row too wide • May exceed HFile size • Terrible write/read performance
  • 29. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Partition behavior by date • Reduce row size • Gained ability to scan across dates Design: Second Attempt Rowkey Columns 338VBChanel1370377143 23b93laddf82:1370377143973 Hd92jslahd0a:1313323414192 338CCElectronic1370377143 z3be3la2dfa2:1370477142970 kd9zjsla3d01:1313323414192
  • 30. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Design: Second Attempt Issues • Hot spotting • Popular products or high level categories can have millions of users, each day • One region serving same dimension type • Terrible write/read performance
  • 31. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. OK…I need a BREAK!!!
  • 32. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Design: Final • Shard row to prevent hot-spotting • Shard into N number of regions • Significant improvement in read/write • Prepend a salt to each key
  • 33. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. [salt]_len_[siteId]_len_[metric]_len_[dimension]_len_[value][timestamp] Design: Final _len_ is the integer length of the field following it timestamp is stored in day granularity [salt] is computed by first creating a hash from the siteId, metric, and dimension, then combining this with a random number between 0 and N (number of shards) for sharding Row key contains attribute value siteId metric attribute timestamp userGUID
  • 34. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. • Complex segments contain many rules • Each rule = one behavior = one row key • Each row key returns a set of users • OR = Full outer join • AND = Inner join • Done in memory for small rules • Merged on disk for large rules Design: Behavior Joins
  • 35. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. HBase Consumer Sync versus Async API
  • 36. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Segmentation Performance • Seconds latency • 40K puts/sec over 2 tables, 8 regions per table • Scaling achieved through addition of regions • Small segments calculated in msecs • Mid-size segments in seconds • Large segments calculated in 10s of seconds
  • 37. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential.
  • 38. © 2013 RichRelevance, Inc. All Rights Reserved. Confidential. Thank You

Notes de l'éditeur

  1. How we plan to go over stuff
  2. The RichRelevance DataMesh Cloud Platform delivers a single view of your customer by:Giving you one single place to house unlimited sets of dataExample use cases:Create your own run-time strategies (predictive models)Create and manage segments via toolAutomatic & real-time segment creationView performance of strategies against KPIs Run adhoc queries using SQL-like toolImport into offline toolsOLAP capabilitiesMarket Basket AnalysisCustomer Lifetime ValueSequential Pattern miningManage APIs, build products & applications
  3. Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
  4. Distributed System:: i.e. producers, brokers and consumer entities can all be deployed to different hosts in different colos in a truly distributed fashion and coordination controlled through zookeeperPersistence of Messages: messages need to be persisted on the broker for reliability, replay and temporary storagePush & Pull Mechanism:: i.e. push data to Kafka server and pull data from it using a consumer. This allows for two different rates: rate at which messages are transferred to the kafka server and the rate at which the messages are consumed.: Kafka supports GZIP and version 0.8 will additionally support Snappy compression.