At RichRelevance, we service 10 of the top 20 Internet retailer chains and deliver more than $5.5 billions in attributable sales. Every 21 milliseconds a shopper clicks on a recommendation that we have delivered, and we serve over 850 million product recommendations daily. Our Hadoop infrastructure has a capacity to handle upwards of 1.5+ PB. Behavioral Targeting, specifically user segmentation and building personas, is critical for us in generating triggers when a user is added to a segment or switches from a segment. In this presentation, we intend to demonstrate not only how the events are captured, but also how they are stored in HBase in real-time. It is critical to design the system so it can handle thousands of writes per second and, at the same time, be able to query any combination of behavioral attributes in HBase through real-time APIs. This session will walk attendees through the entire design & architecture starting from data Ingestion, schema design, and access patterns, as well as some major problems like sharing & hot spotting. Furthermore, performance metrics will be presented, including the number of read/write per second and details around cluster configuration.
The RichRelevance DataMesh Cloud Platform delivers a single view of your customer by:Giving you one single place to house unlimited sets of dataExample use cases:Create your own run-time strategies (predictive models)Create and manage segments via toolAutomatic & real-time segment creationView performance of strategies against KPIs Run adhoc queries using SQL-like toolImport into offline toolsOLAP capabilitiesMarket Basket AnalysisCustomer Lifetime ValueSequential Pattern miningManage APIs, build products & applications
Nuggets or Data Points1.5PB not as big as yahoo or facebook – huge from a retail industry perspective
Distributed System:: i.e. producers, brokers and consumer entities can all be deployed to different hosts in different colos in a truly distributed fashion and coordination controlled through zookeeperPersistence of Messages: messages need to be persisted on the broker for reliability, replay and temporary storagePush & Pull Mechanism:: i.e. push data to Kafka server and pull data from it using a consumer. This allows for two different rates: rate at which messages are transferred to the kafka server and the rate at which the messages are consumed.: Kafka supports GZIP and version 0.8 will additionally support Snappy compression.