Contenu connexe
Similaire à White Paper: Causata Big Data Architecture
Similaire à White Paper: Causata Big Data Architecture (20)
White Paper: Causata Big Data Architecture
- 2. TABLE OF CONTENTS
· Introduction 1
· Event Storage in HBase 1
· Writing Data into Causata 2
· Data Principles 2
· Customer Identities & Event Timelines 2
· Predictive Profiles 3
· Model Scores & Behavioral Predictors 3
· Reading Data from Causata 4
· Summary 4
· Contact 4
© 2012 Causata Inc. · All Rights Reserved
- 3. Introduction
Causata’s customer experience management applications are built upon parallel big data storage that enables the
efficient analysis of terabytes of diverse, granular, multi-structured customer data.
Stitching together unstructured and structured customer interaction data from any digital source or channel,
Causata then assembles it into concise, structured customer records suitable for ad-hoc analysis, predictive
modeling, and advanced machine learning.
Causata’s data storage layer is customer and event-oriented so every single customer interaction is stored in full
detail, using parallel storage and computation to provide low-latency access to each customer’s record set to drive
real-time actions and decisions.
Event Storage in HBase
At its lowest architectural level, Causata utilizes HBase to store a vast set of granular event data. HBase is a highly
scalable data store that forms part of the open-source Hadoop product suite, and provides a robust, inexpensive
way to store every individual customer interaction.
Causata stores detailed customer interaction records from any digital channel, such as a web click, a product
purchase, an email or a tweet. Each data point is recorded as a simple set of key-value pairs called an event. For
example, a product purchase might have a SKU, a brand, a price, a size and color; a web click might have a URL, a
page category, a browser type, a language setting and a time zone. Causata turns this messy, multi-structured event
data into structured data for analysis, sometimes called ‘rectangular data’ because each customer record has the
same set of computed fields.
Causata’s implementation of HBase supports the flexibility to add new customer interaction data types
easily. Causata does not have a traditional fixed or relational data schema. Data from any source can be loaded or
streamed into Causata, and the structure and signal extraction are applied later, when the data is read.
In order to enable fast access to individual customer records, data is stored redundantly in Causata across
© 2012 Causata Inc. · All Rights Reserved Page 1
- 4. multiple servers. This protects against data loss and enables high-volume data retrieval and analysis through the
use of parallel processing.
Writing Data into Causata
Causata has a simple HTTP Data Connector, to which an event is written as a JSON object. Because Causata is
schema-free, it is easy to input any digital customer interaction – behavioral, social or transactional.
Causata consumes real-time feeds or streams, log and CSV files, ODBC connections to databases and data
warehouses, and plugs-in easily to any ETL including open source tools Pentaho and Talend. Data can be loaded or
streamed into Causata from an existing Hadoop or HBase data store by running a map reduce job to generate input
events into Causata.
Examples of data sets feeding Causata including web and email analytics, web tags and tag management systems,
mobile apps, social data streams, CRM and ERP data, machine logs, and data management platforms (DMPs).
Data Principles
Causata was designed with three data principles in mind:
Scalability, Flexibility, and Low Latency
Scalability across terabytes of unstructured customer interaction records relies on parallel computing – sharing the
data storage across horizontally scaling servers and performing the analytic processes in parallel, close to the data.
Flexibility is essential to cope with rapid and unpredictable changes in how customer data is generated and
consumed. Causata does not impose a fixed database schema, and allows the definition of customer records for
analysis to be made dynamically at query time.
Low Latency data access is critical to both allow business analysts and marketing scientists to perform
interactive investigation of the data, and to drive real-time personalized marketing decisions from the data
analysis. This means retrieval and assembly of customer profiles in 50 milleseconds or less, including their very
latest interactions across multiple channels.
Customer Identities & Event Timelines
A key element of Causata’s big data engine is its Identity Graph. By observing patterns of identifiers that occur
together, Causata builds up a graph connecting identifiers to an individual and ascribes each data fragment to
the correct customer. This picture becomes richer over time as new pieces of linking customer information
are recorded.
For example, if a customer logs into her web account from home and then a week later does the same from her
work computer, both cookies become linked and the two sets of web activity data are merged into a single event
stream, providing a richer profile for that customer.
© 2012 Causata Inc. · All Rights Reserved Page 2
- 5. Data from email, mobile, social, and bricks-and-mortar channels are easily combined in the same way, by
matching identifiers such as credit and loyalty cards, account numbers, email addresses, and telephone numbers.
The Identity Graph adjusts to new connection events, providing as complete a picture as possible of an individual
customer at any point in time.
Causata organizes and stores interaction data by individual customer, forming a single event-based Customer
Timeline. Retaining the detailed event sequence, in chronological time order, allows business analysts to analyze
cause and effect in customer behavior, and to investigate specific scenarios or path analyses. This essential time
ordering is typically lost in other data systems, such as when data is pre-aggregated in a data warehouse.
Predictive Profiles
Event streams or Customer Timelines are valuable for path analysis, but are difficult to consume for ad-hoc
analysis or statistical modeling. Causata distills customer event streams and their descriptive attributes into a set
of predictive variables, or aggregates, computed over specific timescales.
For example, total spend in the past month is computed by summing the prices of all of a customer’s purchase
events in that period. Useful industry-specific variables for Financial Services, Communications, and Digital Media
are pre-built within Causata and are also easily set up and managed by business analysts.
Causata leverages its parallel compute power to calculate these variables on demand as customer data is read.
Calculation on demand ensures that customer profiles are always up to date and takes into account the customer’s
most recent activity. New predictors or variables can be defined in seconds and are then immediately available
through customer profiles.
Model Scores & Behavioral Predictors
Causata provides pre-built regression models to determine the accuracy or predictive power of variables based on
cause and effect. These linear and logistic regression models enable analysts and marketing scientists to quickly
identify the most valuable variables for their customer analyses.
Once an analyst or modeler builds a statistical, predictive model, it can be imported and deployed in seconds to
Causata for real-time, on-demand execution. Each time an individual customer profile is requested or updated, any
applicable model is evaluated for that customer, ensuring that the scores in the customer’s predictive profile are
always up-to-date. Model execution is performed in parallel across the cluster as profiles are assembled, and model
scores are computed just like any other variable.
Since a predictive model score is just like any other variable in a customer’s Predictive Profile, it can be used in
queries, for example, to retrieve event streams, predictive profiles or even just a list of all customers with a high
predicted probability of churn. Scores can also be used in real-time decision-making — for instance, to determine
what content to show on a web page or to guide a call-center agent towards the optimal cross-sell offer for a
customer.
© 2012 Causata Inc. · All Rights Reserved Page 3
- 6. Reading Data from Causata
Data is retrieved from Causata at either the customer or event level.
At the customer level, a familiar Causata SQL query language allows queries to be framed around customer
behavior, enabling the business analyst or data scientist to ask structured questions of unstructured data. These
queries are executed in parallel across all the data stores, returning event streams, predictive profiles or modeled
scores. The queries may include combinations of specific events, profile variables, and predictive scores to select
customer records.
A simple example query by an analyst in a retail bank, for example, might select all customers who have utilized
online bill pay from a mobile device in the last week, and who have downloaded a promotional bank email in the
last 90 days. The output is a structured set of records for every customer who satisfies this query, in a predictive
record set for analysis. By allowing the analyst to ask new questions of a massive data set, Causata saves a huge
amount of time traditionally wasted in ‘data-wrangling.’
Analysts and marketing scientists can choose to run a complete query for all customers who meet specific criteria
or just retrieve a sample for initial analysis. Causata arranges the customer data to ensure that any sample is
statistically unbiased and can be used for reliable analysis.
Causata SQL enables analysts to leverage data visualization tools such as Tableau, QlikView, and Excel for further
analysis, dashboarding and reporting. Statistical modelers can query and access Causata data directly from their
R environment, and then easily import their R models into Causata for real-time operational scoring.
Causata event data can also be queried using Hadoop tools such as Hive and Cloudera Impala, which respectively
enable batch and interactive querying of Causata’s raw event data. This is valuable for queries not structured
specifically around individual customer behavior, but rather for traditional macro segmentation business
intelligence analyses.
Summary
Causata consumes multi-structured customer data from all digital channels, connects and stores it by customer
event, and assembles it into an optimal format for customer analysis and prediction.
A powerful Causata SQL query language allows the retrieval of customer records in a predictive record set structure
for predictive analysis, and the underlying HBase event storage may be queried using standard Hadoop tools.
Causata scales to millions of customer records and is a highly flexible application, making it easy to add new data
sources and ask new questions of the data. Low latency access to individual predictive profiles enables real-time
actions, tailored to the individual customer.
To learn more about us, visit us at causata.com, follow us on Twitter @Causata, or contact us for a demo.
© 2012 Causata Inc. · All Rights Reserved Page 4