This document discusses real-time big data analytics from deployment to production. It covers:
1) Distilling raw data like log files and sensor streams into structured data using Hadoop for analytics.
2) Developing predictive models using techniques like decision trees, clustering, and ensembles on structured data.
3) Deploying models for real-time scoring via SQL, code, or PMML on either batch lookup tables or streaming data factors.
4) Scoring billions of predictions daily for applications like determining why customers buy products and attributing marketing channels.
5) Regularly refreshing models to incorporate new data and outcomes using techniques like exploratory analysis and time-to-event modeling
5. User ID
Predictive Browser
Factors Time/Date / Location
Any known information
Analytics Previous purchases
Friend data
Model
Decision Tree
Logistic Regression
Neural Network
Predictive Model
K-means clustering
Scoring Rules Ensemble Model
Product of most interest
Offer of most likely sale
Scores Most relevant Selection
Prediction or link
Forecast sale value
Optimal Bid
”IO VAPOURA” by Jaya Prime flickr.com/photos/sanjayaprime/4924462993 CC-BY 2.0 5
6. Real-time Deployment
1. Data distillation
2. Model development and
validation
3. Model deployment
4. Real-time model scoring
5. Model refresh
"CLOCK" by Heiko Klingele flickr.com/photos/divdax/3458668053/ CC-BY 2.0 6
7. 1. Data Distillation in Hadoop
Log Files
Sensor Streams HDFS Load Map-Reduce Structured
Data
rmr
Language Text
Unstructured Analytics
Data Data Mart
7
8. 2. The Model Development Cycle
Feature
Selection
Sampling
Aggregati
on
Model
Comparis Variable
Structured Data on /
Bench-
Trans-
formation
Predictive Model
marking
Model
Model
Refineme
nt
Estimation R White Paper
bit.ly/r-is-hot
8
10. Why did I buy that blender?
Just browsing in the mall
TV ad / magazine ad
Coupon in the mail
“Just moved” promo email
Webstore recommendation
Browsing catalog
10
12. 4. Model
• Exploratory data analysis
Scoring • Time-to-event models
• GAM survival models
UPSTREAM DATA CUSTOM VARIABLES
FORMAT (PMML)
• ETL • Scoring for inference
• Marketing channel data • Scoring for prediction
• Behavioral variables
• 5 billion scores per day
• Promotional data per retailer
• Overlay data
16. Real-Time Big Data Predictive Analytics: David Smith
From Deployment to Production @revodavid
The leading enterprise provider of software and services for Open Source R
Booth 618 / Office Hours Weds 1:30PM
www.revolutionanalytics.com +1 650 646 9545 Twitter: @RevolutionR
16
Notes de l'éditeur
Get out your buzzword bingo cards!
Data as “new oil” – valuable commodityBig Data is crude oil: messy, hard to get at, got contaminants in it.
Start off with stuff we know in real time.
Model development processNot just about the computational speed. Also about productivity of developer.
Demographics: consumer, product, marketActions: web clicks, email clicks, mobile app usage, call center logs, social, search …Outcomes: impressions, touches, orders (retail, online, mobile)Strategic allocation
Outcome is “buying” instead of “dying”
From Revolution Analytics. We help companies deploy predictive models created in R to real-time production systems.