5. 1. Data Distillation in Hadoop
5
Unstructured
Data
Analytics
Data Mart
Structured
Data
Log Files
Sensor Streams
Language Text
HDFS Load Map-Reduce
RHadoop
rmr
6. 6
2. The Model Development Cycle
Feature
Selection
Sampling
Aggregati
on
Variable
Trans-
formation
Model
Estimation
Model
Refineme
nt
Model
Comparis
on /
Bench-
marking
Predictive
Model
R White Paper
bit.ly/r-is-hot
Structured
Data
9. 9
4. Real-Time
Scoring Factors
Scores
”IO VAPOURA” by Jaya Prime flickr.com/photos/sanjayaprime/4924462993 CC-BY 2.0
Decision Tree
Logistic Regression
Neural Network
K-means clustering
Ensemble Model
Predictive Model
User ID
Browser
Time/Date / Location
Previous purchases
Friend data
Any known information
Product of most interest
Offer of most likely sale
Most relevant link
Forecast sale value
Optimal Bid
Prediction or Selection
Scoring Rules
13. Why did I buy that blender?
Just browsing in the mall
TV ad / magazine ad
Coupon in the mail
“Just moved” promo email
Webstore recommendation
Browsing catalog
13
15. • ETL
• Marketing channel data
• Behavioral variables
• Promotional data
• Overlay data
• Exploratory data analysis
• Time-to-event models
• GAM survival models
• Scoring for inference
• Scoring for prediction
• 5 billion scores per day
per retailer
UPSTREAM DATA
FORMAT
CUSTOM VARIABLES
(PMML)
17. • Collaboration
• Speed
• Deployment
Process
• Adoption
• Results
17
Analytics Function Library
rACI Package (w/ RevoR)
Model Building Function Library
Data Acquisition Function Library
Portfolio
Optimization and
Simulation API
Market Data from Thomson
Reuters (QA-Direct)
American Century Quant
Proprietary Data
Additional 3rd Party Data
Vendors
Live Analytics
PRODUCTION MODEL GENERATION
AND TRADING PROCESSES
Data Feeds
19. 19
www.revolutionanalytics.com +1 650 646 9545 Twitter: @RevolutionR
The leading enterprise provider of software and services for Open Source R
Big-Data, Real-Time R?
Yes, you can!
David Smith
@revodavid
Notes de l'éditeur
FastScalableIn Production
Data as “new oil” – valuable commodityBig Data is crude oil: messy, hard to get at, got contaminants in it.
Model development processNot just about the computational speed. Also about productivity of developer.
Start off with stuff we know in real time.
Demographics: consumer, product, marketActions: web clicks, email clicks, mobile app usage, call center logs, social, search …Outcomes: impressions, touches, orders (retail, online, mobile)Strategic allocation
Outcome is “buying” instead of “dying”
From Revolution Analytics. We help companies deploy predictive models created in R to real-time production systems.