7. 7
Most SAS/STAT PROCs (including PROC
GENMOD) run single-threaded.
SAS/STAT: 91 PROCs
• 69 single threaded
• 13 multi-threaded
• 9 distributed (if you license SAS HP Statistics)
9. 9
2013: SAS Benchmark
PROC HPGENSELECT
– SAS/STAT
– SAS High Performance Statistics
Massive grid (140/144 nodes)
– 16 cores per node
– 2,240/2,304 cores
Conclusion: SAS on 2,304 cores is competitive
with RRE on 20 cores.
10. Honest Benchmarking
Compare RRE and SAS/STAT performance
– Same data
– Same environment
– Same tasks
Test under real-world conditions
Make the test fair and transparent
11. Data
11
Manufactured data
Reproducible in any environment
Designed to emulate “typical” working data
“Entity” tables: 1MM, 5MM rows
“Predict” tables: 10MM, 50MM rows
Fact
Pre-
dict
Entity 1
Entity 2
Entity key
571 Columns
21 Columns
13. Analytic Tasks
13
Task SAS Capability RRE Capability
Descriptive Statistics PROC SURVEYMEANS rxSummary
Median and Deciles PROC SURVEYMEANS rxQuantile
Frequency Distribution PROC FREQ rxCube
Linear Regression (Numeric predictors) PROC REG, HPREG rxLinMod
Linear Regression (Mixed predictors) PROC GENMOD rxLinMod
Stepwise Linear (100 predictors) PROC REG rxLinMod/rxStepControl
Logistic Regression PROC LOGISTIC rxLogit
Generalized Linear PROC GENMOD rxGLM
K-Means Clustering PROC FASTCLUS rxKMeans
Score PROC SCORE rxPredict
14. 14
Preparation
Generated data with randomized procedure
Loaded data into native formats:
– RRE: XDF file
– SAS: SAS DATA set
Generation and load times not included
No meaningful differences
15. 15
RRE: 42 Times Faster Than SAS 9.4
0 1,000 2,000 3,000 4,000 5,000 6,000
124
5,192
Runtime, Seconds
N=5,000,000
SAS 9.4 RRE RRE ~2 minutes
SAS ~1 hour, 26 minutes
Complete script: ten analytic tasks.
16. 16
RRE: Linear Scalability
68 124
623
5,192
0
1,000
2,000
3,000
4,000
5,000
6,000
0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000
Runtime,Seconds
# Rows in Entity Table
RRE 7
SAS 9.4
RRE: consistent
performance with
increased data volume.
17. 17
RRE: Up to 350X Faster Than SAS
0
50
100
150
200
250
300
350
400
RRE Speed Multiple
213 185
351
39 37
19
58
18
101
32
Runtime,Seconds
N=5MM
Stats
Quintiles
Freq
Lin Reg 1
Lin Reg 2
Step Lin
Logistic
GLM
Kmeans 1
Kmeans 2
18. 18
Why is RRE faster than SAS?
RRE supports scalable computing out of the
box
– Multi-threaded processing
– Distributed processing
Legacy SAS is mostly single-threaded
– DATA Step processing
– Most SAS/STAT PROCs
19. 19
SAS HP PROCs
9 new SAS PROCs
Bundled into SAS 9.4
Designed for scalability
Multiple operating modes:
– Single machine
– Distributed (must license SAS HP
Statistics)
20. 20
HP PROCs: Minimal Improvement
0 50 100 150 200 250 300
6.8
267.17
253.82
Runtime, Seconds
N=5,000,000
SAS: PROC HPREG SAS: PROC REG RRE: rxLinMod
Linear regression, 20 predictors
HPREG running in single machine mode.
21. 21
Summary
RRE is faster than Legacy SAS:
– Same tasks
– Same hardware
RRE speed:
– Efficient engineering
– Multi-threaded and distributed processing
SAS performance claims:
– Massive hardware requirements
– Force you to license more software from SAS
– Don’t apply to Legacy SAS
22. 22
Polling Question
Which of the following analytic software
benefits is most important to you:
– A) Completing projects faster
– B) Building better predictive models
– C) High performance with low infrastructure costs
24. Background
Approaching $1 trillion in revenue analyzed. $3 billion in marketing spend under our lens.
Experienced 60+ person team based in San Francisco with offices in Seattle, Los Angeles,
Singapore, and India.
Founded in 2003 with a proven history of solving difficult analytics problems. Evolved from
consulting through close partnerships with our clients.
Our Offerings
Customer interaction insight that powers applications for customer-level revenue attribution,
targeting, media optimization.
Descriptive and predictive modeling of hidden trends and relationships in big data.
Custom development including applications, process automation, and decision support solutions.
DataSong at a Glance
26. DataSong Architecture
• ETL
• N marketing channels
• Behavioral variables
• Promotional data
• Overlay data
• Functions to read Hadoop output;
xdf creation
• Exploratory data analysis
• GAM survival models
• Scoring for inference
• Scoring for prediction
• 5 billion scores per day
per customer
DATASONG DATA
FORMAT (DDF)
CUSTOM VARIABLES
(PMML)
27. Where Speed Matters3 key dimensions
● how many rows
● how many variables
● how many iterations of a model
Trade offs for speed
● Sampling variance
● Test fewers features
● Have less understanding of the signal
This 3rd dimension means we must multiply any benchmark by N