Revolution Analytics brings big data analytics to Teradata database. Presentation from Teradata Partners, October 2013 overviewing Revolution R Enterprise for Teradata by Bill Jacobs, Director, Product Marketing, Revolution Analytics.
4. Our view: Big Data meets Big Math =
New Business Outcomes
THE
PERFECT
STORM
+ Computing Power
+ Data
+ Pace of Business
+ Customer Expectations
+ Data Science
+ Computer Science
+ Management Science
Confidential to Revolution Analytics and shared
with Siemens under the NDA dated 27/9/2013
Better Business
Decisions
New
Business
Outcomes
4
5. Big Analytics Delivers Value from Big Data
Volume
Variety
Velocity
The three Vs
of Big Data:
The three V’s of Big Data Big Analytics:
Maximizing Value,
accommodating data Volatility,
while assuring Veracity of insights
5
Confidential to Revolution Analytics
6. R Open Source
-
Language, Community, Collaboration
-
Robert Gentleman & Ross Ihaka, 1993
-
Version 1.0 released 2000
-
2.5 Million Global Users
-
Over 4,800 add-on ―Packages‖
-
Why R?
R in Universities = New Talent
WELCOME & INTRODUCTIONS
Emerging Modeling/Visualization
Lower Cost Alternative
Open Source = Flexible & Innovative
Access to Free Packages
Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013
6
7. R is Exploding in Popularity & Functionality
Internet Discussion
Package Growth
Mean monthly traffic on email discussion list
Number of R packages listed on CRAN
4,000
2500
R
2000
3,000
1500
2,000
Stata
1000
SAS
1,000
SPSS
S-Plus
0
1995
2000
2005
500
0
2010
Web Site Popularity
Scholarly Activity
Number of links to main web site
Google Scholar hits (’05-’09 CAGR)
R
4,000
SAS
2,000
1,050
SPSS
900
S-Plus
Stata
600
R
SAS
46%
-11%
SPSS -27%
S-Plus
Stata
Source: http://r4stats.com/popularity
0%
10%
7
8. R is exploding in popularity & functionality
R Usage Growth
Rexer Data Miner Survey, 2007-2013
70% of data miners report using R
“I’ve been astonished by the rate at
which R has been adopted. Four
years ago, everyone in my
economics department [at the
University of Chicago] was using
Stata; now, as far as I can tell, R is
the standard tool, and students learn
it first.”
Deputy Editor for New Products at Forbes
24% use R as primary tool
“A key benefit of R is that it provides
near-instant availability of new and
experimental methods created by its
user base — without waiting for the
development/release cycle of
commercial software. SAS
recognizes the value of R to our
customer base…”
Source: www.rexeranalytics.com
Product Marketing Manager SAS Institute, Inc
9. R Is The Most Commonly Used Primarly
Analytics Tool
70% of data miners report using R
24% use R as primary tool
Source: www.rexeranalytics.com
Source: www.rexeranalytics.com
11. R Community, collaboration and breadth:
CRAN task views (sub set of 4800+ packages)
Source: http://www.maths.lancs.ac.uk/~rowlings/R/TaskViews/
Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013
11
12. Key Big Data Challenge: The Analytics Talent Pool
12
14. R is open source and drives analytic innovation
but….has some limitations for Enterprises
Big Data
In-memory bound
Hybrid memory &
disk scalability
Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded
Parallel threading
Shrinks analysis
time
Enterprise
Readiness
Community
support
Commercial support
Delivers full service
production support
Analytic
Breadth &
Depth
5000+ innovative
analytic packages
Leverage open
source packages
plus Big Data ready
packages
Supercharges R
Commerci
al Viability
Risk of
deployment of
open source
Commercial license
Eliminate risk with
open source
14
15. Our History & Our Future
Revolution R Enterprise
V1 through V6.1
Revolution R Enterprise
V6.2 through V9
Revolution R Enterprise
V10 through v11
NA Offices
NYC
Dallas
Company
Founding
Relocate HQ
to Palo Alto
250
Customers
2007
500
Customers
2013
Chapter 1
Capture
Mindshare
1000
Customers
2015
Chapter 2
Mobilize with
Market Focus
Company Confidential – Do not distribute
2017
Chapter 3
Scalable
Growth
15
16. Revolution Confidential
200+ Customer Stories
Finance & Insurance
Academic & Gov’t
Healthcare & Life Sciences
Digital Media & Retail
Manufacturing & High Tech
16
17. Revolution Analytics - Overview
We are the only provider of a commercial analytics platform based on the open source
R statistical computing language.
Distributed, high performance
analytical algorithms
Power
Easier to build and deploy
analytic applications
Stable, scalable
multi-platform with
Productivity
Enterprise
Readiness
Professional services
enablement
world-wide support
World Wide Support Teams
• Standard and Premium Programs
• Technical Account Managers
• Customer Success Managers
Professional Services
• Architecture planning
• Systems Integration
• Advanced analytic applications
• Full life cycle projects
17
18. Customers Revolutionize their Business
Power
4X performance
50M records scored daily
“…we saw about a 4x
performance
improvement on 50
million records. It
works brilliantly.”
- CEO, John Wallace,
DataSong
Scalability
TB’s data from 200+ data sources
10’s thousands attributes
100’s millions of scores daily
“We’ve been able to scale
our solution to a problem
that’s so big that most
companies could not
address it…..”
- SVP Analytics, Kevin
Lyons, eXelate
Performance
2X data
2X attributes
no impact on performance
“We need a highperformance analytics
…we can now identify
opportunities for our clients
that would otherwise be
lost.”
- Chief Analytics Officer,
Leon Zemel, [x+1]
19
19. Revolution R
Enterprise
What is Revolution R
Enterprise?
How does Revolution R
Enterprise work with
Teradata Database?
20. Revolution R Enterprise
is….
the only big data big analytics platform
based on open source R,
the defacto statistical computing language for
modern analytics
High Performance, Scalable Analytics
Portable Across Enterprise Platforms
Easier to Build & Deploy Analytics
21
21. How is RRE Used?
Discovering Patterns
with Big Data
Building Models
Efficiently
Flexibly Deploying
Models to Consumers
Customer
segmentation
Market basket
analysis
Social networking
analysis
Fraud detection
Marketing attribution
Sentiment analysis
…and much more
Customer lifetime value
Pricing optimization
Recommendation
engines
…and much more
Credit risk
Customer churn
Propensity to buy
Market risk
Operational risk
…and much more
22
22. Introducing Revolution R Enterprise (RRE)
The Big Data Big Analytics Platform
Big Data Big Analytics Ready
– Enterprise readiness
DevelopR
ConnectR
DeployR
– High performance analytics
– Multi-platform architecture
– Data source integration
– Development tools
ScaleR
– Deployment tools
DistributedR
23
23. The Platform Step by Step:
R Capabilities
R+CRAN
RevoR
• Open source R interpreter
• UPDATED R 3.0.2
• Freely-available R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing
R scripts, functions and
packages
• Performance enhanced R interpreter
• Based on open source R
• Adds high-performance math
Available On:
•
•
•
•
•
•
•
•
•
•
•
PlatformTM LSFTM Linux®
Microsoft® HPC Clusters
Microsoft Azure Burst
Windows® & Linux Servers
Windows & Linux Workstations
Teradata® Database
IBM® Netezza®
IBM BigInsightsTM
Cloudera Hadoop®
Hortonworks Hadoop
Intel® Hadoop
24
24. Big Data Speed @ Scale with
Revolution R Enterprise (RRE)
In-Hadoop Execution
First, we enhance and
accelerate the Open Source
R interpreter.
In-Database Execution
Parallelized User Code
Parallelized Algorithms
Multi-Core Processing
Multi-Threaded Execution
Memory Management
Fast Math Libraries
25
25. Open Source R Performance: Multi-threaded
Math
Open
Customers report 5-50x
Source R
Revolution R
Enterprise
performance improvements
compared to Open Source R —
without changing any code
Computation (4-core laptop)
Open Source R
Revolution R
Speedup
Matrix Multiply
176 sec
9.3 sec
18x
Cholesky Factorization
25.5 sec
1.3 sec
19x
Linear Discriminant Analysis
189 sec
74 sec
3x
R Benchmarks (Matrix Functions)
22 sec
3.5 sec
5x
R Benchmarks (Program Control)
5.6 sec
5.4 sec
Not appreciable
Linear Algebra1
General R Benchmarks2
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
26
26. The Platform Step by Step:
Parallelization & Data Sourcing
ConnectR
• High-speed & direct connectors
Available for:
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical
tests
• Correlation & covariance matrices
• Predictive Models – linear, logistic,
GLM
• Machine learning
• Monte Carlo simulation
• NEW Tools for distributing
customized algorithms across nodes
• High-performance XDF
• SAS, SPSS, delimited & fixed format
text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBC
DistributedR
• Distributed computing framework
• Delivers portability across platforms
Available on:
•
•
•
•
•
•
•
•
Windows Servers
Red Hat and NEW SuSE Linux Servers
IBM Platform LSF Linux
Microsoft HPC Clusters
Microsoft Azure Burst
NEW Teradata Database
NEW Cloudera Hadoop
NEW Hortonworks Hadoop
27
27. Big Data Speed @ Scale with
Revolution R Enterprise (RRE)
In-Hadoop Execution
Second, we built a platform for hosting
R with Big Data on a variety of
massively parallel platforms.
In-Database Execution
Parallelized User Code
Parallelized Algorithms
Multi-Core Processing
Multi-Threaded Execution
Memory Management
Fast Math Libraries
28
29. SAS HPA Speed comparison* Logistic
Regression
Rows of data
1 billion
Parameters
“just a few”
Time
80 seconds
Data location
In memory
Nodes
32
Cores
384
RAM
1,536 GB
1 billion
Double
7
45%
44 seconds
On disk
1/6th
5
5%
20
5%
80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
Revolution R Enterprise Delivers Performance at 2%
of the Cost
*As published by SAS in HPC Wire, April 21, 2011
30
30. Analytics Layer: High Performance Big Data
Analytics with ScaleR
R Data Step
Descriptive
Statistics
Statistical
Tests
Sampling
Predictive
Modeling
Data
Visualization
Machine
Learning
Simulation
31
31. ScaleR: Fast Parallel External Memory
Algorithms
Data Prep, Distillation & Descriptive Analytics
R Data Step
Data import –
Delimited, Fixed, SAS, SPSS, O
BDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category
(means, sums)
Use any of the functionality of
the R language to transform
and clean data row by row!
Descriptive Statistics
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard
tables & long form)
Marginal Summaries of Cross
Tabulations
Company Confidential – Do not distribute
Statistical Tests
Chi Square Test
t-Test
F-Test
Plus 100’s of other tests
available in R!
Sampling
Subsample (observations &
variables)
Random Sampling
High quality, fast, parallel
random number generators
32
32. ScaleR: Fast Parallel External Memory
Algorithms
Statistical Modeling
Predictive Models
Covariance, Correlation, Sums of
Squares (cross product matrix for set
variables) matrices
Multiple Linear Regression
Generalized Linear Models (GLM) - All
exponential family distributions:
binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard
link functions including:
cauchit, identity, log, logit, probit. User
defined distributions & link functions.
Logistic Regression
Classification & Regression Trees
Decision Forests
Predictions/scoring for models
Residuals for all models
Machine Learning
Data Visualization
Histogram
Line Plot
Lorenz Curve
ROC Curves (actual data and predicted
values)
Plus numerous tools in R and ScaleR
to generate big data visualizations
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Simulation
High quality, fast, parallel random
number generators
Use the rich functionality of R
for simulations
33
33. The Power of Revolution R Enterprise
Performance & Scalability
ScaleR
ScaleR
Moves computation to data
ScaleR
V
a
l
u
e
Moves computation to data
Leverage CRAN
ScaleR
Labor saving power
DistributedR
Maximizes computation
DistributedR
Powerful divide & conquer
DistributedR
Effective memory utilization
RevoR
3-50X faster
Open Source
Leverage latest innovation
34
34. Why Teradata And Revolution R Enterprise?
Teradata User Demand
Data Movement Penalty Growing
New Analytics Requiring MPP Approach
R Popularity
Open Source Limitations
Arrival of Teradata v14.10
35
35. +
Revolution Analytics coupled with the Teradata Unified Data Architecture accelerates
big data analytics using the widely-accepted R language.
Available Today:
Scalable R analytics on servers
connected to Teradata
High speed, parallel data transfer, 5x
faster than RODBC
Integrated parallel analytics solution
Teradata Version 14.0
Upcoming Capabilities (4Q13)
Parallel R in-database for big data
analytics on Teradata
R programmers can immediately build
parallel R models completely in R
Revolution parallel in-database
algorithms exclusively available on
Teradata
Revolution R
Enterprise 6.2
High-Speed
TPT Connector
Company Confidential
Teradata
Version 14.10
+
Revolution R
Enterprise V7
36
36. Introducing Revolution R Enterprise Version 7 on
Teradata Database
New Teradata Table Operators
New Parallelized Algorithms
In-Database Execution of Parallelized Algorithms
Executes R Scripts From R Workstations or Servers
Provides Orders of Magnitude Performance Gains
Supports Multiple Platforms in UDA
Available Late 2013
37
37. Revolution Analytics in the UDA
UNIFIED DATA ARCHITECTURE
With Revolution R Enterprise
RODBC
Seamless use of R analytics across the
Teradata UDA
38
39. Understanding R’s Compute Workload
R Script
< 1%
Computational
Workload Breakdown
Compute Burden from Script
or Command
Compute Burden from
Algorithmic Computations
Algorithms
99.xxx%
40
40. ScaleR PEMAs: High Performance Analytical
Algorithms
Users Script Calls ScaleR PEMA
– No Unique Code or Setup for Parallelism
– ScaleR Algorithms are “just another R package”
– Using PEMAs is Transparent, Automatic, Fast and Scales
Linearly
PEMAs Transparently Parallelize Algorithm Execution
– Parallelized Versions of Statistics, Predictive Modeling and
Machine Learning Algorithms
– PEMAs Transparently Distribute Computations Across AMPs
– Results are Consolidated Into A Single Result Set
– Provides Write Once Deploy Anywhere (WODA) Portability
41
41. Transparent Distributed Computing with
RRE ScaleR
Transparent to the Script
Algorithm Starts A Master Process
Master Identifies Environment
In Revolution R Enterprise:
Script Calls ScaleR PEMA
Algorithm Executes
Algorithm Returns to Script
Script Continues Execution
Threading?
Cores?
Chips?
Distributed Nodes?
Master Initializes Algorithm
Prepares Instructions for Nodes
Master Executes Table Operators In
Each VAMP
VAMPs process each data segment
Table Operator runs in each VAMP
Table Operator returns Intermediate Result
Object (IRO) to master process
Master Process Combines IROs
Returns Consolidated Answer to Script
42
42. ScaleR PEMAs on Teradata:
Transparent Distribution of R Analytics
Desktops &
Servers
Revolution R
Enterprise
For Each Call to a
ScaleR Algorithm:
– One Request
– Many Subtasks
– One Answer
Corporate
Applications
Revolution R
Enterprise
ODBC
Teradata
Database
+
Revolution
R
Enterprise
Extended Stored
Procedure
Table
Operators
AMPs
43
43. Revolution R Enterprise Ecosystem
Power of Integration
SI / Service
Deployment / Consumption
MSP / DSP
Advanced Analytics
ETL
Corios
Data / Infrastructure
46
44. The Platform Step by Step:
Tools & Deployment
DevelopR
DeployR
• Freely-available R algorithms
• Callable by RevoR
• Embeddable in R scripts
• Web services software
development kit
• Integrates R Into application
infrastructures
Available on:
• Can be called by RevoR
• Can be run singe-node
using RevoR
• Analyze large data using
RDataStep package
• Run on multiple nodes using
rxEXEC package
DevelopR
DeployR
Capabilities:
• Invokes R Scripts from
web services calls
• RESTful interface for
easy integration
• Works with leading desktop
& BI tools
47
45. DevelopR Integrated Development
Environment
Script with type ahead
and code snippets
Sophisticated debugging
with breakpoints ,
variable values etc.
Solutions window
for organizing code
and data
Objects loaded
in the R
Environment
Packages
installed and
loaded
Object
details
http://www.revolutionanalytics.com/demos/revolution-productivity-environment/demo.htm
48
46. Data Analysis
DeployR
R / Statistical
Modeling Expert
Deployment
Expert
Business Intelligence
Seamless
Mobile Web Apps
Bring the power of R to any web enabled application
Simple
Leverage common APIs including JS, Java, .NET
Scalable
Robustly scale user and compute workloads
Secure
Cloud / SaaS
Manage enterprise security with LDAP & SSO
49
47. Create Custom, On-Demand Analytical
Apps
Some Examples:
On-demand sales
forecasting
Leveraging the
power of R from
Microsoft tools
Real-time social
media sentiment
analysis
50
48. Alteryx and Revolution Analytics
Making Predictive Analytics
More Accessible and Scalable
Empowering Analysts with
Easy-to-Use Predictive
Tools combined with the
Leading R Platform
Delivering Enterprise-Scale
Predictive Analytics to Line
of Business Analysts
Enabling a Broader
Audience to Harness the
Universe of R
51
49. Summary.
R is Hot.
– Most Broadly Used Analytical Language
– Its Popularity Addresses Critical Talent Gap
– Vast Functionality Via CRAN
– R Needs a Platform For Big Data Big Analytics
Revolution Provides Enterprise-Capable Platforms for R.
– High Performance.
– Scalable via Transparent Distributed Execution
– Portable – Write Once Deploy Anywhere - WODA
– Commercial Support & Services Cut Project Risks
Teradata + Revolution Provide a Robust Solution
– Teradata provides stable, high-performane big data environment
– Revolution provides speed, scale, portability and stability for the enterprise
52
50. Next steps?
The leading commercial provider of software and support for the popular
open source R statistics language.
www.revolutionanalytics.com
650.646.9545
Twitter: @RevolutionR
53
Enterprise readinessPerformance architectureBig Data analyticsData source integrationDevelopment toolsDeployment tools
Enterprise readinessBuild assurance: Continuous testing, custom validationImplementation tools: validation utilityTechnical support, documentation, trainingPerformance architectureFast math librariesBetter memory managementMulti-core processingDistributed computing architectureBig Data analyticsDescriptive StatisticsCross TabulationStatistical TestsCorrelation, Covariance and SSCP MatricesLinear RegressionLogistic RegressionGeneralized Linear ModelsDecision TreesK-Means ClusteringData source integrationODBCTeradata (high speed)Text Files: Delimited & Fixed formatSASSPSSHadoop:HDFS & HbaseDevelopment toolsVisual DebuggerScript EditorR SnippetsObject BrowserSolution ExplorerCustomizable WorkspaceVersion Control Plug-InDeployment toolsR objects as JSON, XMLSupports Java, JavaScript, .NETRESTful web services APISecurity: LDAP, SSOBuilt-In load balancingAsynchronous schedulingManagement consoleAccelerators: Jaspersoft, Qlikview
A Revolution R Enterprise ScaleR analytic is provided a data source as inputThe analytic loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0).Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memoryWhen all of the data is processed a master results object is created from the intermediate results objects
RRE environments on workstations and servers submit execution requests to TD just as they do for other platforms.Steps in the execution include: In response to an execution request from a user or a tool, the users’s RRE instance packages the R Script and environment metadata into a request.Request is transferred to Teradata via ODBC interface and contains the R code and environment details.Teradata’s Gateway Process receives the ODBC request, enqueuing it for one of the Parsing Engines (PE), typically the least busy of them across the machine.The PE invokes the RRE engine (deployed inside of an Extended Stored Procedure)The RRE instance inside the XSP is essentially a master process. It decomposes the run request into one or more set of executions The execution requests are delivered to PEMAs running as table operators by the XSPScaleR PEMAs run as table operators on chunks of data provided by the Teradata database (part of the new native Table Operator functionality)PEMAs PEMAs run in VAMPs as table operators. Can be run interatively. TD chunks the data via AMPs We produce intermediate result objects Int. Results are returned to XSPs.
Type ahead: the IDE recognizes an R function as you type in the first few characters and shows the completed formula and parametersCode snippets: Templates for common R functions e.g. for loop, xy plot. These are written in XML and users can add their ownSolution Window: The RPE organizes R scripts and data files in folders by Solution. This facilitates but does not implement versioningThe lists of packages of installed and the list of loaded packages are available for inspection. Clicking on these packages shows their components in the object windowThe top right Object Browser window shows all of the objects available in the R environmentThe bottom right object window shows the details of particular objectsDebugging Tools: when running in debugging mode the RPE supports breakpoints, stepping in and out of code and shows the contents of variables upon “mouse over”.Users may step through all code available in the Solution that is active.
Alteryx:Alteryx has always been about enabling business users to create powerful analytic applications through simple drag and drop functionsAnd as our support for R has developed, we have seen an evolution, as we have added more functionality, that has been fervently adopted by our customersAs we see more and more customers like WalMart using R and doing predictive analytics, we are starting to see these customers come up against the limitations of Open R. To address the scalability issues, and we started talking to Revolution Analytics. RevolutionDelivering Enterprise-Scale Predictive Analytics to Line of Business AnalystsWe make R enterprise ready in a number of ways. The end result is a powerful, scalable way to run predictive analytics, that takes leverages the open source community’s innovation and broad pool of experts.Furthermore, we are Enabling a Broader Audience to Harness the Power of RR is the most widely adopted statistical language with over 2M usersR is the standard statistical platform at universities around the worldR has a vibrant ecosystem that is constantly improving and innovating and offering new ways to use RAlteryx enables experts to adopt these new innovations but then to make the available to analysts by incorporating them back into the workflow, and then enabling those applications to be made available to business users through the Alteryx Gallery.