Contenu connexe
Similaire à User 2013-oracle-big-data-analytics-1971985
Similaire à User 2013-oracle-big-data-analytics-1971985 (20)
Plus de OUGTH Oracle User Group in Thailand
Plus de OUGTH Oracle User Group in Thailand (18)
User 2013-oracle-big-data-analytics-1971985
- 1. Big Data Analytics – Scaling R to Enterprise Data
useR! 2013 – Albacete Spain #useR2013
Luis Campos
Mark Hornick
Big Data Solutions Lead, Oracle EMEA
@luigicampos
1
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Director, Oracle Database Advanced Analytics
@MarkHornick
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
- 3. The girl with all the questions!
“The real innovation here is that we can
and get the
ask questions
answer back before we have forgotten
why
we asked the question in the first
place
.”
– Hilary Mason, Chief Scientist Bit.ly
+ member of NYC Mayor Bloomberg’s Technology and Innovation Advisory Council
3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
- 4. Nexus of Forces, Platform 3.0, Four Pillars
What Analysts/groups are saying?
4
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
- 5. New Information Challenges
Data Explosion
A Decade of Digital Universe Growth: Storage in Exabytes (Source:
IDC’s Digital Universe Study, June 2011)
Combinatory Explosion
Dimension Explosion
5
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
- 6. Big Data Solution = Data + Analytics + Tools
Source: McKinsey study “Big data: What’s your plan?” (March 2013)
http://www.mckinsey.com/insights/business_technology/big_data_whats_your_plan
DATA
Any Data,
Any Source
6
ANALYTICS
Out-of-the box
Analytics,
New Models
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
TOOLS
Self Service
Data Discovery
On Premise,
On Cloud,
On Mobile
- 7. Oracle Complete Business Analytics Solution
BIG DATA
APPLIANCE
BIG DATA
CONNECTORS
NoSQL DB
7
Oracle Advanced
DATA MINING
Analytics
ORACLE R Ent.
SPATIAL,GRAPH
Real Time
Decisions (RTD)
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
OBIEE
ENDECA
Collective
Intellect (CI)
On Premise,
Oracle Cloud,
On Mobile
- 8. Apply Advanced Analytics on All Data
Visualise it with any BI Tool
Hadoop
Relational
HDFS
Data
BI Tools
8
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
- 9. Oracle R Advantages
1. Keep the R tools
2. Keep the data where it sits (Relational or HDFS)
3. Keep the SQL Based BI Tools
4. Scale to LARGE data sets
R workspace console
Function push-down
– data transformation &
Oracle statistics engine
statistics
Development
9
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Production
OBIEE, Web
Services
Consumption
- 10. Oracle’s Advanced Analytics Strategic Offerings
Deliver enterprise-level advanced analytics in the Database
Oracle in-Database Data Mining algorithms
– Access through Free GUI from SQL Developer or programmatically from SQL,
PL/SQL, R or Java
– Predictive model APIs for the Oracle R Enterprise
– Exadata architecture advantages for up to 5x improvement with Smart Scan
Oracle R Distribution
– Free download, pre-installed on Oracle Big Data Appliance, bundled with Oracle
Linux
– Enhanced linear algebra performance: Intel’s Math Kernel Library, AMD’s Core Math
Library (Windows and Linux), SUN Solaris and IBM AIX
– Enterprise support for customers of Oracle Advanced Analytics, Big Data Appliance,
and Oracle Linux
10
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
- 11. Oracle’s Advanced Analytics Strategic Offerings
Deliver enterprise-level R in the Database or Hadoop
Oracle R Enterprise
– Transparent access to database-resident data from R
– Embedded R script execution through database managed R engines
– Statistics engine
– Enhanced support for high-speed Exadata scoring
Oracle R Connector for Hadoop [ORCH] (Part of Oracle Big Data Connectors)
– R interface to Oracle Hadoop Cluster on BDA and non-Oracle Hadoop clusters
– Access and manipulate data in HDFS, database, and file system
– Write MapReduce functions using R and execute through natural R interface
– Predictive models with execution in-Cluster against Hadoop-stored data
11
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
- 12. Oracle R Components
Component layout
Analyst Laptop
Oracle Database
Oracle R Distribution
Oracle R Enterprise
Server Components
Oracle R Distribution
Oracle R Connector
for Hadoop Client
Oracle R Enterprise
Client Packages
Optional with ORCH
12
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Oracle R Distribution
Oracle R Connector
for Hadoop
Oracle R Enterprise
Client Packages
Big Data Appliance
Oracle R Enterprise
Client Packages
Exadata
- 13. Knowledge Exploitation Process
Typical stages in a Big Data Project
Business
Understanding
Deployment
Data
Scientist
Data
Selection
Evaluation
Discovery
Model
Building
13
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Data
Preparation
13
- 14. Data Loading with Oracle R Enterprise
Business
Understanding
Deployment
Data
Scientist
Data
Selection
library(ORE)
R> df <- data.frame(A=1:26,
B=letters[1:26])
R> dim(df)
[1] 26 2
R> class(df)
[1] "data.frame"
R> ore.create(df, table="DF_TABLE")
Evaluation
Discovery
Model
Building
16
Data
Preparation
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
R> ore.ls()
[1] "DF_TABLE"
R> class(DF_TABLE)
[1] "ore.frame" attr(,"package")
[1] "OREbase"
R> dim(DF_TABLE)
[1] 26 2
16
- 15. Discovery with Oracle R in-DB and HDFS
Business
Understanding
Deployment
Data
Scientist
Discovery
Evaluation
Model
Building
17
Data
Selection
Data
Preparation
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
library(ORE)
ore.ls() # list tables in DB
class(MY_TABLE) # ore.frame
dim(MY_TABLE)
# overloaded R functions
head(MY_TABLE)
sample(MY_TABLE)
summary(MY_TABLE)
library(ORCH)
hdfs.ls()
hdfs.dim("myHDFSdata")
hdfs.head("myHDFSdata")
hdfs.sample("myHDFSdata")
hdfs.toHive("myHDFSdata",
tablename="my_hive_data")
summary(my_hive_data)
17
- 16. Data Prep with Oracle R in-DB and HDFS
library(ORE) / library(ORCH)
# join
merge (MY_TABLE1, MY_TABLE2,by.x="x1", by.y="x2")
Business
Understanding
Deployment
Data
Scientist
Data
Selection
# project columns
df <- MY_TABLE[,c("X","Y","Z")]
# filter rows
df <- df[df$Z<=4.3 | df$A=="B",1:3]
Evaluation
Discovery
Model
Building
18
Data
Preparation
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
#binning
IRIS_TAB <- ore.push(iris[1:4])
IRIS_TAB$PetalBins =
ifelse(IRIS_TAB$Petal.Length < 2.0, "SMALL
PETALS",
ifelse(IRIS_TAB$Petal.Length < 4.0, "MEDIUM
PETALS", "LARGE PETALS"))
18
- 17. “Densifying” data: custom MapReduce jobs
Count occurrence of hash tags in tweets per customer for select tags
mapHashTags <- function (k,v) {
x <- strsplit(v$text, " ")
x <- x[x!='']
importantTags <- tolower(importantTags)
for(twt in 1:length(x)) {
for(tag in x[[twt]]) {
if(substr(tag,1,1) == "#") {
tagL <- tolower(tag)
if(tagL %in% importantTags) {
orch.keyval(v[twt,"screenName"],tagL)
}}}}}
reduceHashTags <- function(k,vals) { # k = screenName, vals = vector(tags)
importantTags <- tolower(importantTags)
vals <- factor(vals$val,levels=importantTags)
x <- as.data.frame(t(as.matrix(table(vals))))
orch.keyval(k,x) # k = screenName, x = df(importantTags as cols) with counts
}
19
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
19
- 18. ORCH: Create your own MapReduce jobs
Count occurrence of hash tags in tweets per customer for select tags
importantTags <- c("#bigdata","#database","#oracle","#sql")
tag.summary <- hadoop.exec(tweets.id,
mapper=mapHashTags,
reducer=reduceHashTags,
export=orch.export(importantTags=importantTags),
config=new("mapred.config",
job.name
= "TwitterScreenNameHashTags",
reduce.tasks = 5,
map.output
= data.frame(key='a', val='a'),
reduce.output = data.frame(key='a', bigdata=0,
database=0 ,oracle=0, sql=0)))
hdfs.get(tag.summary)
> hdfs.get(tag.summary)
key bigdata
database oracle
sql
1
4
7
37
91
2
twitter.user.2
15
19
1
32
3
twitter.user.3
104
57
8
0
4
20
twitter.user.1
twitter.user.4
0
64
549
0
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
20
- 19. Modelling with Oracle R in-DB and HDFS
# Clustering with ORE
Business
Understanding
Deployment
Data
Scientist
Data
Selection
X <- ore.push (data.frame(x))
km.mod1 <ore.odmKMeans(~., X, num.centers=2,
num.bins=5)
summary(km.mod1)
rules(km.mod1)
clusterhists(km.mod1)
# Regression with ORCH
Discovery
Evaluation
Model
Building
21
Data
Preparation
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
mod.lm <- orch.lm(myFormula, myData,
nReducers = 2)
summary(mod.lm)
pred <- predict.orch.lm(mod.lm, newdata =
myData)
res.pred <- hdfs.get(pred)
head(res.pred)
21
- 20. In-database performance advantage
R lm vs. ORE ore.lm
Data: 500k to 1.5m records, 3 predictors
Performance: 2x-3x improvement for build, 4x improvement for scoring
22
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
22
- 21. In-database performance advantage – lm
More tests at http://blogs.oracle.com/R/entry/oracle_r_enterprise_1_32
23
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
23
- 22. Deploying with Oracle R Enterprise
Load R scripts into ORE script repository
Invoke R scripts by name from SQL
Business
Understanding
Production
Deploy
ment
Data
Scientist
Data
Selection
Discovery
Evaluation
Model
Building
24
Data
Preparation
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Store R objects directly in Oracle
Database (no separate files)
Optional return values:
• Data frame consumable by any SQL-ready
application
• XML containing structured data, complex
R objects, PNG images
• PNG table with BLOB column containing
images for immediate consumption
Schedule for automatic execution
24
- 23. Oracle Advanced Analytics: Embedded R Execution
SQL interface rqEval – generate XML string for graphic output
Oracle PL/SQL
begin
sys.rqScriptCreate('Example6',
'function(){
res <- 1:10
Oracle BI Publisher
plot( 1:100, rnorm(100), pch = 21,
bg = "red", cex = 2 )
R Language
res
}');
end;
/
Oracle SQL
select value
from
25
table(rqEval(NULL,'XML','Example6'));
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
- 24. Summary
Oracle R Enterprise (ORE)
Oracle R Connector for Hadoop (ORCH)
• A comprehensive, database-centric
environment for end-to-end analytical
processes in R with immediate deployment
to production environments
• Wide range of in-database advanced
analytics algorithms exposed through R
• Eliminate R client memory limits
• A collection of R packages enabling Big Data
analytics from an R environment
• Allows R users to leverage a Hadoop Cluster
with HDFS and MapReduce from R
• Prepackaged advanced analytics algorithms
• Transparent manipulation of HIVE data
• Enable R users to conduct Big Data projects from R
• Eliminate client R engine memory barrier
• Scale to large data sets
• Deploy R-based solutions without translation to other
languages or environments
26
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
26
- 25. Resources
• Blog:
http://www.oracle.com/goto/R
https://blogs.oracle.com/R/
• Forum: https://forums.oracle.com/forums/forum.jspa?forumID=1397
• Oracle R Distribution:
http://www.oracle.com/technetwork/indexes/downloads/r-distribution-1532464.html
• ROracle:
http://cran.r-project.org/web/packages/ROracle
• Oracle R Enterprise:
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise
• Oracle R Connector for Hadoop:
http://www.oracle.com/us/products/database/big-data-connectors/overview
27
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
27