3. What is R?
Object-oriented and functional language for Stats, Math and
Data Science created by statisticians with comprehensive
data visualisation and statistical modelling capabilities;
5000+ (and grow) freely available specialised algorithms for
finance, economics, genomics, linguistic and so on;
2M+ users with specialised domain skills;
… but some drawbacks are:
- limited by RAM
- single thread
4. R development environment
RStudio is de-
facto standard
IDE for R
development and
available in local
or server mode.
Might be used not
only for coding,
but also
visualisation.
Suitable to
develop R
solutions on top of
Hadoop.
5. Apache Hadoop is an software framework that supports data-
intensive distributed applications based on MapReduce
algorithm (MR). Main idea: move computation to data.
MR idea:
- Map step: Map(k1,v1) → list(k2,v2)
- Magic here (sort by k2, data transfer between
nodes, etc)
- Reduce step: Reduce(k2, list (v2)) → (k3, v3)
What is Hadoop?
6. Linear regression
Web-store might use linear
regression to predict sales of
goods or discover trends.
sale(Product) ~ visitors(Product)
Linear regression might be
used here:
sale = α * visitors + β
7. Linear regression in R
df <- read.csv("Phone.csv", header=TRUE)
qq <-
qplot(visited,purchased,colour=product_page,
data=df)
qq + geom_smooth(method='lm', formula=y~x)
8. Linear regression in R
df.p2 <- df[df$product_page == 'phone_2', ]
m <- lm(purchased ~ visited, data=df.p2)
summary(m)
9. R on Hadoop
Several options:
• Hadoop streaming
• RHadoop
• RHipe
• RSpark
• Oracle R Advanced Analytics for Hadoop
• etc.
10. R Hadoop streaming
Hadoop was mainly designed to use Java and
provides comprehensive Java API.
Other languages can be used through “Streaming
API” Streaming API utilised standard input (reading)
and standard output (writing) OS possibilities. It
provides lightweight API for MapReduce in compare
to Java API.
Streaming requires writing two separate scripts (per
mapper and reducer) in any language (Python,
Ruby, R, C#, Go, OCalm, Lisp, etc)
11. R Hadoop streaming
Streaming API drawbacks:
● while the inputs to the reducer are grouped by key, they are still iterated
over line-by-line, and the boundaries between keys must be detected by the
user
● no possibilities to utilize different mappers in one MapReduce job
● no possibilities to create different outputs from reducer
● counters update through stderr
Additional disadvantage of implementing streaming in R:
•strong output control for R functions, because they are “buzzy”, however
only meaning data must be pushed
14. RHadoop
RHadoop - set of libraries (written in R language)
for R languages aim to facilitate using R
languages with Hadoop streaming to develop MR
jobs. So, it has general drawbacks for Hadoop
streaming.
15. RHadoop
RHadoop is still R through Hadoop Streaming
Advantages compared to Streaming:
● don’t need to manage key change in Reducer
● don’t need to control functions output manually
● simple R API covers Streaming API
● R code can be run on local env/Hadoop without
changes
17. R on Hadoop in Real Life
Several steps are required to achieve the goal:
1. Data ingestion
2. Data preparation
3. R processing
4. Postprocessing
http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry-
Rogozhin.jpg
18. Learned Lessons
R is slow… for million calculations
it’s even slow with Hadoop!
How to improve the speed?
Rewrite flow - maximum preprocessing work before R
step.
Hadoop streaming supports mapper/reducer in
different languages.
Think twice. R is great for exploratory analysis and
researches, but in production might cause performance
penalty.