SlideShare une entreprise Scribd logo
1  sur  19
R ON HADOOP
Kostiantyn Kudriavtsev
Lviv Hadoop User Group, June 19, 2014
Agenda
• What is R?
• Linear Regression
• R on Hadoop
• Summary
What is R?
Object-oriented and functional language for Stats, Math and
Data Science created by statisticians with comprehensive
data visualisation and statistical modelling capabilities;
5000+ (and grow) freely available specialised algorithms for
finance, economics, genomics, linguistic and so on;
2M+ users with specialised domain skills;
… but some drawbacks are:
- limited by RAM
- single thread
R development environment
RStudio is de-
facto standard
IDE for R
development and
available in local
or server mode.
Might be used not
only for coding,
but also
visualisation.
Suitable to
develop R
solutions on top of
Hadoop.
Apache Hadoop is an software framework that supports data-
intensive distributed applications based on MapReduce
algorithm (MR). Main idea: move computation to data.
MR idea:
- Map step: Map(k1,v1) → list(k2,v2)
- Magic here (sort by k2, data transfer between
nodes, etc)
- Reduce step: Reduce(k2, list (v2)) → (k3, v3)
What is Hadoop?
Linear regression
Web-store might use linear
regression to predict sales of
goods or discover trends.
sale(Product) ~ visitors(Product)
Linear regression might be
used here:
sale = α * visitors + β
Linear regression in R
df <- read.csv("Phone.csv", header=TRUE)
qq <-
qplot(visited,purchased,colour=product_page,
data=df)
qq + geom_smooth(method='lm', formula=y~x)
Linear regression in R
df.p2 <- df[df$product_page == 'phone_2', ]
m <- lm(purchased ~ visited, data=df.p2)
summary(m)
R on Hadoop
Several options:
• Hadoop streaming
• RHadoop
• RHipe
• RSpark
• Oracle R Advanced Analytics for Hadoop
• etc.
R Hadoop streaming
Hadoop was mainly designed to use Java and
provides comprehensive Java API.
Other languages can be used through “Streaming
API” Streaming API utilised standard input (reading)
and standard output (writing) OS possibilities. It
provides lightweight API for MapReduce in compare
to Java API.
Streaming requires writing two separate scripts (per
mapper and reducer) in any language (Python,
Ruby, R, C#, Go, OCalm, Lisp, etc)
R Hadoop streaming
Streaming API drawbacks:
● while the inputs to the reducer are grouped by key, they are still iterated
over line-by-line, and the boundaries between keys must be detected by the
user
● no possibilities to utilize different mappers in one MapReduce job
● no possibilities to create different outputs from reducer
● counters update through stderr
Additional disadvantage of implementing streaming in R:
•strong output control for R functions, because they are “buzzy”, however
only meaning data must be pushed
R Hadoop streaming: Mapper
R Hadoop streaming: Reducer
RHadoop
RHadoop - set of libraries (written in R language)
for R languages aim to facilitate using R
languages with Hadoop streaming to develop MR
jobs. So, it has general drawbacks for Hadoop
streaming.
RHadoop
RHadoop is still R through Hadoop Streaming
Advantages compared to Streaming:
● don’t need to manage key change in Reducer
● don’t need to control functions output manually
● simple R API covers Streaming API
● R code can be run on local env/Hadoop without
changes
Demo time
R on Hadoop in Real Life
Several steps are required to achieve the goal:
1. Data ingestion
2. Data preparation
3. R processing
4. Postprocessing
http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry-
Rogozhin.jpg
Learned Lessons
R is slow… for million calculations
it’s even slow with Hadoop!
How to improve the speed?
Rewrite flow - maximum preprocessing work before R
step.
Hadoop streaming supports mapper/reducer in
different languages.
Think twice. R is great for exploratory analysis and
researches, but in production might cause performance
penalty.
Q&A
• Thank you for your attention

Contenu connexe

Similaire à R on Hadoop

Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programminghemasri56
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.Roman Nikitchenko
 
R programming presentation
R programming presentationR programming presentation
R programming presentationAkshat Sharma
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 

Similaire à R on Hadoop (20)

Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
R programming
R programmingR programming
R programming
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Data Analytics Domain
Data Analytics DomainData Analytics Domain
Data Analytics Domain
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
HBase, dances on the elephant back.
HBase, dances on the elephant back.HBase, dances on the elephant back.
HBase, dances on the elephant back.
 
Apache pig
Apache pigApache pig
Apache pig
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 

Dernier

%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 

Dernier (20)

%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 

R on Hadoop

  • 1. R ON HADOOP Kostiantyn Kudriavtsev Lviv Hadoop User Group, June 19, 2014
  • 2. Agenda • What is R? • Linear Regression • R on Hadoop • Summary
  • 3. What is R? Object-oriented and functional language for Stats, Math and Data Science created by statisticians with comprehensive data visualisation and statistical modelling capabilities; 5000+ (and grow) freely available specialised algorithms for finance, economics, genomics, linguistic and so on; 2M+ users with specialised domain skills; … but some drawbacks are: - limited by RAM - single thread
  • 4. R development environment RStudio is de- facto standard IDE for R development and available in local or server mode. Might be used not only for coding, but also visualisation. Suitable to develop R solutions on top of Hadoop.
  • 5. Apache Hadoop is an software framework that supports data- intensive distributed applications based on MapReduce algorithm (MR). Main idea: move computation to data. MR idea: - Map step: Map(k1,v1) → list(k2,v2) - Magic here (sort by k2, data transfer between nodes, etc) - Reduce step: Reduce(k2, list (v2)) → (k3, v3) What is Hadoop?
  • 6. Linear regression Web-store might use linear regression to predict sales of goods or discover trends. sale(Product) ~ visitors(Product) Linear regression might be used here: sale = α * visitors + β
  • 7. Linear regression in R df <- read.csv("Phone.csv", header=TRUE) qq <- qplot(visited,purchased,colour=product_page, data=df) qq + geom_smooth(method='lm', formula=y~x)
  • 8. Linear regression in R df.p2 <- df[df$product_page == 'phone_2', ] m <- lm(purchased ~ visited, data=df.p2) summary(m)
  • 9. R on Hadoop Several options: • Hadoop streaming • RHadoop • RHipe • RSpark • Oracle R Advanced Analytics for Hadoop • etc.
  • 10. R Hadoop streaming Hadoop was mainly designed to use Java and provides comprehensive Java API. Other languages can be used through “Streaming API” Streaming API utilised standard input (reading) and standard output (writing) OS possibilities. It provides lightweight API for MapReduce in compare to Java API. Streaming requires writing two separate scripts (per mapper and reducer) in any language (Python, Ruby, R, C#, Go, OCalm, Lisp, etc)
  • 11. R Hadoop streaming Streaming API drawbacks: ● while the inputs to the reducer are grouped by key, they are still iterated over line-by-line, and the boundaries between keys must be detected by the user ● no possibilities to utilize different mappers in one MapReduce job ● no possibilities to create different outputs from reducer ● counters update through stderr Additional disadvantage of implementing streaming in R: •strong output control for R functions, because they are “buzzy”, however only meaning data must be pushed
  • 14. RHadoop RHadoop - set of libraries (written in R language) for R languages aim to facilitate using R languages with Hadoop streaming to develop MR jobs. So, it has general drawbacks for Hadoop streaming.
  • 15. RHadoop RHadoop is still R through Hadoop Streaming Advantages compared to Streaming: ● don’t need to manage key change in Reducer ● don’t need to control functions output manually ● simple R API covers Streaming API ● R code can be run on local env/Hadoop without changes
  • 17. R on Hadoop in Real Life Several steps are required to achieve the goal: 1. Data ingestion 2. Data preparation 3. R processing 4. Postprocessing http://static.vroomgirls.com/website/wp-content/uploads/2011/09/Route66Road%C2%A9-Dmitry- Rogozhin.jpg
  • 18. Learned Lessons R is slow… for million calculations it’s even slow with Hadoop! How to improve the speed? Rewrite flow - maximum preprocessing work before R step. Hadoop streaming supports mapper/reducer in different languages. Think twice. R is great for exploratory analysis and researches, but in production might cause performance penalty.
  • 19. Q&A • Thank you for your attention