2. Agenda
• When Should You Use R?
• When Should You Consider Hadoop?
• How to use R on Hadoop?
– Rhadoop
– R + Hadoop Streaming
– Rhipe
• Demo
• Conclusions
12. Installation (in textbook)devtools
> library(devtools)
> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/
rmr_1.3.tar.gz")
Installing rmr_1.3.tar.gz from https://github.com/downloads/
RevolutionAnalytics/RHadoop/rmr_1.3.tar.gz
Installing rmr
Installing dependencies for rmr:
...
> # make sure to set HADOOP_HOME to the location of your HADOOP installation,
> # HADOOP_CONF to the location of your hadoop config files, and make sure
> # that the hadoop bin diretory is on your path
> Sys.setenv(HADOOP_HOME="/Users/jadler/src/hadoop-0.20.2-cdh3u4")
> Sys.setenv(HADOOP_CONF=paste(Sys.getenv("HADOOP_HOME"),
+ "/conf", sep=""))
> Sys.setenv(PATH=paste(Sys.getenv("PATH"), ":", Sys.getenv("HADOOP_HOME"),
+ "/bin", sep=""))
> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/
rhdfs_1.0.4.tar.gz")
Installing rhdfs_1.0.4.tar.gz from https://github.com/downloads/
RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz
Installing rhdfs
...
> install_url("https://github.com/downloads/RevolutionAnalytics/
RHadoop/rhbase_1.0.4.tar.gz")(Refer to page 581)
13. Installation
http://blog.fens.me/rhadoop-rhadoop/
• Download Rhadoop package from
https://github.com/RevolutionAnalytics/RHadoop/wiki
• $ R CMD javareonf
• $ R
– Install rJava, reshape2, Rcpp, iterators, itertools, digest,
RJSONIO, functional, and bitops.
• >q()
• $ R CMD INSTALL rhdfs_1.0.6.tar.gz
• $ R CMD INSTALL rmr2_2.2.2.tar.gz
• Check whether successful installation
– > library(rhdfs)
– > hdfs.init()
– > hdfs.ls(“/user”)
16. An example RHadoop application
• Mortality Public Use File Documentation
– The dataset contains a record of every death in the United States,
including the cause of death and demographic information about the
deceased. (in 2009, the mortality data file was 1.1GB and contained
2,441,219 records)
$ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2009us.zip
$ unzip mort2009us.zip
$ hadoop fs -mkdir mort09
$ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09
$ hadoop fs -ls mort09
Found 1 items
-rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31
/user/jadler/mort09/VS09MORT.DUSMCPUB
17. /home/spndc/src/Rhadoop/mort09.R (1/3)read.fwf read.fwf
.X
mort.schema <- c(
.X0=19, ResidentStatus=1, .X1=40, Education1989=2, Education2003=1,
EducationFlag=1,MonthOfDeath=2,.X2=2,Sex=1,AgeDetail=4, AgeSubstitution=1,
AgeRecode52=2,AgeRecode27=2,AgeRecode12=2,AgeRecodeInfant22=2,
PlaceOfDeath=1,MaritalStatus=1,DayOfWeekofDeath=1,.X3=16,
CurrentDataYear=4, InjuryAtWork=1, MannerOfDeath=1, MethodOfDisposition=1,
Autopsy=1,.X4=34,ActivityCode=1,PlaceOfInjury=1,ICDCode=4,
CauseRecode358=3,.X5=1,CauseRecode113=3,CauseRecode130=3,
CauseRecode39=2,.X6=1,Conditions=281,.X8=1,Race=2,BridgeRaceFlag=1,
RaceImputationFlag=1,RaceRecode3=1,RaceRecode5=1,.X9=33,
HispanicOrigin=3,.X10=1,HispanicOriginRecode=1)
> # according to the documentation, each line is 488 characters long
> sum(mort.schema)
[1] 488
37. Conclusions
• Rhadoop is a good way to scale out, but it might be not the
best way.
• Rhadoop is still under fast developing cycle, so you might
be aware of the backward compatible issue.
• So far, SPN has no plan to adopt Rhadoop for data
analysis.
• One of R fans suggests that using Pig with R will be better
than using Rhadoop directly.
38. Reference
• Rhadoop Wiki
– https://github.com/RevolutionAnalytics/RHadoop/wiki
• Rhipe
– http://www.datadr.org/
• Rhadoop實踐系列文章:
– http://blog.fens.me/series-rhadoop/
• 阿貝好威的實驗室
– http://lab.howie.tw/2013/01/Big-Data-Analytic-Weka-vs-Mahout-vs-
R.html
• R and Hadoop 整合初體驗
– http://michaelhsu.tw/2013/05/01/r-and-hadoop-%E5%88%9D
%E9%AB%94%E9%A9%97/