3. 3
RBelgiumStat'Rgy
Introduction and docs
Hadoop For Dummies - Dirk deRoos
Hadoop – The Definitive Guide - Tom White
RHadoop: make use of Hadoop framework from R
https://github.com/RevolutionAnalytics/rmr2/blob/master/doc
s/tutorial.md
Big Data Analytics with R and Hadoop - Vignesh Prajapati
4. 4
RBelgiumStat'Rgy
Get started
Download cloudera VM
http://www.cloudera.com/content/cloudera/en/documentation/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html
This VM runs
CentOS
CDH5.3
R 3.x
Java v1.7.x
Download RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
13. 14
RBelgiumStat'Rgy
Debugging
Start with the local backend and use debug().
Switch to the hadoop mode in standalone. In standalone, R errors are reported
in console, that is in your regular R environment. More info to setup the different
hadoop modes is available on
http://www.rdatamining.com/big-data/r-hadoop-setup-guide
Once your program run with the Hadoop backend with hadoop in standalone,
you are ready to switch to pseudo-distributed or distributed modes. No debug()
here !
In these two modes, to find R errors you have to dig out the logs, specifically
those called "userlogs". See
http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-the
m-in-cdh-and-what-info-they-contain/
In parallel, you should grow your test data set sizes. New bugs can show up with
larger files.
To print variable values, you can use rmr.str or cat(var1, …, varN, file=stderr())
https://github.com/RevolutionAnalytics/RHadoop/wiki/user-rmr-Debugging-rmr-programs