From Event to Action: Accelerate Your Decision Making with Real-Time Automation
R hive tutorial - apply functions and map reduce
1. RHive tutorial - apply functions and
map/reduce
RHive supports the use of R’s syntax and characteristics by connecting to
Hive for massive data processing.
Hence RHive supports several Functions similar to apply-like Functions
provided in R,
and supports the making of map/reduce code by writing scripts in a similar
way as Hadoop streaming.
These Functions and features will be expanded upon in future versions.
rhive.napply, rhive.sapply
rhive.napply and rhive.sapply are the same Function. Their only difference is
whether the returned value is numeric or character type.
napply’s argument Function must return a numeric type. sapply’s argument
Function returns a character type.
The two Functions’ arguments take** table names, R Functions that run for
each record, and columns.
In rhive.napply/rhive.sapply, the number of arguments after the “Function to
be applied” must equal to the number of arguments of the “Function to be
applied” itself.
Under the surface, these Functions are actually using RHive’s UDF support
feature, so if you already know how Rhive’s UDF works then you will easily
understand this as well.
The following example of uses the two Functions.
First make a table for testing.
rhive.write.table('iris')
The following is an example of using rhive.napply.
rhive.napply('iris',
function(column1)
{
column1
*
10
},
'sepallength')
rhive.napply('iris',
function(column1)
{
2. column1
*
10
},
'sepallength')
[1]
"iris_napply1323970435_table"
rhive.desc.table("iris_napply1323970435_table")
col_name
data_type
comment
1
_c0
double
The following is an example of using rhive.sapply.
rhive.sapply('iris',
function(column1)
{
as.character(column1
*
10)
},
'sepallength')
[1]
"iris_sapply1323970891_table"
rhive.desc.table("iris_sapply1323970891_table")
col_name
data_type
comment
1
_c0
string
Do note when using these Functions that these Functions do not return
data.frame but return the name of the table temporarily made within the
Functions themselves.
The user should reprocess then delete these returned tables which the
Functions made as results of processing.
This is because it is generally impossible to receive massive data through
standard output or data.frame.
rhive.mrapply, rhive.mapapply, rhive.reduceapply
These Functions have names similar to the ones mentioned before, but they
actually make Hadoop’s map/reduce into a form that resembles using Hadoop
streaming.
You can use these Functions to implement wordcount which is frequently
seen in Hadoop streaming examples. Users who wish to write code in
traditional Map/Reduce style will need these Functions.
It is very easy to use them.
rhive.mapapply takes the tables and columns inputted as arguments and runs
them with another Functions inputted as an argument but only runs it with**
map.
rhive.reduceapply only performs a reduce.
3. And rhive.mrapply performs both map and reduce.
Use rhive.mapapply if you are making something that only requires the map
procedure.
Use rhive.reduceapply if you are making something that only requires the
reduce procedure.
Use rhive.mrapply if you need both procedures.
You’ll probably find yourself using rhive.mrapply more often than not.
The following is a wordcount example using rhive.mrapply.
First let’s make a dataset for applying wordcount.
We’ll be using a text web browser called lynx to crawl the R introduction page
and save it to a Hive table.
First install lynx.
If you have a different text file and prefer this rather than installing lynx, then
that is fine as well.
yum
install
lynx
Save the downloaded file to a Hive table.
Open
R
system("lynx
-‐-‐dump
http://cran.r-‐project.org/doc/manuals/R-‐
intro.html
>
/tmp/r-‐intro.txt")
rintro
<-‐
readLines("/tmp/r-‐intro.txt")
unlink("/tmp/r-‐intro.txt")
rintro
<-‐
data.frame(rintro)
colnames(rintro)
<-‐
c("rawtext")
rhive.write.table(rintro)
[1]
"rintro"
rhive.desc.table("rintro")
col_name
data_type
comment
1
rowname
string
2
rawtext
string
The RHive code that performs a wordcount on the text file called “rintro” is as
follows:
4. map
<-‐
function(key,
value)
{
if(is.null(value))
{
put(NA,1)
}
lapply(value,
function(v)
{
lapply(strsplit(x=v,
split="
")[[1]],
function(word)
put(word,1))
})
}
reduce
<-‐
function(key,
values)
{
put(key,
sum(as.numeric(values)))
}
result
<-‐
rhive.mrapply("rintro",
map,
reduce,
c("rowname",
"rawtext"),
c("word","one"),
by="word",
c("word","one"),
c("word","count"))
head(result)
word
count
1
26927
2
"
1
3
"%!%"
1
4
"+")
1
5
"."
1
6
".GlobalEnv"
3
The above example is similar to a Map/Reduce code in Hadoop streaming
style, but rhive.mrapply in the last part of the example is probably unfamiliar
with users.
RHive also processes Map/Reduce by using Hive.
Thus input must receive table name and column as basic input parameters.
And because it is quite difficult to automatically find out the inputs and outputs
of each step of map and reduce, the user must assign which are the inputs
and outputs and what names will be used for aliases.
Take
a
look
at
the
last
paragraph
of
the
above
example
again.
rhive.mrapply("rintro", map, reduce, c("rowname", "rawtext"), c("word","one"),
by="word", c("word","one"), c("word","count"))
5. The first argument, “rintro”, is the name of the table to be processed.
The map and reduce coming thereafter are each symbols of the R Functions,
respectively processing map and reduce. The c("rowname", "rawtext") after
that are the “rintro” table’s columns that will be taken as arguments for map
Function, and its first value is the column to be used as key, the second is the
column to be used as its value.
The fifth argument, c(“word”, “one”), refers to map’s output and the sixth
argument, by=”word”, refers to the column to be aggregated among map’s
output. The seventh and the eighth are respectively the input and output of
reduce.
Confusion may ensue with so many arguments, but they are necessary for
Hive’s map/reduce.
The future will see many improvements for this Function..
Under the surface, Rhive.mrapply actually processes map/reduce-related
tasks by making Hive SQL.
This means the user does not have to use rhive.mrapply to process above
examples but the user can directly use the Function provided by RHive and
Hive table.
But this too consists of an unfamiliar and difficult syntax, so RHive came to
support Functions such as these.
rhive.mapapply and rhive.reduceapply are used in almost the same way as
rhive.mrapply so this tutorial omits explaining it.
If you already have a Map/Reduce module with Hadoop streaming or Hadoop
library used and wish to convert it to RHive, there may be many differences
during the conversion process.
RHive does not replace Hadoop streaming, nor does it replace Hadoop library.
But it is merely a convenient measure for helping R users approach Hadoop
and Hive. Should you be attempting to take a high-performance
Map/Reduce module or binary executable file written in C/C++ (and others)
and attempt to convert it into map/reduce and running it, then Hadoop library
and Hadoop streaming may yet be a better choice.