SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
RHive tutorial - apply functions and
map/reduce
RHive supports the use of R’s syntax and characteristics by connecting to
Hive for massive data processing.
Hence RHive supports several Functions similar to apply-like Functions
provided in R,
and supports the making of map/reduce code by writing scripts in a similar
way as Hadoop streaming.
These Functions and features will be expanded upon in future versions.
rhive.napply, rhive.sapply
rhive.napply and rhive.sapply are the same Function. Their only difference is
whether the returned value is numeric or character type.
napply’s argument Function must return a numeric type. sapply’s argument
Function returns a character type.
The two Functions’ arguments take** table names, R Functions that run for
each record, and columns.
In rhive.napply/rhive.sapply, the number of arguments after the “Function to
be applied” must equal to the number of arguments of the “Function to be
applied” itself.
Under the surface, these Functions are actually using RHive’s UDF support
feature, so if you already know how Rhive’s UDF works then you will easily
understand this as well.
The following example of uses the two Functions.
First make a table for testing.
rhive.write.table('iris')	
  
The following is an example of using rhive.napply.
rhive.napply('iris',	
  function(column1)	
  {	
  	
  
column1	
  *	
  10	
  
},	
  	
  
'sepallength')	
  	
  
rhive.napply('iris',	
  function(column1)	
  {	
  	
  
column1	
  *	
  10	
  
},	
  'sepallength')	
  	
  
[1]	
  "iris_napply1323970435_table"	
  
rhive.desc.table("iris_napply1323970435_table")	
  	
  
col_name	
  data_type	
  comment	
  	
  
1	
  _c0	
  double	
  
The following is an example of using rhive.sapply.
rhive.sapply('iris',	
  function(column1)	
  {	
  	
  
as.character(column1	
  *	
  10)	
  	
  
},	
  	
  
'sepallength')	
  	
  
[1]	
  "iris_sapply1323970891_table"	
  
rhive.desc.table("iris_sapply1323970891_table")	
  	
  
col_name	
  data_type	
  comment	
  	
  
1	
  _c0	
  string	
  
Do note when using these Functions that these Functions do not return
data.frame but return the name of the table temporarily made within the
Functions themselves.
The user should reprocess then delete these returned tables which the
Functions made as results of processing.
This is because it is generally impossible to receive massive data through
standard output or data.frame.
rhive.mrapply, rhive.mapapply, rhive.reduceapply
These Functions have names similar to the ones mentioned before, but they
actually make Hadoop’s map/reduce into a form that resembles using Hadoop
streaming.
You can use these Functions to implement wordcount which is frequently
seen in Hadoop streaming examples. Users who wish to write code in
traditional Map/Reduce style will need these Functions.
It is very easy to use them.
rhive.mapapply takes the tables and columns inputted as arguments and runs
them with another Functions inputted as an argument but only runs it with**
map.
rhive.reduceapply only performs a reduce.
And rhive.mrapply performs both map and reduce.
Use rhive.mapapply if you are making something that only requires the map
procedure.
Use rhive.reduceapply if you are making something that only requires the
reduce procedure.
Use rhive.mrapply if you need both procedures.
You’ll probably find yourself using rhive.mrapply more often than not.
The following is a wordcount example using rhive.mrapply.
First let’s make a dataset for applying wordcount.
We’ll be using a text web browser called lynx to crawl the R introduction page
and save it to a Hive table.
First install lynx.
If you have a different text file and prefer this rather than installing lynx, then
that is fine as well.
yum	
  install	
  lynx	
  
Save the downloaded file to a Hive table.
Open	
  R	
  	
  
system("lynx	
  -­‐-­‐dump	
  http://cran.r-­‐project.org/doc/manuals/R-­‐
intro.html	
  >	
  /tmp/r-­‐intro.txt")	
  	
  
rintro	
  <-­‐	
  readLines("/tmp/r-­‐intro.txt")	
  	
  
unlink("/tmp/r-­‐intro.txt")	
  	
  
rintro	
  <-­‐	
  data.frame(rintro)	
  	
  
colnames(rintro)	
  <-­‐	
  c("rawtext")	
  	
  
rhive.write.table(rintro)	
  	
  
[1]	
  "rintro"	
  
rhive.desc.table("rintro")	
  	
  
col_name	
  data_type	
  comment	
  	
  
1	
  rowname	
  string	
  	
  
2	
  rawtext	
  string	
  
The RHive code that performs a wordcount on the text file called “rintro” is as
follows:
map	
  <-­‐	
  function(key,	
  value)	
  {	
  	
  
if(is.null(value))	
  {	
  	
  
put(NA,1)	
  	
  
}	
  	
  
lapply(value,	
  function(v)	
  {	
  	
  
lapply(strsplit(x=v,	
  split="	
  ")[[1]],	
  function(word)	
  
put(word,1))	
  	
  
})	
  	
  
}	
  	
  
reduce	
  <-­‐	
  function(key,	
  values)	
  {	
  	
  
put(key,	
  sum(as.numeric(values)))	
  	
  
}	
  	
  
result	
  <-­‐	
  rhive.mrapply("rintro",	
  map,	
  reduce,	
  c("rowname",	
  
"rawtext"),	
  c("word","one"),	
  by="word",	
  c("word","one"),	
  
c("word","count"))	
  	
  
head(result)	
  	
  
word	
  count	
  	
  
1	
  26927	
  
2	
  "	
  1	
  
3	
  "%!%"	
  1	
  
4	
  "+")	
  1	
  
5	
  "."	
  1	
  
6	
  ".GlobalEnv"	
  3	
  
The above example is similar to a Map/Reduce code in Hadoop streaming
style, but rhive.mrapply in the last part of the example is probably unfamiliar
with users.
RHive also processes Map/Reduce by using Hive.
Thus input must receive table name and column as basic input parameters.
And because it is quite difficult to automatically find out the inputs and outputs
of each step of map and reduce, the user must assign which are the inputs
and outputs and what names will be used for aliases.
Take	
  a	
  look	
  at	
  the	
  last	
  paragraph	
  of	
  the	
  above	
  example	
  again.
rhive.mrapply("rintro", map, reduce, c("rowname", "rawtext"), c("word","one"),
by="word", c("word","one"), c("word","count"))
The first argument, “rintro”, is the name of the table to be processed.
The map and reduce coming thereafter are each symbols of the R Functions,
respectively processing map and reduce. The c("rowname", "rawtext") after
that are the “rintro” table’s columns that will be taken as arguments for map
Function, and its first value is the column to be used as key, the second is the
column to be used as its value.
The fifth argument, c(“word”, “one”), refers to map’s output and the sixth
argument, by=”word”, refers to the column to be aggregated among map’s
output. The seventh and the eighth are respectively the input and output of
reduce.
Confusion may ensue with so many arguments, but they are necessary for
Hive’s map/reduce.
The future will see many improvements for this Function..
Under the surface, Rhive.mrapply actually processes map/reduce-related
tasks by making Hive SQL.
This means the user does not have to use rhive.mrapply to process above
examples but the user can directly use the Function provided by RHive and
Hive table.
But this too consists of an unfamiliar and difficult syntax, so RHive came to
support Functions such as these.
rhive.mapapply and rhive.reduceapply are used in almost the same way as
rhive.mrapply so this tutorial omits explaining it.
If you already have a Map/Reduce module with Hadoop streaming or Hadoop
library used and wish to convert it to RHive, there may be many differences
during the conversion process.
RHive does not replace Hadoop streaming, nor does it replace Hadoop library.
But it is merely a convenient measure for helping R users approach Hadoop
and Hive. Should you be attempting to take a high-performance
Map/Reduce module or binary executable file written in C/C++ (and others)
and attempt to convert it into map/reduce and running it, then Hadoop library
and Hadoop streaming may yet be a better choice.

Contenu connexe

Tendances

Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysisTim Essam
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageZurich_R_User_Group
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Laura Hughes
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformationLaura Hughes
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statisticsKrishna Dhakal
 
Export Data using R Studio
Export Data using R StudioExport Data using R Studio
Export Data using R StudioRupak Roy
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 

Tendances (20)

Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Stata cheat sheet analysis
Stata cheat sheet analysisStata cheat sheet analysis
Stata cheat sheet analysis
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
Export Data using R Studio
Export Data using R StudioExport Data using R Studio
Export Data using R Studio
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Sparklyr
SparklyrSparklyr
Sparklyr
 
R language introduction
R language introductionR language introduction
R language introduction
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Unit 2
Unit 2Unit 2
Unit 2
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 

En vedette

RHive tutorials - Basic functions
RHive tutorials - Basic functionsRHive tutorials - Basic functions
RHive tutorials - Basic functionsAiden Seonghak Hong
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정Aiden Seonghak Hong
 
R hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhiveR hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhiveAiden Seonghak Hong
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 

En vedette (7)

RHive tutorials - Basic functions
RHive tutorials - Basic functionsRHive tutorials - Basic functions
RHive tutorials - Basic functions
 
RHive tutorial - HDFS functions
RHive tutorial - HDFS functionsRHive tutorial - HDFS functions
RHive tutorial - HDFS functions
 
RHive tutorial - Installation
RHive tutorial - InstallationRHive tutorial - Installation
RHive tutorial - Installation
 
RHadoop, R meets Hadoop
RHadoop, R meets HadoopRHadoop, R meets Hadoop
RHadoop, R meets Hadoop
 
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
RHive tutorial 1: RHive 튜토리얼 1 - 설치 및 설정
 
R hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhiveR hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhive
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 

Similaire à R hive tutorial - apply functions and map reduce

Get started with R lang
Get started with R langGet started with R lang
Get started with R langsenthil0809
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data Jay Nagar
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 

Similaire à R hive tutorial - apply functions and map reduce (20)

Unit 3
Unit 3Unit 3
Unit 3
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Introduction to R for beginners
Introduction to R for beginnersIntroduction to R for beginners
Introduction to R for beginners
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Lecture_R.ppt
Lecture_R.pptLecture_R.ppt
Lecture_R.ppt
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Inroduction to r
Inroduction to rInroduction to r
Inroduction to r
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Special topics in finance lecture 2
Special topics in finance   lecture 2Special topics in finance   lecture 2
Special topics in finance lecture 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
R basics
R basicsR basics
R basics
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 

Plus de Aiden Seonghak Hong

RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치Aiden Seonghak Hong
 
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치Aiden Seonghak Hong
 
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치Aiden Seonghak Hong
 
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스Aiden Seonghak Hong
 
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수Aiden Seonghak Hong
 
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수Aiden Seonghak Hong
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수Aiden Seonghak Hong
 
R hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveR hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveAiden Seonghak Hong
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopAiden Seonghak Hong
 

Plus de Aiden Seonghak Hong (11)

IoT and Big data with R
IoT and Big data with RIoT and Big data with R
IoT and Big data with R
 
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
RHive tutorial supplement 3: RHive 튜토리얼 부록 3 - RStudio 설치
 
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
RHive tutorial supplement 2: RHive 튜토리얼 부록 2 - Hive 설치
 
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
RHive tutorial supplement 1: RHive 튜토리얼 부록 1 - Hadoop 설치
 
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
RHive tutorial 5: RHive 튜토리얼 5 - apply 함수와 맵리듀스
 
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
RHive tutorial 4: RHive 튜토리얼 4 - UDF, UDTF, UDAF 함수
 
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
RHive tutorial 3: RHive 튜토리얼 3 - HDFS 함수
 
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
RHive tutorial 2: RHive 튜토리얼 2 - 기본 함수
 
R hive tutorial 1
R hive tutorial 1R hive tutorial 1
R hive tutorial 1
 
R hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing HiveR hive tutorial supplement 2 - Installing Hive
R hive tutorial supplement 2 - Installing Hive
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

R hive tutorial - apply functions and map reduce

  • 1. RHive tutorial - apply functions and map/reduce RHive supports the use of R’s syntax and characteristics by connecting to Hive for massive data processing. Hence RHive supports several Functions similar to apply-like Functions provided in R, and supports the making of map/reduce code by writing scripts in a similar way as Hadoop streaming. These Functions and features will be expanded upon in future versions. rhive.napply, rhive.sapply rhive.napply and rhive.sapply are the same Function. Their only difference is whether the returned value is numeric or character type. napply’s argument Function must return a numeric type. sapply’s argument Function returns a character type. The two Functions’ arguments take** table names, R Functions that run for each record, and columns. In rhive.napply/rhive.sapply, the number of arguments after the “Function to be applied” must equal to the number of arguments of the “Function to be applied” itself. Under the surface, these Functions are actually using RHive’s UDF support feature, so if you already know how Rhive’s UDF works then you will easily understand this as well. The following example of uses the two Functions. First make a table for testing. rhive.write.table('iris')   The following is an example of using rhive.napply. rhive.napply('iris',  function(column1)  {     column1  *  10   },     'sepallength')     rhive.napply('iris',  function(column1)  {    
  • 2. column1  *  10   },  'sepallength')     [1]  "iris_napply1323970435_table"   rhive.desc.table("iris_napply1323970435_table")     col_name  data_type  comment     1  _c0  double   The following is an example of using rhive.sapply. rhive.sapply('iris',  function(column1)  {     as.character(column1  *  10)     },     'sepallength')     [1]  "iris_sapply1323970891_table"   rhive.desc.table("iris_sapply1323970891_table")     col_name  data_type  comment     1  _c0  string   Do note when using these Functions that these Functions do not return data.frame but return the name of the table temporarily made within the Functions themselves. The user should reprocess then delete these returned tables which the Functions made as results of processing. This is because it is generally impossible to receive massive data through standard output or data.frame. rhive.mrapply, rhive.mapapply, rhive.reduceapply These Functions have names similar to the ones mentioned before, but they actually make Hadoop’s map/reduce into a form that resembles using Hadoop streaming. You can use these Functions to implement wordcount which is frequently seen in Hadoop streaming examples. Users who wish to write code in traditional Map/Reduce style will need these Functions. It is very easy to use them. rhive.mapapply takes the tables and columns inputted as arguments and runs them with another Functions inputted as an argument but only runs it with** map. rhive.reduceapply only performs a reduce.
  • 3. And rhive.mrapply performs both map and reduce. Use rhive.mapapply if you are making something that only requires the map procedure. Use rhive.reduceapply if you are making something that only requires the reduce procedure. Use rhive.mrapply if you need both procedures. You’ll probably find yourself using rhive.mrapply more often than not. The following is a wordcount example using rhive.mrapply. First let’s make a dataset for applying wordcount. We’ll be using a text web browser called lynx to crawl the R introduction page and save it to a Hive table. First install lynx. If you have a different text file and prefer this rather than installing lynx, then that is fine as well. yum  install  lynx   Save the downloaded file to a Hive table. Open  R     system("lynx  -­‐-­‐dump  http://cran.r-­‐project.org/doc/manuals/R-­‐ intro.html  >  /tmp/r-­‐intro.txt")     rintro  <-­‐  readLines("/tmp/r-­‐intro.txt")     unlink("/tmp/r-­‐intro.txt")     rintro  <-­‐  data.frame(rintro)     colnames(rintro)  <-­‐  c("rawtext")     rhive.write.table(rintro)     [1]  "rintro"   rhive.desc.table("rintro")     col_name  data_type  comment     1  rowname  string     2  rawtext  string   The RHive code that performs a wordcount on the text file called “rintro” is as follows:
  • 4. map  <-­‐  function(key,  value)  {     if(is.null(value))  {     put(NA,1)     }     lapply(value,  function(v)  {     lapply(strsplit(x=v,  split="  ")[[1]],  function(word)   put(word,1))     })     }     reduce  <-­‐  function(key,  values)  {     put(key,  sum(as.numeric(values)))     }     result  <-­‐  rhive.mrapply("rintro",  map,  reduce,  c("rowname",   "rawtext"),  c("word","one"),  by="word",  c("word","one"),   c("word","count"))     head(result)     word  count     1  26927   2  "  1   3  "%!%"  1   4  "+")  1   5  "."  1   6  ".GlobalEnv"  3   The above example is similar to a Map/Reduce code in Hadoop streaming style, but rhive.mrapply in the last part of the example is probably unfamiliar with users. RHive also processes Map/Reduce by using Hive. Thus input must receive table name and column as basic input parameters. And because it is quite difficult to automatically find out the inputs and outputs of each step of map and reduce, the user must assign which are the inputs and outputs and what names will be used for aliases. Take  a  look  at  the  last  paragraph  of  the  above  example  again. rhive.mrapply("rintro", map, reduce, c("rowname", "rawtext"), c("word","one"), by="word", c("word","one"), c("word","count"))
  • 5. The first argument, “rintro”, is the name of the table to be processed. The map and reduce coming thereafter are each symbols of the R Functions, respectively processing map and reduce. The c("rowname", "rawtext") after that are the “rintro” table’s columns that will be taken as arguments for map Function, and its first value is the column to be used as key, the second is the column to be used as its value. The fifth argument, c(“word”, “one”), refers to map’s output and the sixth argument, by=”word”, refers to the column to be aggregated among map’s output. The seventh and the eighth are respectively the input and output of reduce. Confusion may ensue with so many arguments, but they are necessary for Hive’s map/reduce. The future will see many improvements for this Function.. Under the surface, Rhive.mrapply actually processes map/reduce-related tasks by making Hive SQL. This means the user does not have to use rhive.mrapply to process above examples but the user can directly use the Function provided by RHive and Hive table. But this too consists of an unfamiliar and difficult syntax, so RHive came to support Functions such as these. rhive.mapapply and rhive.reduceapply are used in almost the same way as rhive.mrapply so this tutorial omits explaining it. If you already have a Map/Reduce module with Hadoop streaming or Hadoop library used and wish to convert it to RHive, there may be many differences during the conversion process. RHive does not replace Hadoop streaming, nor does it replace Hadoop library. But it is merely a convenient measure for helping R users approach Hadoop and Hive. Should you be attempting to take a high-performance Map/Reduce module or binary executable file written in C/C++ (and others) and attempt to convert it into map/reduce and running it, then Hadoop library and Hadoop streaming may yet be a better choice.