SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
RHive : Integrating R and Hive
                             Introduction



                                                     JunHo Cho
                                           Data Analysis Platform Team




Friday, November 11, 11
Analysis of Data




Friday, November 11, 11
Analysis of Data




                                                            CF      Classifier
                                                                                 Decision Tree
                    MapReduce                      Recommendation
                                                                                Graph
                                                      Clustering




Friday, November 11, 11
Related Works


                    •     RHIPE

                    •     RHadoop
                                                                      duce
                                                                  R e
                    •     hive (Hadoop InteractiVE)
                                                               Map
                                                           and
                    •     seuge
                                                      r st
                                               un de
                                        u st
                                       M



Friday, November 11, 11
RHive is inspired by ...

                    •     Many analysts have been used R for a long time

                    •     Many analysts can use SQL language

                    •     There are already a lot of statistical functions in R

                    •     R needs a capability to analyze big data

                    •     Hive supports SQL-like query language (HQL)

                    •     Hive supports MapReduce to execute HQL




                          R is the best solution for familiarity
                          Hive is the best solution for capability


Friday, November 11, 11
RHive Components


                   •      Hadoop

                          •   store and analyze big data

                   •      Hive

                          •   use HQL instead of MapReduce programming

                   •      R

                          •   support friendly environment to analysts




Friday, November 11, 11
RHive - Architecture
                   Execute R Function Objects and R Objects
                   through Hive Query

                 Execute Hive Query through R
                                                                              rcal <- function(arg1,arg2) {
                                       SELECT R(‘rcal’,col1,col2)                 coeff * sum(arg1,arg2)
                                       from tab1                              }



                                                                                     RServe


                                           01010100101   01010100101   01010100101
                                           01010010101   01010010101   01010010101
                                           01001010101   01001010101   01001010101
                                           10101000111   10101000111   10101000111




                                                      R Function          R Object         RUDF           RUDAF




Friday, November 11, 11
RHive API

                •         Extension R Functions
                      •     rhive.connect       •   rhive.napply        •   rhive.load.table
                      •     rhive.query         •   rhive.sapply        •   rhive.desc.table
                      •     rhive.assign        •   rhive.aggregate
                      •     rhive.export        •   rhive.list.tables




                •         Extension Hive Functions
                      •     RUDF

                      •     RUDAF

                      •     GenericUDTFExpand

                      •     GenericUDTFUnFold


Friday, November 11, 11
RUDF - R User-defined Functions

                   SELECT R(‘R function-name’,col1,col2,...,TYPE)
                    •     UDF doesn’t know return type until calling R function

                          •    TYPE : return type

                Example : R function which sums all passed columns


                sumCols <- function(arg1,...) {
                   sum(arg1,...)
                }
                rhive.assign(‘sumCols’,sumCols)
                rhive.exportAll(‘sumCols’,hadoop-clusters)
                result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)
                plot(result)



Friday, November 11, 11
RUDAF - R User-defined Aggregation Function
                            SELECT RA(‘R function-name’,col1,col2,...)
                    •        R can not manipulate large dataset

                    •        Support UDAF’s life cycle

                           •    iterate, partial-merge, merge, terminate

                    •        Return type is only string delimited by ‘,’ - “data1,data2,data3,...”

                          partial aggregation                                                                  partial aggregation
                                                                 aggregation values



                    FUN                FUN.partial                                                   FUN.merge           FUN.terminate

                 01010100101   01010100101   01010100101   01010100101   01010100101   01010100101   01010100101
                 01010010101   01010010101   01010010101   01010010101   01010010101   01010010101   01010010101
                 01001010101   01001010101   01001010101   01001010101   01001010101   01001010101   01001010101
                 10101000111   10101000111   10101000111   10101000111   10101000111   10101000111   10101000111




Friday, November 11, 11
UDTF : unfold and expand
                    •     RUDAF only returns string delimited by ‘,’

                    •     Convert RUDAF’s result to R data.frame



                   unfold(string_value,type1,type2,...,delimiter)
                   expand(string_value,type,delimiter)

                  RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key

                  select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’,
                  attr1,attr2,attr3,attr4) as x from table group by cluster-key




Friday, November 11, 11
napply and sapply

                   rhive.napply(table-name,FUN,col1,...)
                   rhive.sapply(table-name,FUN,col1,...)
                  •       napply : R apply function for Numeric type

                  •       sapply : R apply function for String type



                      Example : R function which sums all passed columns

                      sumCols <- function(arg1,...) {
                         sum(arg1,...)
                      }
                      result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)
                      rhive.load.table(result)




Friday, November 11, 11
napply

         •       ‘napply’ is similar to R apply function

         •       Store big result to HDFS as Hive table


   rhive.napply       <- function(tablename, FUN, col = NULL, ...) {
         if(is.null(col))
              cols <- ""
         else
              cols <- paste(",",col)

         for(element in c(...)) {
              cols <- paste(cols,",",element)
         }

         exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")

   !     rhive.assign(exportname,FUN)
   !     rhive.exportAll(exportname)
         tmptable <- paste(exportname,”_table”)
   !     rhive.query(
                   paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))

   !     tmptable
   }




Friday, November 11, 11
aggregate

                 rhive.aggregate(table-name,hive-FUN,...,goups)

            •      RHive aggregation function to aggregate data stored in HDFS using HIVE Function




                  Example : Aggregate using SUM (Hive aggregation function)

                  result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)
                  rhive.load.table(result)




Friday, November 11, 11
Examples - predict flight delay
    library(RHive)
    rhive.connect()

    - Retrieve training set from large dataset stored in HDFS
    train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())

    train$arrdelay <- as.numeric(train$arrdelay)

    train$distance <- as.numeric(train$distance)

    train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),]                   Native R code
    model <- lm(arrdelay ~ distance + dayofweek,data=train)

    - Export R object data
    rhive.assign("model", model)

    - Analyze big data using model calculated by R
    predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {

            if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0)                HiveQuery + R code
            res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))

            return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)


Friday, November 11, 11
DEMO

Friday, November 11, 11
Conclusion

                    •     RHive supports HQL, not MapReduce model style

                    •     RHive allows analytics to do everything in R console

                    •     RHive interacts R data and HDFS data



                    •     Future & Current Works
                          •   Integrate Hadoop HDFS

                          •   Support Transform/Map-Reduce Scripts

                          •   Distributed Rserve

                          •   Support more R style API

                          •   Support machine learning algorithms (k-means, classifier, ...)


Friday, November 11, 11
Cooperators


                   •      JunHo Cho

                   •      Seonghak Hong

                   •      Choonghyun Ryu




                                      YO U !
Friday, November 11, 11
How to join RHive project


               •          Logo




               •          github (https://github.com/nexr/RHive)

               •          CRAN (http://cran.r-project.org/web/packages/RHive)

               •          Welcome to join RHive project




Friday, November 11, 11
References



              •       Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)

              •       RHIPE (http://ml.stat.purdue.edu/rhipe)

              •       Hive (http://hive.apache.org)

              •       Parallels R by Q. Ethan McCallum and Stephen Weston




Friday, November 11, 11
jun.cho@nexr.com




Friday, November 11, 11
Appendix




Friday, November 11, 11
RHIPE

                •         the R and Hadoop Integrated Processing Environment

                •         Must understand the MapReduce model

            map <- function() {...}                                             shuffle / sort
            reduce <- function() {...}                                Mapper                    Reducer
            rmr <- rhmr(map,reduce,...)
                                                                                ProtocolBuf
                            R
                     Fork                                                R                        R
                          RHMR
                                                           R Objects (map)       R Objects (reduce)

                                R Objects (map, reduce)                        PersonalServer
                                R Conf


                                                           HDFS

Friday, November 11, 11
RHadoop
                  •       Manipulate Hadoop data stores and HBASE directly from R
                  •       Write MapReduce models in R using Hadoop Streaming
                  •       Must understand the MapReduce model


           map <- function() {...}
           reduce <- function() {...}
           mapreduce(input,output,map,reduce,...)


                              R
              rhbase           rhdfs          rmr
                                                           execute hadoop
                                                           streaming
                                                                              R                         R
                                                                                     Hadoop Streaming

                      manipulate                                                      shuffle / sort
                                       store R objecs as file                Mapper                    Reducer


                                   HBASE                                             HDFS

Friday, November 11, 11
hive(Hadoop InteractiVE)
                •         R extension facilitating distributed computing via the MapReduce
                          paradigm
                •         Provide an interface to Hadoop, HDFS and Hadoop Streaming
                •         Must understand the MapReduce model

           map <- function() {...}
           reduce <- function() {...}
           hive_stream(map,reduce,...)


                             R
                                                          execute hadoop     R                         R
               hive        DFS     hive_stream
                                                          streaming
                                                                                    Hadoop Streaming

           manipulate                                                                shuffle / sort
                                 save R script on local                    Mapper                    Reducer


                                                                 HDFS

Friday, November 11, 11
seuge

            •       Simple parallel processing with a fast and easy setup on Amazon’s WS.
            •       Parallel lapply function for the EMR engine using Hadoop streaming.
            •       Does not support MapReduce model but only Map model.



           data <- list(...)
           emrlapply(clusterObject,data,FUN,..)                               Amazon S3
                                                  upload R objects
                                  R

                          emrlapply   awsFunctions                     EMR              Hadoop Streaming
                                                                                                           R



                                                                                                     Mapper
                  save R objects (data + FUN) on local
                                                                       bootstrap (setup R)
                                                                       mapper.R




Friday, November 11, 11
Ricardo
                    •     Integrate R and Jaql (JSON Query Language)
                    •     Must know how to use uncommon query, Jaql
                    •     Not open-source




                                                                       Ref : Ricardo-SIGMOD10



Friday, November 11, 11

Contenu connexe

Tendances

Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaNelson Forte
 

Tendances (20)

Unit 2
Unit 2Unit 2
Unit 2
 
Apache pig
Apache pigApache pig
Apache pig
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 

Similaire à Integrate Hive and R

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processingBryan Downing
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Implementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeImplementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeKato Takashi
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコードShinpei Ohtani
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコードShinpei Ohtani
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionClaudio Martella
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and RRadek Maciaszek
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 

Similaire à Integrate Hive and R (20)

Rhive 0.0 3
Rhive 0.0 3Rhive 0.0 3
Rhive 0.0 3
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Implementing R7RS on R6RS Scheme
Implementing R7RS on R6RS SchemeImplementing R7RS on R6RS Scheme
Implementing R7RS on R6RS Scheme
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
サンプルから見るMap reduceコード
サンプルから見るMap reduceコードサンプルから見るMap reduceコード
サンプルから見るMap reduceコード
 
サンプルから見るMapReduceコード
サンプルから見るMapReduceコードサンプルから見るMapReduceコード
サンプルから見るMapReduceコード
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Special topics in finance lecture 2
Special topics in finance   lecture 2Special topics in finance   lecture 2
Special topics in finance lecture 2
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
R - the language
R - the languageR - the language
R - the language
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Integrate Hive and R

  • 1. RHive : Integrating R and Hive Introduction JunHo Cho Data Analysis Platform Team Friday, November 11, 11
  • 2. Analysis of Data Friday, November 11, 11
  • 3. Analysis of Data CF Classifier Decision Tree MapReduce Recommendation Graph Clustering Friday, November 11, 11
  • 4. Related Works • RHIPE • RHadoop duce R e • hive (Hadoop InteractiVE) Map and • seuge r st un de u st M Friday, November 11, 11
  • 5. RHive is inspired by ... • Many analysts have been used R for a long time • Many analysts can use SQL language • There are already a lot of statistical functions in R • R needs a capability to analyze big data • Hive supports SQL-like query language (HQL) • Hive supports MapReduce to execute HQL R is the best solution for familiarity Hive is the best solution for capability Friday, November 11, 11
  • 6. RHive Components • Hadoop • store and analyze big data • Hive • use HQL instead of MapReduce programming • R • support friendly environment to analysts Friday, November 11, 11
  • 7. RHive - Architecture Execute R Function Objects and R Objects through Hive Query Execute Hive Query through R rcal <- function(arg1,arg2) { SELECT R(‘rcal’,col1,col2) coeff * sum(arg1,arg2) from tab1 } RServe 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 R Function R Object RUDF RUDAF Friday, November 11, 11
  • 8. RHive API • Extension R Functions • rhive.connect • rhive.napply • rhive.load.table • rhive.query • rhive.sapply • rhive.desc.table • rhive.assign • rhive.aggregate • rhive.export • rhive.list.tables • Extension Hive Functions • RUDF • RUDAF • GenericUDTFExpand • GenericUDTFUnFold Friday, November 11, 11
  • 9. RUDF - R User-defined Functions SELECT R(‘R function-name’,col1,col2,...,TYPE) • UDF doesn’t know return type until calling R function • TYPE : return type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } rhive.assign(‘sumCols’,sumCols) rhive.exportAll(‘sumCols’,hadoop-clusters) result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”) plot(result) Friday, November 11, 11
  • 10. RUDAF - R User-defined Aggregation Function SELECT RA(‘R function-name’,col1,col2,...) • R can not manipulate large dataset • Support UDAF’s life cycle • iterate, partial-merge, merge, terminate • Return type is only string delimited by ‘,’ - “data1,data2,data3,...” partial aggregation partial aggregation aggregation values FUN FUN.partial FUN.merge FUN.terminate 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010100101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01010010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 01001010101 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 10101000111 Friday, November 11, 11
  • 11. UDTF : unfold and expand • RUDAF only returns string delimited by ‘,’ • Convert RUDAF’s result to R data.frame unfold(string_value,type1,type2,...,delimiter) expand(string_value,type,delimiter) RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-key Friday, November 11, 11
  • 12. napply and sapply rhive.napply(table-name,FUN,col1,...) rhive.sapply(table-name,FUN,col1,...) • napply : R apply function for Numeric type • sapply : R apply function for String type Example : R function which sums all passed columns sumCols <- function(arg1,...) { sum(arg1,...) } result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4) rhive.load.table(result) Friday, November 11, 11
  • 13. napply • ‘napply’ is similar to R apply function • Store big result to HDFS as Hive table rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col) for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="") ! rhive.assign(exportname,FUN) ! rhive.exportAll(exportname) tmptable <- paste(exportname,”_table”) ! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep="")) ! tmptable } Friday, November 11, 11
  • 14. aggregate rhive.aggregate(table-name,hive-FUN,...,goups) • RHive aggregation function to aggregate data stored in HDFS using HIVE Function Example : Aggregate using SUM (Hive aggregation function) result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”) rhive.load.table(result) Friday, November 11, 11
  • 15. Examples - predict flight delay library(RHive) rhive.connect() - Retrieve training set from large dataset stored in HDFS train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) train$arrdelay <- as.numeric(train$arrdelay) train$distance <- as.numeric(train$distance) train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),] Native R code model <- lm(arrdelay ~ distance + dayofweek,data=train) - Export R object data rhive.assign("model", model) - Analyze big data using model calculated by R predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) { if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0) HiveQuery + R code res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3)) return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’) Friday, November 11, 11
  • 17. Conclusion • RHive supports HQL, not MapReduce model style • RHive allows analytics to do everything in R console • RHive interacts R data and HDFS data • Future & Current Works • Integrate Hadoop HDFS • Support Transform/Map-Reduce Scripts • Distributed Rserve • Support more R style API • Support machine learning algorithms (k-means, classifier, ...) Friday, November 11, 11
  • 18. Cooperators • JunHo Cho • Seonghak Hong • Choonghyun Ryu YO U ! Friday, November 11, 11
  • 19. How to join RHive project • Logo • github (https://github.com/nexr/RHive) • CRAN (http://cran.r-project.org/web/packages/RHive) • Welcome to join RHive project Friday, November 11, 11
  • 20. References • Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf) • RHIPE (http://ml.stat.purdue.edu/rhipe) • Hive (http://hive.apache.org) • Parallels R by Q. Ethan McCallum and Stephen Weston Friday, November 11, 11
  • 23. RHIPE • the R and Hadoop Integrated Processing Environment • Must understand the MapReduce model map <- function() {...} shuffle / sort reduce <- function() {...} Mapper Reducer rmr <- rhmr(map,reduce,...) ProtocolBuf R Fork R R RHMR R Objects (map) R Objects (reduce) R Objects (map, reduce) PersonalServer R Conf HDFS Friday, November 11, 11
  • 24. RHadoop • Manipulate Hadoop data stores and HBASE directly from R • Write MapReduce models in R using Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} mapreduce(input,output,map,reduce,...) R rhbase rhdfs rmr execute hadoop streaming R R Hadoop Streaming manipulate shuffle / sort store R objecs as file Mapper Reducer HBASE HDFS Friday, November 11, 11
  • 25. hive(Hadoop InteractiVE) • R extension facilitating distributed computing via the MapReduce paradigm • Provide an interface to Hadoop, HDFS and Hadoop Streaming • Must understand the MapReduce model map <- function() {...} reduce <- function() {...} hive_stream(map,reduce,...) R execute hadoop R R hive DFS hive_stream streaming Hadoop Streaming manipulate shuffle / sort save R script on local Mapper Reducer HDFS Friday, November 11, 11
  • 26. seuge • Simple parallel processing with a fast and easy setup on Amazon’s WS. • Parallel lapply function for the EMR engine using Hadoop streaming. • Does not support MapReduce model but only Map model. data <- list(...) emrlapply(clusterObject,data,FUN,..) Amazon S3 upload R objects R emrlapply awsFunctions EMR Hadoop Streaming R Mapper save R objects (data + FUN) on local bootstrap (setup R) mapper.R Friday, November 11, 11
  • 27. Ricardo • Integrate R and Jaql (JSON Query Language) • Must know how to use uncommon query, Jaql • Not open-source Ref : Ricardo-SIGMOD10 Friday, November 11, 11