SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
R and Hadoop

    Ram Venkat
   Dawn Analytics
What is Hadoop?
•   Hadoop is an open source Apache software for
    running distributed applications on 'big data'
•   It contains a distributed file system (HDFS) and a
    parallel processing 'batch' framework
•   Hadoop is written in java, runs on unix/linux for
    development and production
•   Windows and Mac can be used as development
    platform
•   Yahoo has > 43000 nodes hadoop cluster and
    Facebook has over 100 PB(PB= 1 M GB) of data in
    hadoop clusters
Hadoop overview (1/2)
     Central Idea: Moving computation to data and
    compute across nodes in parallel
•   Data Loading
Hadoop overview (2/2)
Parallel Computation: MapReduce
Map Reduce : Example 'Hello word'
•   Mathematically, this is what MapReduce is about:
    —map (k1, v1) ➔ list(k2, v2)
    —reduce(k2, list(v2)) ➔ list(<k3, v3>)
•   Implementation of the 'hello word' (word count):
    Mapper: K1 -> file name, v1 -> text of the file
             K2 -> word, V2 -> “1”
    Reducer: Sums up the '1' s and produces a list of
    words and their counts
Word Count (slide from Yahoo)
R libraries to work with Hadoop
•    'Hadoop Streaming' - An alternative to the Java
     MapReduce API
•    Hadoop Streaming allows you to write jobs in any
     language supporting stdin/stdout.
•    R has several libraries/ways that help you to work with
     Hadoop:
      – Write your mapper.R and reducer.R and run a shell
        script
      – 'rmr' and 'rhadoop' from revolution analytics
     – 'rhipe' from Purdue University statistical computing
     – 'RHive' interacts with Hadoop via Hive query
Word Count Demo with R(rmr)
  mapper.wordcount = function(key, val) {

                lapply(
                  strsplit( x = val, split = " ")[[1]],
                  function(w) keyval(w,1)
                )
  }

   reducer.wordcount = function(key, val.list) {
             output.key = key
             output.val = sum(unlist(val.list))
                return( keyval(output.key, output.val) )
   }
More advanced example –
    Sentiment Analysis in R(rmr)
•   One area where Hadoop could help out traders is in
    sentiment analysis
•   Oreilly Strata blog 'Trading on sentiment' :
    http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html

•   Demo2 is modified code from this example from Jeffrey
    Breen on airlines sentiment analysis :
    http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

•   Jeffrey has been very active in R groups in Chicago
    area, This is another tutorial last month on R and
    Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started-
    with-r-hadoop
Demo2 Sentiment Analysis with rmr
 mapper.score = function(key, val) {

 # clean up tweets with R's regex-driven global substitute, gsub():
   val = gsub('[[:punct:]]', '', val)
   val = gsub('[[:cntrl:]]', '', val)
   val = gsub('d+', '', val)

 # Key is the Airline we added as tag to the tweets
   airline = substr(val,1,2)

 # Run the sentiment analysis
   output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words))

 # our interest is in computing the counts by airlines and scores, so 'this' count is 1
   output.val = 1
   return( keyval(output.key, output.val) )
 }
Demo3 - Hive

•   Hive is a data warehousing infrastructure for Hadoop
•   Provides a familiar SQL like interface to create tables,
    insert and query data
•   Behind the scene , it implements map-reduce
•    Hive is an alternative to our hadoop streaming we
    covered before
•   Demo3 – stock query with Hive
Use cases for Traders
•    Stock sentiment analysis
•    Stock trading pattern analysis
•    Default prediction
•    Fraud/anomoly detection
•    NextGen data warehousing
Hadoop support - Cloudera
•   Cloudera distribution of hadoop is one of the most
    popular distribution (I used their CDH3v5 in my 2
    demos above)
•   Doug Cutting, the creator of Hadoop is the architect
    with Cloudera
•   Adam Muise , a Cloudera engineer at Toronto is the
    organizer of Toronto Hadoop user Group (TOHUG)
•   Upcoming meeting organized by TOHUG on the 30th
    October - “PIG-fest”
Hadoop Tutorials and Books
•   http://hadoop.apache.org/docs/r0.20.2/quickstart.html
•   Cloudera: http://university.cloudera.com/
•   Book: “Hadoop in Action” – Manning
•   Book: “Hadoop - The Definitive Guide” – Oreilly
•   Hadoop Streaming:
    http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html
•   Google Code University:
    http://code.google.com/edu/parallel/mapreduce-tutorial.html
•   Yahoo's Tutorial :
    http://developer.yahoo.com/hadoop/tutorial/module1.html
Thank You

For any clarification, send e-mail to
ram@dawnanalytics.com

Contenu connexe

Tendances

Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first classalogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm alogarg
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 

Tendances (20)

Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 

Similaire à R and Hadoop: An Introduction to Using R with Hadoop for Big Data Analysis

Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014rpbrehm
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 

Similaire à R and Hadoop: An Introduction to Using R with Hadoop for Big Data Analysis (20)

Hive paris
Hive parisHive paris
Hive paris
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Data Science
Data ScienceData Science
Data Science
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop
HadoopHadoop
Hadoop
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 

Dernier

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 

Dernier (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

R and Hadoop: An Introduction to Using R with Hadoop for Big Data Analysis

  • 1. R and Hadoop Ram Venkat Dawn Analytics
  • 2. What is Hadoop? • Hadoop is an open source Apache software for running distributed applications on 'big data' • It contains a distributed file system (HDFS) and a parallel processing 'batch' framework • Hadoop is written in java, runs on unix/linux for development and production • Windows and Mac can be used as development platform • Yahoo has > 43000 nodes hadoop cluster and Facebook has over 100 PB(PB= 1 M GB) of data in hadoop clusters
  • 3. Hadoop overview (1/2) Central Idea: Moving computation to data and compute across nodes in parallel • Data Loading
  • 4. Hadoop overview (2/2) Parallel Computation: MapReduce
  • 5. Map Reduce : Example 'Hello word' • Mathematically, this is what MapReduce is about: —map (k1, v1) ➔ list(k2, v2) —reduce(k2, list(v2)) ➔ list(<k3, v3>) • Implementation of the 'hello word' (word count): Mapper: K1 -> file name, v1 -> text of the file K2 -> word, V2 -> “1” Reducer: Sums up the '1' s and produces a list of words and their counts
  • 6. Word Count (slide from Yahoo)
  • 7. R libraries to work with Hadoop • 'Hadoop Streaming' - An alternative to the Java MapReduce API • Hadoop Streaming allows you to write jobs in any language supporting stdin/stdout. • R has several libraries/ways that help you to work with Hadoop: – Write your mapper.R and reducer.R and run a shell script – 'rmr' and 'rhadoop' from revolution analytics – 'rhipe' from Purdue University statistical computing – 'RHive' interacts with Hadoop via Hive query
  • 8. Word Count Demo with R(rmr) mapper.wordcount = function(key, val) { lapply( strsplit( x = val, split = " ")[[1]], function(w) keyval(w,1) ) } reducer.wordcount = function(key, val.list) { output.key = key output.val = sum(unlist(val.list)) return( keyval(output.key, output.val) ) }
  • 9. More advanced example – Sentiment Analysis in R(rmr) • One area where Hadoop could help out traders is in sentiment analysis • Oreilly Strata blog 'Trading on sentiment' : http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html • Demo2 is modified code from this example from Jeffrey Breen on airlines sentiment analysis : http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ • Jeffrey has been very active in R groups in Chicago area, This is another tutorial last month on R and Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started- with-r-hadoop
  • 10. Demo2 Sentiment Analysis with rmr mapper.score = function(key, val) { # clean up tweets with R's regex-driven global substitute, gsub(): val = gsub('[[:punct:]]', '', val) val = gsub('[[:cntrl:]]', '', val) val = gsub('d+', '', val) # Key is the Airline we added as tag to the tweets airline = substr(val,1,2) # Run the sentiment analysis output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words)) # our interest is in computing the counts by airlines and scores, so 'this' count is 1 output.val = 1 return( keyval(output.key, output.val) ) }
  • 11. Demo3 - Hive • Hive is a data warehousing infrastructure for Hadoop • Provides a familiar SQL like interface to create tables, insert and query data • Behind the scene , it implements map-reduce • Hive is an alternative to our hadoop streaming we covered before • Demo3 – stock query with Hive
  • 12. Use cases for Traders • Stock sentiment analysis • Stock trading pattern analysis • Default prediction • Fraud/anomoly detection • NextGen data warehousing
  • 13. Hadoop support - Cloudera • Cloudera distribution of hadoop is one of the most popular distribution (I used their CDH3v5 in my 2 demos above) • Doug Cutting, the creator of Hadoop is the architect with Cloudera • Adam Muise , a Cloudera engineer at Toronto is the organizer of Toronto Hadoop user Group (TOHUG) • Upcoming meeting organized by TOHUG on the 30th October - “PIG-fest”
  • 14. Hadoop Tutorials and Books • http://hadoop.apache.org/docs/r0.20.2/quickstart.html • Cloudera: http://university.cloudera.com/ • Book: “Hadoop in Action” – Manning • Book: “Hadoop - The Definitive Guide” – Oreilly • Hadoop Streaming: http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html • Google Code University: http://code.google.com/edu/parallel/mapreduce-tutorial.html • Yahoo's Tutorial : http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 15. Thank You For any clarification, send e-mail to ram@dawnanalytics.com