SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
R, Hadoop and Amazon Web
         Services
    Portland R Users Group
     December 20th, 2011
A general disclaimer
• Good programmers learn fast and develop expertise in
  technologies and methodologies in a rather intrepid,
  exploratory manner.
• I am by no means a expert in the paradigm which we
  are discussing this evening but I’d like to share what I
  have learned in the last year while developing
  MapReduce applications in R within the AWS.
  Translation: ask anything and everything but reserve
  the right to say “I don’t know, yet.”
• Also, this is a meetup.com meeting – seems only
  appropriate to keep this short, sweet, high-level and
  full of solicitous discussion points.
The whole point of this presentation
• I am selfish (and you should be too!)
    – I like collaborators
    – I like collaborators interested in things I am interested in
    – I believe that dissemination of information related to sophisticated,
      numerical decision making processes generally makes the world a
      better place
    – I believe that the more people use Open Source technology, the more
      people contribute to Open Source technology and the better Open
      Source technology gets in general. Hence, my life gets easier and
      cheaper which is presumably analogous to “better” in some respect.
    – There is beer at this meetup. Queue short intermission.
• Otherweiser® (brought by the aforementioned speaking point,) I’d
  really be very happy if people said to themselves at the end of this
  presentation “Hadoop seems easy! I’m going to give it a try.”
Why are we talking about this
                    anyhow?
“Every two days now we create as much information as we did from the dawn of
   civilization up until 2003.“ -Eric Schmidt, August 2010

•   We aggregate a lot of data (and have been)
     – Particularly businesses like Google, Amazon, Apple etc…
     – Presumably the government is doing awful things with data too
•   But aggregation isn’t understanding
     – Lawnmower Man aside
     – We need to UNDERSTAND the data- that is take raw data and make it interoperable.
     – Hence the need for a marriage of Statistics and Programming directed at understanding
       phenomena expressed in these large data sets
     – Can’t recommend this book enough:
          •   The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert
              Tibshirani and Jerome Freidman
          •   http://www.amazon.com/Elements-Statistical-Learning-Prediction-
              Statistics/dp/0387848576/ref=pd_sim_b_1
•   So everybody is going crazy about this in general.
Also, who is this “self” I speak of?
• tis’ I, Timothy Dalbey
     • I work for the Emerging Technologies Group of News
       Corporation
     • I live in North East Portland and keep an office on 53rd
       and 5th in New York City
     • Studied Mathematics and Economics as a
       undergraduate student and Statistics as a graduate
       student at University of Virginia
     • 2 awesome kids and a awesome partner at home: Liam,
       Juniper and Lindsay
     • Enthusiastic about technology, science and futuristic
       endeavors in general
Elastic MapReduce
• Elastic Map reduce is
  – A service of Amazon Web Services
  – Is composed of Amazon Machine Images
     • ssh capability
     • Debian Linux
     • Preloaded with ancient versions of R
  – A complimentary set of Ruby Client Tools
  – A web interface
  – Preconfigured to run Hadoop
Hadoop
• Popular framework for controlling distributed cluster computations
     – Popularity is important – queue story about MPI at Levy Laboratory
       and Beowulf clusters…
• Hadoop is a Apache Project product
     – http://hadoop.apache.org/
•   Open Source
•   Java
•   Configurable (mostly uses XML config files)
•   Fault Tolerant
•   Lots of ways to interact with Hadoop
     –   Pig
     –   Hive
     –   Streaming
     –   Custom .jar
Hadoop is MapReduce
• What is a MapReduce?
   – Originally coined by Google Labs in 2004
   – A super simplified single-node version of the paradigm is as follows:
       cat input.txt | ./mapper.R | sort | reducer.R > output.txt
• That is, MapReduce has follows a general process:
   –   Read input (cat input)
   –   Map (mapper.R)
   –   Partition
   –   Comparison (sort)
   –   Reduce (reducer.R)
   –   Output (output.txt)
• You can use most popular scripting languages
   – Perl, PHP, Python etc…
   – R
But – that sort of misses the point
• MapReduce is computational paradigm intended for
   – Large Datasets
   – Multi-Node Computation
   – Truly Parallel Processing
• Master/Slave architecture
   – Nodes are agnostic of one another, only the master
     node(s) have any idea about the greater scheme of things.
      • The importance of truly parallel processing
• A good first question before engaging in creating a
  Hadoop job is:
   – Is this process a good candidate for Hadoop processing in
     the first place?
Benefits to using AWS for Hadoop Jobs
• Preconfigured to run Hadoop
   – This is itself is something of a miracle
• Virtual Servers
   – Use the servers for only as long as you need
   – configurability
• Handy command line tools
• S3 is sitting in the same cloud
   – Your data is sitting in the same space
• Servers come at $0.06 per hour of compute time
  – dirt cheap
Specifics
•       Bootstrapping
           –       Bootstrapping is a process by which you may customize the nodes via bash shell
                       •     Acquiring data
                       •     Updating R
                       •     Installing Packages
                       •     Please, you example:

#!/bin/bash
#debian R upgrade
gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480
gpg -a --export 06F90DE5381BA480 | sudo apt-key add -
echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev



•       Input file
           –       Mapper specific
                       •     Classic example in WordCounter.py
                                   –     Example: “It was the best of times, it was the worst of times…”
                                   –     Note: Big data set!
                       •     An example from a recent appliocation of mine:
                                   –     "25621”r"23803"r"31712”r…
                                   –     Note: Not such a big data set


•       Mapper & Reducer
           –       Both typically draw from STDIN and write to STDOUT
           –       Please see the following examples
The typical “Hello World” MapReduce
                Mapper
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”)

con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    cat(paste(words, "t1n", sep=""), sep="")
}

close(con)
The typical “Hello World” MapReduce
                 Reducer
#! /usr/bin/env Rscript

trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
        val <- unlist(strsplit(line, "t"))
        list(word = val[1], count = as.integer(val[2]))
}

env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
       line <- trimWhiteSpace(line)
       split <- splitLine(line)
       word <- split$word
       count <- split$count
       if (exists(word, envir = env, inherits = FALSE)) {
           oldcount <- get(word, envir = env)
           assign(word, oldcount + count, envir = env)
       }else{
           assign(word, count, envir = env)
       }
}

close(con)
for (w in ls(env, all = TRUE)){
       cat(w, "t", get(w, envir = env), "n", sep = "”)
}
MapReduce and R: Forecasting data
       for News Corporation
• 50k+ products with historical unit sales data of roughly
  2.5MM rows
• Some of the titles require heavy computational processing
   – Titles with insufficient data require augmented or surrogate
     data in order to make “good” predictions – thus identifying good
     candidate data was also necessary in addition to prediction
     methods
   – Took lots of time (particularly in R)
      • But R had the analysis tools I needed!
• Key observation: The predictions were independent of one
  another which made the process truly parallel.
• Thus, Hadoop and Elastic MapReduce were merited
My Experience Learning and Using
            Hadoop with AWS
•   Debugging is something of a nightmare.
     –   SSH onto nodes to figure out what’s really going on
     –   STDERR is your enemy – it will cause your job to fail rather completely
     –   STDERR is your best friend. No errors and failed jobs are rather frustrating
•   Most of the work is in transactional with AWS Elastic MapReduce
•   I followed conventional advice which is “move data to the nodes.”
     –   This meant moving data into csv’s in S3 and importing the data into R via standard read methods
     –   This also meant that my processes were database agnostic
     –   JSON is a great way of structuring input and output between phases of the MapReduce Process
           •   To that effect, check out RJSON – great package.
•   In general, the following rule seems to apply:
     –   Data frame bad.
     –   Data table good.
           •   http://cran.r-project.org/web/packages/data.table/index.html
•   Packages to simplify R make my skin crawl
     –   Ever see Jurassic Park?
     –   Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I
         said that.
R Package to Utilize Map Reduce
• Segue – Written J.D. Long
  – http://www.cerebralmastication.com
     • P.s. We all realize that www is a subdomain, right?
       World Wide Web… is that really necessary?
  – Handles much of the transactional details and
    allows the use of Elastic MapReduce through
    apply() and lapply() wrappers
• Seems like this is a good tutorial too:
  – http://jeffreybreen.wordpress.com/2011/01/10/s
    egue-r-to-amazon-elastic-mapreduce-hadoop/
Other stuff
• Distributed Cache
  – Load your data the smart way!
• Ruby Command Tools
  – Interact with AWS the smart way!
• Web interface
  – Simple.
  – Helpful when monitoring jobs when you wake up
    at 3:30AM and wonder “is my script still running?”

Contenu connexe

Tendances

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Tendances (20)

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Giraph
GiraphGiraph
Giraph
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Apache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done betterApache Giraph: Large-scale graph processing done better
Apache Giraph: Large-scale graph processing done better
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 

En vedette

Apresentação nivel i bovespa
Apresentação nivel i bovespaApresentação nivel i bovespa
Apresentação nivel i bovespa
Braskem_RI
 
Entrevista ao gigante adamastor
Entrevista ao gigante adamastorEntrevista ao gigante adamastor
Entrevista ao gigante adamastor
dsa97
 
Public interviews
Public interviewsPublic interviews
Public interviews
Emma Garner
 
Practice soap evaluation
Practice soap evaluationPractice soap evaluation
Practice soap evaluation
Emma Garner
 
Galeano-Gutierrez-Roldan-Rombolá
Galeano-Gutierrez-Roldan-RomboláGaleano-Gutierrez-Roldan-Rombolá
Galeano-Gutierrez-Roldan-Rombolá
Damian
 
Registro de estrangeiro no brasil
Registro de estrangeiro no brasilRegistro de estrangeiro no brasil
Registro de estrangeiro no brasil
juramentado05
 
Case Brilux - Limpeza Para Toda Família
Case Brilux - Limpeza Para Toda FamíliaCase Brilux - Limpeza Para Toda Família
Case Brilux - Limpeza Para Toda Família
gruponove_promonove
 

En vedette (18)

Roald dahl
 Roald dahl Roald dahl
Roald dahl
 
7Guia1
7Guia17Guia1
7Guia1
 
LM3405 : Constant Current Regulator for Powering LEDs
LM3405 : Constant Current Regulator for Powering LEDs LM3405 : Constant Current Regulator for Powering LEDs
LM3405 : Constant Current Regulator for Powering LEDs
 
Que saudades
Que saudadesQue saudades
Que saudades
 
Apresentação nivel i bovespa
Apresentação nivel i bovespaApresentação nivel i bovespa
Apresentação nivel i bovespa
 
Life is not about finding
Life is not about findingLife is not about finding
Life is not about finding
 
Entrevista ao gigante adamastor
Entrevista ao gigante adamastorEntrevista ao gigante adamastor
Entrevista ao gigante adamastor
 
Proactive and reactive thermal optimization techniques to improve energy effi...
Proactive and reactive thermal optimization techniques to improve energy effi...Proactive and reactive thermal optimization techniques to improve energy effi...
Proactive and reactive thermal optimization techniques to improve energy effi...
 
It’s all about priorities.”
It’s all about priorities.”It’s all about priorities.”
It’s all about priorities.”
 
Public interviews
Public interviewsPublic interviews
Public interviews
 
Mingus High School Teacher In-Service
Mingus High School Teacher In-ServiceMingus High School Teacher In-Service
Mingus High School Teacher In-Service
 
Practice soap evaluation
Practice soap evaluationPractice soap evaluation
Practice soap evaluation
 
Mobile marketing- Hvad ønsker brugeren sig af den mobile platform?
Mobile marketing- Hvad ønsker brugeren sig af den mobile platform?Mobile marketing- Hvad ønsker brugeren sig af den mobile platform?
Mobile marketing- Hvad ønsker brugeren sig af den mobile platform?
 
Galeano-Gutierrez-Roldan-Rombolá
Galeano-Gutierrez-Roldan-RomboláGaleano-Gutierrez-Roldan-Rombolá
Galeano-Gutierrez-Roldan-Rombolá
 
O que é tradução juramentada
O que é tradução juramentadaO que é tradução juramentada
O que é tradução juramentada
 
Registro de estrangeiro no brasil
Registro de estrangeiro no brasilRegistro de estrangeiro no brasil
Registro de estrangeiro no brasil
 
Taller grado 5
Taller grado 5Taller grado 5
Taller grado 5
 
Case Brilux - Limpeza Para Toda Família
Case Brilux - Limpeza Para Toda FamíliaCase Brilux - Limpeza Para Toda Família
Case Brilux - Limpeza Para Toda Família
 

Similaire à "R, Hadoop, and Amazon Web Services (20 December 2011)"

Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 

Similaire à "R, Hadoop, and Amazon Web Services (20 December 2011)" (20)

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

"R, Hadoop, and Amazon Web Services (20 December 2011)"

  • 1. R, Hadoop and Amazon Web Services Portland R Users Group December 20th, 2011
  • 2. A general disclaimer • Good programmers learn fast and develop expertise in technologies and methodologies in a rather intrepid, exploratory manner. • I am by no means a expert in the paradigm which we are discussing this evening but I’d like to share what I have learned in the last year while developing MapReduce applications in R within the AWS. Translation: ask anything and everything but reserve the right to say “I don’t know, yet.” • Also, this is a meetup.com meeting – seems only appropriate to keep this short, sweet, high-level and full of solicitous discussion points.
  • 3. The whole point of this presentation • I am selfish (and you should be too!) – I like collaborators – I like collaborators interested in things I am interested in – I believe that dissemination of information related to sophisticated, numerical decision making processes generally makes the world a better place – I believe that the more people use Open Source technology, the more people contribute to Open Source technology and the better Open Source technology gets in general. Hence, my life gets easier and cheaper which is presumably analogous to “better” in some respect. – There is beer at this meetup. Queue short intermission. • Otherweiser® (brought by the aforementioned speaking point,) I’d really be very happy if people said to themselves at the end of this presentation “Hadoop seems easy! I’m going to give it a try.”
  • 4. Why are we talking about this anyhow? “Every two days now we create as much information as we did from the dawn of civilization up until 2003.“ -Eric Schmidt, August 2010 • We aggregate a lot of data (and have been) – Particularly businesses like Google, Amazon, Apple etc… – Presumably the government is doing awful things with data too • But aggregation isn’t understanding – Lawnmower Man aside – We need to UNDERSTAND the data- that is take raw data and make it interoperable. – Hence the need for a marriage of Statistics and Programming directed at understanding phenomena expressed in these large data sets – Can’t recommend this book enough: • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Freidman • http://www.amazon.com/Elements-Statistical-Learning-Prediction- Statistics/dp/0387848576/ref=pd_sim_b_1 • So everybody is going crazy about this in general.
  • 5. Also, who is this “self” I speak of? • tis’ I, Timothy Dalbey • I work for the Emerging Technologies Group of News Corporation • I live in North East Portland and keep an office on 53rd and 5th in New York City • Studied Mathematics and Economics as a undergraduate student and Statistics as a graduate student at University of Virginia • 2 awesome kids and a awesome partner at home: Liam, Juniper and Lindsay • Enthusiastic about technology, science and futuristic endeavors in general
  • 6. Elastic MapReduce • Elastic Map reduce is – A service of Amazon Web Services – Is composed of Amazon Machine Images • ssh capability • Debian Linux • Preloaded with ancient versions of R – A complimentary set of Ruby Client Tools – A web interface – Preconfigured to run Hadoop
  • 7. Hadoop • Popular framework for controlling distributed cluster computations – Popularity is important – queue story about MPI at Levy Laboratory and Beowulf clusters… • Hadoop is a Apache Project product – http://hadoop.apache.org/ • Open Source • Java • Configurable (mostly uses XML config files) • Fault Tolerant • Lots of ways to interact with Hadoop – Pig – Hive – Streaming – Custom .jar
  • 8. Hadoop is MapReduce • What is a MapReduce? – Originally coined by Google Labs in 2004 – A super simplified single-node version of the paradigm is as follows: cat input.txt | ./mapper.R | sort | reducer.R > output.txt • That is, MapReduce has follows a general process: – Read input (cat input) – Map (mapper.R) – Partition – Comparison (sort) – Reduce (reducer.R) – Output (output.txt) • You can use most popular scripting languages – Perl, PHP, Python etc… – R
  • 9. But – that sort of misses the point • MapReduce is computational paradigm intended for – Large Datasets – Multi-Node Computation – Truly Parallel Processing • Master/Slave architecture – Nodes are agnostic of one another, only the master node(s) have any idea about the greater scheme of things. • The importance of truly parallel processing • A good first question before engaging in creating a Hadoop job is: – Is this process a good candidate for Hadoop processing in the first place?
  • 10. Benefits to using AWS for Hadoop Jobs • Preconfigured to run Hadoop – This is itself is something of a miracle • Virtual Servers – Use the servers for only as long as you need – configurability • Handy command line tools • S3 is sitting in the same cloud – Your data is sitting in the same space • Servers come at $0.06 per hour of compute time – dirt cheap
  • 11. Specifics • Bootstrapping – Bootstrapping is a process by which you may customize the nodes via bash shell • Acquiring data • Updating R • Installing Packages • Please, you example: #!/bin/bash #debian R upgrade gpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480 gpg -a --export 06F90DE5381BA480 | sudo apt-key add - echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.list sudo apt-get update sudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev • Input file – Mapper specific • Classic example in WordCounter.py – Example: “It was the best of times, it was the worst of times…” – Note: Big data set! • An example from a recent appliocation of mine: – "25621”r"23803"r"31712”r… – Note: Not such a big data set • Mapper & Reducer – Both typically draw from STDIN and write to STDOUT – Please see the following examples
  • 12. The typical “Hello World” MapReduce Mapper #! /usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="") } close(con)
  • 13. The typical “Hello World” MapReduce Reducer #! /usr/bin/env Rscript trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line) splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2])) } env <- new.env(hash = TRUE) con <- file("stdin", open = "r") while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) }else{ assign(word, count, envir = env) } } close(con) for (w in ls(env, all = TRUE)){ cat(w, "t", get(w, envir = env), "n", sep = "”) }
  • 14. MapReduce and R: Forecasting data for News Corporation • 50k+ products with historical unit sales data of roughly 2.5MM rows • Some of the titles require heavy computational processing – Titles with insufficient data require augmented or surrogate data in order to make “good” predictions – thus identifying good candidate data was also necessary in addition to prediction methods – Took lots of time (particularly in R) • But R had the analysis tools I needed! • Key observation: The predictions were independent of one another which made the process truly parallel. • Thus, Hadoop and Elastic MapReduce were merited
  • 15. My Experience Learning and Using Hadoop with AWS • Debugging is something of a nightmare. – SSH onto nodes to figure out what’s really going on – STDERR is your enemy – it will cause your job to fail rather completely – STDERR is your best friend. No errors and failed jobs are rather frustrating • Most of the work is in transactional with AWS Elastic MapReduce • I followed conventional advice which is “move data to the nodes.” – This meant moving data into csv’s in S3 and importing the data into R via standard read methods – This also meant that my processes were database agnostic – JSON is a great way of structuring input and output between phases of the MapReduce Process • To that effect, check out RJSON – great package. • In general, the following rule seems to apply: – Data frame bad. – Data table good. • http://cran.r-project.org/web/packages/data.table/index.html • Packages to simplify R make my skin crawl – Ever see Jurassic Park? – Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I said that.
  • 16. R Package to Utilize Map Reduce • Segue – Written J.D. Long – http://www.cerebralmastication.com • P.s. We all realize that www is a subdomain, right? World Wide Web… is that really necessary? – Handles much of the transactional details and allows the use of Elastic MapReduce through apply() and lapply() wrappers • Seems like this is a good tutorial too: – http://jeffreybreen.wordpress.com/2011/01/10/s egue-r-to-amazon-elastic-mapreduce-hadoop/
  • 17. Other stuff • Distributed Cache – Load your data the smart way! • Ruby Command Tools – Interact with AWS the smart way! • Web interface – Simple. – Helpful when monitoring jobs when you wake up at 3:30AM and wonder “is my script still running?”