SlideShare a Scribd company logo
1 of 22
Running R on the Amazon Cloud
Ian Cook
Raleigh-Durham-Chapel Hill R Users Group
June 20, 2013
+
Why?
• Some R jobs are RAM- and CPU-intensive
• Powerful hardware is expensive to buy
• Institutional cluster compute resources can be
difficult to procure and to use
• Amazon Web Services (AWS) provides a
fast, cheap, and easy way to use
computational resources in the cloud
• AWS offers a free usage tier that you can use
to try this: http://aws.amazon.com/free/
What is AWS?
• A collection of cloud computing services
• Billed based on usage
• The best-known AWS service is Amazon
Elastic Compute Cloud (EC2) which provides
scalable virtual private servers
• Other AWS services include Elastic
MapReduce (EMR) (a hosted Hadoop service)
and Simple Storage Service (S3) (for online file
storage)
How much RAM/CPU can I use on EC2?
• Up to 32 virtual CPU cores per instance
• Up to 244 GB RAM per instance
• Can distribute a task across multiple instances
• Can resize instances (start small, grow as
needed)
• Instance details at http://aws.amazon.com/
ec2/instance-types/instance-details/
• Pricing at http://aws.amazon.com/ec2/
pricing/#on-demand
When not to use AWS?
• It is often cheaper, easier, and more elegant to
use tools and techniques to make your R code
less RAM- and CPU-intensive:
– R package bigmemory allows analysis of datasets
larger than available RAM
http://www.bigmemory.org/
– R package data.table enables faster operations on
large data http://cran.r-
project.org/web/packages/data.table/index.html
– Good R programming techniques (e.g. vectorization)
can make your code run drastically faster on just one
CPU core http://www.noamross.net/blog/
2013/4/25/faster-talk.html
More ways to speed up R code
• Rewrite key functions in C++ for much
improved performance, and use Dirk
Eddelbuettel’s Rcpp package to embed the
C++ code in your R program:
– http://dirk.eddelbuettel.com/code/rcpp.html
– https://github.com/hadley/devtools/wiki/Rcpp
• Radford Neal’s pqR is a faster version of R
– http://radfordneal.wordpress.com/2013/06/22/a
nnouncing-pqr-a-faster-version-of-r/
Free Commercial R Distributions
• Two (very different) commercial distributions of R
are freely available. Both have much improved
performance vs. plain R in many cases
– Revolution R
An enhanced distribution of open source R with an IDE
http://www.revolutionanalytics.com/products/revolut
ion-r.php
– TIBCO Enterprise Runtime for R
A high-performance R-compatible statistical engine
http://spotfire.tibco.com/en/discover-spotfire/what-
does-spotfire-do/predictive-analytics/tibco-
enterprise-runtime-for-r-terr.aspx
RStudio Server AMIs
• Louis Aslett maintains a set of Amazon
Machine Images (AMIs) available for anyone
to use
• These AMIs include the latest versions of R
and RStudio Server on Ubuntu
• These AMIs make it very fast and easy to use R
on EC2
• Thanks Louis!
Launch EC2 Instance
• Sign up for an AWS account at
https://portal.aws.amazon.com/gp/aws/develop
er/registration/index.html
• Go to http://www.louisaslett.com/RStudio_AMI/
and click the AMI for your region (US
East, Virginia)
• Complete the process to launch the instance
– Choose instance type t1.micro for free usage tier
– Open port 80, and optionally port 22 (to use SSH)
– After done, may take about 5 minutes to launch
Use RStudio on EC2 Instance
• Copy the “Public DNS” for your EC2 instance into
your web browser address field (e.g. ec2-xx-xx-
xxx-xxx.compute-1.amazonaws.com)
• Login with username rstudio and password
rstudio and start using RStudio
• Remember to stop your instance when finished
• Video instructions at
http://www.louisaslett.com/RStudio_AMI/video_
guide.html
How to use all those CPU cores?
• R package parallel enables some tasks in R to run
parallel across multiple CPU cores
– This is explicit parallelism—the task must be
parallelizable
– CPU cores can be on one machine or across multiple
machines
• The parallel package has been included directly in
R since version 2.14.0. It derives from the two R
packages snow and multicore.
• http://stat.ethz.ch/R-manual/R-
devel/library/parallel/doc/parallel.pdf
Example: Parallel numerical integration
• Calculate the volume under a
three-dimensional function
• Adapted from the example in
Appendix B, part 4 of
http://www.jstatsoft.org/v31/i01/
“State of the Art in Parallel Computing with
R.”
Schmidberger, Morgan, Eddelbuettel, Yu, Tier
ney, and Mansmann. Journal of Statistical
Software. August 2009, Volume 31, Issue 1.x
y
z
Note that paper by Schmidberger et al. was written before the package parallel was included in R.
The examples in the paper use other packages including snow that were precursors of the package parallel.
Example: Parallel numerical integration
Define a three-dimensional function and limits on its domain:
func <- function(x, y) x^3-3*x + y^3-3*y
xint <- c(-1, 2)
yint <- c(-1, 2)
Plot a figure of the function:
library(lattice)
g <- expand.grid(x = seq(xint[1], xint[2], 0.1),
y = seq(yint[1], yint[2], 0.1))
g$z <- func(g$x, g$y)
print( wireframe(z ~ x + y, data = g) )
Example: Parallel numerical integration
Define the number of increments for integration
n <- 10000
Calculate with nested for loops (very slow!)
erg <- 0
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
for(xi in seq(xint[1], xint[2], length.out = n)){
for(yi in seq(yint[1], yint[2], length.out = n)){
box <- func(xi, yi) * xincr * yincr
erg <- erg + box
}
}
erg
Example: Parallel numerical integration
Use nested sapply (much faster)
applyfunc <- function(xrange, xint, yint, n, func)
{
yrange <- seq(yint[1], yint[2], length.out = n)
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
erg <- sum( sapply(xrange, function(x)
sum( func(x, yrange)
)) ) * xincr * yincr
return(erg)
}
xrange <- seq(xint[1], xint[2], length.out = n)
erg <- sapply(xrange, applyfunc, xint, yint, n, func)
sum(erg)
Example: Parallel numerical integration
Define a worker function for parallel calculation
workerfunc <-
function(id, nworkers, xint, yint, n, func)
{
xrange <- seq(xint[1], xint[2],
length.out = n)[seq(id, n, nworkers)]
yrange <- seq(yint[1], yint[2], length.out = n)
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
erg <- sapply(xrange, function(x)
sum( func(x, yrange ) )
) * xincr * yincr
return( sum(erg) )
}
Example: Parallel numerical integration
Start a cluster of local R engines using all your CPU cores
library(parallel)
nworkers <- detectCores()
cluster <- makeCluster(nworkers)
Run the calculation in parallel (faster than serial calculation)
erg <- clusterApplyLB(cluster, 1:nworkers,
workerfunc, nworkers, xint, yint, n, func)
sum(unlist(erg))
Stop the cluster
stopCluster(cluster)
Vectorized Code
Use vectorized R code (the fastest method!)
xincr <- ( xint[2]-xint[1] ) / n
yincr <- ( yint[2]-yint[1] ) / n
erg <- sum(
func( seq(xint[1], xint[2], length.out = n),
seq(yint[1], yint[2], length.out = n) )
) * xincr * yincr * n
erg
Refer back to slide: “When not to use AWS?” This problem is
best solved through vectorization instead of using larger
computational resources.
Reminder to Stop EC2 Instances
• Stop your EC2 instances after use to avoid
charges
– After one year free usage of one micro
instance, running one micro instance 24x7 will
result in charges of about $15/month
• If regularly using EC2, configure CloudWatch
alarms to automatically notify you or stop
your instances after period of low CPU
utilization
R with Amazon Elastic MapReduce
• The R package segue provides an integration
with Amazon Elastic MapReduce (EMR) for
simple parallel computation
– https://code.google.com/p/segue/
– http://jeffreybreen.wordpress.com/2011/01/10/s
egue-r-to-amazon-elastic-mapreduce-hadoop/
Other Useful Links
• CRAN Task View: High-Performance and
Parallel Computing with R:
http://cran.r-project.org/web/views/
HighPerformanceComputing.html
• R package AWS.tools:
http://cran.r-project.org/web/packages/
AWS.tools/index.html
Join the Raleigh-Durham-Chapel Hill R Users Group at:
http://www.meetup.com/Triangle-useR/

More Related Content

Viewers also liked (8)

Building Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributesBuilding Decision Tree model with numerical attributes
Building Decision Tree model with numerical attributes
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requests
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
Introduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data AnalyticsIntroduction to Data Mining and Big Data Analytics
Introduction to Data Mining and Big Data Analytics
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Disaster Recovery for SAP HANA with SUSE Linux
Disaster Recovery for SAP HANA with SUSE LinuxDisaster Recovery for SAP HANA with SUSE Linux
Disaster Recovery for SAP HANA with SUSE Linux
 
How Google Works
How Google WorksHow Google Works
How Google Works
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Running R on the Amazon Cloud

  • 1. Running R on the Amazon Cloud Ian Cook Raleigh-Durham-Chapel Hill R Users Group June 20, 2013 +
  • 2. Why? • Some R jobs are RAM- and CPU-intensive • Powerful hardware is expensive to buy • Institutional cluster compute resources can be difficult to procure and to use • Amazon Web Services (AWS) provides a fast, cheap, and easy way to use computational resources in the cloud • AWS offers a free usage tier that you can use to try this: http://aws.amazon.com/free/
  • 3. What is AWS? • A collection of cloud computing services • Billed based on usage • The best-known AWS service is Amazon Elastic Compute Cloud (EC2) which provides scalable virtual private servers • Other AWS services include Elastic MapReduce (EMR) (a hosted Hadoop service) and Simple Storage Service (S3) (for online file storage)
  • 4. How much RAM/CPU can I use on EC2? • Up to 32 virtual CPU cores per instance • Up to 244 GB RAM per instance • Can distribute a task across multiple instances • Can resize instances (start small, grow as needed) • Instance details at http://aws.amazon.com/ ec2/instance-types/instance-details/ • Pricing at http://aws.amazon.com/ec2/ pricing/#on-demand
  • 5. When not to use AWS? • It is often cheaper, easier, and more elegant to use tools and techniques to make your R code less RAM- and CPU-intensive: – R package bigmemory allows analysis of datasets larger than available RAM http://www.bigmemory.org/ – R package data.table enables faster operations on large data http://cran.r- project.org/web/packages/data.table/index.html – Good R programming techniques (e.g. vectorization) can make your code run drastically faster on just one CPU core http://www.noamross.net/blog/ 2013/4/25/faster-talk.html
  • 6. More ways to speed up R code • Rewrite key functions in C++ for much improved performance, and use Dirk Eddelbuettel’s Rcpp package to embed the C++ code in your R program: – http://dirk.eddelbuettel.com/code/rcpp.html – https://github.com/hadley/devtools/wiki/Rcpp • Radford Neal’s pqR is a faster version of R – http://radfordneal.wordpress.com/2013/06/22/a nnouncing-pqr-a-faster-version-of-r/
  • 7. Free Commercial R Distributions • Two (very different) commercial distributions of R are freely available. Both have much improved performance vs. plain R in many cases – Revolution R An enhanced distribution of open source R with an IDE http://www.revolutionanalytics.com/products/revolut ion-r.php – TIBCO Enterprise Runtime for R A high-performance R-compatible statistical engine http://spotfire.tibco.com/en/discover-spotfire/what- does-spotfire-do/predictive-analytics/tibco- enterprise-runtime-for-r-terr.aspx
  • 8. RStudio Server AMIs • Louis Aslett maintains a set of Amazon Machine Images (AMIs) available for anyone to use • These AMIs include the latest versions of R and RStudio Server on Ubuntu • These AMIs make it very fast and easy to use R on EC2 • Thanks Louis!
  • 9. Launch EC2 Instance • Sign up for an AWS account at https://portal.aws.amazon.com/gp/aws/develop er/registration/index.html • Go to http://www.louisaslett.com/RStudio_AMI/ and click the AMI for your region (US East, Virginia) • Complete the process to launch the instance – Choose instance type t1.micro for free usage tier – Open port 80, and optionally port 22 (to use SSH) – After done, may take about 5 minutes to launch
  • 10. Use RStudio on EC2 Instance • Copy the “Public DNS” for your EC2 instance into your web browser address field (e.g. ec2-xx-xx- xxx-xxx.compute-1.amazonaws.com) • Login with username rstudio and password rstudio and start using RStudio • Remember to stop your instance when finished • Video instructions at http://www.louisaslett.com/RStudio_AMI/video_ guide.html
  • 11. How to use all those CPU cores? • R package parallel enables some tasks in R to run parallel across multiple CPU cores – This is explicit parallelism—the task must be parallelizable – CPU cores can be on one machine or across multiple machines • The parallel package has been included directly in R since version 2.14.0. It derives from the two R packages snow and multicore. • http://stat.ethz.ch/R-manual/R- devel/library/parallel/doc/parallel.pdf
  • 12. Example: Parallel numerical integration • Calculate the volume under a three-dimensional function • Adapted from the example in Appendix B, part 4 of http://www.jstatsoft.org/v31/i01/ “State of the Art in Parallel Computing with R.” Schmidberger, Morgan, Eddelbuettel, Yu, Tier ney, and Mansmann. Journal of Statistical Software. August 2009, Volume 31, Issue 1.x y z Note that paper by Schmidberger et al. was written before the package parallel was included in R. The examples in the paper use other packages including snow that were precursors of the package parallel.
  • 13. Example: Parallel numerical integration Define a three-dimensional function and limits on its domain: func <- function(x, y) x^3-3*x + y^3-3*y xint <- c(-1, 2) yint <- c(-1, 2) Plot a figure of the function: library(lattice) g <- expand.grid(x = seq(xint[1], xint[2], 0.1), y = seq(yint[1], yint[2], 0.1)) g$z <- func(g$x, g$y) print( wireframe(z ~ x + y, data = g) )
  • 14. Example: Parallel numerical integration Define the number of increments for integration n <- 10000 Calculate with nested for loops (very slow!) erg <- 0 xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n for(xi in seq(xint[1], xint[2], length.out = n)){ for(yi in seq(yint[1], yint[2], length.out = n)){ box <- func(xi, yi) * xincr * yincr erg <- erg + box } } erg
  • 15. Example: Parallel numerical integration Use nested sapply (much faster) applyfunc <- function(xrange, xint, yint, n, func) { yrange <- seq(yint[1], yint[2], length.out = n) xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n erg <- sum( sapply(xrange, function(x) sum( func(x, yrange) )) ) * xincr * yincr return(erg) } xrange <- seq(xint[1], xint[2], length.out = n) erg <- sapply(xrange, applyfunc, xint, yint, n, func) sum(erg)
  • 16. Example: Parallel numerical integration Define a worker function for parallel calculation workerfunc <- function(id, nworkers, xint, yint, n, func) { xrange <- seq(xint[1], xint[2], length.out = n)[seq(id, n, nworkers)] yrange <- seq(yint[1], yint[2], length.out = n) xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n erg <- sapply(xrange, function(x) sum( func(x, yrange ) ) ) * xincr * yincr return( sum(erg) ) }
  • 17. Example: Parallel numerical integration Start a cluster of local R engines using all your CPU cores library(parallel) nworkers <- detectCores() cluster <- makeCluster(nworkers) Run the calculation in parallel (faster than serial calculation) erg <- clusterApplyLB(cluster, 1:nworkers, workerfunc, nworkers, xint, yint, n, func) sum(unlist(erg)) Stop the cluster stopCluster(cluster)
  • 18. Vectorized Code Use vectorized R code (the fastest method!) xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n erg <- sum( func( seq(xint[1], xint[2], length.out = n), seq(yint[1], yint[2], length.out = n) ) ) * xincr * yincr * n erg Refer back to slide: “When not to use AWS?” This problem is best solved through vectorization instead of using larger computational resources.
  • 19. Reminder to Stop EC2 Instances • Stop your EC2 instances after use to avoid charges – After one year free usage of one micro instance, running one micro instance 24x7 will result in charges of about $15/month • If regularly using EC2, configure CloudWatch alarms to automatically notify you or stop your instances after period of low CPU utilization
  • 20. R with Amazon Elastic MapReduce • The R package segue provides an integration with Amazon Elastic MapReduce (EMR) for simple parallel computation – https://code.google.com/p/segue/ – http://jeffreybreen.wordpress.com/2011/01/10/s egue-r-to-amazon-elastic-mapreduce-hadoop/
  • 21. Other Useful Links • CRAN Task View: High-Performance and Parallel Computing with R: http://cran.r-project.org/web/views/ HighPerformanceComputing.html • R package AWS.tools: http://cran.r-project.org/web/packages/ AWS.tools/index.html
  • 22. Join the Raleigh-Durham-Chapel Hill R Users Group at: http://www.meetup.com/Triangle-useR/