SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Scalable Machine Learning
Alton Alexander
@10altoids
R

Motivation to use Spark
• http://spark.apache.org/
–Speed
–Ease of use
–Generality
–Integrated with Hadoop
• Scalability

Architecture
• New Amazon Memory Optimized Options
https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-
announcing-the-next-generation-of-amazon-ec2-memory-optimized-
instances/

Learn more about Spark
• http://spark.apache.org/documentation.html
– Great documentation (with video tutorials)
– June 2014 Conference http://spark-summit.org
• Keynote Talk at STRATA 2014
– Use cases by Yahoo and other companies
– http://youtu.be/KspReT2JjeE
– Matei Zaharia – Core developer and now at DataBricks
– 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s

Motivation to use R
• Great community
– R: The most powerful and most widely used statistical software
– https://www.youtube.com/watch?v=TR2bHSJ_eck
• Statistics
• Packages
– There’s an R package for that
– Roger Peng, John Hopkins
– https://www.youtube.com/watch?v=yhTerzNFLbo
• Plots

Example: Word Count
Library(SparkR)
sc <- sparkR.init(master="local")
Lines <- textFile(sc, “hdfs://my_text_file”)
Words <- flatMap(lines,
function(line){
strsplit(line, “ “)[[1]]
})
wordCount <- lapply(words,
function(word){
list(word,1L)
})
Counts <- reduceByKey(wordCount, "+", 2L)
Output <- collect(counts)

Learn more about SparkR
• GitHub repository
– https://github.com/amplab-extras/SparkR-pkg
– How to install
– Examples
• An old but still good talk introducing SparkR
– http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL
-x35fyliRwiP3YteXbnhk0QGOtYLBT3a
– Shows MINST demo

Hands on Exercises
• http://spark-
summit.org/2013/exercises/index.html
– Walk through the tutorial
– Set up a cluster on EC2
– Data exploration
– Stream processing with spark streaming
– Machine learning

Local box
• Start with a micro dev box using the latest public
build on Amazon EC2
– spark.ami.pvm.v9 - ami-5bb18832
• Or start by just installing it on your laptop
– wget http://d3kbcqa49mib13.cloudfront.net/spark-
0.9.1-bin-hadoop1.tgz
• Add AWS keys as environment variables
– AWS_ACCESS_KEY_ID=
– AWS_SECRET_ACCESS_KEY=

Run the examples
• Load pyspark and work interactively
– /root/spark-0.9.1-bin-hadoop1/bin/pyspark
– >>> help(sc)
• Estimate pi
– ./bin/pyspark python/examples/pi.py local[4] 20

Start Cluster
• Configure the cluster and start it
– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark-
key -i ~/spark-key.pem -s 1 launch spark -test-
cluster
• Log onto the master
– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login
spark-test-cluster

Ganglia: The Cluster Dashboard

Run these Demos
• http://spark.apache.org/docs/latest/mllib-
guide.html
– Talks about each of the algorithms
– Gives some demos in Scala
– More demos in Python

Clustering
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile(“data/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=30, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

Python Code
• http://spark.incubator.apache.org/docs/latest/ap
i/pyspark/index.html
• Python API for Spark
• Package Mllib
– Classification
– Clustering
– Recommendation
– Regression

Clustering Skullcandy Followers
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile(“../skullcandy.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=30, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

Clustering Skullcandy Followers

Apply model to all followers
• predictions = parsedData.map(lambda
follower: clusters.predict(follower))
• Save this out for visualization
– predictions.saveAsTextFile("predictions.csv")

• Upgrade to python 2.7
• https://spark-
project.atlassian.net/browse/SPARK-922

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Similar to SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th (20)

Recently uploaded

Recently uploaded (20)

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th