6. Learn more about Spark
• http://spark.apache.org/documentation.html
– Great documentation (with video tutorials)
– June 2014 Conference http://spark-summit.org
• Keynote Talk at STRATA 2014
– Use cases by Yahoo and other companies
– http://youtu.be/KspReT2JjeE
– Matei Zaharia – Core developer and now at DataBricks
– 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s
7. Motivation to use R
• Great community
– R: The most powerful and most widely used statistical software
– https://www.youtube.com/watch?v=TR2bHSJ_eck
• Statistics
• Packages
– There’s an R package for that
– Roger Peng, John Hopkins
– https://www.youtube.com/watch?v=yhTerzNFLbo
• Plots
16. Learn more about SparkR
• GitHub repository
– https://github.com/amplab-extras/SparkR-pkg
– How to install
– Examples
• An old but still good talk introducing SparkR
– http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL
-x35fyliRwiP3YteXbnhk0QGOtYLBT3a
– Shows MINST demo
18. Hands on Exercises
• http://spark-
summit.org/2013/exercises/index.html
– Walk through the tutorial
– Set up a cluster on EC2
– Data exploration
– Stream processing with spark streaming
– Machine learning
19. Local box
• Start with a micro dev box using the latest public
build on Amazon EC2
– spark.ami.pvm.v9 - ami-5bb18832
• Or start by just installing it on your laptop
– wget http://d3kbcqa49mib13.cloudfront.net/spark-
0.9.1-bin-hadoop1.tgz
• Add AWS keys as environment variables
– AWS_ACCESS_KEY_ID=
– AWS_SECRET_ACCESS_KEY=
20. Run the examples
• Load pyspark and work interactively
– /root/spark-0.9.1-bin-hadoop1/bin/pyspark
– >>> help(sc)
• Estimate pi
– ./bin/pyspark python/examples/pi.py local[4] 20
21. Start Cluster
• Configure the cluster and start it
– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark-
key -i ~/spark-key.pem -s 1 launch spark -test-
cluster
• Log onto the master
– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login
spark-test-cluster
23. Run these Demos
• http://spark.apache.org/docs/latest/mllib-
guide.html
– Talks about each of the algorithms
– Gives some demos in Scala
– More demos in Python
24. Clustering
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile(“data/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=30, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
26. Clustering Skullcandy Followers
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt
# Load and parse the data
data = sc.textFile(“../skullcandy.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=30, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
29. Apply model to all followers
• predictions = parsedData.map(lambda
follower: clusters.predict(follower))
• Save this out for visualization
– predictions.saveAsTextFile("predictions.csv")