2. Outline
I. Hadoop/MapReduce Recap and Limitations
II. Complex Workflows and RDDs
III. The Spark Framework
IV. Spark on Gordon
V. Practical Limitations of Spark
SAN DIEGO SUPERCOMPUTER CENTER
3. Map/Reduce Parallelism
Data Data
SAN DIEGO SUPERCOMPUTER CENTER
Data
Data
Data
taDsakt a0
task 5 task 4
task 3
task 1 task 2
6. Shuffle/Sort
SAN DIEGO SUPERCOMPUTER CENTER
MapReduce Disk
Spill
1. Map – convert raw input into
key/value pairs. Output to
local disk ("spill")
2. Shuffle/Sort – All reducers
retrieve all spilled records
from all mappers over
network
3. Reduce – For each unique
key, do something with all
the corresponding values.
Output to HDFS
Map Map Map
Reduce Reduce Reduce
7. 2. Full* data dump to disk
SAN DIEGO SUPERCOMPUTER CENTER
MapReduce: Two
Fundamental Limitations
1. MapReduce prescribes
workflow.
• You map, then you reduce.
• You cannot reduce, then map...
• ...or anything else. See first
point.
Map Map Map
Reduce Reduce Reduce
between workflow steps.
• Mappers deliver output on local
disk (mapred.local.dir)
• Reducers pull input over network
from other nodes' local disks
• Output goes right back to local
* Combiners do local reductions to prevent a full, unreduced
dump of data to local disk
disks via HDFS
Shuffle/Sort
8. Beyond MapReduce
• What if workflow could be arbitrary in length?
• map-map-reduce
• reduce-map-reduce
• What if higher-level map/reduce operations
could be applied?
• sampling or filtering of a large dataset
• mean and variance of a dataset
• sum/subtract all elements of a dataset
• SQL JOIN operator
SAN DIEGO SUPERCOMPUTER CENTER
9. Beyond MapReduce: Complex
Workflows
• What if workflow could be arbitrary in length?
• map-map-reduce
• reduce-map-reduce
How can you do this without flushing intermediate
results to disk after every operation?
• What if higher-level map/reduce operations
could be applied?
• sampling or filtering of a large dataset
• mean and variance of a dataset
• sum/subtract all elements of a dataset
• SQL JOIN operator
How can you ensure fault tolerance for all of these
baked-in operations?
SAN DIEGO SUPERCOMPUTER CENTER
10. SAN DIEGO SUPERCOMPUTER CENTER
MapReduce Fault
Tolerance
Map Map Map
Reduce Reduce Reduce
Mapper Failure:
1. Re-run map task
and spill to disk
2. Block until finished
3. Reducers proceed
as normal
Reducer Failure:
1. Re-fetch spills from
all mappers' disks
2. Re-run reducer task
11. Performing Complex Workflows
How can you do complex workflows without
flushing intermediate results to disk after every
operation?
1. Cache intermediate results in-memory
2. Allow users to specify persistence in memory and
partitioning of dataset across nodes
How can you ensure fault tolerance?
1. Coarse-grained atomicity via partitions (transform
chunks of data, not record-by-record)
2. Use transaction logging--forget replication
SAN DIEGO SUPERCOMPUTER CENTER
12. Resilient Distributed Dataset (RDD)
• Comprised of distributed, atomic partitions of elements
• Apply transformations to generate new RDDs
• RDDs are immutable (read-only)
• RDDs can only be created from persistent storage (e.g.,
HDFS, POSIX, S3) or by transforming other RDDs
# Create an RDD from a file on HDFS
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt')
# Transform the RDD of lines into an RDD of words (one word per element)
words = text.flatMap( lambda line: line.split() )
# Transform the RDD of words into an RDD of key/value pairs
keyvals = words.map( lambda word: (word, 1) )
sc is a SparkContext object that describes our Spark cluster
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages)
SAN DIEGO SUPERCOMPUTER CENTER
14. RDD Transformation vs. Action
• Transformations are lazy: nothing actually happens when
this code is evaluated
• RDDs are computed only when an action is called on
them, e.g.,
• Calculate statistics over the elements of an RDD (count, mean)
• Save the RDD to a file (saveAsTextFile)
• Reduce elements of an RDD into a single object or value (reduce)
• Allows you to define partitioning/caching behavior after
defining the RDD but before calculating its contents
SAN DIEGO SUPERCOMPUTER CENTER
15. RDD Transformation vs. Action
• Must insert an action here to get pipeline to execute.
• Actions create files or objects:
# The saveAsTextFile action dumps the contents of an RDD to disk
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt')
# The count action returns the number of elements in an RDD
>>> num_elements = rdd.count();
num_elements;
type(num_elements)
SAN DIEGO SUPERCOMPUTER CENTER
215136
<type 'int'>
16. Resiliency: The 'R' in 'RDD'
• No replication of in-memory data
• Restrict transformations to coarse granularity
• Partition-level operations simplifies data lineage
SAN DIEGO SUPERCOMPUTER CENTER
17. Resiliency: The 'R' in 'RDD'
• Reconstruct missing data from its lineage
• Data in RDDs are deterministic since partitions
are immutable and atomic
SAN DIEGO SUPERCOMPUTER CENTER
18. Resiliency: The 'R' in 'RDD'
• Long lineages or complex interactions
(reductions, shuffles) can be checkpointed
• RDD immutability nonblocking (background)
SAN DIEGO SUPERCOMPUTER CENTER
19. Introduction to Spark
SPARK: AN IMPLEMENTATION
OF RDDS
SAN DIEGO SUPERCOMPUTER CENTER
20. Spark Framework
• Master/worker Model
• Spark Master is analogous to Hadoop Jobtracker (MRv1)
or Application Master (MRv2)
• Spark Worker is analogous to Hadoop Tasktracker
• Relies on "3rd party" storage for RDD generation
(hdfs://, s3n://, file://, http://)
• Spark clusters take three forms:
• Standalone mode - workers communicate directly with
master via spark://master:7077 URI
• Mesos - mesos://master:5050 URI
• YARN - no HA; complicated job launch
SAN DIEGO SUPERCOMPUTER CENTER
21. Spark on Gordon: Configuration
1. Standalone mode is the simplest configuration
and execution model (similar to MRv1)
2. Leverage existing HDFS support in myHadoop
for storage
3. Combine #1 and #2 to extend myHadoop to
support Spark:
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf
$ myhadoop-configure.sh
...
myHadoop: Enabling experimental Spark support
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark
myHadoop:
To use Spark, you will want to type the following commands:"
source /home/glock/hadoop.conf/spark/spark-env.sh
myspark start
SAN DIEGO SUPERCOMPUTER CENTER
22. Spark on Gordon: Storage
• Spark can use HDFS
$ start-dfs.sh # after you run myhadoop-configure.sh, of course
...
$ pyspark
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt')
>>> mydata.count()
982394
• Spark can use POSIX file systems too
$ pyspark
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt')
>>> mydata.count()
982394
• S3 Native (s3n://) and HTTP (http://) also work
• file:// input will be served in chunks to Spark
workers via the Spark driver's built-in httpd
SAN DIEGO SUPERCOMPUTER CENTER
23. Spark on Gordon: Running
Spark treats several languages as first-class
citizens:
Feature Scala Java Python
Interactive YES NO YES
Shark (SQL) YES YES YES
Streaming YES YES NO
MLlib YES YES YES
GraphX YES YES NO
R is a second-class citizen; basic RDD API is
available outside of CRAN
(http://amplab-extras.github.io/SparkR-pkg/)
SAN DIEGO SUPERCOMPUTER CENTER
24. myHadoop/Spark on Gordon (1/2)
#!/bin/bash
#PBS -l nodes=2:ppn=16:native:flash
#PBS -l walltime=00:30:00
#PBS -q normal
### Environment setup for Hadoop
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH
module load hadoop/2.2.0
export HADOOP_CONF_DIR=$HOME/mycluster.conf
myhadoop-configure.sh
### Start HDFS. Starting YARN isn't necessary since Spark will be running in
### standalone mode on our cluster.
start-dfs.sh
### Load in the necessary Spark environment variables
source $HADOOP_CONF_DIR/spark/spark-env.sh
### Start the Spark masters and workers. Do NOT use the start-all.sh provided
### by Spark, as they do not correctly honor $SPARK_CONF_DIR
myspark start
SAN DIEGO SUPERCOMPUTER CENTER
25. myHadoop/Spark on Gordon (2/2)
### Run our example problem.
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home
### dir by default which is different from Hadoop 1.x!)
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt
### Step 2. Run our Python Spark job. Note that Spark implicitly requires
### Python 2.6 (some features, like MLLib, require 2.7)
module load python scipy
/home/glock/hadoop/run/wordcount-spark.py
### Step 3. Copy output back out
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/
### Shut down Spark and HDFS
myspark stop
stop-dfs.sh
### Clean up
myhadoop-cleanup.sh
SAN DIEGO SUPERCOMPUTER CENTER
Wordcount submit script and Python code online:
https://github.com/glennklockwood/sparktutorial
27. Major Problems with Spark
1. Still smells like a CS project
2. Debugging is a dark art
3. Not battle-tested at scale
SAN DIEGO SUPERCOMPUTER CENTER
28. #1: Spark Smells Like CS
• Components are constantly breaking
• Graph.partitionBy broken in 1.0.0 (SPARK-1931)
• Some components never worked
• SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058)
• stop-master.sh doesn't work
• Spark with YARN will break with large data sets (SPARK-2398)
• spark-submit for standalone mode doesn't work (SPARK-2260)
SAN DIEGO SUPERCOMPUTER CENTER
29. #1: Spark Smells Like CS
• Really obvious usability issues:
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt')
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir')
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19
scala.MatchError: 0 (of class java.lang.Integer)
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
SAN DIEGO SUPERCOMPUTER CENTER
...
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Read an RDD, then write it out = unhandled exception with
cryptic Scala errors from Python (SPARK-1690)
30. #2: Debugging is a Dark Art
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in
saveAsTextFile
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-
src.zip/py4j/java_gateway.py", line 537, in __call__
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with
client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
Cause: Spark built against Hadoop 2 DFS trying to access data
on Hadoop 1 DFS
SAN DIEGO SUPERCOMPUTER CENTER
31. #2: Debugging is a Dark Art
>>> data.count()
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py",
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
for obj in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
for item in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin
if acc is None:
TypeError: an integer is required
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
...
Cause: Master was using Python 2.6, but workers were only
able to find Python 2.4
SAN DIEGO SUPERCOMPUTER CENTER
32. #2: Debugging is a Dark Art
>>> data.saveAsTextFile('hdfs://user/glock/output.dir/')
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py",
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers.
for obj in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin
if not isinstance(x, basestring):
SystemError: unknown opcode
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
...
Cause: Master was using Python 2.6, but workers were only
able to find Python 2.4
SAN DIEGO SUPERCOMPUTER CENTER
33. #2: Spark Debugging Tips
• $SPARK_LOG_DIR/app-* contains master/worker
logs with failure information
• Try to find the salient error amidst the stack traces
• Google that error--odds are, it is a known issue
• Stick any required environment variables ($PATH,
$PYTHONPATH, $JAVA_HOME) in
$SPARK_CONF_DIR/spark-env.sh to rule out
these problems
• All else fails, look at Spark source code
SAN DIEGO SUPERCOMPUTER CENTER
34. #3: Spark Isn't Battle Tested
• Companies (Cloudera, SAP, etc) jumping on the
Spark bandwagon with disclaimers about scaling
• Spark does not handle multitenancy well at all.
Wait scheduling is considered best way to achieve
memory/disk data locality
• Largest Spark clusters ~ hundreds of nodes
SAN DIEGO SUPERCOMPUTER CENTER
35. Spark Take-Aways
SAN DIEGO SUPERCOMPUTER CENTER
• FACTS
• Data is represented as resilient distributed datasets
(RDDs) which remain in-memory and read-only
• RDDs are comprised of elements
• Elements are distributed across physical nodes in user-defined
groups called partitions
• RDDs are subject to transformations and actions
• Fault tolerance achieved by lineage, not replication
• Opinions
• Spark is still in its infancy but its progress is promising
• Good for evaluating--good for Gordon, Comet
37. Lazy Evaluation + In-Memory Caching =
Optimized JOIN Operations
Start every webpage with a rank R = 1.0
1. For each webpage linking in N neighbor webpages,
have it "contribute" R/N to each of its N neighbors
2. Then, for each webpage, set its rank R to (0.15 +
0.85 * contributions)
SAN DIEGO SUPERCOMPUTER CENTER
3. Repeat
insert flow diagram here
38. Lazy Evaluation + In-Memory Caching =
Optimized JOIN Operations
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt')
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such
# that all common keys are kept together. Then retain this RDD in memory.
links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache()
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0
ranks = links.map(lambda (url, neighbors): (url, 1.0))
# Calculate and update URL rank
for iteration in range(10):
# Calculate URL contributions to their neighbors
contribs = links.join(ranks).flatMap(
lambda (url, (urls, rank)): computeContribs(urls, rank))
# Recalculate URL ranks based on neighbor contributions
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank)
# Print all URLs and their ranks
for (link, rank) in ranks.collect():
print '%s has rank %s' % (link, rank)
SAN DIEGO SUPERCOMPUTER CENTER
Notes de l'éditeur
groupByKey: group the values for each key in the RDD into a single sequence
mapValues: apply map function to all values of key/value pairs without modifying keys (or their partitioning)
collect: return a list containing all elements of the RDD
def computeContribs(urls, rank):
"""Calculates URL contributions to the rank of other URLs."""
num_urls = len(urls)
for url in urls:
yield (url, rank / num_urls)