SlideShare une entreprise Scribd logo
1  sur  38
Introduction to Spark 
Glenn K. Lockwood 
July 2014 
SAN DIEGO SUPERCOMPUTER CENTER
Outline 
I. Hadoop/MapReduce Recap and Limitations 
II. Complex Workflows and RDDs 
III. The Spark Framework 
IV. Spark on Gordon 
V. Practical Limitations of Spark 
SAN DIEGO SUPERCOMPUTER CENTER
Map/Reduce Parallelism 
Data Data 
SAN DIEGO SUPERCOMPUTER CENTER 
Data 
Data 
Data 
taDsakt a0 
task 5 task 4 
task 3 
task 1 task 2
Magic of HDFS 
SAN DIEGO SUPERCOMPUTER CENTER
Hadoop Workflow 
SAN DIEGO SUPERCOMPUTER CENTER
Shuffle/Sort 
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce Disk 
Spill 
1. Map – convert raw input into 
key/value pairs. Output to 
local disk ("spill") 
2. Shuffle/Sort – All reducers 
retrieve all spilled records 
from all mappers over 
network 
3. Reduce – For each unique 
key, do something with all 
the corresponding values. 
Output to HDFS 
Map Map Map 
Reduce Reduce Reduce
2. Full* data dump to disk 
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce: Two 
Fundamental Limitations 
1. MapReduce prescribes 
workflow. 
• You map, then you reduce. 
• You cannot reduce, then map... 
• ...or anything else. See first 
point. 
Map Map Map 
Reduce Reduce Reduce 
between workflow steps. 
• Mappers deliver output on local 
disk (mapred.local.dir) 
• Reducers pull input over network 
from other nodes' local disks 
• Output goes right back to local 
* Combiners do local reductions to prevent a full, unreduced 
dump of data to local disk 
disks via HDFS 
Shuffle/Sort
Beyond MapReduce 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
SAN DIEGO SUPERCOMPUTER CENTER
Beyond MapReduce: Complex 
Workflows 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
How can you do this without flushing intermediate 
results to disk after every operation? 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
How can you ensure fault tolerance for all of these 
baked-in operations? 
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce Fault 
Tolerance 
Map Map Map 
Reduce Reduce Reduce 
Mapper Failure: 
1. Re-run map task 
and spill to disk 
2. Block until finished 
3. Reducers proceed 
as normal 
Reducer Failure: 
1. Re-fetch spills from 
all mappers' disks 
2. Re-run reducer task
Performing Complex Workflows 
How can you do complex workflows without 
flushing intermediate results to disk after every 
operation? 
1. Cache intermediate results in-memory 
2. Allow users to specify persistence in memory and 
partitioning of dataset across nodes 
How can you ensure fault tolerance? 
1. Coarse-grained atomicity via partitions (transform 
chunks of data, not record-by-record) 
2. Use transaction logging--forget replication 
SAN DIEGO SUPERCOMPUTER CENTER
Resilient Distributed Dataset (RDD) 
• Comprised of distributed, atomic partitions of elements 
• Apply transformations to generate new RDDs 
• RDDs are immutable (read-only) 
• RDDs can only be created from persistent storage (e.g., 
HDFS, POSIX, S3) or by transforming other RDDs 
# Create an RDD from a file on HDFS 
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') 
# Transform the RDD of lines into an RDD of words (one word per element) 
words = text.flatMap( lambda line: line.split() ) 
# Transform the RDD of words into an RDD of key/value pairs 
keyvals = words.map( lambda word: (word, 1) ) 
sc is a SparkContext object that describes our Spark cluster 
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) 
SAN DIEGO SUPERCOMPUTER CENTER
Potential RDD Workflow 
SAN DIEGO SUPERCOMPUTER CENTER
RDD Transformation vs. Action 
• Transformations are lazy: nothing actually happens when 
this code is evaluated 
• RDDs are computed only when an action is called on 
them, e.g., 
• Calculate statistics over the elements of an RDD (count, mean) 
• Save the RDD to a file (saveAsTextFile) 
• Reduce elements of an RDD into a single object or value (reduce) 
• Allows you to define partitioning/caching behavior after 
defining the RDD but before calculating its contents 
SAN DIEGO SUPERCOMPUTER CENTER
RDD Transformation vs. Action 
• Must insert an action here to get pipeline to execute. 
• Actions create files or objects: 
# The saveAsTextFile action dumps the contents of an RDD to disk 
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') 
# The count action returns the number of elements in an RDD 
>>> num_elements = rdd.count(); 
num_elements; 
type(num_elements) 
SAN DIEGO SUPERCOMPUTER CENTER 
215136 
<type 'int'>
Resiliency: The 'R' in 'RDD' 
• No replication of in-memory data 
• Restrict transformations to coarse granularity 
• Partition-level operations simplifies data lineage 
SAN DIEGO SUPERCOMPUTER CENTER
Resiliency: The 'R' in 'RDD' 
• Reconstruct missing data from its lineage 
• Data in RDDs are deterministic since partitions 
are immutable and atomic 
SAN DIEGO SUPERCOMPUTER CENTER
Resiliency: The 'R' in 'RDD' 
• Long lineages or complex interactions 
(reductions, shuffles) can be checkpointed 
• RDD immutability  nonblocking (background) 
SAN DIEGO SUPERCOMPUTER CENTER
Introduction to Spark 
SPARK: AN IMPLEMENTATION 
OF RDDS 
SAN DIEGO SUPERCOMPUTER CENTER
Spark Framework 
• Master/worker Model 
• Spark Master is analogous to Hadoop Jobtracker (MRv1) 
or Application Master (MRv2) 
• Spark Worker is analogous to Hadoop Tasktracker 
• Relies on "3rd party" storage for RDD generation 
(hdfs://, s3n://, file://, http://) 
• Spark clusters take three forms: 
• Standalone mode - workers communicate directly with 
master via spark://master:7077 URI 
• Mesos - mesos://master:5050 URI 
• YARN - no HA; complicated job launch 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Configuration 
1. Standalone mode is the simplest configuration 
and execution model (similar to MRv1) 
2. Leverage existing HDFS support in myHadoop 
for storage 
3. Combine #1 and #2 to extend myHadoop to 
support Spark: 
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf 
$ myhadoop-configure.sh 
... 
myHadoop: Enabling experimental Spark support 
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark 
myHadoop: 
To use Spark, you will want to type the following commands:" 
source /home/glock/hadoop.conf/spark/spark-env.sh 
myspark start 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Storage 
• Spark can use HDFS 
$ start-dfs.sh # after you run myhadoop-configure.sh, of course 
... 
$ pyspark 
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') 
>>> mydata.count() 
982394 
• Spark can use POSIX file systems too 
$ pyspark 
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') 
>>> mydata.count() 
982394 
• S3 Native (s3n://) and HTTP (http://) also work 
• file:// input will be served in chunks to Spark 
workers via the Spark driver's built-in httpd 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Running 
Spark treats several languages as first-class 
citizens: 
Feature Scala Java Python 
Interactive YES NO YES 
Shark (SQL) YES YES YES 
Streaming YES YES NO 
MLlib YES YES YES 
GraphX YES YES NO 
R is a second-class citizen; basic RDD API is 
available outside of CRAN 
(http://amplab-extras.github.io/SparkR-pkg/) 
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop/Spark on Gordon (1/2) 
#!/bin/bash 
#PBS -l nodes=2:ppn=16:native:flash 
#PBS -l walltime=00:30:00 
#PBS -q normal 
### Environment setup for Hadoop 
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH 
module load hadoop/2.2.0 
export HADOOP_CONF_DIR=$HOME/mycluster.conf 
myhadoop-configure.sh 
### Start HDFS. Starting YARN isn't necessary since Spark will be running in 
### standalone mode on our cluster. 
start-dfs.sh 
### Load in the necessary Spark environment variables 
source $HADOOP_CONF_DIR/spark/spark-env.sh 
### Start the Spark masters and workers. Do NOT use the start-all.sh provided 
### by Spark, as they do not correctly honor $SPARK_CONF_DIR 
myspark start 
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop/Spark on Gordon (2/2) 
### Run our example problem. 
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home 
### dir by default which is different from Hadoop 1.x!) 
hdfs dfs -mkdir -p /user/$USER 
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt 
### Step 2. Run our Python Spark job. Note that Spark implicitly requires 
### Python 2.6 (some features, like MLLib, require 2.7) 
module load python scipy 
/home/glock/hadoop/run/wordcount-spark.py 
### Step 3. Copy output back out 
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ 
### Shut down Spark and HDFS 
myspark stop 
stop-dfs.sh 
### Clean up 
myhadoop-cleanup.sh 
SAN DIEGO SUPERCOMPUTER CENTER 
Wordcount submit script and Python code online: 
https://github.com/glennklockwood/sparktutorial
Introduction to Spark 
PRACTICAL LIMITATIONS 
SAN DIEGO SUPERCOMPUTER CENTER
Major Problems with Spark 
1. Still smells like a CS project 
2. Debugging is a dark art 
3. Not battle-tested at scale 
SAN DIEGO SUPERCOMPUTER CENTER
#1: Spark Smells Like CS 
• Components are constantly breaking 
• Graph.partitionBy broken in 1.0.0 (SPARK-1931) 
• Some components never worked 
• SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) 
• stop-master.sh doesn't work 
• Spark with YARN will break with large data sets (SPARK-2398) 
• spark-submit for standalone mode doesn't work (SPARK-2260) 
SAN DIEGO SUPERCOMPUTER CENTER
#1: Spark Smells Like CS 
• Really obvious usability issues: 
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') 
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 
scala.MatchError: 0 (of class java.lang.Integer) 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
SAN DIEGO SUPERCOMPUTER CENTER 
... 
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:722) 
Read an RDD, then write it out = unhandled exception with 
cryptic Scala errors from Python (SPARK-1690)
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in 
saveAsTextFile 
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- 
src.zip/py4j/java_gateway.py", line 537, in __call__ 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. 
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with 
client version 4 
at org.apache.hadoop.ipc.Client.call(Client.java:1070) 
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) 
at $Proxy7.getProtocolVersion(Unknown Source) 
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) 
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) 
Cause: Spark built against Hadoop 2 DFS trying to access data 
on Hadoop 1 DFS 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Debugging is a Dark Art 
>>> data.count() 
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
self.serializer.dump_stream(self._batched(iterator), stream) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for item in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin 
if acc is None: 
TypeError: an integer is required 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
... 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin 
if not isinstance(x, basestring): 
SystemError: unknown opcode 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
... 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Spark Debugging Tips 
• $SPARK_LOG_DIR/app-* contains master/worker 
logs with failure information 
• Try to find the salient error amidst the stack traces 
• Google that error--odds are, it is a known issue 
• Stick any required environment variables ($PATH, 
$PYTHONPATH, $JAVA_HOME) in 
$SPARK_CONF_DIR/spark-env.sh to rule out 
these problems 
• All else fails, look at Spark source code 
SAN DIEGO SUPERCOMPUTER CENTER
#3: Spark Isn't Battle Tested 
• Companies (Cloudera, SAP, etc) jumping on the 
Spark bandwagon with disclaimers about scaling 
• Spark does not handle multitenancy well at all. 
Wait scheduling is considered best way to achieve 
memory/disk data locality 
• Largest Spark clusters ~ hundreds of nodes 
SAN DIEGO SUPERCOMPUTER CENTER
Spark Take-Aways 
SAN DIEGO SUPERCOMPUTER CENTER 
• FACTS 
• Data is represented as resilient distributed datasets 
(RDDs) which remain in-memory and read-only 
• RDDs are comprised of elements 
• Elements are distributed across physical nodes in user-defined 
groups called partitions 
• RDDs are subject to transformations and actions 
• Fault tolerance achieved by lineage, not replication 
• Opinions 
• Spark is still in its infancy but its progress is promising 
• Good for evaluating--good for Gordon, Comet
Introduction to Spark 
PAGERANK EXAMPLE 
(INCOMPLETE) 
SAN DIEGO SUPERCOMPUTER CENTER
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
Start every webpage with a rank R = 1.0 
1. For each webpage linking in N neighbor webpages, 
have it "contribute" R/N to each of its N neighbors 
2. Then, for each webpage, set its rank R to (0.15 + 
0.85 * contributions) 
SAN DIEGO SUPERCOMPUTER CENTER 
3. Repeat 
insert flow diagram here
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') 
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such 
# that all common keys are kept together. Then retain this RDD in memory. 
links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() 
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 
ranks = links.map(lambda (url, neighbors): (url, 1.0)) 
# Calculate and update URL rank 
for iteration in range(10): 
# Calculate URL contributions to their neighbors 
contribs = links.join(ranks).flatMap( 
lambda (url, (urls, rank)): computeContribs(urls, rank)) 
# Recalculate URL ranks based on neighbor contributions 
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) 
# Print all URLs and their ranks 
for (link, rank) in ranks.collect(): 
print '%s has rank %s' % (link, rank) 
SAN DIEGO SUPERCOMPUTER CENTER

Contenu connexe

Tendances

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at ViadeoCepoi Eugen
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 

Tendances (19)

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 

En vedette

Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonGlenn K. Lockwood
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Shivang Bajaniya
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentationlordjoe
 
C++ AMPを使ってみよう
C++ AMPを使ってみようC++ AMPを使ってみよう
C++ AMPを使ってみようOsamu Masutani
 
Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSTechnopark
 
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章Hakky St
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. SparkTechnopark
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習hagino 3000
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)narumikanno0918
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門Hideo Terada
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 

En vedette (20)

myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Large-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by GordonLarge-scale Genomic Analysis Enabled by Gordon
Large-scale Genomic Analysis Enabled by Gordon
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
 
Use of spark for proteomic scoring seattle presentation
Use of spark for  proteomic scoring   seattle presentationUse of spark for  proteomic scoring   seattle presentation
Use of spark for proteomic scoring seattle presentation
 
C++ AMPを使ってみよう
C++ AMPを使ってみようC++ AMPを使ってみよう
C++ AMPを使ってみよう
 
Лекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFSЛекция 3. Распределённая файловая система HDFS
Лекция 3. Распределённая файловая система HDFS
 
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark
 
スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習スパース性に基づく機械学習 2章 データからの学習
スパース性に基づく機械学習 2章 データからの学習
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
 
Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習Spark MLlibではじめるスケーラブルな機械学習
Spark MLlibではじめるスケーラブルな機械学習
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 

Similaire à Overview of Spark for HPC

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hivearunkumar sadhasivam
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?Knoldus Inc.
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 

Similaire à Overview of Spark for HPC (20)

Spark
SparkSpark
Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
How Spark Does It Internally?
How Spark Does It Internally?How Spark Does It Internally?
How Spark Does It Internally?
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 

Dernier

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Dernier (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Overview of Spark for HPC

  • 1. Introduction to Spark Glenn K. Lockwood July 2014 SAN DIEGO SUPERCOMPUTER CENTER
  • 2. Outline I. Hadoop/MapReduce Recap and Limitations II. Complex Workflows and RDDs III. The Spark Framework IV. Spark on Gordon V. Practical Limitations of Spark SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Map/Reduce Parallelism Data Data SAN DIEGO SUPERCOMPUTER CENTER Data Data Data taDsakt a0 task 5 task 4 task 3 task 1 task 2
  • 4. Magic of HDFS SAN DIEGO SUPERCOMPUTER CENTER
  • 5. Hadoop Workflow SAN DIEGO SUPERCOMPUTER CENTER
  • 6. Shuffle/Sort SAN DIEGO SUPERCOMPUTER CENTER MapReduce Disk Spill 1. Map – convert raw input into key/value pairs. Output to local disk ("spill") 2. Shuffle/Sort – All reducers retrieve all spilled records from all mappers over network 3. Reduce – For each unique key, do something with all the corresponding values. Output to HDFS Map Map Map Reduce Reduce Reduce
  • 7. 2. Full* data dump to disk SAN DIEGO SUPERCOMPUTER CENTER MapReduce: Two Fundamental Limitations 1. MapReduce prescribes workflow. • You map, then you reduce. • You cannot reduce, then map... • ...or anything else. See first point. Map Map Map Reduce Reduce Reduce between workflow steps. • Mappers deliver output on local disk (mapred.local.dir) • Reducers pull input over network from other nodes' local disks • Output goes right back to local * Combiners do local reductions to prevent a full, unreduced dump of data to local disk disks via HDFS Shuffle/Sort
  • 8. Beyond MapReduce • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator SAN DIEGO SUPERCOMPUTER CENTER
  • 9. Beyond MapReduce: Complex Workflows • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce How can you do this without flushing intermediate results to disk after every operation? • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator How can you ensure fault tolerance for all of these baked-in operations? SAN DIEGO SUPERCOMPUTER CENTER
  • 10. SAN DIEGO SUPERCOMPUTER CENTER MapReduce Fault Tolerance Map Map Map Reduce Reduce Reduce Mapper Failure: 1. Re-run map task and spill to disk 2. Block until finished 3. Reducers proceed as normal Reducer Failure: 1. Re-fetch spills from all mappers' disks 2. Re-run reducer task
  • 11. Performing Complex Workflows How can you do complex workflows without flushing intermediate results to disk after every operation? 1. Cache intermediate results in-memory 2. Allow users to specify persistence in memory and partitioning of dataset across nodes How can you ensure fault tolerance? 1. Coarse-grained atomicity via partitions (transform chunks of data, not record-by-record) 2. Use transaction logging--forget replication SAN DIEGO SUPERCOMPUTER CENTER
  • 12. Resilient Distributed Dataset (RDD) • Comprised of distributed, atomic partitions of elements • Apply transformations to generate new RDDs • RDDs are immutable (read-only) • RDDs can only be created from persistent storage (e.g., HDFS, POSIX, S3) or by transforming other RDDs # Create an RDD from a file on HDFS text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') # Transform the RDD of lines into an RDD of words (one word per element) words = text.flatMap( lambda line: line.split() ) # Transform the RDD of words into an RDD of key/value pairs keyvals = words.map( lambda word: (word, 1) ) sc is a SparkContext object that describes our Spark cluster lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) SAN DIEGO SUPERCOMPUTER CENTER
  • 13. Potential RDD Workflow SAN DIEGO SUPERCOMPUTER CENTER
  • 14. RDD Transformation vs. Action • Transformations are lazy: nothing actually happens when this code is evaluated • RDDs are computed only when an action is called on them, e.g., • Calculate statistics over the elements of an RDD (count, mean) • Save the RDD to a file (saveAsTextFile) • Reduce elements of an RDD into a single object or value (reduce) • Allows you to define partitioning/caching behavior after defining the RDD but before calculating its contents SAN DIEGO SUPERCOMPUTER CENTER
  • 15. RDD Transformation vs. Action • Must insert an action here to get pipeline to execute. • Actions create files or objects: # The saveAsTextFile action dumps the contents of an RDD to disk >>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') # The count action returns the number of elements in an RDD >>> num_elements = rdd.count(); num_elements; type(num_elements) SAN DIEGO SUPERCOMPUTER CENTER 215136 <type 'int'>
  • 16. Resiliency: The 'R' in 'RDD' • No replication of in-memory data • Restrict transformations to coarse granularity • Partition-level operations simplifies data lineage SAN DIEGO SUPERCOMPUTER CENTER
  • 17. Resiliency: The 'R' in 'RDD' • Reconstruct missing data from its lineage • Data in RDDs are deterministic since partitions are immutable and atomic SAN DIEGO SUPERCOMPUTER CENTER
  • 18. Resiliency: The 'R' in 'RDD' • Long lineages or complex interactions (reductions, shuffles) can be checkpointed • RDD immutability  nonblocking (background) SAN DIEGO SUPERCOMPUTER CENTER
  • 19. Introduction to Spark SPARK: AN IMPLEMENTATION OF RDDS SAN DIEGO SUPERCOMPUTER CENTER
  • 20. Spark Framework • Master/worker Model • Spark Master is analogous to Hadoop Jobtracker (MRv1) or Application Master (MRv2) • Spark Worker is analogous to Hadoop Tasktracker • Relies on "3rd party" storage for RDD generation (hdfs://, s3n://, file://, http://) • Spark clusters take three forms: • Standalone mode - workers communicate directly with master via spark://master:7077 URI • Mesos - mesos://master:5050 URI • YARN - no HA; complicated job launch SAN DIEGO SUPERCOMPUTER CENTER
  • 21. Spark on Gordon: Configuration 1. Standalone mode is the simplest configuration and execution model (similar to MRv1) 2. Leverage existing HDFS support in myHadoop for storage 3. Combine #1 and #2 to extend myHadoop to support Spark: $ export HADOOP_CONF_DIR=/home/glock/hadoop.conf $ myhadoop-configure.sh ... myHadoop: Enabling experimental Spark support myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark myHadoop: To use Spark, you will want to type the following commands:" source /home/glock/hadoop.conf/spark/spark-env.sh myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 22. Spark on Gordon: Storage • Spark can use HDFS $ start-dfs.sh # after you run myhadoop-configure.sh, of course ... $ pyspark >>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') >>> mydata.count() 982394 • Spark can use POSIX file systems too $ pyspark >>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') >>> mydata.count() 982394 • S3 Native (s3n://) and HTTP (http://) also work • file:// input will be served in chunks to Spark workers via the Spark driver's built-in httpd SAN DIEGO SUPERCOMPUTER CENTER
  • 23. Spark on Gordon: Running Spark treats several languages as first-class citizens: Feature Scala Java Python Interactive YES NO YES Shark (SQL) YES YES YES Streaming YES YES NO MLlib YES YES YES GraphX YES YES NO R is a second-class citizen; basic RDD API is available outside of CRAN (http://amplab-extras.github.io/SparkR-pkg/) SAN DIEGO SUPERCOMPUTER CENTER
  • 24. myHadoop/Spark on Gordon (1/2) #!/bin/bash #PBS -l nodes=2:ppn=16:native:flash #PBS -l walltime=00:30:00 #PBS -q normal ### Environment setup for Hadoop export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH module load hadoop/2.2.0 export HADOOP_CONF_DIR=$HOME/mycluster.conf myhadoop-configure.sh ### Start HDFS. Starting YARN isn't necessary since Spark will be running in ### standalone mode on our cluster. start-dfs.sh ### Load in the necessary Spark environment variables source $HADOOP_CONF_DIR/spark/spark-env.sh ### Start the Spark masters and workers. Do NOT use the start-all.sh provided ### by Spark, as they do not correctly honor $SPARK_CONF_DIR myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 25. myHadoop/Spark on Gordon (2/2) ### Run our example problem. ### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home ### dir by default which is different from Hadoop 1.x!) hdfs dfs -mkdir -p /user/$USER hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt ### Step 2. Run our Python Spark job. Note that Spark implicitly requires ### Python 2.6 (some features, like MLLib, require 2.7) module load python scipy /home/glock/hadoop/run/wordcount-spark.py ### Step 3. Copy output back out hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ ### Shut down Spark and HDFS myspark stop stop-dfs.sh ### Clean up myhadoop-cleanup.sh SAN DIEGO SUPERCOMPUTER CENTER Wordcount submit script and Python code online: https://github.com/glennklockwood/sparktutorial
  • 26. Introduction to Spark PRACTICAL LIMITATIONS SAN DIEGO SUPERCOMPUTER CENTER
  • 27. Major Problems with Spark 1. Still smells like a CS project 2. Debugging is a dark art 3. Not battle-tested at scale SAN DIEGO SUPERCOMPUTER CENTER
  • 28. #1: Spark Smells Like CS • Components are constantly breaking • Graph.partitionBy broken in 1.0.0 (SPARK-1931) • Some components never worked • SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) • stop-master.sh doesn't work • Spark with YARN will break with large data sets (SPARK-2398) • spark-submit for standalone mode doesn't work (SPARK-2260) SAN DIEGO SUPERCOMPUTER CENTER
  • 29. #1: Spark Smells Like CS • Really obvious usability issues: >>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') >>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 scala.MatchError: 0 (of class java.lang.Integer) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) SAN DIEGO SUPERCOMPUTER CENTER ... at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Read an RDD, then write it out = unhandled exception with cryptic Scala errors from Python (SPARK-1690)
  • 30. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in saveAsTextFile keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- src.zip/py4j/java_gateway.py", line 537, in __call__ File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) Cause: Spark built against Hadoop 2 DFS trying to access data on Hadoop 1 DFS SAN DIEGO SUPERCOMPUTER CENTER
  • 31. #2: Debugging is a Dark Art >>> data.count() 14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. self.serializer.dump_stream(self._batched(iterator), stream) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for item in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin if acc is None: TypeError: an integer is required at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 32. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin if not isinstance(x, basestring): SystemError: unknown opcode at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 33. #2: Spark Debugging Tips • $SPARK_LOG_DIR/app-* contains master/worker logs with failure information • Try to find the salient error amidst the stack traces • Google that error--odds are, it is a known issue • Stick any required environment variables ($PATH, $PYTHONPATH, $JAVA_HOME) in $SPARK_CONF_DIR/spark-env.sh to rule out these problems • All else fails, look at Spark source code SAN DIEGO SUPERCOMPUTER CENTER
  • 34. #3: Spark Isn't Battle Tested • Companies (Cloudera, SAP, etc) jumping on the Spark bandwagon with disclaimers about scaling • Spark does not handle multitenancy well at all. Wait scheduling is considered best way to achieve memory/disk data locality • Largest Spark clusters ~ hundreds of nodes SAN DIEGO SUPERCOMPUTER CENTER
  • 35. Spark Take-Aways SAN DIEGO SUPERCOMPUTER CENTER • FACTS • Data is represented as resilient distributed datasets (RDDs) which remain in-memory and read-only • RDDs are comprised of elements • Elements are distributed across physical nodes in user-defined groups called partitions • RDDs are subject to transformations and actions • Fault tolerance achieved by lineage, not replication • Opinions • Spark is still in its infancy but its progress is promising • Good for evaluating--good for Gordon, Comet
  • 36. Introduction to Spark PAGERANK EXAMPLE (INCOMPLETE) SAN DIEGO SUPERCOMPUTER CENTER
  • 37. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations Start every webpage with a rank R = 1.0 1. For each webpage linking in N neighbor webpages, have it "contribute" R/N to each of its N neighbors 2. Then, for each webpage, set its rank R to (0.15 + 0.85 * contributions) SAN DIEGO SUPERCOMPUTER CENTER 3. Repeat insert flow diagram here
  • 38. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') # Load key/value pairs of (url, link), eliminate duplicates, and partition them such # that all common keys are kept together. Then retain this RDD in memory. links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() # Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 ranks = links.map(lambda (url, neighbors): (url, 1.0)) # Calculate and update URL rank for iteration in range(10): # Calculate URL contributions to their neighbors contribs = links.join(ranks).flatMap( lambda (url, (urls, rank)): computeContribs(urls, rank)) # Recalculate URL ranks based on neighbor contributions ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) # Print all URLs and their ranks for (link, rank) in ranks.collect(): print '%s has rank %s' % (link, rank) SAN DIEGO SUPERCOMPUTER CENTER

Notes de l'éditeur

  1. groupByKey: group the values for each key in the RDD into a single sequence mapValues: apply map function to all values of key/value pairs without modifying keys (or their partitioning) collect: return a list containing all elements of the RDD def computeContribs(urls, rank): """Calculates URL contributions to the rank of other URLs.""" num_urls = len(urls) for url in urls: yield (url, rank / num_urls)