An introduction To Apache Spark

1
An Introduction to Apache Spark
By Amir Sedighi
Datis Pars Data Technology
Slides adopted from Databricks
(Paco Nathan and Aaron Davidson)
@amirsedighi
http://hexican.com

2
History
● Developed in 2009 at UC Berkeley AMPLab.
● Open sourced in 2010.
● Spark becomes one of the largest big-data
projects with more 400 contributors in 50+
organizations such as:
– Databricks, Yahoo!, Intel, Cloudera, IBM, …

3
What is Spark?
● Fast and general cluster computing system
interoperable with Hadoop datasets.

4
What are Spark improvements?
● Improves efficiency through:
– In-memory computing primitives.
– General computation graphs.
● Improves usability through
– Rich APIs in Scala, Java, Python
– Interactive shell (Scala/Python)

5
MapReduce is a DAG in General

6
MapReduce
● MapReduce is great for single-pass batch jobs
while in many use-cases we need to use
MapReduce in a multi-pass manner...

7
What improvements Spark made on
running MapReduce?
● Improving the performance of MapReduce for
running as a multi-pass analytics, interactive,
real-time, distributed computation model on the
top of Hadoop.
Note:
– Spark is a hadoop successor.

8
How Spark Made it?
A Wise Data Sharing!

9
Data Sharing in Hadoop
MapReduce

11
Data Sharing in Spark
10-100x Faster than network and disk!

12
Spark Programming Model
● At a high level, every Spark application consists
of a driver program that runs the user’s main
function.
● Promotes you to write programs in term of
making transformations on distributed datasets.

13
● The main abstraction Spark provides is a
resilient distributed dataset (RDD).
– Collection of elements partitioned across the cluster
(Memory of Disk)
– Can be accessed and operated in parallel (map,
filter, ...)
– Automatically rebuilt on failure

14
● RDDs Operations
– Transformations: Create a new dataset from an
existing one.
● Example: map()
– Actions: Return a value to the driver program after
running a computation on the dataset.
● Example: reduce()

16
● Another abstraction is Shared Variables
– Broadcast Variables, which can be used to cache a
value in memory on all nodes.
– Accumulator

20
Ease of Use
● Spark offers over 80 high-level operators that
make it easy to build parallel apps.
● Scala and Python shells to use it interactively.

23
Apache Spark Core
● Spark Core is the general engine for the Spark
platform.
– In-memory computing capabilities deliver speed
– General execution model supports wide delivery of
use cases
– Ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)

30
Spark Streaming
● makes it easy to build scalable fault-tolerant
streaming applications.

35
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data

36
Spark Streaming
val hashTags = tweets.flatMap (status => getTags(status))
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
transformation: modify data in one DStream to create
another DStream
new DStream

37
Spark Streaming
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval

38
Spark Streaming
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

40
MLLib
● MLLib is Spark's scaleable machine learning
engine.
● MLLib works on any hadoop datasource such
as HDFS, HBase and local files.

41
MLLib
● Algorithms:
– linear SVM and logistic regression
– classification and regression tree
– k-means clustering
– recommendation via alternating least squares
– singular value decomposition
– linear regression with L1- and L2-regularization
– multinomial naive Bayes
– basic statistics
– feature transformations

43
GraphX
● GraphX is Spark's API for graphs and graph-
parallel computation.
● Works with both graphs and collections.

44
GraphX
● Comparable performance to the fastest
specialized graph processing systems

45
GraphX
● Algorithms
– PageRank
– Connected components
– Label propagation
– SVD++
– Strongly connected components
– Triangle count

46
Spark Runs Everywhere
● Spark runs on Hadoop, Mesos, standalone, or
in the cloud.
● Spark accesses diverse data sources including
HDFS, Cassandra, HBase, S3.

47
Resources
● http://spark.apache.org
● Intro to Apache Spark by Paco Nathan
● Building a Unified Data Pipeline in Spark by Aaron
Davidson.
● http://www.slideshare.net/manishgforce/lightening-fast-bi
g-data-analytics-using-apache-spark
● Deep Dive with Spark Streaming - Tathagata Das - Spark
Meetup
● ZYMR

An introduction To Apache Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à An introduction To Apache Spark

Similaire à An introduction To Apache Spark (20)

Plus de Amir Sedighi

Plus de Amir Sedighi (8)

Dernier

Dernier (20)

An introduction To Apache Spark