Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
An introduction To Apache Spark
1. 1
An Introduction to Apache Spark
By Amir Sedighi
Datis Pars Data Technology
Slides adopted from Databricks
(Paco Nathan and Aaron Davidson)
@amirsedighi
http://hexican.com
2. 2
History
● Developed in 2009 at UC Berkeley AMPLab.
● Open sourced in 2010.
● Spark becomes one of the largest big-data
projects with more 400 contributors in 50+
organizations such as:
– Databricks, Yahoo!, Intel, Cloudera, IBM, …
3. 3
What is Spark?
● Fast and general cluster computing system
interoperable with Hadoop datasets.
4. 4
What are Spark improvements?
● Improves efficiency through:
– In-memory computing primitives.
– General computation graphs.
● Improves usability through
– Rich APIs in Scala, Java, Python
– Interactive shell (Scala/Python)
6. 6
MapReduce
● MapReduce is great for single-pass batch jobs
while in many use-cases we need to use
MapReduce in a multi-pass manner...
7. 7
What improvements Spark made on
running MapReduce?
● Improving the performance of MapReduce for
running as a multi-pass analytics, interactive,
real-time, distributed computation model on the
top of Hadoop.
Note:
– Spark is a hadoop successor.
12. 12
Spark Programming Model
● At a high level, every Spark application consists
of a driver program that runs the user’s main
function.
● Promotes you to write programs in term of
making transformations on distributed datasets.
13. 13
Spark Programming Model
● The main abstraction Spark provides is a
resilient distributed dataset (RDD).
– Collection of elements partitioned across the cluster
(Memory of Disk)
– Can be accessed and operated in parallel (map,
filter, ...)
– Automatically rebuilt on failure
14. 14
Spark Programming Model
● RDDs Operations
– Transformations: Create a new dataset from an
existing one.
● Example: map()
– Actions: Return a value to the driver program after
running a computation on the dataset.
● Example: reduce()
16. 16
Spark Programming Model
● Another abstraction is Shared Variables
– Broadcast Variables, which can be used to cache a
value in memory on all nodes.
– Accumulator
20. 20
Ease of Use
● Spark offers over 80 high-level operators that
make it easy to build parallel apps.
● Scala and Python shells to use it interactively.
23. 23
Apache Spark Core
● Spark Core is the general engine for the Spark
platform.
– In-memory computing capabilities deliver speed
– General execution model supports wide delivery of
use cases
– Ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)
35. 35
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
36. 36
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
transformation: modify data in one DStream to create
another DStream
new DStream
37. 37
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval
40. 40
MLLib
● MLLib is Spark's scaleable machine learning
engine.
● MLLib works on any hadoop datasource such
as HDFS, HBase and local files.
41. 41
MLLib
● Algorithms:
– linear SVM and logistic regression
– classification and regression tree
– k-means clustering
– recommendation via alternating least squares
– singular value decomposition
– linear regression with L1- and L2-regularization
– multinomial naive Bayes
– basic statistics
– feature transformations
46. 46
Spark Runs Everywhere
● Spark runs on Hadoop, Mesos, standalone, or
in the cloud.
● Spark accesses diverse data sources including
HDFS, Cassandra, HBase, S3.
47. 47
Resources
● http://spark.apache.org
● Intro to Apache Spark by Paco Nathan
● Building a Unified Data Pipeline in Spark by Aaron
Davidson.
● http://www.slideshare.net/manishgforce/lightening-fast-bi
g-data-analytics-using-apache-spark
● Deep Dive with Spark Streaming - Tathagata Das - Spark
Meetup
● ZYMR