This document provides an introduction and overview of Apache Spark. It discusses what Spark is, its performance advantages over Hadoop MapReduce, its core abstraction of resilient distributed datasets (RDDs), and how Spark programs are executed. Key features of Spark like its interactive shell, transformations and actions on RDDs, and Spark SQL are explained. Recent new features in Spark like DataFrames, external data sources, and the Tungsten performance optimizer are also covered. The document aims to give attendees an understanding of Spark's capabilities and how it can provide faster performance than Hadoop for certain applications.
5. What is Apache Spark ?
• It is an open source cluster computing
framework
• In contrast to Hadoop's two-stage disk-
based MapReduce paradigm, Spark's
multi-stage in-memory primitives provides
performance up to 100 faster for certain
applications.
6. Databricks
• Founded in late 2013
• By the creators of Apache Spark
• Original team from UC Berkeley AMPLab
(Algorithms,Machines,People)
• Contributed more than 75% of the code
added to Spark in 2014
22. http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of machines
that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like parallel
operations.
RDDs achieve fault tolerance through a notion of lineage:
if a partition of an RDD is lost, the RDD has enough
information about how it was derived from other RDDs to
be able to rebuild just that partition.”
What is RDD ?
24. # Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Read From TextFile
36. Lifecycle of a Spark program
1. Create some input RDDs from external data or
parallelize a collection in your driver program.
2. Lazily transform them to define new RDDs
using transformations like filter() or map()
3. Ask Spark to cache() any intermediate RDDs
that will need to be reused.
4. Launch actions such as count() and collect()
to kick off a parallel computation, which is then
optimized and executed by Spark.
http://www.slideshare.net/SparkSummit/intro-to-spark-development
59. Spark 1.5
• A large part of Spark 1.5, on the other hand, focuses
on under-the-hood changes to improve
Spark’s performance, usability, and operational
stability.
• Spark 1.5 delivers the first phase of Project Tungsten
Reference: https://databricks.com/blog/2015/08/18/spark-1-5-preview-now-available-in-databricks.html
This RDD has 5 partitions.
An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster.
In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.
So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
Introduce that Spark has Operations which can be transformations or actions.
Those are 4 green unique blocks in a single HDFS file
Here we are filtering out the warnings and info messages so we are left with just errors in the RDD.
This doesn’t actually read the file from HDFS just yet… we’re just building out a lineage graph
directed acyclic graph.
That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again.
A collection of tasks that must be ordered into a sequence, subject to constraints that certain tasks must be performed earlier than others, may be represented as a DAG with a vertex for each task and an edge for each constraint
https://en.wikipedia.org/wiki/Directed_acyclic_graph
----- Meeting Notes (6/15/15 16:02) -----
This is a stage (which we'll talk about later).
Now the RDDs dissapear and get destroyed
It’s okay if only part of the RDD actually fits in memory
Talk about lineage: parent RDD and child RDD
----- Meeting Notes (6/15/15 16:08) -----
Also note that an application can have many such 1 through 4 procedures.
Actions force the evaluation of the transformations required for the RDD they are called on, since they are required to actually produce output.