5. Apache Spark History
• Started in 2009 as a research project in UC Berkeley RAD lab which became
AMP Lab.
• Spark researchers found that Hadoop MapReduce was inefficient for
iterative and interactive computing.
• Spark was designed from the beginning to be fast for interactive, iterative
with support for in-memory storage and fault-tolerance.
• Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors.
• Spark was open sourced in March 2010 and transformed into Apache
Foundation project in June 2013. One of most active project in apache.
6. Apache Spark Stack
Spark Core
Standalone Scheduler YARN Other
Spark SQL
Spark
Streaming
MLlib GraphX
Full stack and great ecosystem.
One stack to rule them all.
API
8. Apache Spark Advantages
• Fast by leverage memory
• General purpose framework – ecosystem
• Good and wide API support - native integration with Java, Python,
Scala
• Spark handles batch, interactive, and real-time within a single
framework
• Leverage Resilient Distributed Dataset – RDD to process data in
memory
Resilient – Data can be recreated if it is lost in the memory
Distributed – Stored in the memory across cluster
Dataset – Can be created from file or programmatically
9. Apache Spark Hall of Fame
Largest cluster Largest single-day intake
Longest-running job Largest shuffle job
More than 8k nodes 1PB+/day
1 week on 1PB+ data 1PB sort in 4hrs
10. How to Work with Spark
• Scala IDE
- Add hadoop dependency or use mvn
- Compile to jar, build or download
(http://spark.apache.org/docs/latest/building-spark.html)
- Run with spark-submit
• spark-shell
- Integrative command line application
• spark-sql
- Where spark using hive metadata
11. Demo – Word Count
MR code Hive code
How about spark code?
12. Demo – Word Count Cont.
MR code Hive code
How about spark code?
14. Spark Core Components
Application User program built on Spark. Consists of a driver program and executors on the
cluster.
Driver program The process running the main() function of the application and creating the
SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN)
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps
data in memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to
a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each
other (similar to the map and reduce stages in MapReduce); you'll see this term
used in the driver's logs.
15. Example of Partial Word Count
Stage 2Stage 1
WordCount Example
HDFSInputSplits
HDFS
pipeline
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
partition shuffle pipeline
• Each core by default execute one task a time (spark.task.cpus)
• One task per partition, one partition per split
• No. of stages is coming from shuffle/wide transformations
16. Two Main Spark Abstractions
RDD – Resilient Distributed Dataset
DAG – Direct Acyclic Graph
16
17. RDD Concept
Simple explanation
• RDD is collection of data items split into partitions,
stored in memory on worker nodes of the cluster, and perform proper
operations.
Complex explanation
• RDD is an interface for data transformation
• RDD refers to the data stored either in persisted store (HDFS,
Cassandra, HBase, etc.) or in cache (memory, memory+disks, disk
only, etc.) or in another RDD
• Partitions are recomputed on failure or cache eviction
18. Spark Program Flow by RDD
Create input
RDD
• spark context
• create from external
• create from SEQ
Transform
RDD
• Using map, flatmap, distinct, filter
• Wide/Narrow
Persist
intermediate
RDD
• Cache
• Persist
Action on RDD • Produce result
Example:
myRDD = sc.parallelize([1,2,3,4])
myRDD = sc.textFile("hdfs://tmp/shakepeare.txt”)
19. RDD Partitions
• A Partition is one of the different chunks that a RDD is splitted on
and that is sent to a node
• The more partitions we have, the more parallelism we get
• Each partition is candidate to be spread out to different worker
nodes
RDDs - Partitions
RDD w ith 8 partitions
P1 P2 P3 P4 P5 P6 P7 P8
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
P1 P5 P2 P6 P3 P7 P4 P8
20. RDD Operations
• RDD is the only data manipulation tool in Apache Spark
• Lazy and non-lazy operations
• There are two types of RDDs
Transformation – lazy – returns Array[T] by collect action
Actions – non-lazy – returns value
• Narrow and wide transformations
21. RDD Transformations - map
• Applies a transformation function on each item of the RDD and
returns the result as a new RDD.
• A narrow transformation
V1
V2
V3
T1
T2
T3
V10,1
V20,2
V30,3
T10,1
T20,2
T30,3
f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val b = a.map(x=>(x,x.length))
b.collect()
b.partitions.size
22. RDD Transformations - flatMap
• Similar to map, but each input item can be mapped to 0 or more
output items.
• A narrow transformation
V1
V2
V3
T1
T2
T3
V10,1
V20,2
V30,3
T10,1
T20,2
T30,3
f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val c = a.flatMap(x=>List(x,x))
c.collect
val d = a.map(x=>List(x,x)).collect
23. RDD Transformations - filter
• Create a new RDD by passing in a function used to filter the
results. It is similar to where clause in SQL language
• A narrow transformation
V1
V2
V3
T1
T2
T3
f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val e = a.filter(x=>x.length>3)
e.collect
V1
V3
T1
T2
24. RDD Transformations - sample
• Return a random sample subset RDD of the input RDD
• A narrow transformation
V1
V2
V3
V4
V5
V6
F:U=>v
val sam = sc.parallelize(1 to 10, 3)
sam.sample(false, 0.1, 0).collect
sam.sample(true, 0.3, 1).collect
V1
V2
V5
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]
25. RDD Transformations - distinct
• Return a new RDD with distinct elements within a source RDD with
specific partitions
• A narrow/wide transformation
V1
V1
V3
V4
V4
V4
F:U=>v
Val dis = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))
dis.distinct.collect
dis.distinct(2).partitions.length
V1
V3
V4
26. RDD Transformations - reduceByKey
• Operates on (K,V) pairs of course, but the function must be of
type (V,V) => V
• A narrow/wide transformation
V1,1
V2,2
V2,1
T1,3
T1,2
T2,1
F(A,B)=(A+B)
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.reduceByKey((a,b)=>(a+b)).collect
f.reduceByKey(_+_).collect
f.groupByKey.collect
V1,1
V2,3
T1,5
T2,1
In terms of group, ReduceByKey is more efficient
with combiner in the map phase.
27. RDD Transformations - sortByKey
• This simply sorts the (K,V) pair by K. Default is asc. False as desc.
• A narrow/wide transformation
1,V1
2,V2
3,V3
4,V4
6,V6
5,V3
F:U=>v
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.sortByKey().collect
f.sortByKey(false).collect
1,V1
2,V2
3,V3
4,V4
5,V5
6,V6
28. RDD Transformations - union
• Returns the union of two RDDs to a new RDD
• A narrow transformation
V1
V2
U1
F:U=>v
val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 4)
u1.union(u2).collect
(u1 ++ u2).collect
V1
V2
V3
U1
U2
V3
U2
29. RDD Transformations - intersection
• Returns a new RDD containing only common elements from source
RDD and argument.
• A narrow/wide transformation
V1
V2
U1
F:U=>v
val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 4)
u1.intersection(u2).collectV2
U1V2
U1
30. RDD Transformations - join
• When invoked on (K,V) and (K,W), this operation creates a new RDD
of (K, (V,W)). Must apply on key-value pairs.
• A wide transformation
K1,V1
K2,V2
K3,V3
K1,W1
K2, V2
F:U=>v
val j1 = sc.parallelize(List("abe", "abby",
"apple")).map(a => (a, 1))
val j2 = sc.parallelize(List("apple", "beatty",
"beatrice")).map(a => (a, 1))
j1.join(j2).collect
j1.leftOuterJoin(j2).collect
j1.rightOuterJoin(j2).collect
K1,(V1,V2)
K2,(V2,V2)
31. RDD Actions - count
• Get the number of data elements in the RDD
V1
V2
U1
F:U=>v val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 10)
u1.count
u2.count
3
32. RDD Actions - take
• Fetch number of data elements from the RDD
V1
V2
F:U=>v val t1 = sc.parallelize(3 to 10)
t1.take(3)
V1
33. RDD Actions - foreach
• Execute function for each data element in RDD.
F:U=>v
val c = sc.parallelize(List("cat", "dog", "tiger", "lion"))
c.foreach(x => println(x + "s are fun"))
V1
V2
V3
U1
U2
U3
34. RDD Actions - saveAsTextFile
• Writes the content of RDD to a text file or a set of text files to local
file system/HDFS. The default path is HDFS or hdfs://
val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("file:/Users/will/Downloads/mydata")
35. RDD Persistence – cache/persist
• Cache data to memory/disk. It is a lazy operation
• cache = persist(MEMORY)
persist(MEMORY_AND_DISK)
val u1 = sc.parallelize(1 to 1000).filter(_%2==0)
u1.cache()
u1.count
V1 (mem)
V2 (mem)
V3 (mem)
U1 (disk)
U2 (disk)
V1
V2
V3
U1
U2
36. Storage Level for persist
Storage Level Purpose
MEMORY_ONLY (Default level)
Keep RDD in available cluster memory as deserialized Java objects. Some
partitions may not be cached if there is not enough cluster memory. Those
partitions will be recalculated on the fly as needed.
MEMORY_AND_DISK
This option stores RDD as deserialized Java objects. If RDD does not fit in
cluster memory, then store those partitions on the disk and read them as
needed.
MEMORY_ONLY_SER
This options stores RDD as serialized Java objects (One byte array per
partition). This is more CPU intensive but saves memory as it is more space
efficient. Some partitions may not be cached. Those will be recalculated on
the fly as needed.
MEMORY_ONLY_DISK_SER
This option is same as above except that disk is used when memory is not
sufficient.
DISC_ONLY This option stores the RDD only on the disk
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as other levels but partitions are replicated on 2 slave nodes
By default, spark uses algorithm of Least Recently Used (LRU) to remove old and unused RDD to
release more memory. You can also manually remove unused RDD from memory by using
unpersist()
37. Spark Variable - Broadcast
• Like Hadoop DC, broadcast variables let programmer keep a read-
only variable cached on each machine rather than shipping a copy
of it with tasks. For example, to give every node a copy of a large
input dataset efficiently
• Spark also attempts to distribute broadcast variables using efficient
broadcast algorithms to reduce communication cost
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
val fn= sc.textFile(“hdfs://filename”).map(_.split(“|”))
val fn_bc= sc.broadcast(fn)
38. Spark Variable - Accumulators
• Accumulators are variables that can only be “added” to through an
associative operation Used to implement counters and sums,
efficiently in parallel
• Spark natively supports accumulators of numeric value types and
standard mutable collections, and programmers can extend for new
types
• Only the driver program can read an accumulator’s value, not the
tasks
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
39. Two Main Spark Abstractions
RDD – Resilient Distributed Dataset
DAG – Direct Acyclic Graph
39
40. DAG
• Direct Acyclic Graph – sequence of computations performed on
data
Node: RDD partition
Edge: Transformation on top of data
Acyclic: Graph cannot return to the older partition
Direct: transformation is an action that transitions data partition
state (from A to B)
43. Spark SQL History
Stop at mid of 2014
Ready at
Beginning of 2015
Released at
April of 2014
44. Spark SQL with Hive Meta
Support almost all hive query – HiveContext, which is superset of SQLContext
SparkSQL uses Hive’s meta store, but his own version of thrift server
48. Spark SQL – How It Works
Catalyst
Frontend Backend
Picture from databricks.com
49. DataFrame (DF) RDD
• Schema on read
• Works with structured and semi-structured data
• Simplifies working with structured data
• Keep track of schema information
• Read/Write from structure data like JSON, Hive tables, Parquet, etc.
• SQL inside your Spark App
• Best Performance and more powerful operations API
• Access through SQLContext
• Integrated with ML
51. DF Example – Direct from files
val df = sqlContext.read.json("/demo/data/data_json/customer.json”)
df.printSchema()
df.show()
df.select(“name”).show
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
SqlContext directly deals with DF
52. DF Example – Direct from Hive
val a = sqlContext.table(”sample_07”)
a.printSchema()
df.show(1)
SqlContext directly deals with DF
53. DF Example – Indirect from Other RDD
val rdd = sc.textFile("/demo/data/data_cust/data1.csv").
flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_+_)
//auto schema
val df1 = sqlContext.createDataFrame(rdd)
val df2 = rdd.toDF
//specify schema
val schemaString = "word cnt”
val schema = StructType(StructField("word", StringType, true) :: StructField("cnt",
IntegerType, false) :: Nil)
val rowRDD = rdd.map(p => Row(p._1,p._2))
val dfcust = sqlContext.createDataFrame(rowRDD, schema)
SqlContext directly deals with DF
54. www.sparkera.ca
BIG DATA is not only about data,
but the understanding of the data
and how people use data actively to improve their life.
LARGEST CLUSTER
Tencent (8000+ nodes)
LARGEST SINGLE-DAY INTAKE
Tencent (1PB+ /day)
LONGEST-RUNNING JOB
Alibaba(1 week on 1PB+ data)
LARGEST SHUFFLE
Databricks PB Sort (1PB)
MOST INTERESTING APP
Jeremy Freeman Mapping the Brain at Scale (with lasers!)
RDD Persistence
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
As you enter your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph.
When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler.
The DAG scheduler divides operator graph into (map and reduce) stages.
A stage is comprised of tasks based on partitions of the input data.
The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages.
The Worker executes the tasks. A new JVM is started per job. The worker knows only about the code that is passed to it.
Optimizing query planning for Spark
Takes Dataframe operations and ‘compiles’ them down to RDD operations
Often faster than writing RDD code manually
Use Dataframes whenever possible (v1.4+)
Works with structured and semi-structured dataDataFrame simplifies working with structured dataRead/Write from structure data like JSON, Hive tables, Parquet, etc. SQL inside your Spark AppBest Performance and more powerful operations API