2. In this talk…
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned
3. 3
• Started as a project of AMP Lab at UC Berkley in 2009
!
• Open sourced in 2010
!
• Apache Incubator project since June 2013
!
• Databricks was founded 2013 as company behind Spark
!
• Top Level Project at Apache in February 2014
History
6. Why a new programming model for BigData analysis?6
ITERATION ITERATION
Input HDFS
write
HDFS
write
HDFS
write
…
HDFS
read
Input
Query 1
Query 3
Query 2
HDFS
read Result 1
Result 2
Result 3
Iterative Model
Ad-Hoc querying
7. Why a new programming model for BigData analysis?7
ITERATION ITERATION
Input HDFS
read
Input Query 1
Query 3
Query 2
HDFS
read / Pre-proceed
Result 1
Result 2
Result 3
Iterative Model
Ad-Hoc querying
MEMORY MEMORY
MEMORY
8. Facts8
• Implemented in ~14.000 lines of Scala
• API’s for Scala, Python and Java
• Vertical and Horizontal scalable
• Fault tolerance and fast recomputation
• Load Balancing
• On top of in memory cluster computing data structure with rich set of
operations
• Libraries for Machine Learning, Graph Computation, Stream
Processing and Ad-Hoc querying
• API to control computation flow and persistance management
• Different deployment options (Standalone, YARN, MR)
• Interoperability with lot of systems ( HIVE, EC2, Mesos, HBase … )
10. Programming Model10
WORKER
WORKER
WORKER
Driver
Input Spark Context
Input Data Tasks
Results
Input Data
Input Data
Input Data
Input: HDFS Cluster,Hadoop,Hive…
Input Data: RDD[T]
Tasks: Serialized Java Objects
Result: RDD[A] Computing: In Memory
Driver: Spark Programm
11. Programming Model11
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("ScalaMeetup")
.set("spark.executor.memory", “1g")
!
val sc = new SparkContext(conf)
!
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
val biggerAsFive = data.filter( _ > 5)
!
biggerAsFive.cache().collect()
!
val exp = biggerAsFive.map(x => x * x)
val result = exp.reduce(_+_)
INFO DAGScheduler: Stage 62 (reduce at <console>:18) finished in 0.021 s!
INFO SparkContext: Job finished: reduce at <console>:18, took 0.024262042 s!
result: Int = 330
12. Launch with Spark-Submit12
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit
--class org.apache.spark.examples.ScalaMeetupMain
--master yarn-cluster # can also be `yarn-client` for client mode
--executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000
You can run you packaged apps with spark-submit command
13. Not covered today…13
• MLIB,GraphX,SparkSQL,SparkStreaming
!
• Third-party systems communication
!
• …
15. RDD’s Creation15
• Distributed Immutable memory abstraction
• Can only be created through:
• read from stable storage
• transformation other RDD
dataFile.txt :
line1 with
line2 some
line3 data
val data : RDD[T] = sc.textFile(“dataFile.txt”)
// T => String
data.foreach(println)
// line1 with
// line2 some
// line3 data
val data : RDD[T] = sc.textFile(“dataFile.txt”)
// T => String
val data_ : RDD[A] = data.filter(_.startsWith(“line1”))
// A => String
data.foreach(println)
// line1 with
16. RDD’s Operations16
Picture from Resilient Distributed Datasets paper.
val data : RDD[String] =
(sc:SparkContext).parallelize
(List(
“a”,
“b”,
“I_AM_A_VERY_LONG_STRING”))
!
case class LongString(element :String)
!
val nothingFound = “NO Long STRING FOUND”
!
val result = data.map ((x:String) => {
if(x > 0)
val longString = x
.toList
.find(_.size() > 10)
!
val result = longString match {
case Some(x:String) => LongString(x)
case None => nothingFound
}
result
else
nothingFound
})
!
result.collect()
// LongString(“I_AM_A_VERY_LONG_STRING”)
Transformations are Lazy Operations
Actions are Reduce Operations
Closures should be associative
17. RDD’s representation17
• Partitions: LIST of atomar pieces to the given dataset
!
• Dependencies between parent and child RDD’s
• Narrow (e.g. map..)
• Wide (e.g. join..)
!
• RDD’s are immutable
!
• Lineage graph to ensure fault recovery
25. JAR Assembling25
!
• Use consistent Scala,AKKA,Spark versions
!
• If combine AKKA & Spark use the same AKKA (shaded-protobuf)
!
• When your project runs unpackaged well, not mean JAR act same way
26. Work with data26
• Caching your Data: 1TB of Data in Memory 5-7seconds vs ~3 Minutes
• Work with small subset of data and validate complex transformations
• MLib is not enough, consult statistician to validate your model
• Well placed logging can save you a lot of time
30. Conclusion30
• Fits best into batch processing use cases
• Scala was absolut right for spark
• Large community support is helpful
• RDD’s are an powerful & flexible data structure