Scala Meetup Hamburg - Spark

Introduction to Spark
Ivan Morozov

In this talk…
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned

3
• Started as a project of AMP Lab at UC Berkley in 2009
!
• Open sourced in 2010
!
• Apache Incubator project since June 2013
!
• Databricks was founded 2013 as company behind Spark
!
• Top Level Project at Apache in February 2014
History

State of Play4
100 %
Open Source
300+ 50+
Contributors Organisations

Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned

Why a new programming model for BigData analysis?6
ITERATION ITERATION
Input HDFS
write
HDFS
write
HDFS
write
…
HDFS
read
Input
Query 1
Query 3
Query 2
HDFS
read Result 1
Result 2
Result 3
Iterative Model
Ad-Hoc querying

Why a new programming model for BigData analysis?7
ITERATION ITERATION
Input HDFS
read
Input Query 1
Query 3
Query 2
HDFS
read / Pre-proceed
Result 1
Result 2
Result 3
Iterative Model
Ad-Hoc querying
MEMORY MEMORY
MEMORY

Facts8
• Implemented in ~14.000 lines of Scala
• API’s for Scala, Python and Java
• Vertical and Horizontal scalable
• Fault tolerance and fast recomputation
• Load Balancing
• On top of in memory cluster computing data structure with rich set of
operations
• Libraries for Machine Learning, Graph Computation, Stream
Processing and Ad-Hoc querying
• API to control computation ﬂow and persistance management
• Different deployment options (Standalone, YARN, MR)
• Interoperability with lot of systems ( HIVE, EC2, Mesos, HBase … )

Architecture9
Spark
SQL
Spark
Streaming
MLlib GraphX
Apache Spark
MESOS
WORKER WORKER WORKER WORKER
HDFS Cassandra
S3

Programming Model10
WORKER
WORKER
WORKER
Driver
Input Spark Context
Input Data Tasks
Results
Input Data
Input Data
Input Data
Input: HDFS Cluster,Hadoop,Hive…
Input Data: RDD[T]
Tasks: Serialized Java Objects
Result: RDD[A] Computing: In Memory
Driver: Spark Programm

Programming Model11
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("ScalaMeetup")
.set("spark.executor.memory", “1g")
!
val sc = new SparkContext(conf)
!
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
val biggerAsFive = data.filter( _ > 5)
!
biggerAsFive.cache().collect()
!
val exp = biggerAsFive.map(x => x * x)
val result = exp.reduce(_+_)
INFO DAGScheduler: Stage 62 (reduce at <console>:18) finished in 0.021 s!
INFO SparkContext: Job finished: reduce at <console>:18, took 0.024262042 s!
result: Int = 330

Launch with Spark-Submit12
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit
--class org.apache.spark.examples.ScalaMeetupMain
--master yarn-cluster # can also be `yarn-client` for client mode
--executor-memory 20G
--num-executors 50
/path/to/examples.jar
1000
You can run you packaged apps with spark-submit command

Not covered today…13
• MLIB,GraphX,SparkSQL,SparkStreaming
!
• Third-party systems communication
!
• …

RDD’s Creation15
• Distributed Immutable memory abstraction
• Can only be created through:
• read from stable storage
• transformation other RDD
dataFile.txt :
line1 with
line2 some
line3 data
val data : RDD[T] = sc.textFile(“dataFile.txt”)
// T => String
data.foreach(println)
// line1 with
// line2 some
// line3 data
val data : RDD[T] = sc.textFile(“dataFile.txt”)
// T => String
val data_ : RDD[A] = data.filter(_.startsWith(“line1”))
// A => String
data.foreach(println)
// line1 with

RDD’s Operations16
Picture from Resilient Distributed Datasets paper.
val data : RDD[String] =
(sc:SparkContext).parallelize
(List(
“a”,
“b”,
“I_AM_A_VERY_LONG_STRING”))
!
case class LongString(element :String)
!
val nothingFound = “NO Long STRING FOUND”
!
val result = data.map ((x:String) => {
if(x > 0)
val longString = x
.toList
.find(_.size() > 10)
!
val result = longString match {
case Some(x:String) => LongString(x)
case None => nothingFound
}
result
else
nothingFound
})
!
result.collect()
// LongString(“I_AM_A_VERY_LONG_STRING”)
Transformations are Lazy Operations
Actions are Reduce Operations
Closures should be associative

RDD’s representation17
• Partitions: LIST of atomar pieces to the given dataset
!
• Dependencies between parent and child RDD’s
• Narrow (e.g. map..)
• Wide (e.g. join..)
!
• RDD’s are immutable
!
• Lineage graph to ensure fault recovery

RDD’s job scheduling & fault recovering18
myRDD0.persist()
myRDD1.map(func)
.union(myRDD2)
.join(myRDD0)myRDD1
myRDD2
myRDD0
Result
map(…)
union(…)
join(…)
map(…)
union(…)
RDD1
RDD2
RDD0
Join(…)

Not covered today…19
• Interpreter integration
!
• Memory management
!
• Checkpointing policies
!
• …

Most important advice24
Read the excellent documentation

JAR Assembling25
!
• Use consistent Scala,AKKA,Spark versions
!
• If combine AKKA & Spark use the same AKKA (shaded-protobuf)
!
• When your project runs unpackaged well, not mean JAR act same way

Work with data26
• Caching your Data: 1TB of Data in Memory 5-7seconds vs ~3 Minutes
• Work with small subset of data and validate complex transformations
• MLib is not enough, consult statistician to validate your model
• Well placed logging can save you a lot of time

Some evaluation27
30 GB
0
9h
18h
1d 3h
1d 12h
Spark Old system
Spark Old system

Future28
https://databricks.com/spark

Resources29
Spark Doc : https://spark.apache.org/docs/
Databricks Doc : https://databricks.com/spark
Typesafe Spark Workshop : http://typesafe.com/activator/template/spark-
workshop
Spark Paper : https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDD Paper : http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf

Conclusion30
• Fits best into batch processing use cases
• Scala was absolut right for spark
• Large community support is helpful
• RDD’s are an powerful & ﬂexible data structure

Scala Meetup Hamburg - Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scala Meetup Hamburg - Spark

Similaire à Scala Meetup Hamburg - Spark (20)

Dernier

Dernier (20)

Scala Meetup Hamburg - Spark