This document provides an overview of Spark and Scala. It discusses why Scala is a good language for functional programming and distributed computing. It then introduces Spark and how it provides a Scala-like API for distributed, parallel computations using RDDs across a cluster. Some key points made are that Spark transformations are lazy while actions are eager, and that it is important to cache, apply transformations efficiently, and avoid shuffling for best performance of Spark jobs.
2. Write a lot to my blog www.Fruzenshtein.com
Currently interested in Scala, Akka, Spark…
Who is who?
Alexey Zvolinskiy
~4 years of Scala experience
Passing through Functional Programming
in Scala Specialization on Coursera
@Fruzenshtein
4. What makes Scala so great?
1. Functional programming language*
2. Immutability
3. Type system
4. Collections API
5. Pattern matching
6. Implicit
5. Functional programming language
1. Function is a first class citizen
2. Totality
3. Determinism
4. Purity
A => B
A1
A2
…
An
B1
B2
…
Bn
A => BAi Bi A => BAi Bi
A => BAi Bi
6. Immutability
1. Makes a code more predictable
2. Reduces efforts to understand a code
3. Key to thread-safety
Books:
Java concurrency in practice
Effective Java 2nd Edition
7. Type system
1. Static typing
2. Type inference
3. Bounds Map[V, K]
List[T1 <: T2]
Set[+T]
8. Collections API
val numbers = List(1,2,3,4,5,6,7,8,9,10)
numbers.filter(_ % 2 == 0)
.map(_ * 10)
//List(20, 40, 60, 80, 100)
filter(n:Int => Boolean)
//(n => n % 2 == 0)
//(n => n * 10)
map(n:Int => Int)
13. Scala parallel collections
val from0to100000: Range = 0 until 100000
val list = from0to100000.toList
//scala.collection.parallel.immutable.ParSeq[Int]
val parList = list.par
14. Some benchmarks
val list = from0to100000.toList
for (i <- 1 to 10) {
val t0 = System.currentTimeMillis()
list.filter(isPrime(_))
println(System.currentTimeMillis - t0)
}
def isPrime(n: Int): Boolean = ! (
(2 until n-1) exists (n % _ == 0)
)
val parList = list.par
for (i <- 1 to 10) {
val t1 = System.currentTimeMillis()
parList.filter(isPrime(_))
println(System.currentTimeMillis - t1)
}
7106
6467
6315
6275
6478
8732
6543
6296
6299
6286
5130
5106
4649
4568
4580
4446
4447
4437
4290
4476
16. Why distributed computations?
single machine
(shared memory)
Multiple nodes
(network)
Parallel collections
(scala)
RDDs
(spark)
Almost the same API