4. Ericsson Internal | 2015-08-11 | Page 4
“Fast and general engine for big
data processing with libraries for
SQL, streaming, advanced
analytics(machine learning)
5. Ericsson Internal | 2015-08-11 | Page 5
WHAT?
Originally developed in 2009 in
UC Berkeley’sAMPLab
Fully open sourced in 2010 –
now at Apache Software
Foundation
http://spark.apache.org
6. Ericsson Internal | 2015-08-11 | Page 6
Spark is the Most Active
Open Source Project in
Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
7. Ericsson Internal | 2015-08-11 | Page 7
Distributors Applications
7
The Spark Community
9. Ericsson Internal | 2015-08-11 | Page 9
WHY SPARK?
Speed
Run programs up to
100x faster than
Hadoop Map
Reduce in memory,
or 10x faster on
disk.
Ease of Use
Supports different
languages for
developing
applications using
Spark
Generality
Combine SQL,
streaming, and
complex analytics
into one platform
Runs
Everywhere
Spark runs on
Hadoop, Mesos,
standalone, or in
the cloud.
10. Ericsson Internal | 2015-08-11 | Page 10
Easy: Get Started
Immediately
Interactive Shell
19. Ericsson Internal | 2015-08-11 | Page 19
THE BIG QUESTION?
Is Spark going to replace Hadoop?
Answer – Yes, Spark will be used on top of Hadoop and replace
MapReduce Reasons:
1. Hadoop MapReduce cannot handle real-time
processing
2. Hadoop MapReduce is slower than Hadoop Spark
3. With rise of IOT, Spark is a must
21. Ericsson Internal | 2015-08-11 | Page 21
RESILIENT Distributed
Dataset
RDDs track lineage information that can be used to efficiently
re-compute lost data
22. Ericsson Internal | 2015-08-11 | Page 22
Partitions in the
cluster
SparkM
SparkW
SparkWSparkW
SparkW
partition
RDD
@doanduy 2
2
30. Ericsson Internal | 2015-08-11 | Page 30
Spark Shell
./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed cluster, or local to run
locally with one thread, or local[N] to run locally with N threads. You should start by
using local for testing.
31. Ericsson Internal | 2015-08-11 | Page 31
scala> textFile.count() // Number of items in this RDD
ees0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
Simplier scala> textFile.filter(line =>
line.contains("Spark")).count() // How many lines contain
"Spark"?
res3: Long = 15
scala> val textFile = sc.textFile(“../README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Basic operations…
32. Ericsson Internal | 2015-08-11 | Page 32
Map - Reduce
scala> textFile.map(line => line.split("
").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
scala> import java.lang.Math
scala> textFile.map(line => line.split("
").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line =>
line.split(" ")).map(word => (word, 1)).reduceByKey((a,
b) => a + b)
wordCounts: spark.RDD[(String, Int)] =
spark.ShuffledAggregatedRDD@71f027b8
wordCounts.collect()
33. Ericsson Internal | 2015-08-11 | Page 33
With Caching…
scala> linesWithSpark.cache()
res7: spark.RDD[String] =
spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
34. Ericsson Internal | 2015-08-11 | Page 34
With HDFS…
val lines = spark.textFile(“hdfs://...”)
val errors = lines.filter(line =>
line.startsWith(“ERROR”))
println(Total errors: + errors.count())
36. Ericsson Internal | 2015-08-11 | Page 36
Configuration
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
37. Ericsson Internal | 2015-08-11 | Page 37
SQL to RDD Translation
Projection & selection
SELECT name, age
FROM people
WHERE age ≥ 13 AND age ≤ 19
SELECT name, age
WHERE age ≥ 13 AND age ≤ 19
val people:RDD[Person]
val teenagers:RDD[(String,Int)]
= people
.filter(p => p.age ≥ 13 && p.age ≤ 19)
.map(p => (p.name, p.age))
.map(p => (p.name, p.age))
.filter(p => p.age ≥ 13 && p.age ≤ 19)