6. INTRODUCTION
Highly accessible through standard APIs built in Java,
Scala, Python, or SQL (for interactive queries), and a rich
set of machine learning libraries
Compatibility with the existing Hadoop v1 (SIMR) and
2.x (YARN) ecosystems so companies can leverage their
existing infrastructure.
7. INSTALLATION
Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+
Download and unzip Apache Spark 1.1.0 sources
Or clone development Version :
git clone git://github.com/apache/spark.git
Run Maven to build Apache Spark
mvn -DskipTests clean package
Launch Apache Spark standalone REPL
[spark_home]/bin/spark-shell
Go to SparkUI @
http://localhost:4040
8. SPARK-SHELL
we’ll run Spark’s interactive shell… within the “spark”
directory, run:
./bin/spark-shell
then from the “scala>” REPL prompt, let’s create some
data…
scala> val data = 1 to 10000
create an RDD based on that data…
scala> val distData = sc.parallelize(data)
then use a filter to select values less than 10…
scala> distData.filter(_ < 10).collect()
9. SPARKCONTEXT
The first thing a Spark program must do is to create a
SparkContext object, which tells Spark how to access a
cluster.
In the shell for either Scala or Python, this is the sc
variable, which is created automatically
Other programs must use a constructor to instantiate a
new SparkContext
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
10. RDDS
Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – It is an immutable distributed
collection of data, which is partitioned across machines
in a cluster
There are currently two types:
parallelized collections : Take an existing Scala collection and
run functions on it in parallel
External datasets : Spark can create distributed datasets from
any storage source supported by Hadoop, including local file
system, HDFS, Cassandra, HBase, Amazon S3, etc.
11. RDDS
Parallelized collections
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at
parallelize at <console>:14
External datasets
scala> val distFile = sc.textFile("README.md")
distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at
textFileat <console>:12
12. RDDS
Two types of operations on RDDs:
transformations and actions
A transformation is a lazy (not computed immediately)
operation on an RDD that yields another RDD
An action is an operation that triggers a computation,
returns a value back to the Master, or writes to a stable
storage system
13. RDDS : COMMONLY USED TRANSFORMATIONS
Transformation & Purpose Example & Result
filter(func)
Purpose: new RDD by selecting
those data elements on which
func returns true
scala> val rdd =
sc.parallelize(List(“ABC”,”BCD”,”DEF”))
scala> val filtered = rdd.filter(_.contains(“C”))
scala> filtered.collect()
Result:
Array[String] = Array(ABC, BCD)
map(func)
Purpose: return new RDD by
applying func on each data
element
scala> val rdd=sc.parallelize(List(1,2,3,4,5))
scala> val times2 = rdd.map(_*2)
scala> times2.collect()
Result:
Array[Int] = Array(2, 4, 6, 8, 10)
flatMap(func)
Purpose: Similar to map but func
returns a Seq instead of a value.
For example, mapping a sentence
into a Seq of words
scala> val rdd=sc.parallelize(List(“Spark is
awesome”,”It is fun”))
scala> val fm=rdd.flatMap(str=>str.split(“ “))
scala> fm.collect()
Result:
Array[String] = Array(Spark, is, awesome, It, is, fun)
14. RDDS : COMMONLY USED TRANSFORMATIONS
Transformation & Purpose Example & Result
reduceByKey(func,[numTasks])
Purpose: To aggregate values of a
key using a function. “numTasks”
is anoptional parameter to specify
number of reduce tasks
scala> val word1=fm.map(word=>(word,1))
scala> val wrdCnt=word1.reduceByKey(_+_)
scala> wrdCnt.collect()
Result:
Array[(String, Int)] = Array((is,2), (It,1),
(awesome,1), (Spark,1), (fun,1))
groupByKey([numTasks])
Purpose: To convert (K,V) to
(K,Iterable<V>)
scala> val cntWrd = wrdCnt.map{case (word,
count) => (count, word)}
scala> cntWrd.groupByKey().collect()
Result:
Array[(Int, Iterable[String])] =
Array((1,ArrayBuffer(It, awesome, Spark,
fun)), (2,ArrayBuffer(is)))
distinct([numTasks])
Purpose: Eliminate duplicates
from RDD
scala> fm.distinct().collect()
Result:
Array[String] = Array(is, It, awesome, Spark,
fun)
15. RDDS : COMMONLY USED ACTIONS
Transformation & Purpose Example & Result
count()
Purpose: Get the number of
data elements in the RDD
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))
scala> rdd.count()
Result:
Long = 3
collect()
Purpose: get all the data elements
in an RDD as an Array
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))
scala> rdd.collect()
Result:
Array[Char] = Array(A, B, C)
reduce(func)
Purpose: Aggregate the data
elements in an RDD using this
function which takes two
arguments and returns one
scala> val rdd = sc.parallelize(List(1,2,3,4))
scala> rdd.reduce(_+_)
Result:
Int = 10
take (n)
Purpose: fetch first n data
elements in an RDD. Computed by
driver program.
Scala> val rdd = sc.parallelize(List(1,2,3,4))
scala> rdd.take(2)
Result:
Array[Int] = Array(1, 2)
16. RDDS : COMMONLY USED ACTIONS
Transformation & Purpose Example & Result
foreach(func)
Purpose: execute function for
each data element in RDD.
Usually used to update an
accumulator(discussed later) or
interacting with external systems.
Scala> val rdd = sc.parallelize(List(1,2))
scala> rdd.foreach(x=>println(“%s*10=%s”.
format(x,x*10)))
Result:
1*10=10
2*10=20
first()
Purpose: retrieves the first
data element in RDD. Similar to
take(1)
scala> val rdd = sc.parallelize(List(1,2,3,4))
scala> rdd.first()
Result:
Int = 1
saveAsTextFile(path)
Purpose: Writes the content of
RDD to a text file or a set of text
files to local file system/HDFS
scala> val hamlet = sc.textFile(“readme.txt”)
scala> hamlet.filter(_.contains(“Spark")).
saveAsTextFile(“filtered”)
Result:
…/filtered$ ls
_SUCCESS part-00000 part-00001
17. RDDS :
For a more detailed list of actions and transformations,
please refer to:
http://spark.apache.org/docs/latest/programming-guide.
html#transformations
http://spark.apache.org/docs/latest/programming-guide.
html#actions
18. PERSISTANCE
Spark can persist (or cache) a dataset in memory across
operations
Each node stores in memory any slices of it that it
computes and reuses them in other actions on that
dataset – often making future actions more than 10x
faster
The cache is fault-tolerant: if any partition of an RDD is
lost, it will automatically be recomputed using the
transformations that originally created it
21. PERSISTANCE : STORAGE LEVEL
Storage Level Purpose
MEMORY_ONLY
(Default level)
Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, some partitions will not be cached and will be
recomputed on the fly each time they're needed. This is the default
level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not
fit in memory, store the partitions that don't fit on disk, and read them
from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This
is generally more space-efficient than deserialized objects, especially
when using a fast serializer, but more CPU-intensive to read.
MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in
memory to disk instead of recomputing them on the fly each time
they're needed.
DISC_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate each partition on two cluster
nodes.
22. SIMPLE SPARK APPS : WORDCOUNT
Download project from github:
https://github.com/MohamedHedi/SparkSamples
WordCount.scala:
val logFile = args(0)
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numApache = logData.filter(line => line.contains("apache")).count()
val numSpark = logData.filter(line => line.contains("spark")).count()
println("Lines with apache: %s, Lines with spark: %s".format(numApache,
numSpark))
sbt
compile
assembly
24. SPARK-SUBMIT : LOCAL MODE
./bin/spark-submit
--class com.ebiznext.spark.examples.WordCount
--master local[4]
--deploy-mode client
--conf <key>=<value>
... # other options
.targetscala-2.10SparkSamples-assembly-1.0.jar
.ressourcesREADME.md
25. CLUSTER MANAGER TYPES
Spark supports three cluster managers:
Standalone – a simple cluster manager included with Spark
that makes it easy to set up a cluster.
Apache Mesos – a general cluster manager that can also run
Hadoop MapReduce and service applications.
Hadoop YARN – the resource manager in Hadoop 2.
26. MASTER URLS
Master URL Meaning
local One worker thread (no parallelism at all)
local[K] Run Spark locally with K worker threads (ideally, set
his to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as
logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone cluster master.
Default master port : 7077
mesos://HOST:PORT Connect to the given Mesos cluster.
Default mesos port : 5050
yarn-client Connect to a YARN cluster in client mode. The cluster
location will be found based on the
HADOOP_CONF_DIR variable.
yarn-cluster Connect to a YARN cluster in cluster mode. The cluster
location will be found based on HADOOP_CONF_DIR.
27. SPARK-SUBMIT : STANDALONE CLUSTER
./sbin/start-master.sh
(Windows users spark-class.cmd org.apache.spark.deploy.master.Master)
Go to the master’s web UI
28. SPARK-SUBMIT : STANDALONE CLUSTER
ConnectWorkers to Master
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
Go to the master’s web UI
30. SPARK SQL
Shark is being migrated to Spark SQL
Spark SQL blurs the lines between RDDs and relational
tables
val conf = new SparkConf().setAppName("SparkSQL")
val sc = new SparkContext(conf)
val peopleFile = args(0)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
31. SPARK GRAPHX
GraphX is the new (alpha) Spark API for graphs and graph-parallel
computation.
GraphX extends the Spark RDD by introducing the Resilient Distributed
Property Graph
case class Peep(name: String, age: Int)
val vertexArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45)))
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9))
val conf = new SparkConf().setAppName("SparkGraphx")
val sc = new SparkContext(conf)
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
32. SPARK MLLIB
MLlib is Spark’s scalable machine learning library
consisting of common learning algorithms and utilities.
Use cases :
Recommendation Engine
Content classification
Ranking
Algorithms
Classification and regression : linear regression, decision
trees, naive Bayes
Collaborative filtering : alternating least squares (ALS)
Clustering : k-means
…
33. SPARK MLLIB
SparkKMeans.scala
val sparkConf = new SparkConf().setAppName("SparkKMeans")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile(args(0))
val data = lines.map(parseVector _).cache()
val K = args(1).toInt
val convergeDist = args(2).toDouble
val kPoints = data.takeSample(withReplacement = false, K, 42).toArray
var tempDist = 1.0
while (tempDist > convergeDist) {
val closest = data.map(p => (closestPoint(p, kPoints), (p, 1)))
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }
val newPoints = pointStats.map { pair =>
(pair._1, pair._2._1 * (1.0 / pair._2._2))
}.collectAsMap()
tempDist = 0.0
for (i <- 0 until K) {
tempDist += squaredDistance(kPoints(i), newPoints(i))
}
for (newP <- newPoints) yield {
kPoints(newP._1) = newP._2
}
println("Finished iteration (delta = " + tempDist + ")")
}
println("Final centers:")
kPoints.foreach(println)
sc.stop()
34. SPARK STREAMING
Spark Streaming extends the core API to allow high-throughput, fault-tolerant
stream processing of live data streams
Data can be ingested from many sources: Kafka, Flume, Twitter,
ZeroMQ, TCP sockets…
Results can be pushed out to filesystems, databases, live dashboards…
Spark’s Mllib algorithms and graph processing algorithms can be
applied to data streams
35. SPARK STREAMING
val ssc = new StreamingContext(sparkConf, Seconds(10))
Create a StreamingContext by providing the configuration and batch
duration
36. TWITTER - SPARK STREAMING - ELASTICSEARCH
1. Twitter access
val keys = ssc.sparkContext.textFile(args(0), 2).cache()
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4)
// Set the system properties so that Twitter4j library used by twitter stream
// can use them to generat OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
2. Streaming from Twitter
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
sparkConf.set("es.index.auto.create", "true")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val keys = ssc.sparkContext.textFile(args(0), 2).cache()
val stream = TwitterUtils.createStream(ssc, None)
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
37. TWITTER - SPARK STREAMING - ELASTICSEARCH
index in Elasticsearch
Adding elasticsearch-spark jar to build.sbt:
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"
Writing RDD to elasticsearch:
val conf = new SparkConf().setAppName(appName).setMaster(master)
sparkConf.set("es.index.auto.create", "true")
val apache = Map("hashtag" -> "#Apache", "count" -> 10)
val spark = Map("hashtag" -> "#Spark", "count" -> 15)
val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark))
rdd.saveToEs("spark/hashtag")
Notes de l'éditeur
Hadoop est un framework Java qui facilite la création d'applications distribuées scalables. Il permet aux applications de travailler avec des milliers de nœuds et des pétaoctets de données.
MapReduce est design pattern d’architecture, inventé par Google
Composé de :
Phase Map (calcul) : Pour chaque ensemble le traitement Map est appliqué.
Phase intermédiaire où les données sont triées et les données liées sont regroupées pour être traitées par un même nœud.
Phase Reduce (agrégation) : Les données sont éventuellement agrégées.
Regrouper les résultat de chacun des nœuds pour calculer le résultat final.