SlideShare une entreprise Scribd logo
1  sur  37
INTRODUCTION TO 
APACHE SPARK 
Mohamed Hedi Abidi - Software Engineer @ebiznext 
@mh_abidi
CONTENT 
 Spark Introduction 
 Installation 
 Spark-Shell 
 SparkContext 
 RDD 
 Persistance 
 Simple Spark Apps 
 Deploiement 
 Spark SQL 
 Spark GraphX 
 Spark Mllib 
 Spark Streaming 
 Spark & Elasticsearch
INTRODUCTION 
An open source data analytics cluster computing 
framework 
In Memory Data processing 
100x faster than Hadoop 
Support MapReduce
INTRODUCTION 
 Handles batch, interactive, and real-time within a single 
framework
INTRODUCTION 
 Programming at a higher level of abstraction : faster, 
easier development
INTRODUCTION 
 Highly accessible through standard APIs built in Java, 
Scala, Python, or SQL (for interactive queries), and a rich 
set of machine learning libraries 
 Compatibility with the existing Hadoop v1 (SIMR) and 
2.x (YARN) ecosystems so companies can leverage their 
existing infrastructure.
INSTALLATION 
 Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+ 
 Download and unzip Apache Spark 1.1.0 sources 
Or clone development Version : 
git clone git://github.com/apache/spark.git 
 Run Maven to build Apache Spark 
mvn -DskipTests clean package 
 Launch Apache Spark standalone REPL 
[spark_home]/bin/spark-shell 
 Go to SparkUI @ 
http://localhost:4040
SPARK-SHELL 
 we’ll run Spark’s interactive shell… within the “spark” 
directory, run: 
./bin/spark-shell 
 then from the “scala>” REPL prompt, let’s create some 
data… 
scala> val data = 1 to 10000 
 create an RDD based on that data… 
scala> val distData = sc.parallelize(data) 
 then use a filter to select values less than 10… 
scala> distData.filter(_ < 10).collect()
SPARKCONTEXT 
 The first thing a Spark program must do is to create a 
SparkContext object, which tells Spark how to access a 
cluster. 
 In the shell for either Scala or Python, this is the sc 
variable, which is created automatically 
 Other programs must use a constructor to instantiate a 
new SparkContext 
val conf = new SparkConf().setAppName(appName).setMaster(master) 
new SparkContext(conf)
RDDS 
 Resilient Distributed Datasets (RDD) are the primary 
abstraction in Spark – It is an immutable distributed 
collection of data, which is partitioned across machines 
in a cluster 
 There are currently two types: 
 parallelized collections : Take an existing Scala collection and 
run functions on it in parallel 
 External datasets : Spark can create distributed datasets from 
any storage source supported by Hadoop, including local file 
system, HDFS, Cassandra, HBase, Amazon S3, etc.
RDDS 
 Parallelized collections 
scala> val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5) 
scala> val distData = sc.parallelize(data) 
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at 
parallelize at <console>:14 
 External datasets 
scala> val distFile = sc.textFile("README.md") 
distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at 
textFileat <console>:12
RDDS 
 Two types of operations on RDDs: 
transformations and actions 
 A transformation is a lazy (not computed immediately) 
operation on an RDD that yields another RDD 
 An action is an operation that triggers a computation, 
returns a value back to the Master, or writes to a stable 
storage system
RDDS : COMMONLY USED TRANSFORMATIONS 
Transformation & Purpose Example & Result 
filter(func) 
Purpose: new RDD by selecting 
those data elements on which 
func returns true 
scala> val rdd = 
sc.parallelize(List(“ABC”,”BCD”,”DEF”)) 
scala> val filtered = rdd.filter(_.contains(“C”)) 
scala> filtered.collect() 
Result: 
Array[String] = Array(ABC, BCD) 
map(func) 
Purpose: return new RDD by 
applying func on each data 
element 
scala> val rdd=sc.parallelize(List(1,2,3,4,5)) 
scala> val times2 = rdd.map(_*2) 
scala> times2.collect() 
Result: 
Array[Int] = Array(2, 4, 6, 8, 10) 
flatMap(func) 
Purpose: Similar to map but func 
returns a Seq instead of a value. 
For example, mapping a sentence 
into a Seq of words 
scala> val rdd=sc.parallelize(List(“Spark is 
awesome”,”It is fun”)) 
scala> val fm=rdd.flatMap(str=>str.split(“ “)) 
scala> fm.collect() 
Result: 
Array[String] = Array(Spark, is, awesome, It, is, fun)
RDDS : COMMONLY USED TRANSFORMATIONS 
Transformation & Purpose Example & Result 
reduceByKey(func,[numTasks]) 
Purpose: To aggregate values of a 
key using a function. “numTasks” 
is anoptional parameter to specify 
number of reduce tasks 
scala> val word1=fm.map(word=>(word,1)) 
scala> val wrdCnt=word1.reduceByKey(_+_) 
scala> wrdCnt.collect() 
Result: 
Array[(String, Int)] = Array((is,2), (It,1), 
(awesome,1), (Spark,1), (fun,1)) 
groupByKey([numTasks]) 
Purpose: To convert (K,V) to 
(K,Iterable<V>) 
scala> val cntWrd = wrdCnt.map{case (word, 
count) => (count, word)} 
scala> cntWrd.groupByKey().collect() 
Result: 
Array[(Int, Iterable[String])] = 
Array((1,ArrayBuffer(It, awesome, Spark, 
fun)), (2,ArrayBuffer(is))) 
distinct([numTasks]) 
Purpose: Eliminate duplicates 
from RDD 
scala> fm.distinct().collect() 
Result: 
Array[String] = Array(is, It, awesome, Spark, 
fun)
RDDS : COMMONLY USED ACTIONS 
Transformation & Purpose Example & Result 
count() 
Purpose: Get the number of 
data elements in the RDD 
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) 
scala> rdd.count() 
Result: 
Long = 3 
collect() 
Purpose: get all the data elements 
in an RDD as an Array 
scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) 
scala> rdd.collect() 
Result: 
Array[Char] = Array(A, B, C) 
reduce(func) 
Purpose: Aggregate the data 
elements in an RDD using this 
function which takes two 
arguments and returns one 
scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.reduce(_+_) 
Result: 
Int = 10 
take (n) 
Purpose: fetch first n data 
elements in an RDD. Computed by 
driver program. 
Scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.take(2) 
Result: 
Array[Int] = Array(1, 2)
RDDS : COMMONLY USED ACTIONS 
Transformation & Purpose Example & Result 
foreach(func) 
Purpose: execute function for 
each data element in RDD. 
Usually used to update an 
accumulator(discussed later) or 
interacting with external systems. 
Scala> val rdd = sc.parallelize(List(1,2)) 
scala> rdd.foreach(x=>println(“%s*10=%s”. 
format(x,x*10))) 
Result: 
1*10=10 
2*10=20 
first() 
Purpose: retrieves the first 
data element in RDD. Similar to 
take(1) 
scala> val rdd = sc.parallelize(List(1,2,3,4)) 
scala> rdd.first() 
Result: 
Int = 1 
saveAsTextFile(path) 
Purpose: Writes the content of 
RDD to a text file or a set of text 
files to local file system/HDFS 
scala> val hamlet = sc.textFile(“readme.txt”) 
scala> hamlet.filter(_.contains(“Spark")). 
saveAsTextFile(“filtered”) 
Result: 
…/filtered$ ls 
_SUCCESS part-00000 part-00001
RDDS : 
 For a more detailed list of actions and transformations, 
please refer to: 
http://spark.apache.org/docs/latest/programming-guide. 
html#transformations 
http://spark.apache.org/docs/latest/programming-guide. 
html#actions
PERSISTANCE 
 Spark can persist (or cache) a dataset in memory across 
operations 
 Each node stores in memory any slices of it that it 
computes and reuses them in other actions on that 
dataset – often making future actions more than 10x 
faster 
 The cache is fault-tolerant: if any partition of an RDD is 
lost, it will automatically be recomputed using the 
transformations that originally created it
PERSISTANCE
PERSISTANCE
PERSISTANCE : STORAGE LEVEL 
Storage Level Purpose 
MEMORY_ONLY 
(Default level) 
Store RDD as deserialized Java objects in the JVM. If the RDD does not 
fit in memory, some partitions will not be cached and will be 
recomputed on the fly each time they're needed. This is the default 
level. 
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not 
fit in memory, store the partitions that don't fit on disk, and read them 
from there when they're needed. 
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This 
is generally more space-efficient than deserialized objects, especially 
when using a fast serializer, but more CPU-intensive to read. 
MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in 
memory to disk instead of recomputing them on the fly each time 
they're needed. 
DISC_ONLY Store the RDD partitions only on disk. 
MEMORY_ONLY_2, 
MEMORY_AND_DISK_2, etc. 
Same as the levels above, but replicate each partition on two cluster 
nodes.
SIMPLE SPARK APPS : WORDCOUNT 
Download project from github: 
https://github.com/MohamedHedi/SparkSamples 
WordCount.scala: 
val logFile = args(0) 
val conf = new SparkConf().setAppName("WordCount") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numApache = logData.filter(line => line.contains("apache")).count() 
val numSpark = logData.filter(line => line.contains("spark")).count() 
println("Lines with apache: %s, Lines with spark: %s".format(numApache, 
numSpark)) 
 sbt 
 compile 
 assembly
SPARK-SUBMIT 
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options 
<application-jar> 
[application-arguments]
SPARK-SUBMIT : LOCAL MODE 
./bin/spark-submit 
--class com.ebiznext.spark.examples.WordCount 
--master local[4] 
--deploy-mode client 
--conf <key>=<value> 
... # other options 
.targetscala-2.10SparkSamples-assembly-1.0.jar 
.ressourcesREADME.md
CLUSTER MANAGER TYPES 
 Spark supports three cluster managers: 
 Standalone – a simple cluster manager included with Spark 
that makes it easy to set up a cluster. 
 Apache Mesos – a general cluster manager that can also run 
Hadoop MapReduce and service applications. 
 Hadoop YARN – the resource manager in Hadoop 2.
MASTER URLS 
Master URL Meaning 
local One worker thread (no parallelism at all) 
local[K] Run Spark locally with K worker threads (ideally, set 
his to the number of cores on your machine). 
local[*] Run Spark locally with as many worker threads as 
logical cores on your machine. 
spark://HOST:PORT Connect to the given Spark standalone cluster master. 
Default master port : 7077 
mesos://HOST:PORT Connect to the given Mesos cluster. 
Default mesos port : 5050 
yarn-client Connect to a YARN cluster in client mode. The cluster 
location will be found based on the 
HADOOP_CONF_DIR variable. 
yarn-cluster Connect to a YARN cluster in cluster mode. The cluster 
location will be found based on HADOOP_CONF_DIR.
SPARK-SUBMIT : STANDALONE CLUSTER 
 ./sbin/start-master.sh 
(Windows users  spark-class.cmd org.apache.spark.deploy.master.Master) 
 Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER 
 ConnectWorkers to Master 
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT 
 Go to the master’s web UI
SPARK-SUBMIT : STANDALONE CLUSTER 
./bin/spark-submit --class com.ebiznext.spark.examples.WordCount 
--master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly- 
1.0.jar .ressourcesREADME.md
SPARK SQL 
 Shark is being migrated to Spark SQL 
 Spark SQL blurs the lines between RDDs and relational 
tables 
val conf = new SparkConf().setAppName("SparkSQL") 
val sc = new SparkContext(conf) 
val peopleFile = args(0) 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
import sqlContext._ 
// Define the schema using a case class. 
case class Person(name: String, age: Int) 
// Create an RDD of Person objects and register it as a table. 
val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) 
people.registerAsTable("people") 
// SQL statements can be run by using the sql methods provided by sqlContext. 
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") 
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations. 
// The columns of a row in the result can be accessed by ordinal. 
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
SPARK GRAPHX 
 GraphX is the new (alpha) Spark API for graphs and graph-parallel 
computation. 
 GraphX extends the Spark RDD by introducing the Resilient Distributed 
Property Graph 
case class Peep(name: String, age: Int) 
val vertexArray = Array( 
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), 
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), 
(5L, Peep("Leslie", 45))) 
val edgeArray = Array( 
Edge(2L, 1L, 7), Edge(2L, 4L, 2), 
Edge(3L, 2L, 4), Edge(3L, 5L, 3), 
Edge(4L, 1L, 1), Edge(5L, 3L, 9)) 
val conf = new SparkConf().setAppName("SparkGraphx") 
val sc = new SparkContext(conf) 
val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray) 
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) 
val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD) 
val results = g.triplets.filter(t => t.attr > 7) 
for (triplet <- results.collect) { 
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") 
}
SPARK MLLIB 
MLlib is Spark’s scalable machine learning library 
consisting of common learning algorithms and utilities. 
Use cases : 
Recommendation Engine 
Content classification 
Ranking 
Algorithms 
Classification and regression : linear regression, decision 
trees, naive Bayes 
 Collaborative filtering : alternating least squares (ALS) 
 Clustering : k-means 
…
SPARK MLLIB 
SparkKMeans.scala 
val sparkConf = new SparkConf().setAppName("SparkKMeans") 
val sc = new SparkContext(sparkConf) 
val lines = sc.textFile(args(0)) 
val data = lines.map(parseVector _).cache() 
val K = args(1).toInt 
val convergeDist = args(2).toDouble 
val kPoints = data.takeSample(withReplacement = false, K, 42).toArray 
var tempDist = 1.0 
while (tempDist > convergeDist) { 
val closest = data.map(p => (closestPoint(p, kPoints), (p, 1))) 
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) } 
val newPoints = pointStats.map { pair => 
(pair._1, pair._2._1 * (1.0 / pair._2._2)) 
}.collectAsMap() 
tempDist = 0.0 
for (i <- 0 until K) { 
tempDist += squaredDistance(kPoints(i), newPoints(i)) 
} 
for (newP <- newPoints) yield { 
kPoints(newP._1) = newP._2 
} 
println("Finished iteration (delta = " + tempDist + ")") 
} 
println("Final centers:") 
kPoints.foreach(println) 
sc.stop()
SPARK STREAMING 
 Spark Streaming extends the core API to allow high-throughput, fault-tolerant 
stream processing of live data streams 
 Data can be ingested from many sources: Kafka, Flume, Twitter, 
ZeroMQ, TCP sockets… 
 Results can be pushed out to filesystems, databases, live dashboards… 
 Spark’s Mllib algorithms and graph processing algorithms can be 
applied to data streams
SPARK STREAMING 
val ssc = new StreamingContext(sparkConf, Seconds(10)) 
 Create a StreamingContext by providing the configuration and batch 
duration
TWITTER - SPARK STREAMING - ELASTICSEARCH 
1. Twitter access 
val keys = ssc.sparkContext.textFile(args(0), 2).cache() 
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4) 
// Set the system properties so that Twitter4j library used by twitter stream 
// can use them to generat OAuth credentials 
System.setProperty("twitter4j.oauth.consumerKey", consumerKey) 
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) 
System.setProperty("twitter4j.oauth.accessToken", accessToken) 
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) 
2. Streaming from Twitter 
val sparkConf = new SparkConf().setAppName("TwitterPopularTags") 
sparkConf.set("es.index.auto.create", "true") 
val ssc = new StreamingContext(sparkConf, Seconds(10)) 
val keys = ssc.sparkContext.textFile(args(0), 2).cache() 
val stream = TwitterUtils.createStream(ssc, None) 
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#"))) 
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)) 
.map { case (topic, count) => (count, topic) } 
.transform(_.sortByKey(false))
TWITTER - SPARK STREAMING - ELASTICSEARCH 
 index in Elasticsearch 
 Adding elasticsearch-spark jar to build.sbt: 
libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3" 
 Writing RDD to elasticsearch: 
val conf = new SparkConf().setAppName(appName).setMaster(master) 
sparkConf.set("es.index.auto.create", "true") 
val apache = Map("hashtag" -> "#Apache", "count" -> 10) 
val spark = Map("hashtag" -> "#Spark", "count" -> 15) 
val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark)) 
rdd.saveToEs("spark/hashtag")

Contenu connexe

Tendances

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 

Tendances (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 

En vedette

ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementMohamed hedi Abidi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Rahul Kumar
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibpumaranikar
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?Örjan Lundberg
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionEmanuele Bezzi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030emmabisogno
 
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturalesLos/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturalesGonza84
 
Dream Village
Dream VillageDream Village
Dream Villagermergo
 
Gr. 4 Unit 1
Gr. 4 Unit 1Gr. 4 Unit 1
Gr. 4 Unit 1jwalts
 
How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India?   How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India? Ranjit Kumar
 
IGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPSIGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPSJose Elias
 

En vedette (20)

ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos Fully fault tolerant real time data pipeline with docker and mesos
Fully fault tolerant real time data pipeline with docker and mesos
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030Interning at CBS Boston - WBZ NewsRadio 1030
Interning at CBS Boston - WBZ NewsRadio 1030
 
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturalesLos/as entrenadores/as de fútbol educativo en contextos multiculturales
Los/as entrenadores/as de fútbol educativo en contextos multiculturales
 
Dream Village
Dream VillageDream Village
Dream Village
 
Gr. 4 Unit 1
Gr. 4 Unit 1Gr. 4 Unit 1
Gr. 4 Unit 1
 
How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India?   How we do monotize SaaS as a VAS in India?
How we do monotize SaaS as a VAS in India?
 
IGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPSIGLESIA APPS CHURCH APPS
IGLESIA APPS CHURCH APPS
 
Riyaz_resume
Riyaz_resumeRiyaz_resume
Riyaz_resume
 
Guía2
Guía2Guía2
Guía2
 

Similaire à Introduction to Apache Spark

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark IntroductionRich Lee
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 

Similaire à Introduction to Apache Spark (20)

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Spark core
Spark coreSpark core
Spark core
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Scala+data
Scala+dataScala+data
Scala+data
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 

Dernier

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Dernier (20)

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

Introduction to Apache Spark

  • 1. INTRODUCTION TO APACHE SPARK Mohamed Hedi Abidi - Software Engineer @ebiznext @mh_abidi
  • 2. CONTENT  Spark Introduction  Installation  Spark-Shell  SparkContext  RDD  Persistance  Simple Spark Apps  Deploiement  Spark SQL  Spark GraphX  Spark Mllib  Spark Streaming  Spark & Elasticsearch
  • 3. INTRODUCTION An open source data analytics cluster computing framework In Memory Data processing 100x faster than Hadoop Support MapReduce
  • 4. INTRODUCTION  Handles batch, interactive, and real-time within a single framework
  • 5. INTRODUCTION  Programming at a higher level of abstraction : faster, easier development
  • 6. INTRODUCTION  Highly accessible through standard APIs built in Java, Scala, Python, or SQL (for interactive queries), and a rich set of machine learning libraries  Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN) ecosystems so companies can leverage their existing infrastructure.
  • 7. INSTALLATION  Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+  Download and unzip Apache Spark 1.1.0 sources Or clone development Version : git clone git://github.com/apache/spark.git  Run Maven to build Apache Spark mvn -DskipTests clean package  Launch Apache Spark standalone REPL [spark_home]/bin/spark-shell  Go to SparkUI @ http://localhost:4040
  • 8. SPARK-SHELL  we’ll run Spark’s interactive shell… within the “spark” directory, run: ./bin/spark-shell  then from the “scala>” REPL prompt, let’s create some data… scala> val data = 1 to 10000  create an RDD based on that data… scala> val distData = sc.parallelize(data)  then use a filter to select values less than 10… scala> distData.filter(_ < 10).collect()
  • 9. SPARKCONTEXT  The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.  In the shell for either Scala or Python, this is the sc variable, which is created automatically  Other programs must use a constructor to instantiate a new SparkContext val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf)
  • 10. RDDS  Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – It is an immutable distributed collection of data, which is partitioned across machines in a cluster  There are currently two types:  parallelized collections : Take an existing Scala collection and run functions on it in parallel  External datasets : Spark can create distributed datasets from any storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc.
  • 11. RDDS  Parallelized collections scala> val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala> val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:14  External datasets scala> val distFile = sc.textFile("README.md") distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] at textFileat <console>:12
  • 12. RDDS  Two types of operations on RDDs: transformations and actions  A transformation is a lazy (not computed immediately) operation on an RDD that yields another RDD  An action is an operation that triggers a computation, returns a value back to the Master, or writes to a stable storage system
  • 13. RDDS : COMMONLY USED TRANSFORMATIONS Transformation & Purpose Example & Result filter(func) Purpose: new RDD by selecting those data elements on which func returns true scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result: Array[String] = Array(ABC, BCD) map(func) Purpose: return new RDD by applying func on each data element scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result: Array[Int] = Array(2, 4, 6, 8, 10) flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result: Array[String] = Array(Spark, is, awesome, It, is, fun)
  • 14. RDDS : COMMONLY USED TRANSFORMATIONS Transformation & Purpose Example & Result reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is anoptional parameter to specify number of reduce tasks scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result: Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1)) groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>) scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result: Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is))) distinct([numTasks]) Purpose: Eliminate duplicates from RDD scala> fm.distinct().collect() Result: Array[String] = Array(is, It, awesome, Spark, fun)
  • 15. RDDS : COMMONLY USED ACTIONS Transformation & Purpose Example & Result count() Purpose: Get the number of data elements in the RDD scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) scala> rdd.count() Result: Long = 3 collect() Purpose: get all the data elements in an RDD as an Array scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’)) scala> rdd.collect() Result: Array[Char] = Array(A, B, C) reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.reduce(_+_) Result: Int = 10 take (n) Purpose: fetch first n data elements in an RDD. Computed by driver program. Scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.take(2) Result: Array[Int] = Array(1, 2)
  • 16. RDDS : COMMONLY USED ACTIONS Transformation & Purpose Example & Result foreach(func) Purpose: execute function for each data element in RDD. Usually used to update an accumulator(discussed later) or interacting with external systems. Scala> val rdd = sc.parallelize(List(1,2)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10))) Result: 1*10=10 2*10=20 first() Purpose: retrieves the first data element in RDD. Similar to take(1) scala> val rdd = sc.parallelize(List(1,2,3,4)) scala> rdd.first() Result: Int = 1 saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/HDFS scala> val hamlet = sc.textFile(“readme.txt”) scala> hamlet.filter(_.contains(“Spark")). saveAsTextFile(“filtered”) Result: …/filtered$ ls _SUCCESS part-00000 part-00001
  • 17. RDDS :  For a more detailed list of actions and transformations, please refer to: http://spark.apache.org/docs/latest/programming-guide. html#transformations http://spark.apache.org/docs/latest/programming-guide. html#actions
  • 18. PERSISTANCE  Spark can persist (or cache) a dataset in memory across operations  Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster  The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it
  • 21. PERSISTANCE : STORAGE LEVEL Storage Level Purpose MEMORY_ONLY (Default level) Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISC_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.
  • 22. SIMPLE SPARK APPS : WORDCOUNT Download project from github: https://github.com/MohamedHedi/SparkSamples WordCount.scala: val logFile = args(0) val conf = new SparkConf().setAppName("WordCount") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numApache = logData.filter(line => line.contains("apache")).count() val numSpark = logData.filter(line => line.contains("spark")).count() println("Lines with apache: %s, Lines with spark: %s".format(numApache, numSpark))  sbt  compile  assembly
  • 23. SPARK-SUBMIT ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]
  • 24. SPARK-SUBMIT : LOCAL MODE ./bin/spark-submit --class com.ebiznext.spark.examples.WordCount --master local[4] --deploy-mode client --conf <key>=<value> ... # other options .targetscala-2.10SparkSamples-assembly-1.0.jar .ressourcesREADME.md
  • 25. CLUSTER MANAGER TYPES  Spark supports three cluster managers:  Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.  Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.  Hadoop YARN – the resource manager in Hadoop 2.
  • 26. MASTER URLS Master URL Meaning local One worker thread (no parallelism at all) local[K] Run Spark locally with K worker threads (ideally, set his to the number of cores on your machine). local[*] Run Spark locally with as many worker threads as logical cores on your machine. spark://HOST:PORT Connect to the given Spark standalone cluster master. Default master port : 7077 mesos://HOST:PORT Connect to the given Mesos cluster. Default mesos port : 5050 yarn-client Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable. yarn-cluster Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
  • 27. SPARK-SUBMIT : STANDALONE CLUSTER  ./sbin/start-master.sh (Windows users  spark-class.cmd org.apache.spark.deploy.master.Master)  Go to the master’s web UI
  • 28. SPARK-SUBMIT : STANDALONE CLUSTER  ConnectWorkers to Master ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT  Go to the master’s web UI
  • 29. SPARK-SUBMIT : STANDALONE CLUSTER ./bin/spark-submit --class com.ebiznext.spark.examples.WordCount --master spark://localhost:7077 .targetscala-2.10SparkSamples-assembly- 1.0.jar .ressourcesREADME.md
  • 30. SPARK SQL  Shark is being migrated to Spark SQL  Spark SQL blurs the lines between RDDs and relational tables val conf = new SparkConf().setAppName("SparkSQL") val sc = new SparkContext(conf) val peopleFile = args(0) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people") // SQL statements can be run by using the sql methods provided by sqlContext. val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 31. SPARK GRAPHX  GraphX is the new (alpha) Spark API for graphs and graph-parallel computation.  GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph case class Peep(name: String, age: Int) val vertexArray = Array( (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)), (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)), (5L, Peep("Leslie", 45))) val edgeArray = Array( Edge(2L, 1L, 7), Edge(2L, 4L, 2), Edge(3L, 2L, 4), Edge(3L, 5L, 3), Edge(4L, 1L, 1), Edge(5L, 3L, 9)) val conf = new SparkConf().setAppName("SparkGraphx") val sc = new SparkContext(conf) val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD) val results = g.triplets.filter(t => t.attr > 7) for (triplet <- results.collect) { println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}") }
  • 32. SPARK MLLIB MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. Use cases : Recommendation Engine Content classification Ranking Algorithms Classification and regression : linear regression, decision trees, naive Bayes  Collaborative filtering : alternating least squares (ALS)  Clustering : k-means …
  • 33. SPARK MLLIB SparkKMeans.scala val sparkConf = new SparkConf().setAppName("SparkKMeans") val sc = new SparkContext(sparkConf) val lines = sc.textFile(args(0)) val data = lines.map(parseVector _).cache() val K = args(1).toInt val convergeDist = args(2).toDouble val kPoints = data.takeSample(withReplacement = false, K, 42).toArray var tempDist = 1.0 while (tempDist > convergeDist) { val closest = data.map(p => (closestPoint(p, kPoints), (p, 1))) val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) } val newPoints = pointStats.map { pair => (pair._1, pair._2._1 * (1.0 / pair._2._2)) }.collectAsMap() tempDist = 0.0 for (i <- 0 until K) { tempDist += squaredDistance(kPoints(i), newPoints(i)) } for (newP <- newPoints) yield { kPoints(newP._1) = newP._2 } println("Finished iteration (delta = " + tempDist + ")") } println("Final centers:") kPoints.foreach(println) sc.stop()
  • 34. SPARK STREAMING  Spark Streaming extends the core API to allow high-throughput, fault-tolerant stream processing of live data streams  Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ, TCP sockets…  Results can be pushed out to filesystems, databases, live dashboards…  Spark’s Mllib algorithms and graph processing algorithms can be applied to data streams
  • 35. SPARK STREAMING val ssc = new StreamingContext(sparkConf, Seconds(10))  Create a StreamingContext by providing the configuration and batch duration
  • 36. TWITTER - SPARK STREAMING - ELASTICSEARCH 1. Twitter access val keys = ssc.sparkContext.textFile(args(0), 2).cache() val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4) // Set the system properties so that Twitter4j library used by twitter stream // can use them to generat OAuth credentials System.setProperty("twitter4j.oauth.consumerKey", consumerKey) System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret) System.setProperty("twitter4j.oauth.accessToken", accessToken) System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret) 2. Streaming from Twitter val sparkConf = new SparkConf().setAppName("TwitterPopularTags") sparkConf.set("es.index.auto.create", "true") val ssc = new StreamingContext(sparkConf, Seconds(10)) val keys = ssc.sparkContext.textFile(args(0), 2).cache() val stream = TwitterUtils.createStream(ssc, None) val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#"))) val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)) .map { case (topic, count) => (count, topic) } .transform(_.sortByKey(false))
  • 37. TWITTER - SPARK STREAMING - ELASTICSEARCH  index in Elasticsearch  Adding elasticsearch-spark jar to build.sbt: libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"  Writing RDD to elasticsearch: val conf = new SparkConf().setAppName(appName).setMaster(master) sparkConf.set("es.index.auto.create", "true") val apache = Map("hashtag" -> "#Apache", "count" -> 10) val spark = Map("hashtag" -> "#Spark", "count" -> 15) val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark)) rdd.saveToEs("spark/hashtag")

Notes de l'éditeur

  1. Hadoop est un framework Java qui facilite la création d'applications distribuées scalables. Il permet aux applications de travailler avec des milliers de nœuds et des pétaoctets de données. MapReduce est design pattern d’architecture, inventé par Google Composé de : Phase Map (calcul) : Pour chaque ensemble le traitement Map est appliqué. Phase intermédiaire où les données sont triées et les données liées sont regroupées pour être traitées par un même nœud. Phase Reduce (agrégation) : Les données sont éventuellement agrégées. Regrouper les résultat de chacun des nœuds pour calculer le résultat final.