SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
@doanduyhai
Introduction to Spark
DuyHai DOAN, Technical Advocate
@doanduyhai
Shameless self-promotion!
2
Duy Hai DOAN
Cassandra technical advocate
•  talks, meetups, confs
•  open-source devs (Achilles, …)
•  OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
•  production troubleshooting
@doanduyhai
Datastax!
3
•  Founded in April 2010 
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 200+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, offices in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features
@doanduyhai
Agenda!
4
•  Spark eco-system
•  RDD abstraction
•  Spark architecture & job life-cycle
•  Spark core API
•  Spark SQL
•  Spark Streaming
Spark eco-system!
Technology landscape!
Spark eco-system!
@doanduyhai
What is Apache Spark ?!
6
Apache Project since 2010

General data processing framework

MapReduce is not the α & ω

One-framework-many-components approach
@doanduyhai
Data processing landscape!
7
Pregel
GraphLab
Giraph
…
Graph
Graph
Graph
Google
Apache
Dato
@doanduyhai
Data processing landscape!
8
Pregel
Dremel
GraphLab
Giraph
Drill
Impala
…
SQL
Graph
SQL
Graph
Graph
SQL
Google
Google
Apache
Apache
Cloudera
Dato
@doanduyhai
Data processing landscape!
9
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Tez
Impala
…
SQL
Graph
SQL
Graph
Graph
DAG
Stream
SQL
Google
Google
Apache
Apache
Apache
Cloudera
Apache
Dato
@doanduyhai
Data processing landscape!
10
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Tez
Impala
…
SQL
Graph
SQL
Graph
Graph
DAG
Stream
SQL
Google
Google
Apache
Apache
Apache
Cloudera
Apache
Dato
Stop surarmement now !
@doanduyhai
Spark characteristics!
11
Fast
•  10x-100x faster than Hadoop MapReduce
•  In-memory storage
•  Single JVM process per node, multi-threaded

Easy
•  Rich Scala, Java and Python APIs (R is coming …)
•  2x-5x less code
•  Interactive shell
@doanduyhai
Spark comparison!
12
LOC
@doanduyhai
Spark eco-system!
13
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
@doanduyhai
Spark eco-system!
14
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLibGraphXSpark SQL
Persistence
Cluster Manager
…
Q & R
! "!
RDD abstraction!
Code example!
RDD interface!
@doanduyhai
Code example!
17
Setup
Data-set (can be from text, CSV, Json, Cassandra, HDFS, …)
val$conf$=$new$SparkConf(true)$
$ .setAppName("basic_example")$
$ .setMaster("local[3]")$
$
val$sc$=$new$SparkContext(conf)$
val$people$=$List(("jdoe","John$DOE",$33),$
$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$
$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$
@doanduyhai
Code example!
18
Processing

//"count"users"by"age"""
"val"counByAge"="sc.parallelize(people)"
""""""""""""""""".map(tuple"=>"(tuple._3,"tuple))""
""""""""""""""""".groupByKey()"
""""""""""""""""".countByKey()"
"
"println("Count"by"age":""+countByAge)"
@doanduyhai
Code example!
19
Decomposition
//"count"users"by"age"""
"val"counByAge"="sc.parallelize(people)""//split"into"different"chunks"(partitions)"
"""".map(tuple"=>"(tuple._3,"tuple))""//("jdoe","John"DOE","33)"=>"(33,(("jdoe",…))"
"""".groupByKey()""//{33"L>"(("jdoe",…),"("rsmith",…)),"24L>("hsue",…))}"
"""".countByKey();"//{33"L>"2,"24L>1}"
"
"println("Count"by"age":""+countByAge);""//Count"by"age"="Map(33"L>"2,"24"L>"1)"
@doanduyhai
RDDs!
20
RDD = Resilient Distributed Dataset

val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$
$
val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$
$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$
$
val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$
$
val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$
@doanduyhai
RDDs in action!
21
(jdoe…) (hsue…) (rsmith…)
(33,(jdoe…)) (24,(hsue…)) (33,(rsmith…))
(33,[(jdoe…), (rsmith…)]) (24,[(hsue…)])
{33 -> 2} {24 -> 1}
RDD[(String, String, Int)]
RDD[(Int, (String, String, Int))]
RDD[(Int, Iterable[(String, String, Int)])]
Map[Int, Long]
@doanduyhai
RDD interface!
22
Interface RDD[A]
•  ≈ collection of A objects 
•  lazy, pull by children transformations

Operations
•  transformations
•  actions
@doanduyhai
RDD interface!
23
protected def getPartitions: Array[Partition]
•  define partitions (chunks)

protected def getDependencies: Seq[Dependency[_]]
•  defines parents RDD
•  direct transformations (map, filter,…): 1-to-1 relationship 
•  aggregations (join, groupByKey …n-to-n relationship

def compute(split: Partition, context: TaskContext): Iterator[T]
lineage
@doanduyhai
RDD interface!
24
protected def getPreferredLocations(split: Partition): Seq[String]
•  for data-locality (HDFS, Cassandra, …)

@transient val partitioner: Option[Partitioner]!
•  HashPartitioner, RangePartitioner 
•  Murmur3Partitioner for Cassandra
•  can be none
optimi
zation
@doanduyhai
Partitions!
25
Definition
•  chunks for an RDD
•  allow to parallelize operations on an RDD
•  1 RDD has [1, n] partitions
•  partitions are distributed across Spark workers

Partionning
•  impacts parallelism
•  impacts performance
@doanduyhai
SparkM
SparkW
SparkWSparkW
SparkW
Partitions in the cluster!
26
partition
RDD
@doanduyhai
Partitions transformations!
27
map(tuple => (tuple._3, tuple))
direct transformation
shuffle
groupByKey()
countByKey()
partition
RDD
@doanduyhai
Stage 1
Stages!
28
Stage 2
Shuffle operation
Delimits "shuffle"
frontiers
@doanduyhai
Tasks!
29
Task1 Task2 Task3 Task4
Task5 Task6 Task7
Pipelinable
transformations
inside a stage
Q & R
! "!
Architecture & job life-cycle!
Spark components!
Spark job life-cycle!
@doanduyhai
Spark Components!
32
Spark Master/Cluster Manager
Spark Worker Spark Worker Spark Worker Spark Worker
Executor Executor Executor Executor
Driver Program
RDD
RDD
RDD RDD RDD
@doanduyhai
Spark Components!
33
Spark Worker
Driver Program
Program that creates
a SparkContext
(your application)
JVM (512Mb)
JVM (16Mb - 256Mb)
@doanduyhai
Spark Components!
34
Spark Worker
Driver Program
Ask for
resources
Spark Master/Cluster Manager JVM (16Mb - 256Mb)
@doanduyhai
Spark Components!
35
Spark Master/Cluster Manager
Spark Worker Spark Worker Spark Worker Spark Worker
Driver Program
Assign workers
to program
@doanduyhai
Spark Components!
36
Spark Worker
Executor
Driver Program
Worker starts an
Executor (JVM)
JVM (512Mb + … ≤ ≈ 30Gb )
@doanduyhai
Spark Components!
37
Spark Worker
Executor
Driver Program
Executor has:
1.  thread pool
2.  local disk storage
Executor creates:
1. RDD
2. Tasks threads
JVM (512Mb + … ≤ ≈ 30Gb )
RDD
@doanduyhai
Spark job life-cycle!
38
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each stage
as ready
DAG
TaskScheduler
Set<Task>
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve blocks
(on disk access)
Block
manager
Threads
Task
stage
failed
© Matei Zaharia
Q & R
! "!
Spark Core API!
Operations!
Performance considerations!
Proper partitioning!
@doanduyhai
Spark Core API!
41
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
+ Scala collection API
...
@doanduyhai
Direct transformations!
42
Direct transformations Description
map(f:A => B): RDD[B]
Return a new RDD by applying a function to all elements of
this RDD
filter(f:A => Boolean): RDD[A]
Return a new RDD containing only the elements that
satisfy a predicate
flatMap(f:A => Seq[B]): RDD[B]
Return a new RDD by first applying a function to all
elements of this RDD, and then flattening the results
mapPartitions(f: Iterator[A] => Iterator[B]): RDD[B]
Return a new RDD by applying a function to each partition
of this RDD
mapPartitionsWithIndex(f: (Int,Iterator[A]) => Iterator[B]):
RDD[B]

Return a new RDD by applying a function to each partition
of this RDD, while tracking the index
of the original partition
sample(withReplacement: Boolean, fraction: Double,
seed: Long): RDD[A]
Return a sampled subset of this RDD
…
 …
@doanduyhai
Transformations with shuffle!
43
Transformations with shuffle Description
union(RDD[A]: otherRDD): RDD[A]
 Return the union of this RDD and another one
intersection(RDD[A]: otherRDD): RDD[A]
 Return the intersection of this RDD and another one
distinct(): RDD[A]
Return a new RDD containing the distinct elements in this
RDD
groupByKey(numTasks: Int): RDD[(K,V)]
Group the values for each key in the RDD into a single
sequence. Hash-partitions the resulting RDD with the
existing partitioner/parallelism level
reduceByKey(f:(V,V) => V, numTasks: Int): RDD[(K,V)]
Merge the values for each key using an associative reduce
function. Output will be hash-partitioned with
numPartitions partitions
join[W](otherRDD: RDD[(K, W)]): RDD[(K, (V, W))]
Return an RDD containing all pairs of elements with
matching keys in `this` and `other`. Each pair of elements
will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
`this` and (k, v2) is in `other`. Performs a hash join across
the cluster
@doanduyhai
Actions!
44
Actions Description
reduce(f: (T, T) => T): T
Reduces the elements of this RDD using the specified
commutative and associative binary operator
collect(): Array[A]
Return an array that contains all of the elements in this
RDD
count(): Long
 Return the number of elements in the RDD
first(): A
 Return the first element in this RDD
take(num: Int): Array[A]
 Take the first num elements of the RDD
countByKey(): Map[K, Long]
Count the number of elements for each key, and return the
result to the master as a Map
foreach(f: A => Unit): Unit
 Applies a function f to all elements of this RDD
…
 …
@doanduyhai
Performance considerations!
45
Filter early to minimize memory usage

Fetch only necessary data

Minimize "shuffle" operations

Co-partition data whenever possible
@doanduyhai
Proper partitioning!
46
Too few partitions 
•  poor parallelism
•  sensitivity to data skew
•  memory pressure for groupBy(), reduceByKey()…

Too many partitions
•  framework overhead (more CPU for Spark than the job to be done)
•  many CPU context-switching
Q & R
! "!
Spark SQL!
Architecture!
SchemaRDD!
SQL to RDD translation!
@doanduyhai
General ideas!
49
SQL-like query abstraction over RDDs

Introduce schema to raw objects

SchemaRDD = RDD[Row]

Declarative vs imperative data transformations

Let’s the engine optimize the query!
@doanduyhai
Integration with Spark!
50
@doanduyhai
Architecture!
51
CatalystSQLCoreCassandra
Relational algebra
Query planner, query optimizer
•  predicates push-down
•  re-ordering
SchemaRDD
•  new methods
•  schema handling + metadata
•  Parquet, JSON integration
Custom Datastax connector
•  custom plan for CQL
•  predicates push-down for CQL
@doanduyhai
Code example!
52
Setup

Data-set (can be from Parquet, Json, Cassandra, Hive, …)
!val!sqlContext!=!new!org.apache.spark.sql.SQLContext(sc)!
!import!sqlContext.createSchemaRDD!
!case!class!Person(name:!String,!age:!Int)!
!
!val!people!=!sc.textFile("people.txt").map(_.split(","))!
! ! .map(p!=>!Person(p(0),!p(1).trim.toInt))!
!
!people.registerTempTable("people")!
@doanduyhai
Code example!
53
Query
!val!teenagers:!SchemaRDD!=!sqlContext.sql("SELECT!name,!age!
!! ! ! ! ! !!FROM!people!
! ! ! ! ! !!WHERE!age!≥!13!AND!age!≤!19");!
!
!//!or!
!!
!val!teenagers:!SchemaRDD!=!people.where('age!≥!10).where('age!≤!19).select('name)!
!
!teenages.map(row!=>!"Name!:!"+row(0)).collect().foreach(println)!
@doanduyhai
SchemaRDD!
54
Person name age
RDD[Person] SchemaRDD
Schema discovery by reflection
John DOE, 33
Helen SUE, 24
Alice, 19
Bob, 18
…
name age
sqlContext.jsonFile(“people.txt”)
JSON file
Row
Columns
@doanduyhai
SQL to RDD translation!
55
Projection & selection
SELECT&name,&age&&&&
FROM&people&&&&
WHERE&age&≥&13&AND&age&≤&19&&
SELECT&name,&age&&&&
FROM&people&&&&
WHERE&age&≥&13&AND&age&≤&19&&
SELECT&name,&age&&&&
FROM&people&&&&
WHERE&age&≥&13&AND&age&≤&19&&
!val!people:RDD[Person]!
!val!teenagers:RDD[(String,Int)]!!
!!!!!=!people!
! .filter(p!=>!p.age!≥!13!&&!p.age!≤!19)!
! .map(p!=>!(p.name,!p.age))!!
!val!people:RDD[Person]!
!val!teenagers:RDD[(String,Int)]!!
!!!!!=!people!
! .filter(p!=>!p.age!≥!13!&&!p.age!≤!19)!
! .map(p!=>!(p.name,!p.age))!!!val!people:RDD[Person]!
!val!teenagers:RDD[(String,Int)]!!
!!!!!=!people!
! .filter(p!=>!p.age!≥!13!&&!p.age!≤!19)!
! .map(p!=>!(p.name,!p.age))!!
@doanduyhai
SQL to RDD translation!
56
Joins (naive version)
SELECT&p.name,&e.content,e.date&&
FROM&people&p&JOIN&emails&e&&&
ON&p.login&=&e.login&
WHERE&p.age&≥&28&
AND&p.age&≤&32&
AND&e.date&≥&‘2015K01K01&00:00:00’&&
@doanduyhai
SQL to RDD translation!
57
Joins (naive version)
!val!people:RDD[Person]!
!val!emails:RDD[Email]!
!val!p!=!people.map(p!=>!(p.login,!p))!
!val!e!=!emails.map(e!=>!(e.login,e)!
!val!eavesdrop:RDD[(String,String,Date)]!
!!!!!=!p.join(e)!
! .filter{!case(login,(p,e)!=>!
! ! p.age!≥!28!&&!p.age!≤!32!&&!
! ! e.date!≥!‘2015L01L01!00:00:00’!
! }!
! .map{!case(login,(p,e))!=>!
! ! (p.name,e.content,e.date)!
! }!!
SELECT&p.name,&e.content,e.date&&
FROM&people&p&JOIN&emails&e&&&
ON&p.login&=&e.login&
WHERE&p.age&≥&28&
AND&p.age&≤&32&
AND&e.date&≥&‘2015K01K01&00:00:00’&&
@doanduyhai
SQL to RDD translation!
58
Joins (optimized version, selection & projection push-down)
SELECT&p.name,&e.content,e.date&&
FROM&people&p&JOIN&emails&e&&&
ON&p.login&=&e.login&
WHERE&p.age&≥&28&
AND&p.age&≤&32&
AND&e.date&≥&‘2015K01K01&00:00:00’&&
!val!p!=!people.filter(p!=>!!
! ! p.age!≥!28!&&!p.age!≤!32!
! )!
! .map(p!=>!(p.login,p.name))!
!!
!val!e!=!emails.filter(e!=>!!
! ! e.date!≥!‘2015A01A01!00:00:00’!
! )!
! .map(e!=>!(e.login,(e.content,e.date))!
!
!val!eavesdrop:RDD[(String,String,Date)]!
!!!!!=!p.join(e)!
! .map{case!(login,(name,(content,date))!=>!
! ! (name,content,date)!
!! }!
Q & R
! "!
Spark Streaming!
General ideas!
Streaming example!
Window-based reduce!
Outputs!
@doanduyhai
General ideas!
61
Micro batching, each batch = RDD

Fault tolerant, exactly-once processing

Unified stream and batch processing framework
DStream
Data Stream
RDD
@doanduyhai
General ideas!
62
Base unit = time window

Enable computations by time window

DStream
Data Stream
RDDTime window
@doanduyhai
Streaming Example!
63
Set up
Words stream
val$ssc$=$new$SparkStreamContext("local",$"test")$$$
ssc.setBatchDuration(Seconds(1))$
!val!words!=!ssc.createNetworkStream("http://...")!
!val!ones!=!words.map(w!=>!(w,!1))!
!val!freqs!=!ones.reduceByKey{!case!(count1,!count2)!=>!count1!+!count2}!
!
!freqs.print()!!!!
!
!//!Start!the!stream!computation!!!
!ssc.run!!
@doanduyhai
Window-based reduce!
64
!val!freqs!=!ones.reduceByKey!{!case!(count1,!count2)!=>!count1!+!count2}!
!val!freqs_60s!=!freqs.window(Seconds(60),!Second(1))!
! ! .reduceByKey!{!case!(count1,!count2)!=>!count1!+!count2}!
!
!//!or!
!
!val!freqs_60s!=!ones.reduceByKeyAndWindow(couple!=>!couple._1+couple._2,!!
! ! ! ! ! Seconds(60),!Seconds(1))!!
@doanduyhai
Window-based reduce!
65
minute1
minute2
minute3
minute4
minute5
minute6
…
…
•  keeps whole time window RDDs in memory (automatic caching)
•  compute sum using all RDDs in the time windows
@doanduyhai
Optimized window-based reduce!
66
!val!freqs_60s!=!freqs.window(Seconds(60),!Second(1))!
! ! .reduceByKey!{!case!(count1,!count2)!=>!count1!+!count2}!
!
!//!or!
!
!val!freqs_60s!=!ones!
! ! .reduceByKeyAndWindow(couple!=>!!
! ! ! ! couple._1+couple._2,!//!reduce!
! ! ! ! couple!=>!couple._1!H!couple._2,!//inverse!reduce!
! ! ! ! Seconds(60),!Seconds(1))!!
@doanduyhai
Optimized window-based reduce!
67
1 2 3
minute1
Non optimized computation
2 3 4
minute2
3 4 5
minute3
Optimized with inverse reduce
1 2 3
minute1 minute2
minute1 1 4- + inverseReduce(minute1,RDD1).reduce(minute1,RDD4)
@doanduyhai
Outputs!
68
Actions Description
print(): Unit
Prints first ten elements of every batch of data in a
DStream on the driver. This is useful for development and
debugging
saveAsTextFiles(prefix: String, suffix: String= ""): Unit
Save this DStream's contents as a text files. The file name
at each batch interval is generated based on prefix and
suffix: "prefix-TIME_IN_MS[.suffix]"
saveAsHadoopFiles(prefix: String, suffix: String= ""): Unit
Save this DStream's contents as a Hadoop files. The file
name at each batch interval is generated based on prefix
and suffix: "prefix-TIME_IN_MS[.suffix]"
foreachRDD(f: RDD[A] = Unit): Unit
 Apply a function to each RDD in this DStream
…
 …
Q & R
! "!
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

Contenu connexe

Tendances

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

Tendances (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark
SparkSpark
Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 

Similaire à Introduction to spark

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Databricks
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 

Similaire à Introduction to spark (20)

Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Fast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ INGFast track to getting started with DSE Max @ ING
Fast track to getting started with DSE Max @ ING
 
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William BentonApache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 

Plus de Duyhai Doan

Plus de Duyhai Doan (20)

Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
 
Le futur d'apache cassandra
Le futur d'apache cassandraLe futur d'apache cassandra
Le futur d'apache cassandra
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
 
Big data 101 for beginners riga dev days
Big data 101 for beginners riga dev daysBig data 101 for beginners riga dev days
Big data 101 for beginners riga dev days
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
 
Datastax day 2016 introduction to apache cassandra
Datastax day 2016   introduction to apache cassandraDatastax day 2016   introduction to apache cassandra
Datastax day 2016 introduction to apache cassandra
 
Datastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basicsDatastax day 2016 : Cassandra data modeling basics
Datastax day 2016 : Cassandra data modeling basics
 
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016Sasi, cassandra on the full text search ride At  Voxxed Day Belgrade 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Spark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotronSpark zeppelin-cassandra at synchrotron
Spark zeppelin-cassandra at synchrotron
 
Sasi, cassandra on full text search ride
Sasi, cassandra on full text search rideSasi, cassandra on full text search ride
Sasi, cassandra on full text search ride
 
Cassandra 3 new features @ Geecon Krakow 2016
Cassandra 3 new features  @ Geecon Krakow 2016Cassandra 3 new features  @ Geecon Krakow 2016
Cassandra 3 new features @ Geecon Krakow 2016
 
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
 
Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016Apache Zeppelin @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016
 
Cassandra 3 new features 2016
Cassandra 3 new features 2016Cassandra 3 new features 2016
Cassandra 3 new features 2016
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Spark Cassandra 2016
Spark Cassandra 2016Spark Cassandra 2016
Spark Cassandra 2016
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Apache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystemApache zeppelin the missing component for the big data ecosystem
Apache zeppelin the missing component for the big data ecosystem
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Introduction to spark