SlideShare une entreprise Scribd logo
1  sur  54
Apache
Spark Overview
Ten Tools for Ten Big Data Areas
Series 03 Big Data Analytics
www.sparkera.ca
Ten Tools for Ten Big Data Areas – Overview
2© Sparkera. Confidential. All Rights Reserved
10 Tools
10 Areas
Programming
SearchandIndex
First ETL fully on Yarn
Data storing platform
Data computing platform
SQL & Metadata
Visualize with just few clicks
Powerful as Java
Simple as Python
real-time
streaming
Made easier
Yours
Google
Lightning-fast cluster computing
Real-time distributed data store
High throughput
distributed messaging
Agenda
3© Sparkera. Confidential. All Rights Reserved
About Apache Spark History
2 About Apache Spark Stack and Architect
3 Resilient Distributed Dataset - RDD
4 RDD Transformation and Action
5 Spark SQL and DataFrame
1
Little About DI – Data Integration
• DI involves combining data residing in different sources and
providing users with a unified view of these data.
• DI process is also called Enterprise Information Integration (EII).
• DI usually means ETL - data extract, transformation, load.
• 80% of enterprise data projects' efforts are spent on DI work.
• Data cleansing, audit, master data management are usually
considered with DI.
© Sparkera. Confidential. All Rights Reserved
Apache Spark History
• Started in 2009 as a research project in UC Berkeley RAD lab which became
AMP Lab.
• Spark researchers found that Hadoop MapReduce was inefficient for
iterative and interactive computing.
• Spark was designed from the beginning to be fast for interactive, iterative
with support for in-memory storage and fault-tolerance.
• Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors.
• Spark was open sourced in March 2010 and transformed into Apache
Foundation project in June 2013. One of most active project in apache.
Apache Spark Stack
Spark Core
Standalone Scheduler YARN Other
Spark SQL
Spark
Streaming
MLlib GraphX
Full stack and great ecosystem.
One stack to rule them all.
API
Apache Spark Ecosystem
Spark summit Europe 2015
Apache Spark Advantages
• Fast by leverage memory
• General purpose framework – ecosystem
• Good and wide API support - native integration with Java, Python,
Scala
• Spark handles batch, interactive, and real-time within a single
framework
• Leverage Resilient Distributed Dataset – RDD to process data in
memory
 Resilient – Data can be recreated if it is lost in the memory
 Distributed – Stored in the memory across cluster
 Dataset – Can be created from file or programmatically
Apache Spark Hall of Fame
Largest cluster Largest single-day intake
Longest-running job Largest shuffle job
More than 8k nodes 1PB+/day
1 week on 1PB+ data 1PB sort in 4hrs
How to Work with Spark
• Scala IDE
- Add hadoop dependency or use mvn
- Compile to jar, build or download
(http://spark.apache.org/docs/latest/building-spark.html)
- Run with spark-submit
• spark-shell
- Integrative command line application
• spark-sql
- Where spark using hive metadata
Demo – Word Count
MR code Hive code
How about spark code?
Demo – Word Count Cont.
MR code Hive code
How about spark code?
Spark Architecture
Zookeeper
Cluster
Cluster
Manager
Cluster
Manager
Master
Standby
Driver
Application
(Spark Context)
Executor
Cache
TaskTask
Task
Executor
Cache
TaskTask
Task
Worker
One worker node by default has
one executor. It can be set from
SPARK_WORKER_INSTANCES
Each workers will try to use all cores
SPARK_WORKER_CORES
Spark Core Components
Application User program built on Spark. Consists of a driver program and executors on the
cluster.
Driver program The process running the main() function of the application and creating the
SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN)
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps
data in memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to
a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each
other (similar to the map and reduce stages in MapReduce); you'll see this term
used in the driver's logs.
Example of Partial Word Count
Stage 2Stage 1
WordCount Example
HDFSInputSplits
HDFS
pipeline
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
partition shuffle pipeline
• Each core by default execute one task a time (spark.task.cpus)
• One task per partition, one partition per split
• No. of stages is coming from shuffle/wide transformations
Two Main Spark Abstractions
RDD – Resilient Distributed Dataset
DAG – Direct Acyclic Graph
16
RDD Concept
Simple explanation
• RDD is collection of data items split into partitions,
stored in memory on worker nodes of the cluster, and perform proper
operations.
Complex explanation
• RDD is an interface for data transformation
• RDD refers to the data stored either in persisted store (HDFS,
Cassandra, HBase, etc.) or in cache (memory, memory+disks, disk
only, etc.) or in another RDD
• Partitions are recomputed on failure or cache eviction
Spark Program Flow by RDD
Create input
RDD
• spark context
• create from external
• create from SEQ
Transform
RDD
• Using map, flatmap, distinct, filter
• Wide/Narrow
Persist
intermediate
RDD
• Cache
• Persist
Action on RDD • Produce result
Example:
myRDD = sc.parallelize([1,2,3,4])
myRDD = sc.textFile("hdfs://tmp/shakepeare.txt”)
RDD Partitions
• A Partition is one of the different chunks that a RDD is splitted on
and that is sent to a node
• The more partitions we have, the more parallelism we get
• Each partition is candidate to be spread out to different worker
nodes
RDDs - Partitions
RDD w ith 8 partitions
P1 P2 P3 P4 P5 P6 P7 P8
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
Worker Node
Executor
P1 P5 P2 P6 P3 P7 P4 P8
RDD Operations
• RDD is the only data manipulation tool in Apache Spark
• Lazy and non-lazy operations
• There are two types of RDDs
 Transformation – lazy – returns Array[T] by collect action
 Actions – non-lazy – returns value
• Narrow and wide transformations
RDD Transformations - map
• Applies a transformation function on each item of the RDD and
returns the result as a new RDD.
• A narrow transformation
V1
V2
V3
T1
T2
T3
V10,1
V20,2
V30,3
T10,1
T20,2
T30,3
f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val b = a.map(x=>(x,x.length))
b.collect()
b.partitions.size
RDD Transformations - flatMap
• Similar to map, but each input item can be mapped to 0 or more
output items.
• A narrow transformation
V1
V2
V3
T1
T2
T3
V10,1
V20,2
V30,3
T10,1
T20,2
T30,3
f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val c = a.flatMap(x=>List(x,x))
c.collect
val d = a.map(x=>List(x,x)).collect
RDD Transformations - filter
• Create a new RDD by passing in a function used to filter the
results. It is similar to where clause in SQL language
• A narrow transformation
V1
V2
V3
T1
T2
T3
f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val e = a.filter(x=>x.length>3)
e.collect
V1
V3
T1
T2
RDD Transformations - sample
• Return a random sample subset RDD of the input RDD
• A narrow transformation
V1
V2
V3
V4
V5
V6
F:U=>v
val sam = sc.parallelize(1 to 10, 3)
sam.sample(false, 0.1, 0).collect
sam.sample(true, 0.3, 1).collect
V1
V2
V5
def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]
RDD Transformations - distinct
• Return a new RDD with distinct elements within a source RDD with
specific partitions
• A narrow/wide transformation
V1
V1
V3
V4
V4
V4
F:U=>v
Val dis = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6))
dis.distinct.collect
dis.distinct(2).partitions.length
V1
V3
V4
RDD Transformations - reduceByKey
• Operates on (K,V) pairs of course, but the function must be of
type (V,V) => V
• A narrow/wide transformation
V1,1
V2,2
V2,1
T1,3
T1,2
T2,1
F(A,B)=(A+B)
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.reduceByKey((a,b)=>(a+b)).collect
f.reduceByKey(_+_).collect
f.groupByKey.collect
V1,1
V2,3
T1,5
T2,1
In terms of group, ReduceByKey is more efficient
with combiner in the map phase.
RDD Transformations - sortByKey
• This simply sorts the (K,V) pair by K. Default is asc. False as desc.
• A narrow/wide transformation
1,V1
2,V2
3,V3
4,V4
6,V6
5,V3
F:U=>v
val a = sc.parallelize(List("dog", "salmon", "pig"), 3)
val f = a.map(x=>(x.length,x))
f.sortByKey().collect
f.sortByKey(false).collect
1,V1
2,V2
3,V3
4,V4
5,V5
6,V6
RDD Transformations - union
• Returns the union of two RDDs to a new RDD
• A narrow transformation
V1
V2
U1
F:U=>v
val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 4)
u1.union(u2).collect
(u1 ++ u2).collect
V1
V2
V3
U1
U2
V3
U2
RDD Transformations - intersection
• Returns a new RDD containing only common elements from source
RDD and argument.
• A narrow/wide transformation
V1
V2
U1
F:U=>v
val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 4)
u1.intersection(u2).collectV2
U1V2
U1
RDD Transformations - join
• When invoked on (K,V) and (K,W), this operation creates a new RDD
of (K, (V,W)). Must apply on key-value pairs.
• A wide transformation
K1,V1
K2,V2
K3,V3
K1,W1
K2, V2
F:U=>v
val j1 = sc.parallelize(List("abe", "abby",
"apple")).map(a => (a, 1))
val j2 = sc.parallelize(List("apple", "beatty",
"beatrice")).map(a => (a, 1))
j1.join(j2).collect
j1.leftOuterJoin(j2).collect
j1.rightOuterJoin(j2).collect
K1,(V1,V2)
K2,(V2,V2)
RDD Actions - count
• Get the number of data elements in the RDD
V1
V2
U1
F:U=>v val u1 = sc.parallelize(1 to 3)
val u2 = sc.parallelize(3 to 10)
u1.count
u2.count
3
RDD Actions - take
• Fetch number of data elements from the RDD
V1
V2
F:U=>v val t1 = sc.parallelize(3 to 10)
t1.take(3)
V1
RDD Actions - foreach
• Execute function for each data element in RDD.
F:U=>v
val c = sc.parallelize(List("cat", "dog", "tiger", "lion"))
c.foreach(x => println(x + "s are fun"))
V1
V2
V3
U1
U2
U3
RDD Actions - saveAsTextFile
• Writes the content of RDD to a text file or a set of text files to local
file system/HDFS. The default path is HDFS or hdfs://
val a = sc.parallelize(1 to 10000, 3)
a.saveAsTextFile("file:/Users/will/Downloads/mydata")
RDD Persistence – cache/persist
• Cache data to memory/disk. It is a lazy operation
• cache = persist(MEMORY)
persist(MEMORY_AND_DISK)
val u1 = sc.parallelize(1 to 1000).filter(_%2==0)
u1.cache()
u1.count
V1 (mem)
V2 (mem)
V3 (mem)
U1 (disk)
U2 (disk)
V1
V2
V3
U1
U2
Storage Level for persist
Storage Level Purpose
MEMORY_ONLY (Default level)
Keep RDD in available cluster memory as deserialized Java objects. Some
partitions may not be cached if there is not enough cluster memory. Those
partitions will be recalculated on the fly as needed.
MEMORY_AND_DISK
This option stores RDD as deserialized Java objects. If RDD does not fit in
cluster memory, then store those partitions on the disk and read them as
needed.
MEMORY_ONLY_SER
This options stores RDD as serialized Java objects (One byte array per
partition). This is more CPU intensive but saves memory as it is more space
efficient. Some partitions may not be cached. Those will be recalculated on
the fly as needed.
MEMORY_ONLY_DISK_SER
This option is same as above except that disk is used when memory is not
sufficient.
DISC_ONLY This option stores the RDD only on the disk
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as other levels but partitions are replicated on 2 slave nodes
By default, spark uses algorithm of Least Recently Used (LRU) to remove old and unused RDD to
release more memory. You can also manually remove unused RDD from memory by using
unpersist()
Spark Variable - Broadcast
• Like Hadoop DC, broadcast variables let programmer keep a read-
only variable cached on each machine rather than shipping a copy
of it with tasks. For example, to give every node a copy of a large
input dataset efficiently
• Spark also attempts to distribute broadcast variables using efficient
broadcast algorithms to reduce communication cost
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
val fn= sc.textFile(“hdfs://filename”).map(_.split(“|”))
val fn_bc= sc.broadcast(fn)
Spark Variable - Accumulators
• Accumulators are variables that can only be “added” to through an
associative operation Used to implement counters and sums,
efficiently in parallel
• Spark natively supports accumulators of numeric value types and
standard mutable collections, and programmers can extend for new
types
• Only the driver program can read an accumulator’s value, not the
tasks
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
Two Main Spark Abstractions
RDD – Resilient Distributed Dataset
DAG – Direct Acyclic Graph
39
DAG
• Direct Acyclic Graph – sequence of computations performed on
data
 Node: RDD partition
 Edge: Transformation on top of data
 Acyclic: Graph cannot return to the older partition
 Direct: transformation is an action that transitions data partition
state (from A to B)
How Spark Works by DAG
Apache Spark SQL Stack
Spark Application
Python/Scala/Java/MLLib/R
BI Tools
Tableau/Qlik/ZoomData/…
JDBC
Spark SQL
SchemaRDD/DataFrame
Hive Json CSV Parquet JDBC HBase Other
Spark SQL History
Stop at mid of 2014
Ready at
Beginning of 2015
Released at
April of 2014
Spark SQL with Hive Meta
Support almost all hive query – HiveContext, which is superset of SQLContext
SparkSQL uses Hive’s meta store, but his own version of thrift server
Spark SQL with BI Tools
Leverage existing BI Tools
Spark SQL with Other Integrations
Picture from databricks.com
Spark SQL Supported Keywords
Picture from databricks.com
Support SQL 92
Spark SQL – How It Works
Catalyst
Frontend Backend
Picture from databricks.com
DataFrame (DF) RDD
• Schema on read
• Works with structured and semi-structured data
• Simplifies working with structured data
• Keep track of schema information
• Read/Write from structure data like JSON, Hive tables, Parquet, etc.
• SQL inside your Spark App
• Best Performance and more powerful operations API
• Access through SQLContext
• Integrated with ML
DF – Performance
Picture from databricks.com
DF Example – Direct from files
val df = sqlContext.read.json("/demo/data/data_json/customer.json”)
df.printSchema()
df.show()
df.select(“name”).show
df.filter(df("age") >= 35).show()
df.groupBy("age").count().show()
SqlContext directly deals with DF
DF Example – Direct from Hive
val a = sqlContext.table(”sample_07”)
a.printSchema()
df.show(1)
SqlContext directly deals with DF
DF Example – Indirect from Other RDD
val rdd = sc.textFile("/demo/data/data_cust/data1.csv").
flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_+_)
//auto schema
val df1 = sqlContext.createDataFrame(rdd)
val df2 = rdd.toDF
//specify schema
val schemaString = "word cnt”
val schema = StructType(StructField("word", StringType, true) :: StructField("cnt",
IntegerType, false) :: Nil)
val rowRDD = rdd.map(p => Row(p._1,p._2))
val dfcust = sqlContext.createDataFrame(rowRDD, schema)
SqlContext directly deals with DF
www.sparkera.ca
BIG DATA is not only about data,
but the understanding of the data
and how people use data actively to improve their life.

Contenu connexe

Tendances

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 

Tendances (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 

Similaire à Ten tools for ten big data areas 03_Apache Spark

Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Juan Pedro Moreno
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into SparkAshish kumar
 

Similaire à Ten tools for ten big data areas 03_Apache Spark (20)

Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark core
Spark coreSpark core
Spark core
 
A Deep Dive Into Spark
A Deep Dive Into SparkA Deep Dive Into Spark
A Deep Dive Into Spark
 

Dernier

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 

Dernier (20)

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 

Ten tools for ten big data areas 03_Apache Spark

  • 1. Apache Spark Overview Ten Tools for Ten Big Data Areas Series 03 Big Data Analytics www.sparkera.ca
  • 2. Ten Tools for Ten Big Data Areas – Overview 2© Sparkera. Confidential. All Rights Reserved 10 Tools 10 Areas Programming SearchandIndex First ETL fully on Yarn Data storing platform Data computing platform SQL & Metadata Visualize with just few clicks Powerful as Java Simple as Python real-time streaming Made easier Yours Google Lightning-fast cluster computing Real-time distributed data store High throughput distributed messaging
  • 3. Agenda 3© Sparkera. Confidential. All Rights Reserved About Apache Spark History 2 About Apache Spark Stack and Architect 3 Resilient Distributed Dataset - RDD 4 RDD Transformation and Action 5 Spark SQL and DataFrame 1
  • 4. Little About DI – Data Integration • DI involves combining data residing in different sources and providing users with a unified view of these data. • DI process is also called Enterprise Information Integration (EII). • DI usually means ETL - data extract, transformation, load. • 80% of enterprise data projects' efforts are spent on DI work. • Data cleansing, audit, master data management are usually considered with DI. © Sparkera. Confidential. All Rights Reserved
  • 5. Apache Spark History • Started in 2009 as a research project in UC Berkeley RAD lab which became AMP Lab. • Spark researchers found that Hadoop MapReduce was inefficient for iterative and interactive computing. • Spark was designed from the beginning to be fast for interactive, iterative with support for in-memory storage and fault-tolerance. • Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors. • Spark was open sourced in March 2010 and transformed into Apache Foundation project in June 2013. One of most active project in apache.
  • 6. Apache Spark Stack Spark Core Standalone Scheduler YARN Other Spark SQL Spark Streaming MLlib GraphX Full stack and great ecosystem. One stack to rule them all. API
  • 7. Apache Spark Ecosystem Spark summit Europe 2015
  • 8. Apache Spark Advantages • Fast by leverage memory • General purpose framework – ecosystem • Good and wide API support - native integration with Java, Python, Scala • Spark handles batch, interactive, and real-time within a single framework • Leverage Resilient Distributed Dataset – RDD to process data in memory  Resilient – Data can be recreated if it is lost in the memory  Distributed – Stored in the memory across cluster  Dataset – Can be created from file or programmatically
  • 9. Apache Spark Hall of Fame Largest cluster Largest single-day intake Longest-running job Largest shuffle job More than 8k nodes 1PB+/day 1 week on 1PB+ data 1PB sort in 4hrs
  • 10. How to Work with Spark • Scala IDE - Add hadoop dependency or use mvn - Compile to jar, build or download (http://spark.apache.org/docs/latest/building-spark.html) - Run with spark-submit • spark-shell - Integrative command line application • spark-sql - Where spark using hive metadata
  • 11. Demo – Word Count MR code Hive code How about spark code?
  • 12. Demo – Word Count Cont. MR code Hive code How about spark code?
  • 13. Spark Architecture Zookeeper Cluster Cluster Manager Cluster Manager Master Standby Driver Application (Spark Context) Executor Cache TaskTask Task Executor Cache TaskTask Task Worker One worker node by default has one executor. It can be set from SPARK_WORKER_INSTANCES Each workers will try to use all cores SPARK_WORKER_CORES
  • 14. Spark Core Components Application User program built on Spark. Consists of a driver program and executors on the cluster. Driver program The process running the main() function of the application and creating the SparkContext Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) Worker node Any node that can run application code in the cluster Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Task A unit of work that will be sent to one executor Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 15. Example of Partial Word Count Stage 2Stage 1 WordCount Example HDFSInputSplits HDFS pipeline Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 partition shuffle pipeline • Each core by default execute one task a time (spark.task.cpus) • One task per partition, one partition per split • No. of stages is coming from shuffle/wide transformations
  • 16. Two Main Spark Abstractions RDD – Resilient Distributed Dataset DAG – Direct Acyclic Graph 16
  • 17. RDD Concept Simple explanation • RDD is collection of data items split into partitions, stored in memory on worker nodes of the cluster, and perform proper operations. Complex explanation • RDD is an interface for data transformation • RDD refers to the data stored either in persisted store (HDFS, Cassandra, HBase, etc.) or in cache (memory, memory+disks, disk only, etc.) or in another RDD • Partitions are recomputed on failure or cache eviction
  • 18. Spark Program Flow by RDD Create input RDD • spark context • create from external • create from SEQ Transform RDD • Using map, flatmap, distinct, filter • Wide/Narrow Persist intermediate RDD • Cache • Persist Action on RDD • Produce result Example: myRDD = sc.parallelize([1,2,3,4]) myRDD = sc.textFile("hdfs://tmp/shakepeare.txt”)
  • 19. RDD Partitions • A Partition is one of the different chunks that a RDD is splitted on and that is sent to a node • The more partitions we have, the more parallelism we get • Each partition is candidate to be spread out to different worker nodes RDDs - Partitions RDD w ith 8 partitions P1 P2 P3 P4 P5 P6 P7 P8 Worker Node Executor Worker Node Executor Worker Node Executor Worker Node Executor P1 P5 P2 P6 P3 P7 P4 P8
  • 20. RDD Operations • RDD is the only data manipulation tool in Apache Spark • Lazy and non-lazy operations • There are two types of RDDs  Transformation – lazy – returns Array[T] by collect action  Actions – non-lazy – returns value • Narrow and wide transformations
  • 21. RDD Transformations - map • Applies a transformation function on each item of the RDD and returns the result as a new RDD. • A narrow transformation V1 V2 V3 T1 T2 T3 V10,1 V20,2 V30,3 T10,1 T20,2 T30,3 f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3) val b = a.map(x=>(x,x.length)) b.collect() b.partitions.size
  • 22. RDD Transformations - flatMap • Similar to map, but each input item can be mapped to 0 or more output items. • A narrow transformation V1 V2 V3 T1 T2 T3 V10,1 V20,2 V30,3 T10,1 T20,2 T30,3 f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3) val c = a.flatMap(x=>List(x,x)) c.collect val d = a.map(x=>List(x,x)).collect
  • 23. RDD Transformations - filter • Create a new RDD by passing in a function used to filter the results. It is similar to where clause in SQL language • A narrow transformation V1 V2 V3 T1 T2 T3 f:T->u val a = sc.parallelize(List("dog", "salmon", "pig"), 3) val e = a.filter(x=>x.length>3) e.collect V1 V3 T1 T2
  • 24. RDD Transformations - sample • Return a random sample subset RDD of the input RDD • A narrow transformation V1 V2 V3 V4 V5 V6 F:U=>v val sam = sc.parallelize(1 to 10, 3) sam.sample(false, 0.1, 0).collect sam.sample(true, 0.3, 1).collect V1 V2 V5 def sample(withReplacement: Boolean, fraction: Double, seed: Int): RDD[T]
  • 25. RDD Transformations - distinct • Return a new RDD with distinct elements within a source RDD with specific partitions • A narrow/wide transformation V1 V1 V3 V4 V4 V4 F:U=>v Val dis = sc.parallelize(List(1,2,3,4,5,6,7,8,9,9,2,6)) dis.distinct.collect dis.distinct(2).partitions.length V1 V3 V4
  • 26. RDD Transformations - reduceByKey • Operates on (K,V) pairs of course, but the function must be of type (V,V) => V • A narrow/wide transformation V1,1 V2,2 V2,1 T1,3 T1,2 T2,1 F(A,B)=(A+B) val a = sc.parallelize(List("dog", "salmon", "pig"), 3) val f = a.map(x=>(x.length,x)) f.reduceByKey((a,b)=>(a+b)).collect f.reduceByKey(_+_).collect f.groupByKey.collect V1,1 V2,3 T1,5 T2,1 In terms of group, ReduceByKey is more efficient with combiner in the map phase.
  • 27. RDD Transformations - sortByKey • This simply sorts the (K,V) pair by K. Default is asc. False as desc. • A narrow/wide transformation 1,V1 2,V2 3,V3 4,V4 6,V6 5,V3 F:U=>v val a = sc.parallelize(List("dog", "salmon", "pig"), 3) val f = a.map(x=>(x.length,x)) f.sortByKey().collect f.sortByKey(false).collect 1,V1 2,V2 3,V3 4,V4 5,V5 6,V6
  • 28. RDD Transformations - union • Returns the union of two RDDs to a new RDD • A narrow transformation V1 V2 U1 F:U=>v val u1 = sc.parallelize(1 to 3) val u2 = sc.parallelize(3 to 4) u1.union(u2).collect (u1 ++ u2).collect V1 V2 V3 U1 U2 V3 U2
  • 29. RDD Transformations - intersection • Returns a new RDD containing only common elements from source RDD and argument. • A narrow/wide transformation V1 V2 U1 F:U=>v val u1 = sc.parallelize(1 to 3) val u2 = sc.parallelize(3 to 4) u1.intersection(u2).collectV2 U1V2 U1
  • 30. RDD Transformations - join • When invoked on (K,V) and (K,W), this operation creates a new RDD of (K, (V,W)). Must apply on key-value pairs. • A wide transformation K1,V1 K2,V2 K3,V3 K1,W1 K2, V2 F:U=>v val j1 = sc.parallelize(List("abe", "abby", "apple")).map(a => (a, 1)) val j2 = sc.parallelize(List("apple", "beatty", "beatrice")).map(a => (a, 1)) j1.join(j2).collect j1.leftOuterJoin(j2).collect j1.rightOuterJoin(j2).collect K1,(V1,V2) K2,(V2,V2)
  • 31. RDD Actions - count • Get the number of data elements in the RDD V1 V2 U1 F:U=>v val u1 = sc.parallelize(1 to 3) val u2 = sc.parallelize(3 to 10) u1.count u2.count 3
  • 32. RDD Actions - take • Fetch number of data elements from the RDD V1 V2 F:U=>v val t1 = sc.parallelize(3 to 10) t1.take(3) V1
  • 33. RDD Actions - foreach • Execute function for each data element in RDD. F:U=>v val c = sc.parallelize(List("cat", "dog", "tiger", "lion")) c.foreach(x => println(x + "s are fun")) V1 V2 V3 U1 U2 U3
  • 34. RDD Actions - saveAsTextFile • Writes the content of RDD to a text file or a set of text files to local file system/HDFS. The default path is HDFS or hdfs:// val a = sc.parallelize(1 to 10000, 3) a.saveAsTextFile("file:/Users/will/Downloads/mydata")
  • 35. RDD Persistence – cache/persist • Cache data to memory/disk. It is a lazy operation • cache = persist(MEMORY) persist(MEMORY_AND_DISK) val u1 = sc.parallelize(1 to 1000).filter(_%2==0) u1.cache() u1.count V1 (mem) V2 (mem) V3 (mem) U1 (disk) U2 (disk) V1 V2 V3 U1 U2
  • 36. Storage Level for persist Storage Level Purpose MEMORY_ONLY (Default level) Keep RDD in available cluster memory as deserialized Java objects. Some partitions may not be cached if there is not enough cluster memory. Those partitions will be recalculated on the fly as needed. MEMORY_AND_DISK This option stores RDD as deserialized Java objects. If RDD does not fit in cluster memory, then store those partitions on the disk and read them as needed. MEMORY_ONLY_SER This options stores RDD as serialized Java objects (One byte array per partition). This is more CPU intensive but saves memory as it is more space efficient. Some partitions may not be cached. Those will be recalculated on the fly as needed. MEMORY_ONLY_DISK_SER This option is same as above except that disk is used when memory is not sufficient. DISC_ONLY This option stores the RDD only on the disk MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as other levels but partitions are replicated on 2 slave nodes By default, spark uses algorithm of Least Recently Used (LRU) to remove old and unused RDD to release more memory. You can also manually remove unused RDD from memory by using unpersist()
  • 37. Spark Variable - Broadcast • Like Hadoop DC, broadcast variables let programmer keep a read- only variable cached on each machine rather than shipping a copy of it with tasks. For example, to give every node a copy of a large input dataset efficiently • Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value val fn= sc.textFile(“hdfs://filename”).map(_.split(“|”)) val fn_bc= sc.broadcast(fn)
  • 38. Spark Variable - Accumulators • Accumulators are variables that can only be “added” to through an associative operation Used to implement counters and sums, efficiently in parallel • Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types • Only the driver program can read an accumulator’s value, not the tasks val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value
  • 39. Two Main Spark Abstractions RDD – Resilient Distributed Dataset DAG – Direct Acyclic Graph 39
  • 40. DAG • Direct Acyclic Graph – sequence of computations performed on data  Node: RDD partition  Edge: Transformation on top of data  Acyclic: Graph cannot return to the older partition  Direct: transformation is an action that transitions data partition state (from A to B)
  • 41. How Spark Works by DAG
  • 42. Apache Spark SQL Stack Spark Application Python/Scala/Java/MLLib/R BI Tools Tableau/Qlik/ZoomData/… JDBC Spark SQL SchemaRDD/DataFrame Hive Json CSV Parquet JDBC HBase Other
  • 43. Spark SQL History Stop at mid of 2014 Ready at Beginning of 2015 Released at April of 2014
  • 44. Spark SQL with Hive Meta Support almost all hive query – HiveContext, which is superset of SQLContext SparkSQL uses Hive’s meta store, but his own version of thrift server
  • 45. Spark SQL with BI Tools Leverage existing BI Tools
  • 46. Spark SQL with Other Integrations Picture from databricks.com
  • 47. Spark SQL Supported Keywords Picture from databricks.com Support SQL 92
  • 48. Spark SQL – How It Works Catalyst Frontend Backend Picture from databricks.com
  • 49. DataFrame (DF) RDD • Schema on read • Works with structured and semi-structured data • Simplifies working with structured data • Keep track of schema information • Read/Write from structure data like JSON, Hive tables, Parquet, etc. • SQL inside your Spark App • Best Performance and more powerful operations API • Access through SQLContext • Integrated with ML
  • 50. DF – Performance Picture from databricks.com
  • 51. DF Example – Direct from files val df = sqlContext.read.json("/demo/data/data_json/customer.json”) df.printSchema() df.show() df.select(“name”).show df.filter(df("age") >= 35).show() df.groupBy("age").count().show() SqlContext directly deals with DF
  • 52. DF Example – Direct from Hive val a = sqlContext.table(”sample_07”) a.printSchema() df.show(1) SqlContext directly deals with DF
  • 53. DF Example – Indirect from Other RDD val rdd = sc.textFile("/demo/data/data_cust/data1.csv"). flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_+_) //auto schema val df1 = sqlContext.createDataFrame(rdd) val df2 = rdd.toDF //specify schema val schemaString = "word cnt” val schema = StructType(StructField("word", StringType, true) :: StructField("cnt", IntegerType, false) :: Nil) val rowRDD = rdd.map(p => Row(p._1,p._2)) val dfcust = sqlContext.createDataFrame(rowRDD, schema) SqlContext directly deals with DF
  • 54. www.sparkera.ca BIG DATA is not only about data, but the understanding of the data and how people use data actively to improve their life.

Notes de l'éditeur

  1. Lightning-fast cluster computing
  2. Resilient Distributed Dataset
  3. Tachyon Mesos Blinkdb – 17T 100nodes 2sec 2-10 error
  4. Resilient Distributed Dataset
  5. LARGEST CLUSTER Tencent (8000+ nodes) LARGEST SINGLE-DAY INTAKE Tencent (1PB+ /day) LONGEST-RUNNING JOB Alibaba (1 week on 1PB+ data) LARGEST SHUFFLE Databricks PB Sort (1PB) MOST INTERESTING APP Jeremy Freeman Mapping the Brain at Scale (with lasers!)
  6. RDD Persistence
  7. The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications. As you enter your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph. When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages. The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages. The Worker executes the tasks. A new JVM is started per job. The worker knows only about the code that is passed to it.
  8. Optimizing query planning for Spark Takes Dataframe operations and ‘compiles’ them down to RDD operations Often faster than writing RDD code manually Use Dataframes whenever possible (v1.4+)
  9. Works with structured and semi-structured data DataFrame simplifies working with structured data Read/Write from structure data like JSON, Hive tables, Parquet, etc. SQL inside your Spark App Best Performance and more powerful operations API