SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
OVERVIEW
Vadim Bichutskiy
@vybstat
Interface Symposium
June 11, 2015
Licensed under: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
WHO AM I
• Computational and Data Sciences, PhD Candidate, George Mason
• Independent Data Science Consultant
• MS/BS Computer Science, MS Statistics
• NOT a Spark expert (yet!)
ACKNOWLEDGEMENTS
• Much of this talk is inspired by SparkCamp at Strata HadoopWorld, San
Jose, CA, February 2015 licensed under: Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License
• Taught by Paco Nathan
BIG NEWSTODAY!
databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html
4
SPARK HELPSYOU BUILD NEWTOOLS
5
THISTALK…
• Part I: Big Data:A Brief History
• Part II: A Tour of Spark
• Part III: Spark Concepts
6
PART I:
BIG DATA:A BRIEF HISTORY
7
• Web, e-commerce, marketing, other data explosion
• Work no longer fits on a single machine
• Move to horizontal scale-out on clusters of commodity hardware
• Machine learning, indexing, graph processing use cases at scale
DOT COM BUBBLE: 1994-2001
8
GAME CHANGE: C. 2002-2004
Google File System
research.google.com/archive/gfs.html
MapReduce: Simplified Data Processing on Large Clusters
research.google.com/archive/mapreduce.html
9
HISTORY: FUNCTIONAL PROGRAMMING FOR BIG DATA
2002 2004 2006 2008 2010 2012 2014
MapReduce @ Google
MapReduce Paper
Hadoop @ Yahoo!
Hadoop Summit
Amazon EMR
Spark @ Berkeley
Spark Paper
Databricks
Spark Summit
Apache Spark
takes off
Databricks Cloud
SparkR
KeystoneML
c. 1979 - MIT, CMU, Stanford, etc.
LISP, Prolog, etc. operations: map, reduce, etc. Slide adapted from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 201510
MapReduce Limitations
• Difficult to program directly in MR
• Performance bottlenecks, batch processing only
• Streaming, iterative, interactive, graph processing,…
MR doesn’t fit modern use cases
Specialized systems developed as workarounds…
11
MapReduce Limitations
MR doesn’t fit modern use cases
Specialized systems developed as workarounds
Slide adapted from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 201512
PART II:
APACHE SPARK TOTHE RESCUE…
13
Apache Spark
• Fast, unified, large-scale data processing engine for modern workflows
• Batch, streaming, iterative, interactive
• SQL, ML, graph processing
• Developed in ’09 at UC Berkeley AMPLab, open sourced in ’10
• Spark is one of the largest Big Data OSS projects
“Organizations that are looking at big data challenges –

including collection, ETL, storage, exploration and analytics –

should consider Spark for its in-memory performance and

the breadth of its model. It supports advanced analytics

solutions on Hadoop clusters, including the iterative model

required for machine learning and graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
14
Apache Spark
Spark’s goal was to generalize MapReduce, supporting
modern use cases within same engine!
15
Spark Research
Spark: Cluster Computer withWorking Sets
http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
Resilient Distributed Datasets:A Fault-Tolerant Abstraction for
In-Memory Cluster Computing
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
16
Spark: Key Points
• Same engine for batch, streaming and interactive workloads
• Scala, Java, Python, and (soon) R APIs
• Programming at a higher level of abstraction
• More general than MR
17
WordCount: “Hello World” for Big Data Apps
Slide adapted from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 2015
18
Spark vs. MapReduce
• Unified engine for modern workloads
• Lazy evaluation of the operator graph
• Optimized for modern hardware
• Functional programming / ease of use
• Reduction in cost to build/maintain enterprise apps
• Lower start up overhead
• More efficient shuffles
19
Spark in a nutshell…
20
Spark Destroys Previous Sort Record
Spark: 3x faster with 10x fewer nodes
databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
21
Spark is one of the most active Big Data projects…
openhub.net/orgs/apache
22
What does Google say…
23
Spark on Stack Overflow
twitter.com/dberkholz/status/568561792751771648
24
It pays to Spark…
oreilly.com/data/free/2014-data-science-salary-survey.csp
25
Spark Adoption
databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html
26
PART III:
APACHE SPARK CONCEPTS…
27
Resilient Distributed Datasets (RDDs)
• Spark’s main abstraction - a fault-tolerant collection of elements that can be
operated on in parallel
• Two ways to create RDDs:
I. Parallelized collections
val data = Array(1, 2, 3, 4, 5)

data: Array[Int] = Array(1, 2, 3, 4, 5)

val distData = sc.parallelize(data)

distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970]
II. External Datasets
lines = sc.textFile(“s3n://error-logs/error-log.txt”) 
.map(lambda x: x.split("t"))
28
RDD Operations
• Two types: transformations and actions
• Transformations create a new RDD out of existing one, e.g. rdd.map(…)
• Actions return a value to the driver program after running a computation
on the RDD, e.g., rdd.count()
Figure from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 2015 29
Transformations
spark.apache.org/docs/latest/programming-guide.html
Transformation Meaning
map(func)
Return a new distributed dataset formed by passing each
element of the source through a function func.
filter(func)
Return a new dataset formed by selecting those elements
of the source on which func returns true.
flatMap(func)
Similar to map, but each input item can be mapped to 0 or
more output items (so func should return a Seq rather than
a single item).
mapPartitions(func)
Similar to map, but runs separately on each partition
(block) of the RDD.
mapPartitionsWithIndex(func)
Similar to mapPartitions, but also provides func with an
integer value representing the index of the partition.
sample(withReplacement,
fraction, seed)
Sample a fraction fraction of the data, with or without
replacement, using a given random number generator seed.
30
Transformations
spark.apache.org/docs/latest/programming-guide.html
Transformation Meaning
union(otherDataset)
Return a new dataset that contains the union of the elements in
the source dataset and the argument.
intersection(otherDataset)
Return a new RDD that contains the intersection of elements in
the source dataset and the argument.
distinct([numTasks]))
Return a new dataset that contains the distinct elements of the
source dataset.
groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable<V>) pairs.
reduceByKey(func,
[numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K,
V) pairs where the values for each key are aggregated using the
given reduce function func.
sortByKey([ascending],
[numTasks])
When called on a dataset of (K, V) pairs where K implements
Ordered, returns a dataset of (K, V) pairs sorted by keys in
ascending or descending order
31
Transformations
spark.apache.org/docs/latest/programming-guide.html
Transformation Meaning
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a
dataset of (K, (V, W)) with all pairs of elements for each key.
cogroup(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a
dataset of (K, (Iterable<V>, Iterable<W>)) tuples.
cartesian(otherDataset)
When called on datasets of types T and U, returns a dataset of
(T, U) pairs (all pairs of elements).
pipe(command, [envVars])
Pipe each partition of the RDD through a shell command. RDD
elements are written to the process's stdin and lines output to
its stdout are returned as an RDD of strings.
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions.
Useful for running operations more efficiently after filtering
down a large dataset.
32
Ex:Transformations
Python
>>> x = ['hello world', 'how are you enjoying the conference']
>>> rdd = sc.parallelize(x)
>>> rdd.filter(lambda x: 'hello' in x).collect()
['hello world']
>>> rdd.map(lambda x: x.split(" ")).collect()
[['hello', 'world'], ['how', 'are', 'you', 'enjoying', 'the', 'conference']]
>>> rdd.flatMap(lambda x: x.split(" ")).collect()
['hello', 'world', 'how', 'are', 'you', 'enjoying', 'the', 'conference']
33
Ex:Transformations
Scala
scala> val x = Array(“hello world”, “how are you enjoying the conference”)
scala> val rdd = sc.parallelize(x)
scala> rdd.filter(x => x contains "hello").collect()
res15: Array[String] = Array(hello world)
scala> rdd.map(x => x.split(" ")).collect()
res19: Array[Array[String]] = Array(Array(hello, world),
Array(how, are, you, enjoying, the, conference))
scala> rdd.flatMap(x => x.split(" ")).collect()
res20: Array[String] = Array(hello, world, how, are, you, enjoying,
the, conference)
34
Actions
spark.apache.org/docs/latest/programming-guide.html
Action Meaning
reduce(func)
Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one), func should be commutative
and associative so it can be computed correctly in parallel.
collect()
Return all elements of the dataset as array at the driver program.
Usually useful after a filter or other operation that returns
sufficiently small data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacement,
num, [seed])
Return an array with a random sample of num elements of the
dataset, with or without replacement, with optional random
number generator seed.
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural
order or a custom comparator.
35
Actions
spark.apache.org/docs/latest/programming-guide.html
Action Meaning
saveAsTextFile(path)
Write the dataset as a text file (or set of text files) in a given path
in the local filesystem, HDFS or any other Hadoop-supported file
system. Spark will call toString on each element to convert it to a
line of text in the file.
saveAsSequenceFile(path)
(Java and Scala)
Write the dataset as a Hadoop SequenceFile in a given path in the
local filesystem, HDFS or any other Hadoop-supported file system.
saveAsObjectFile(path)
(Java and Scala)
Write the dataset in a simple format using Java serialization, which
can then be loaded using SparkContext.objectFile().
countByKey()
For RDD of type (K, V), returns a hashmap of (K, Int) pairs with the
count of each key.
foreach(func)
Run a function func on each element of the dataset. This is usually
done for side effects such as updating an accumulator variable or
interacting with external storage systems.
36
Ex:Actions
Python
>>> x = [“hello world", "hello there", "hello again”]
>>> rdd = sc.parallelize(x)
>>> wordsCounts = rdd.flatMap(lamdba x: x.split(" “)).map(lambda w: (w, 1)) 
.reduceByKey(add)
>>> wordCounts.saveAsTextFile("/Users/vb/wordcounts")
>>> wordCounts.collect()
[(again,1), (hello,3), (world,1), (there,1)]
>>> from operator import add
37
Ex:Actions
Scala
scala> val x = Array("hello world", "hello there", "hello again")
scala> val rdd = sc.parallelize(x)
scala> val wordsCounts = rdd.flatMap(x => x.split(" ")).map(word => (word, 1)) 
.reduceByKey(_ + _)
scala> wordCounts.saveAsTextFile("/Users/vb/wordcounts")
scala> wordCounts.collect()
res43: Array[(String, Int)] = Array((again,1), (hello,3), (world,1), (there,1))
38
RDD Persistence
• Unlike MapReduce, Spark can persist (or cache) a dataset in
memory across operations
• Each node stores any partitions of it that it computes in memory
and reuses them in other transformations/actions on that RDD
• 10x increase in speed
• One of the most important Spark features
>>> wordCounts = rdd.flatMap(lamdba x: x.split(“ “)) 
.map(lambda w: (w, 1)) 
.reduceByKey(add) 
.cache() 39
RDD Persistence Storage Levels
Storage Level Meaning
MEMORY_ONLY
Store RDD as deserialized Java objects in
the JVM. If the RDD does not fit in
memory, some partitions will not be
cached and will be recomputed on the fly
each time they're needed. This is the
default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in
the JVM. If the RDD does not fit in
memory, store the partitions that don't fit
on disk, and read them from there when
they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one
byte array per partition). This is generally
more space-efficient than deserialized
objects, especially when using a fast
serializer, but more CPU-intensive to read.
http://spark.apache.org/docs/latest/programming-guide.html
40
More RDD Persistence Storage Levels
Storage Level Meaning
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill
partitions that don't fit in memory to disk
instead of recomputing them on the fly
each time they're needed.
DISK_ONLY Store RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate
each partition on two cluster nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon.
Compared to MEMORY_ONLY_SER,
OFF_HEAP reduces garbage collection
overhead and allows executors to be
smaller and to share a pool of memory,
making it attractive in environments with
large heaps or multiple concurrent
applications.
http://spark.apache.org/docs/latest/programming-guide.html
41
Under the hood…
spark.apache.org/docs/latest/cluster-overview.html
42
And so much more…
• DataFrames and SQL
• Spark Streaming
• MLlib
• GraphX
spark.apache.org/docs/latest/
43

Contenu connexe

Tendances

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 

Tendances (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark
SparkSpark
Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 

En vedette

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming OverviewStratio
 
Pancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanPancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanElla Feby
 
Spark architecture
Spark architectureSpark architecture
Spark architecturedatamantra
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 

En vedette (16)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview[Spark meetup] Spark Streaming Overview
[Spark meetup] Spark Streaming Overview
 
Pancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraanPancasila sebagai konteks ketatanegaraan
Pancasila sebagai konteks ketatanegaraan
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

Similaire à Apache Spark Overview

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkMuktadiur Rahman
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJUGBD
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 

Similaire à Apache Spark Overview (20)

Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 

Dernier

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 

Dernier (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Apache Spark Overview

  • 1. OVERVIEW Vadim Bichutskiy @vybstat Interface Symposium June 11, 2015 Licensed under: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
  • 2. WHO AM I • Computational and Data Sciences, PhD Candidate, George Mason • Independent Data Science Consultant • MS/BS Computer Science, MS Statistics • NOT a Spark expert (yet!)
  • 3. ACKNOWLEDGEMENTS • Much of this talk is inspired by SparkCamp at Strata HadoopWorld, San Jose, CA, February 2015 licensed under: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License • Taught by Paco Nathan
  • 5. SPARK HELPSYOU BUILD NEWTOOLS 5
  • 6. THISTALK… • Part I: Big Data:A Brief History • Part II: A Tour of Spark • Part III: Spark Concepts 6
  • 7. PART I: BIG DATA:A BRIEF HISTORY 7
  • 8. • Web, e-commerce, marketing, other data explosion • Work no longer fits on a single machine • Move to horizontal scale-out on clusters of commodity hardware • Machine learning, indexing, graph processing use cases at scale DOT COM BUBBLE: 1994-2001 8
  • 9. GAME CHANGE: C. 2002-2004 Google File System research.google.com/archive/gfs.html MapReduce: Simplified Data Processing on Large Clusters research.google.com/archive/mapreduce.html 9
  • 10. HISTORY: FUNCTIONAL PROGRAMMING FOR BIG DATA 2002 2004 2006 2008 2010 2012 2014 MapReduce @ Google MapReduce Paper Hadoop @ Yahoo! Hadoop Summit Amazon EMR Spark @ Berkeley Spark Paper Databricks Spark Summit Apache Spark takes off Databricks Cloud SparkR KeystoneML c. 1979 - MIT, CMU, Stanford, etc. LISP, Prolog, etc. operations: map, reduce, etc. Slide adapted from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 201510
  • 11. MapReduce Limitations • Difficult to program directly in MR • Performance bottlenecks, batch processing only • Streaming, iterative, interactive, graph processing,… MR doesn’t fit modern use cases Specialized systems developed as workarounds… 11
  • 12. MapReduce Limitations MR doesn’t fit modern use cases Specialized systems developed as workarounds Slide adapted from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 201512
  • 13. PART II: APACHE SPARK TOTHE RESCUE… 13
  • 14. Apache Spark • Fast, unified, large-scale data processing engine for modern workflows • Batch, streaming, iterative, interactive • SQL, ML, graph processing • Developed in ’09 at UC Berkeley AMPLab, open sourced in ’10 • Spark is one of the largest Big Data OSS projects “Organizations that are looking at big data challenges –
 including collection, ETL, storage, exploration and analytics –
 should consider Spark for its in-memory performance and
 the breadth of its model. It supports advanced analytics
 solutions on Hadoop clusters, including the iterative model
 required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014) 14
  • 15. Apache Spark Spark’s goal was to generalize MapReduce, supporting modern use cases within same engine! 15
  • 16. Spark Research Spark: Cluster Computer withWorking Sets http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf Resilient Distributed Datasets:A Fault-Tolerant Abstraction for In-Memory Cluster Computing https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf 16
  • 17. Spark: Key Points • Same engine for batch, streaming and interactive workloads • Scala, Java, Python, and (soon) R APIs • Programming at a higher level of abstraction • More general than MR 17
  • 18. WordCount: “Hello World” for Big Data Apps Slide adapted from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 2015 18
  • 19. Spark vs. MapReduce • Unified engine for modern workloads • Lazy evaluation of the operator graph • Optimized for modern hardware • Functional programming / ease of use • Reduction in cost to build/maintain enterprise apps • Lower start up overhead • More efficient shuffles 19
  • 20. Spark in a nutshell… 20
  • 21. Spark Destroys Previous Sort Record Spark: 3x faster with 10x fewer nodes databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html 21
  • 22. Spark is one of the most active Big Data projects… openhub.net/orgs/apache 22
  • 23. What does Google say… 23
  • 24. Spark on Stack Overflow twitter.com/dberkholz/status/568561792751771648 24
  • 25. It pays to Spark… oreilly.com/data/free/2014-data-science-salary-survey.csp 25
  • 27. PART III: APACHE SPARK CONCEPTS… 27
  • 28. Resilient Distributed Datasets (RDDs) • Spark’s main abstraction - a fault-tolerant collection of elements that can be operated on in parallel • Two ways to create RDDs: I. Parallelized collections val data = Array(1, 2, 3, 4, 5)
 data: Array[Int] = Array(1, 2, 3, 4, 5)
 val distData = sc.parallelize(data)
 distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970] II. External Datasets lines = sc.textFile(“s3n://error-logs/error-log.txt”) .map(lambda x: x.split("t")) 28
  • 29. RDD Operations • Two types: transformations and actions • Transformations create a new RDD out of existing one, e.g. rdd.map(…) • Actions return a value to the driver program after running a computation on the RDD, e.g., rdd.count() Figure from SparkCamp, Strata Hadoop World, San Jose, CA, Feb 2015 29
  • 30. Transformations spark.apache.org/docs/latest/programming-guide.html Transformation Meaning map(func) Return a new distributed dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD. mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition. sample(withReplacement, fraction, seed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed. 30
  • 31. Transformations spark.apache.org/docs/latest/programming-guide.html Transformation Meaning union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument. intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument. distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func. sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order 31
  • 32. Transformations spark.apache.org/docs/latest/programming-guide.html Transformation Meaning join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) with all pairs of elements for each key. cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). pipe(command, [envVars]) Pipe each partition of the RDD through a shell command. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings. coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. 32
  • 33. Ex:Transformations Python >>> x = ['hello world', 'how are you enjoying the conference'] >>> rdd = sc.parallelize(x) >>> rdd.filter(lambda x: 'hello' in x).collect() ['hello world'] >>> rdd.map(lambda x: x.split(" ")).collect() [['hello', 'world'], ['how', 'are', 'you', 'enjoying', 'the', 'conference']] >>> rdd.flatMap(lambda x: x.split(" ")).collect() ['hello', 'world', 'how', 'are', 'you', 'enjoying', 'the', 'conference'] 33
  • 34. Ex:Transformations Scala scala> val x = Array(“hello world”, “how are you enjoying the conference”) scala> val rdd = sc.parallelize(x) scala> rdd.filter(x => x contains "hello").collect() res15: Array[String] = Array(hello world) scala> rdd.map(x => x.split(" ")).collect() res19: Array[Array[String]] = Array(Array(hello, world), Array(how, are, you, enjoying, the, conference)) scala> rdd.flatMap(x => x.split(" ")).collect() res20: Array[String] = Array(hello, world, how, are, you, enjoying, the, conference) 34
  • 35. Actions spark.apache.org/docs/latest/programming-guide.html Action Meaning reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and returns one), func should be commutative and associative so it can be computed correctly in parallel. collect() Return all elements of the dataset as array at the driver program. Usually useful after a filter or other operation that returns sufficiently small data. count() Return the number of elements in the dataset. first() Return the first element of the dataset (similar to take(1)). take(n) Return an array with the first n elements of the dataset. takeSample(withReplacement, num, [seed]) Return an array with a random sample of num elements of the dataset, with or without replacement, with optional random number generator seed. takeOrdered(n, [ordering]) Return the first n elements of the RDD using either their natural order or a custom comparator. 35
  • 36. Actions spark.apache.org/docs/latest/programming-guide.html Action Meaning saveAsTextFile(path) Write the dataset as a text file (or set of text files) in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. saveAsSequenceFile(path) (Java and Scala) Write the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. saveAsObjectFile(path) (Java and Scala) Write the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile(). countByKey() For RDD of type (K, V), returns a hashmap of (K, Int) pairs with the count of each key. foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an accumulator variable or interacting with external storage systems. 36
  • 37. Ex:Actions Python >>> x = [“hello world", "hello there", "hello again”] >>> rdd = sc.parallelize(x) >>> wordsCounts = rdd.flatMap(lamdba x: x.split(" “)).map(lambda w: (w, 1)) .reduceByKey(add) >>> wordCounts.saveAsTextFile("/Users/vb/wordcounts") >>> wordCounts.collect() [(again,1), (hello,3), (world,1), (there,1)] >>> from operator import add 37
  • 38. Ex:Actions Scala scala> val x = Array("hello world", "hello there", "hello again") scala> val rdd = sc.parallelize(x) scala> val wordsCounts = rdd.flatMap(x => x.split(" ")).map(word => (word, 1)) .reduceByKey(_ + _) scala> wordCounts.saveAsTextFile("/Users/vb/wordcounts") scala> wordCounts.collect() res43: Array[(String, Int)] = Array((again,1), (hello,3), (world,1), (there,1)) 38
  • 39. RDD Persistence • Unlike MapReduce, Spark can persist (or cache) a dataset in memory across operations • Each node stores any partitions of it that it computes in memory and reuses them in other transformations/actions on that RDD • 10x increase in speed • One of the most important Spark features >>> wordCounts = rdd.flatMap(lamdba x: x.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(add) .cache() 39
  • 40. RDD Persistence Storage Levels Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. http://spark.apache.org/docs/latest/programming-guide.html 40
  • 41. More RDD Persistence Storage Levels Storage Level Meaning MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. DISK_ONLY Store RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes. OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. http://spark.apache.org/docs/latest/programming-guide.html 41
  • 43. And so much more… • DataFrames and SQL • Spark Streaming • MLlib • GraphX spark.apache.org/docs/latest/ 43